Category Archives: Data sources

Phil 9.14.19

Document This document describes the Facebook Privacy-Protected URLs-light release, resulting from a collaboration between Facebook and Social Science One. It was originally prepared for Social Science One grantees and describes the dataset’s scope, structure, and fields.

As part of this project, we are pleased to announce that we are making data from the URLs service available to the broader academic community for projects concerning the effect of social media on elections and democracy. This unprecedented dataset consists of web page addresses (URLs) that have been shared on Facebook starting January 1, 2017 through to and including February 19, 2019. URLs are included if shared by more than on average 100 unique accounts with public privacy settings. Read the complete Request for Proposals for more information.

Phil 2.25.19

7:00 – 2:30 ASRC TL

Start of first draft of whitepaper
Had a nice chat with Flynn about the paper. He agrees generally, and points out that physical security will converge with cybersecurity. I think that also hovering on a potential horizon is the addition of social security with physical and cyber security.
- Cybersecurity:
- Social Security:
  - Cyber China: Upgrading Propaganda, Public Opinion Work and Social Management for the Twenty-First Century
  - Inside China’s Dystopian Dreams: A.I., Shame and Lots of Cameras

2:30 – 4:30 PhD

Fix directory code of LMN so that it remembers the input and output directories – done
Add time bucketing capabilities. Do this by taking the complete conversation and splitting the results into N sublists. Take the beginning and ending time from each list and then use those to set the timestamp start and stop for each player’s posts.
Thinking about a time-series LMN tool that can chart the relative occurrence of the sorted terms over time. I think this could be done with tkinter. I would need to create and executable as described here, though the easiest answer seems to be pyinstaller.
Here are two papers that show the advantages of herding over nomadic behavior:
- Phagotrophy by a flagellate selects for colonial prey: A possible origin of multicellularity
  - Predation was a powerful selective force promoting increased morphological complexity in a unicellular prey held in constant environmental conditions. The green alga, Chlorella vulgaris, is a well-studied eukaryote, which has retained its normal unicellular form in cultures in our laboratories for thousands of generations. For the experiments reported here, steady-state unicellular C. vulgaris continuous cultures were inoculated with the predator Ochromonas vallescia, a phagotrophic flagellated protist (‘flagellate’). Within less than 100 generations of the prey, a multicellular Chlorella growth form became dominant in the culture (subsequently repeated in other cultures). The prey Chlorella first formed globose clusters of tens to hundreds of cells. After about 10–20 generations in the presence of the phagotroph, eight-celled colonies predominated. These colonies retained the eight-celled form indefinitely in continuous culture and when plated onto agar. These self-replicating, stable colonies were virtually immune to predation by the flagellate, but small enough that each Chlorella cell was exposed directly to the nutrient medium.
- De novo origins of multicellularity in response to predation
  - The transition from unicellular to multicellular life was one of a few major events in the history of life that created new opportunities for more complex biological systems to evolve. Predation is hypothesized as one selective pressure that may have driven the evolution of multicellularity. Here we show that de novo origins of simple multicellularity can evolve in response to predation. We subjected outcrossed populations of the unicellular green alga Chlamydomonas reinhardtii to selection by the filter-feeding predator Paramecium tetraurelia. Two of five experimental populations evolved multicellular structures not observed in unselected control populations within ~750 asexual generations. Considerable variation exists in the evolved multicellular life cycles, with both cell number and propagule size varying among isolates. Survival assays show that evolved multicellular traits provide effective protection against predation. These results support the hypothesis that selection imposed by predators may have played a role in some origins of multicellularity. \

Phil 12.7.18

7:00 – 4:30 ASRC NASA/PhD

Aaron’s making good progress on building the linear dungeon and recruiting users for the first live test!
Got the final version of the EAD 2019 paper to review 11 days ago and never saw the email. Needs to be done by the 12th!
Installed XlsxWriter on my home dev box
Information Extraction from Online Text — from Opinions to Arguments to Persuasion – Friday, December 7, 2018, 11:00 am-12:00 pm
The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Here’s a partial list of subjects from the call
- Dialogue and Interactive Systems
- Discourse and Pragmatics
- Document Analysis
- Generation
- Information Extraction and Text Mining
- Sentiment Analysis and Argument Mining
- Social Media
- Tagging, Chunking, Syntax and Parsing
- Textual Inference and Other Areas of Semantics
Starting Using argumentative structure to interpret debates in online deliberative democracy and eRulemaking
- Lead author: John Laurence (Scholar), Centre for Argument Technology, University of Dundee, UK
- Internet Argument Corpus
  - Overview: The Internet Argument Corpus (IAC) is a corpus for research in political debate on internet forums. It consists of ~11,000 disscussions, ~390,000 posts, and some ~73,000,000 words. Subsets of the data have been annotated for topic, stance, agreement, sarcasm, and nastiness among others.
  - The Data: The data is stored in JSON files with most annotations in CSV format (see included readme for details). Python code to load and use the data is included. The zip archive is 158MB.

Phil 11.1.18

7:00 – 4:30 ASRC PhD

Quick thought. Stampedes may be recognized not just from low variance (density of connections), but also the speed that a new term moves into the lexicon (stiffness)
The Junk News Aggregator, the Visual Junk News Aggregator and the Top 10 Junk News Aggregator are research projects of the Computational Propaganda group (COMPROP) of the Oxford Internet Institute (OII)at the University of Oxford.These aggregators are intended as tools to help researchers, journalists, and the public see what English language junk news stories are being shared and engaged with on Facebook, ahead of the 2018 US midterm elections on November 6, 2018.The aggregators show junk news posts along with how many reactions they received, for all eight types of post reactions available on Facebook, namely: Likes, Comments, Shares, and the five emoji reactions: Love, Haha, Wow, Angry, and Sad.
Reading Charles Perrow’s Normal Accidents. Riveting. All about dense, tightly connected networks with hidden information
- From The Montreal Review
  - Normal Accident drew attention to two different forms of organizational structure that Herbert Simon had pointed to years before, vertical integration, and what we now call modularity. Examining risky systems in the Accident book, I focused upon the unexpected interactions of different parts of the system that no designer could have expected and no operator comprehend or be able to interdict. Reading Charles Perrow’s Normal Accidents. Riveting. All about dense, tightly connected networks with hidden information

Building generators.

Need to change the “stepsize” in the Torrance generator to be variable – done. Here’s my little ode to The Shining:

#confg: {"rows":100, "sequence_length":26, "step":26, "type":"words"}
all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes 
jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work 
and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a 
dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no 
play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy 
all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes 
jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work 
and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a 
dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no 
play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy 
all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes 
jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a dull boy all work 
and no play makes jack a dull boy all work and no play makes jack a dull boy all work and no play makes jack a

Need to be able to turn out a numeric equivalent. Done with floating point. This:

#confg: {"function":math.sin(xx)*math.sin(xx/2.0)*math.cos(xx/4.0), "rows":100, "sequence_length":20, "step":1, "delta":0.4, "type":"floating_point"}
0.0,0.07697897630719268,0.27378318599563484,0.5027638400821064,0.6604469814238397,0.6714800165989514,0.519596709539434,0.2524851001382131,-0.04065231596017931,-0.2678812526747579,-0.37181365763470914,-0.34898182120310306,-0.24382057359778858,-0.12182487479311599,-0.035942415169752356,-0.0027892469005274916,0.00019865778200507415,0.016268713740310237,0.07979661440830532,0.19146155036709192,
0.07697897630719312,0.2737831859956355,0.5027638400821071,0.6604469814238401,0.6714800165989512,0.5195967095394334,0.2524851001382121,-0.04065231596018022,-0.26788125267475843,-0.37181365763470925,-0.3489818212031028,-0.24382057359778805,-0.12182487479311552,-0.0359424151697521,-0.0027892469005274395,0.0001986577820050832,0.016268713740310397,0.07979661440830574,0.19146155036709248,0.31158944024296154,
0.2737831859956368,0.502763840082108,0.6604469814238405,0.6714800165989508,0.5195967095394324,0.25248510013821085,-0.04065231596018143,-0.2678812526747592,-0.37181365763470936,-0.34898182120310245,-0.24382057359778747,-0.12182487479311502,-0.03594241516975184,-0.002789246900527388,0.00019865778200509222,0.01626871374031056,0.07979661440830614,0.191461550367093,0.311589440242962,0.3760334615921674,
0.5027638400821092,0.6604469814238411,0.6714800165989505,0.5195967095394312,0.25248510013820913,-0.040652315960182955,-0.26788125267476015,-0.37181365763470964,-0.348981821203102,-0.24382057359778667,-0.12182487479311428,-0.03594241516975145,-0.0027892469005273107,0.00019865778200510578,0.016268713740310803,0.07979661440830675,0.1914615503670939,0.3115894402429629,0.3760334615921675,0.3275646734005755,
0.660446981423842,0.6714800165989498,0.5195967095394289,0.2524851001382062,-0.04065231596018568,-0.2678812526747618,-0.37181365763471,-0.34898182120310123,-0.24382057359778553,-0.1218248747931133,-0.03594241516975093,-0.0027892469005272066,0.00019865778200512388,0.016268713740311122,0.07979661440830756,0.19146155036709495,0.31158944024296387,0.3760334615921676,0.3275646734005745,0.1475692800414062,
0.671480016598949,0.5195967095394267,0.25248510013820324,-0.04065231596018842,-0.2678812526747636,-0.3718136576347104,-0.34898182120310045,-0.24382057359778414,-0.12182487479311209,-0.03594241516975028,-0.002789246900527077,0.0001986577820051465,0.016268713740311528,0.07979661440830856,0.19146155036709636,0.3115894402429648,0.37603346159216783,0.32756467340057344,0.1475692800414041,-0.12805444308254293,
0.5195967095394245,0.2524851001382003,-0.04065231596019116,-0.2678812526747653,-0.3718136576347107,-0.3489818212030998,-0.24382057359778303,-0.12182487479311109,-0.03594241516974975,-0.0027892469005269733,0.00019865778200516457,0.016268713740311847,0.07979661440830936,0.19146155036709747,0.3115894402429657,0.37603346159216794,0.32756467340057244,0.147569280041402,-0.1280544430825456,-0.41793663502550105,
0.2524851001381973,-0.04065231596019389,-0.26788125267476703,-0.3718136576347111,-0.3489818212030989,-0.2438205735977817,-0.12182487479310988,-0.0359424151697491,-0.002789246900526843,0.00019865778200518717,0.01626871374031225,0.07979661440831039,0.1914615503670989,0.3115894402429671,0.3760334615921681,0.3275646734005709,0.14756928004139883,-0.1280544430825496,-0.41793663502550454,-0.6266831461371138,

Gives this:
Need to write a generator that reads in text (words and characters) and produces data tables with stepsizes
Need to write a generator that takes an equation as a waveform

USPTO Meeting. Use NN to produce multiple centrality / laplacians that user interact with
Working on my 810 tasks
- Potentially useful for mapmaking: Learning the Preferences of Ignorant, Inconsistent Agents
  - An important use of machine learning is to learn what people value. What posts or photos should a user be shown? Which jobs or activities would a person find rewarding? In each case, observations of people’s past choices can inform our inferences about their likes and preferences. If we assume that choices are approximately optimal according to some utility function, we can treat preference inference as Bayesian inverse planning. That is, given a prior on utility functions and some observed choices, we invert an optimal decision-making process to infer a posterior distribution on utility functions. However, people often deviate from approximate optimality. They have false beliefs, their planning is sub-optimal, and their choices may be temporally inconsistent due to hyperbolic discounting and other biases. We demonstrate how to incorporate these deviations into algorithms for preference inference by constructing generative models of planning for agents who are subject to false beliefs and time inconsistency. We explore the inferences these models make about preferences, beliefs, and biases. We present a behavioral experiment in which human subjects perform preference inference given the same observations of choices as our model. Results show that human subjects (like our model) explain choices in terms of systematic deviations from optimal behavior and suggest that they take such deviations into account when inferring preferences.
- An Overview of the Schwartz Theory of Basic Values (Added to normative map making)
  - This article presents an overview of the Schwartz theory of basic human values. It discusses the nature of values and spells out the features that are common to all values and what distinguishes one value from another. The theory identifies ten basic personal values that are recognized across cultures and explains where they come from. At the heart of the theory is the idea that values form a circular structure that reflects the motivations each value expresses. This circular structure, that captures the conflicts and compatibility among the ten values is apparently culturally universal. The article elucidates the psychological principles that give rise to it. Next, it presents the two major methods developed to measure the basic values, the Schwartz Value Survey and the Portrait Values Questionnaire. Findings from 82 countries, based on these and other methods, provide evidence for the validity of the theory across cultures. The findings reveal substantial differences in the value priorities of individuals. Surprisingly, however, the average value priorities of most societal groups exhibit a similar hierarchical order whose existence the article explains. The last section of the article clarifies how values differ from other concepts used to explain behavior—attitudes, beliefs, norms, and traits.

Phil 10.18.18

7:00 – 9:00, 12:00 – ASRC PhD

Reading the New Yorker piece How Russia Helped Swing the Election for Trump, about Kathleen Hall Jamieson‘s book Cyberwar: How Russian Hackers and Trolls Helped Elect a President—What We Don’t, Can’t, and Do Know. Some interesting points with respect to Adversarial Herding:
- Jamieson’s Post article was grounded in years of scholarship on political persuasion. She noted that political messages are especially effective when they are sent by trusted sources, such as members of one’s own community. Russian operatives, it turned out, disguised themselves in precisely this way. As the Times first reported, on June 8, 2016, a Facebook user depicting himself as Melvin Redick, a genial family man from Harrisburg, Pennsylvania, posted a link to DCLeaks.com, and wrote that users should check out “the hidden truth about Hillary Clinton, George Soros and other leaders of the US.” The profile photograph of “Redick” showed him in a backward baseball cap, alongside his young daughter—but Pennsylvania records showed no evidence of Redick’s existence, and the photograph matched an image of an unsuspecting man in Brazil. U.S. intelligence experts later announced, “with high confidence,” that DCLeaks was the creation of the G.R.U., Russia’s military-intelligence agency.
- Jamieson argues that the impact of the Russian cyberwar was likely enhanced by its consistency with messaging from Trump’s campaign, and by its strategic alignment with the campaign’s geographic and demographic objectives. Had the Kremlin tried to push voters in a new direction, its effort might have failed. But, Jamieson concluded, the Russian saboteurs nimbly amplified Trump’s divisive rhetoric on immigrants, minorities, and Muslims, among other signature topics, and targeted constituencies that he needed to reach.
Twitter released IRA dataset (announcement, archive), and Kate Starbird’s group has done some preliminary analysis
- A First Glimpse through the Data Window onto the Internet Research Agency’s Twitter Operations
Need to do something about the NESTA Call for Ideas, which is due “11am on Friday 9th November“
Continuing with Market-Oriented Programming
- Some thoughts on what the “cost” for a trip can reference
  - Passenger
    - Ticket price
      - provider: Current price, refundability, includes taxes
        
        carbon
        
        congestion
        
        other?
      - consumer: Acceptable range
    - Travel time
    - Departure time
    - Arrival time (plus arrival time confidence)
    - comfort (legroom, AC)
    - Number of stops (related to convenience)
    - Number of passengers
    - Time to wait
    - Externalities like airport security, which adds +/- 2 hours to air travel
  - Cargo
    - Divisibility (ship as one or more items)
    - Physical state for shipping (packaged, indivisible solid, fluid, gas)
      - Waste to food grade to living (is there a difference between algae and cattle? Pets? Show horses?
      - Refrigerated/heated
      - Danger
      - Stability/lifespan
      - weight
  - Aggregators provide simpler combinations of transportation options
- Any exchange that supports this format should be able to participate. Additionally, each exchange should contain a list of other exchanges that a consumer can request, so we don’t need another level of hierarchy. Exchanges could rate other exchanges as a quality measure
  - It also occurs to me that there could be some kind of peer-to-peer or mesh network for degraded modes. A degraded mode implies a certain level of emergency, which would affect the (now small-scale) allocation of resources.
- Some stuff about Mobility as a Service. Slide deck (from Canada Intelligent Transportation Service), and an app (Whim)
PSC AI/ML working group 9:00 – 12:00, plus writeup
- PSC will convene a working group meeting on Thursday, Oct. 18 from 9am – 10am to identify actions and policy considerations related to advancing the use of AI solutions in government. Come prepared to share your ideas and experience. We would welcome your specific feedback on these questions:
  - How can PSC help make the government a “smarter buyer” when it comes to AI/ML?
  - How are agencies effectively using AI/ML today?
  - In what other areas could these technologies be deployed in government today?
    - Looking for bad sensors on NOAA satellites
  - What is the current federal market and potential future market for AI/ML?
  - Notes:
    - How to help our members – federal contracts. Help make the federal market frictionless
    - Kevin – SmartForm? What are the main gvt concerns? Is it worry about False positives?
      - Competitiveness – no national strategy
      - Appropriate use, particularly law enforcement
      - Robotic Process Automation (RPA) Security, Compliancy, and adoption. Compliancy testing.
      - Data trust. Humans make errors. When ML makes the same errors, it’s worse.
    - A system that takes time to get accurate watching people perform is not the kind of system that the government can buy.
      - This implies that there has to be immediate benefit, and can have the possibility of downstream benefit.
    - Dell would love to participate (in what?) Something about cloud
    - Replacing legacy processes with better approaches
    - Fedramp-like compliance mechanism for AI. It is a requirement if it is a cloud service.
    - Perceived, implicit bias is the dominant narrative on the government side. Specific applications like facial recognition
    - Take a look at all the laws that might affect AI, to see how the constraints are affecting adoption/use with an eye towards removing barriers
    - Chris ?? There isn’t a very good understanding or clear linkage between the the promise and the current problems, such as staffing, tagged data, etc
    - What does it mean to be reskilled and retrained in an AI context?
    - President’s Management Agenda
    - The killer app is cost savings, particularly when one part of government is getting a better price than another part.
    - Federal Data Strategy
    - Send a note to Kevin about data availability. The difference between NOAA sensor data (clean and abundant), vs financial data, constantly changing spreadsheets that are not standardized. Maybe the creation of tools that make it easier to standardize data than use artisanal (usually Excel-based) solutions. Wrote it up for Aaron to review. It turned out to be a page.

Phil 4.26.18

Too much stuff posted yesterday, so I’m putting Kate Starbird’s new paper here:

Ecosystem or Echo-System? Exploring Content Sharing across Alternative Media Domains
- This research examines the competing narratives about the role and function of Syria Civil Defence, a volunteer humanitarian organization popularly known as the White Helmets, working in war-torn Syria. Using a mixed-method approach based on seed data collected from Twitter, and then extending out to the websites cited in that data, we examine content sharing practices across distinct media domains that functioned to construct, shape, and propagate these narratives. We articulate a predominantly alternative media “echo-system” of websites that repeatedly share content about the White Helmets. Among other findings, our work reveals a small set of websites and authors generating content that is spread across diverse sites, drawing audiences from distinct communities into a shared narrative. This analysis also reveals the integration of government funded media and geopolitical think tanks as source content for anti-White Helmets narratives. More broadly, the analysis demonstrates the role of alternative newswire-like services in providing content for alternative media websites. Though additional work is needed to understand these patterns over time and across topics, this paper provides insight into the dynamics of this multi-layered media ecosystem.

7:00 – 5:00 ASRC MKT

Referencing for Aanton at 5:00
Call Charlestown about getting last two years of payments
Benjamin D. Horne, Sara Khedr, and Sibel Adali. “Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape” ICWSM 2018
Continuing From I to We: Group Formation and Linguistic Adaption in an Online Xenophobic Forum
Anchor-Free Correlated Topic Modeling
- In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has an anchor word, which may be fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but identifiability still hinges on additional assumptions. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.
Cleaning up the Angular/PHP example. Put on GitHub?
- Bob diagnosed the POST problem so that’s fixed
- Client and server on on github
- Worked out stories for next spring. Looks like it’s going to be Agent text generation using GANs

viztales

Dimension reduction, State, Orientation, and Speed