Monthly Archives: June 2021

Phil 6.30.21


Nice day at the TdF yesterday! (


  • Roll in Clay’s last-minute changes
\caption[position=bottom]{Title of Figure\label{fig:F1}}
  • Submit!


  • Keep on adding content to new project

GPT Agents

  • Getting started with the Twitter API v2 for academic research
    • Welcome to this ‘101 course’ on getting started with academic research using the Twitter API. The objective of this course is to help academic researchers learn how to get Twitter data using the new Twitter API v2.
  • Meaningful measures of human society in the twenty-first century
    • Science rarely proceeds beyond what scientists can observe and measure, and sometimes what can be observed proceeds far ahead of scientific understanding. The twenty-first century offers such a moment in the study of human societies. A vastly larger share of behaviours is observed today than would have been imaginable at the close of the twentieth century. Our interpersonal communication, our movements and many of our everyday actions, are all potentially accessible for scientific research; sometimes through purposive instrumentation for scientific objectives (for example, satellite imagery), but far more often these objectives are, literally, an afterthought (for example, Twitter data streams). Here we evaluate the potential of this massive instrumentation—the creation of techniques for the structured representation and quantification—of human behaviour through the lens of scientific measurement and its principles. In particular, we focus on the question of how we extract scientific meaning from data that often were not created for such purposes. These data present conceptual, computational and ethical challenges that require a rejuvenation of our scientific theories to keep up with the rapidly changing social realities and our capacities to capture them. We require, in other words, new approaches to manage, use and analyze data.
  • Start testing model when its ready
  • Still chunking along:

Phil 6.29.21

Put out soaker hose! – Done!

Stewardship of Ourselves

  • The first (and perhaps foremost) of my concerns is the impact that the perturbation of our social dynamics may have on our collective cognitive abilities. In Cognitive Democracy, Henry Farrell and Cosma Shalizi make the (credible) case that democracy is intrinsically better at solving complex problems (of the kind that have rugged solution landscapes) than markets or hierarchies/bureaucracies.
  • I am far more concerned with how uniform these algorithms are across huge populations. The underlying insight that explains why diverse groups are better at complex problems is that a diverse set of intellectual tools and viewpoints will be better at finding solutions on a rugged landscape. In mediating so much of humankind’s discovery through the tiny funnel of a handful of systems, we are creating an unprecedented impoverishment of our intellectual toolbox. I am far less concerned about filter bubbles than I am about turning a complex, likely scale-free network of discovery into a fully-mediated hub-and-spokes structure in which everything flows through a system of very limited variety.


  • Roll in Clay’s changes today!
  • Done????


  • Start putting all the chapters in the same place. Take out all the placeholders and let’s see what we have

GPT Agents

  • Start training
    • yelp_American: started 7:55
  • 3:00 Meeting

Phil 6.28.21

I had a pretty wild dream last night. I was working at Google building physical neural networks. I think we were precipitating them out of a metallic semiconductor solution. My sense is that it was something where the input buffer was the cathode and the output was the anode. The finished systems were placed in mineral oil tanks, so they were basically artificial brains in a bucket. They looked something like this:

Google Brain?

Netron is a viewer for neural network, deep learning and machine learning models.

Sentence Transformers in the Hugging Face Hub

  • Sentence Transformers is a framework for sentence, paragraph and image embeddings. This allows to derive semantically meaningful embeddings (1) which is useful for applications such as semantic search or multi-lingual zero shot classification. As part of Sentence Transformers v2 release, there are a lot of cool new features:

RAM ProMaster Lift Kit (2014-2020)

Start enjoying those backcountry roads with the OHV 3″ lift kit engineered specifically for the RAM ProMaster chassis. Out of the factory, the van is naturally lower on the front end, a 3″ lift on the front axle, and a 2.25″ lift in the rear helps level out the vehicle. In addition to greater clearance, the lift kit also increases protection for any gear you may have mounted underneath. The ProMaster lift kit is truly designed to let you go anywhere.

GPT Agents

  • Create a view for reviews and businesses. Done
  • Search for types and start pulling out reviews + stars. Done. Here’s the estimate of the number of rows based on the number of rows it takes to get to 100 samples:

The same info as a chart:

  • Once I figure that out start making training corpora. I think I’ll stick to those cuisines that have more than 100k estimated reviews – code is done, running the queries and creating the test/train corpora. I’m adding useful_votes, funny_votes, and cool_votes for some more ground truth numbers to look at. The format should work for Excel too, so the stats can be computed from there


  • Roll in Upendra’s changes
  • Start updating Overleaf


  • 4:00 Tagup. Ping Andreea

Phil 6.27.21

Learning to hesitate

  • We investigate how people make choices when they are unsure about the value of the options they face and have to decide whether to choose now or wait and acquire more information first. In an experiment, we find that participants deviate from optimal information acquisition in a systematic manner. They acquire too much information (when they should only collect little) or not enough (when they should collect a lot). We show that this pattern can be explained as naturally emerging from Fechner cognitive errors. Over time participants tend to learn to approximate the optimal strategy when information is relatively costly.
  • Overall, participants make their decisions too quickly (sample too little information) when information is relatively cheap. Inversely, they hesitate too long (sample too much information) when information is relatively expensive. They stop after approximately 9 draws in the $0.10 treatment, 7 draws in the $0.50 treatment and 4 draws in the $1 treatment. In the lower cost treatment, this average is below the theoretical prediction, and in the two other costs treatments it is above it. The average stopping time is significantly different from the theoretical one in each treatment (p<0.001 for a Wilcoxon signed-rank test in $0.10 and $1 treatments, p=0.0085 in the $0.50 treatment).


  • People are drawn to the easy and to the easiest side of the easy. But it is clear that we must hold ourselves to the difficult, as it is true for everything alive. Everything in nature grows and defends itself in its own way and against all opposition, straining from within and at any price, to become distinctively itself. It is good to be solitary, because solitude is difficult, and that a thing is difficult must be even more of a reason for us to undertake it.


  • Read whatever I finished writing on Friday and edit/fix -done!

Phil 6.25.21


  • Went up to Princeton yesterday to have lunch with Barna Donovan to discuss conspiracy theories and the GPT-3. Fun!
  • 2:00 Meeting with Michelle. Need to move all the content over to Overleaf


  • Writing – done with the first draft!


  • Ingest the DB on the ML box
  • Ping David Mclure about GPT-3 Mapping? He used AllenNLP for this which looks like it’s worth a deep dive.

Phil 6.23.21

Drop truck off for service – done

Look! A map with beliefs!


  • Doing a readthrough
  • Add some content on modern maps – Done!

GPT Agents

  • Changing the primary key to row_id. Slow! Done!


  • 10:00 Meeting – all kinds of IT-related Zoom and Google Meet problems
  • Working on final report. Working on folding the first two reports into the methods/results narrative

Phil 6.22.21

Back from a long weekend off with some interesting adventures. And I managed to break the RV again. Need to contact Jim Donnie’s today and get something scheduled. Done! Drop off tomorrow.

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis. Time series analysis is an essential component of Data Science and Engineering work at industry, from understanding the key statistics and characteristics, detecting regressions and anomalies, to forecasting future trends. Kats aims to provide the one-stop shop for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, etc. Kats is released by Facebook’s Infrastructure Data Science team. It is available for download on PyPI.

CutPaste: Self-Supervised Learning for Anomaly Detection and Localization

  • We aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. To this end, we propose a two-stage framework for building anomaly detectors using normal training data only. We first learn self-supervised deep representations and then build a generative one-class classifier on learned representations. We learn representations by classifying normal data from the CutPaste, a simple data augmentation strategy that cuts an image patch and pastes at a random location of a large image. Our empirical study on MVTec anomaly detection dataset demonstrates the proposed algorithm is general to be able to detect various types of real-world defects. We bring the improvement upon previous arts by 3.1 AUCs when learning representations from scratch. By transfer learning on pretrained representations on ImageNet, we achieve a new state-of-theart 96.6 AUC. Lastly, we extend the framework to learn and extract representations from patches to allow localizing defective areas without annotations during training.


Andrej Karpathy (Tesla): CVPR 2021 Workshop on Autonomous Vehicles


  • Starting to really dig into the phase 1 final report. Trying to make it useful as a source for a conference paper


  • Continue on the Diana section. Done?

GPT-2 Agents

  • The re-indexing is done! Need to change the primary key to row_id and generate some text
  • 3:00 Meeting

Phil 6.16.21

The BBC did a show on using outside technology rather than HR as a way of tracking harassments. The idea seems to be that you can report an incident, and if the app detects enough activity around one person (which can be anonymously reported), then the company can be contacted. This sounds a lot like a model for trustworthy anonymous citizen journalism. Need to look into it some more.

Also, I realize that this is in some ways coordination without (explicit) communication. It could also be a framework for a more updated form of collective action such as unions. The issue to solve with that would be the ability to negotiate with the union as a whole, rather than an individual negotiator.

3pm – 3:40pm meeting with Upendra


  • 7:00 Waikato meeting


  • Abstract for 007 – done
  • Abstract for 008 – done
  • Cyber training – done. Painful!

Phil 6.15.21

I saw millions compromise their Facebook accounts to fuel fake engagement

  • During my time at Facebook, I saw compromised accounts functioning in droves in Latin America, Asia, and elsewhere. Most of these accounts were commandeered through autolikers: online programs which promise users automatic likes and other engagement for their posts. Signing up for the autoliker, however, requires the user to hand over account access. Then, these accounts join a bot farm, where their likes and comments are delivered to other autoliker users, or sold en masse, even while the original user maintains ownership of the account. Although motivated by money rather than politics — and far less sophisticated than government-run human troll farms — the sheer quantity of these autoliker programs can be dangerous.


  • Princess Di and others!
  • Maybe a postscript?


  • First drafts of 008 and 007 are done! Time to clean up and edit Done. Just need to do an abstract for each


  • At 7.5M reviews indexed
  • Need to bow out of meeting today, but maybe send a copy of the article draft?

Phil 6.14.21

Childhood cross-ethnic exposure predicts political behavior seven decades later: Evidence from linked administrative data

  • Does contact across social groups influence sociopolitical behavior? This question is among the most studied in the social sciences with deep implications for the harmony of diverse societies. Yet, despite a voluminous body of scholarship, evidence around this question is limited to cross-sectional surveys that only measure short-term consequences of contact or to panel surveys with small samples covering short time periods. Using advances in machine learning that enable large-scale linkages across datasets, we examine the long-term determinants of sociopolitical behavior through an unprecedented individual-level analysis linking contemporary political records to the 1940 U.S. Census. These linked data allow us to measure the exact residential context of nearly every person in the United States in 1940 and, for men, connect this with the political behavior of those still alive over 70 years later. We find that, among white Americans, early-life exposure to black neighbors predicts Democratic partisanship over 70 years later.
  • Diversity injection works!


  • Try to get a pass through 007
  • Add Peter’s bits to 008


  • 4:30 Meeting with Andreea

Phil 6.11.21


  • More writing – send current state to Michelle this morning!
  • 2:00 Meeting


  • More writing
    • Done with the first draft of 008!

Phil 6.10.21

Social Cooling is a name for the long-term negative side effects of living in a reputation economy

Big Sleep (Github) – Ryan Murdock has done it again, combining OpenAI’s CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

Brain rhythms that help us to detect borders

  • Oscillations in neuronal activity in the medial temporal lobe of the human brain encode proximity to boundaries such as walls, both when navigating while walking and when watching another person do so.
  • Lots of very interesting projects here, too


  • More writing


  • 9:15 standup
  • cybersecurity training
  • More writing

Phil 6.9.21

Thinking ahead: spontaneous prediction in context as a keystone of language in humans and machines

  • Departing from traditional linguistic models, advances in deep learning have resulted in a new type of predictive (autoregressive) deep language models (DLMs). These models are trained to generate appropriate linguistic responses in a given context using a self-supervised prediction task. We provide empirical evidence that the human brain and autoregressive DLMs share two computational principles: 1) both are engaged in continuous prediction; 2) both represent words as a function of the previous context. Behaviorally, we demonstrate a match between humans and DLM’s next-word predictions given sufficient contextual windows during the processing of a real-life narrative. Neurally, we demonstrate that the brain, like autoregressive DLMs, constantly predicts upcoming words in natural speech, hundreds of milliseconds before they are perceived. Finally, we show that DLM’s contextual embeddings capture the neural representation of context-specific word meaning better than arbitrary or static semantic embeddings. Our findings suggest that autoregressive DLMs provide a novel and biologically feasible computational framework for studying the neural basis of language.


  • Even though I have Windows updates turned off, it seems that MS has rebooted my machine last night. Figuring out where the updates stopped so that I can pick up in a reasonable way. Currently,
select count(*) from table_review where row_id is not null;
  • has taken 25 minutes to return. Grrr. And I need a new valve in the shower. Grrr!
  • Update – it took 89 minutes and 32 seconds. There are 3,954,779 values set
    • Back to adding row numbers. Had to figure out where in the table we were, but an hour of coding beats the hell out of a few days of redundant inserts!
  • Pinged Antonio about meeting at 10:00 on Friday


  • More Writing


  • More writing

Phil 6.8.21

RE-ran my COVID states model, basically to see how Georgia is doing. It’s still a mess
  • Manim is an engine for precise programatic animations, designed for creating explanatory math videos.

Computer-Assisted Keyword and Document Set Discovery from Unstructured Text

  • The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depend in practice on the inadequate keyword counting or matching methods they are designed to replace. Improved methods of keyword selection would also be valuable in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, among others.


  • More writing


  • More writing