Welcome to this ‘101 course’ on getting started with academic research using the Twitter API. The objective of this course is to help academic researchers learn how to get Twitter data using the new Twitter API v2.
Science rarely proceeds beyond what scientists can observe and measure, and sometimes what can be observed proceeds far ahead of scientific understanding. The twenty-first century offers such a moment in the study of human societies. A vastly larger share of behaviours is observed today than would have been imaginable at the close of the twentieth century. Our interpersonal communication, our movements and many of our everyday actions, are all potentially accessible for scientific research; sometimes through purposive instrumentation for scientific objectives (for example, satellite imagery), but far more often these objectives are, literally, an afterthought (for example, Twitter data streams). Here we evaluate the potential of this massive instrumentation—the creation of techniques for the structured representation and quantification—of human behaviour through the lens of scientific measurement and its principles. In particular, we focus on the question of how we extract scientific meaning from data that often were not created for such purposes. These data present conceptual, computational and ethical challenges that require a rejuvenation of our scientific theories to keep up with the rapidly changing social realities and our capacities to capture them. We require, in other words, new approaches to manage, use and analyze data.
The first (and perhaps foremost) of my concerns is the impact that the perturbation of our social dynamics may have on our collective cognitive abilities. In Cognitive Democracy, Henry Farrell and Cosma Shalizi make the (credible) case that democracy is intrinsically better at solving complex problems (of the kind that have rugged solution landscapes) than markets or hierarchies/bureaucracies.
I am far more concerned with how uniform these algorithms are across huge populations. The underlying insight that explains why diverse groups are better at complex problems is that a diverse set of intellectual tools and viewpoints will be better at finding solutions on a rugged landscape. In mediating so much of humankind’s discovery through the tiny funnel of a handful of systems, we are creating an unprecedented impoverishment of our intellectual toolbox. I am far less concerned about filter bubbles than I am about turning a complex, likely scale-free network of discovery into a fully-mediated hub-and-spokes structure in which everything flows through a system of very limited variety.
Roll in Clay’s changes today!
Start putting all the chapters in the same place. Take out all the placeholders and let’s see what we have
I had a pretty wild dream last night. I was working at Google building physical neural networks. I think we were precipitating them out of a metallic semiconductor solution. My sense is that it was something where the input buffer was the cathode and the output was the anode. The finished systems were placed in mineral oil tanks, so they were basically artificial brains in a bucket. They looked something like this:
Netron is a viewer for neural network, deep learning and machine learning models.
Sentence Transformers is a framework for sentence, paragraph and image embeddings. This allows to derive semantically meaningful embeddings (1) which is useful for applications such as semantic search or multi-lingual zero shot classification. As part of Sentence Transformers v2 release, there are a lot of cool new features:
Start enjoying those backcountry roads with the OHV 3″ lift kit engineered specifically for the RAM ProMaster chassis. Out of the factory, the van is naturally lower on the front end, a 3″ lift on the front axle, and a 2.25″ lift in the rear helps level out the vehicle. In addition to greater clearance, the lift kit also increases protection for any gear you may have mounted underneath. The ProMaster lift kit is truly designed to let you go anywhere.
Create a view for reviews and businesses. Done
Search for types and start pulling out reviews + stars. Done. Here’s the estimate of the number of rows based on the number of rows it takes to get to 100 samples:
The same info as a chart:
Once I figure that out start making training corpora. I think I’ll stick to those cuisines that have more than 100k estimated reviews – code is done, running the queries and creating the test/train corpora. I’m adding useful_votes, funny_votes, and cool_votes for some more ground truth numbers to look at. The format should work for Excel too, so the stats can be computed from there
We investigate how people make choices when they are unsure about the value of the options they face and have to decide whether to choose now or wait and acquire more information first. In an experiment, we find that participants deviate from optimal information acquisition in a systematic manner. They acquire too much information (when they should only collect little) or not enough (when they should collect a lot). We show that this pattern can be explained as naturally emerging from Fechner cognitive errors. Over time participants tend to learn to approximate the optimal strategy when information is relatively costly.
Overall, participants make their decisions too quickly (sample too little information) when information is relatively cheap. Inversely, they hesitate too long (sample too much information) when information is relatively expensive. They stop after approximately 9 draws in the $0.10 treatment, 7 draws in the $0.50 treatment and 4 draws in the $1 treatment. In the lower cost treatment, this average is below the theoretical prediction, and in the two other costs treatments it is above it. The average stopping time is significantly different from the theoretical one in each treatment (p<0.001 for a Wilcoxon signed-rank test in $0.10 and $1 treatments, p=0.0085 in the $0.50 treatment).
People are drawn to the easy and to the easiest side of the easy. But it is clear that we must hold ourselves to the difficult, as it is true for everything alive. Everything in nature grows and defends itself in its own way and against all opposition, straining from within and at any price, to become distinctively itself. It is good to be solitary, because solitude is difficult, and that a thing is difficult must be even more of a reason for us to undertake it.
Back from a long weekend off with some interesting adventures. And I managed to break the RV again. Need to contact Jim Donnie’s today and get something scheduled. Done! Drop off tomorrow.
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis. Time series analysis is an essential component of Data Science and Engineering work at industry, from understanding the key statistics and characteristics, detecting regressions and anomalies, to forecasting future trends. Kats aims to provide the one-stop shop for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, etc. Kats is released by Facebook’s Infrastructure Data Science team. It is available for download on PyPI.
We aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. To this end, we propose a two-stage framework for building anomaly detectors using normal training data only. We first learn self-supervised deep representations and then build a generative one-class classifier on learned representations. We learn representations by classifying normal data from the CutPaste, a simple data augmentation strategy that cuts an image patch and pastes at a random location of a large image. Our empirical study on MVTec anomaly detection dataset demonstrates the proposed algorithm is general to be able to detect various types of real-world defects. We bring the improvement upon previous arts by 3.1 AUCs when learning representations from scratch. By transfer learning on pretrained representations on ImageNet, we achieve a new state-of-theart 96.6 AUC. Lastly, we extend the framework to learn and extract representations from patches to allow localizing defective areas without annotations during training.
Also, I realize that this is in some ways coordination without (explicit) communication. It could also be a framework for a more updated form of collective action such as unions. The issue to solve with that would be the ability to negotiate with the union as a whole, rather than an individual negotiator.
During my time at Facebook, I saw compromised accounts functioning in droves in Latin America, Asia, and elsewhere. Most of these accounts were commandeered through autolikers: online programs which promise users automatic likes and other engagement for their posts. Signing up for the autoliker, however, requires the user to hand over account access. Then, these accounts join a bot farm, where their likes and comments are delivered to other autoliker users, or sold en masse, even while the original user maintains ownership of the account. Although motivated by money rather than politics — and far less sophisticated than government-run human troll farms — the sheer quantity of these autoliker programs can be dangerous.
Princess Di and others!
Maybe a postscript?
First drafts of 008 and 007 are done! Time to clean up and edit Done. Just need to do an abstract for each
At 7.5M reviews indexed
Need to bow out of meeting today, but maybe send a copy of the article draft?
Does contact across social groups influence sociopolitical behavior? This question is among the most studied in the social sciences with deep implications for the harmony of diverse societies. Yet, despite a voluminous body of scholarship, evidence around this question is limited to cross-sectional surveys that only measure short-term consequences of contact or to panel surveys with small samples covering short time periods. Using advances in machine learning that enable large-scale linkages across datasets, we examine the long-term determinants of sociopolitical behavior through an unprecedented individual-level analysis linking contemporary political records to the 1940 U.S. Census. These linked data allow us to measure the exact residential context of nearly every person in the United States in 1940 and, for men, connect this with the political behavior of those still alive over 70 years later. We find that, among white Americans, early-life exposure to black neighbors predicts Democratic partisanship over 70 years later.
Oscillations in neuronal activity in the medial temporal lobe of the human brain encode proximity to boundaries such as walls, both when navigating while walking and when watching another person do so.
Departing from traditional linguistic models, advances in deep learning have resulted in a new type of predictive (autoregressive) deep language models (DLMs). These models are trained to generate appropriate linguistic responses in a given context using a self-supervised prediction task. We provide empirical evidence that the human brain and autoregressive DLMs share two computational principles: 1) both are engaged in continuous prediction; 2) both represent words as a function of the previous context. Behaviorally, we demonstrate a match between humans and DLM’s next-word predictions given sufficient contextual windows during the processing of a real-life narrative. Neurally, we demonstrate that the brain, like autoregressive DLMs, constantly predicts upcoming words in natural speech, hundreds of milliseconds before they are perceived. Finally, we show that DLM’s contextual embeddings capture the neural representation of context-specific word meaning better than arbitrary or static semantic embeddings. Our findings suggest that autoregressive DLMs provide a novel and biologically feasible computational framework for studying the neural basis of language.
Even though I have Windows updates turned off, it seems that MS has rebooted my machine last night. Figuring out where the updates stopped so that I can pick up in a reasonable way. Currently,
select count(*) from table_review where row_id is not null;
has taken 25 minutes to return. Grrr. And I need a new valve in the shower. Grrr!
Update – it took 89 minutes and 32 seconds. There are 3,954,779 values set
Back to adding row numbers. Had to figure out where in the table we were, but an hour of coding beats the hell out of a few days of redundant inserts!
The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depend in practice on the inadequate keyword counting or matching methods they are designed to replace. Improved methods of keyword selection would also be valuable in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, among others.