Author Archives: pgfeldman

Phil 7.31.21

I had some perplexing runs trying different epoch counts on corpora recently. Here’s the 6k:

As you can see, only the 32 epoch can figure out there are no 0 star ratings, though the best overall match is the 16 epoch.

The 100k is weirder, and makes more sense when you look at the raw data:

As you can see, the 16 and 32 epoch miss the two star rating entirely, though aside(!) from that the fit isn’t bad:

Looking at this, I’d say that 2-4 epochs seem to work well, and when I take out the explicit epoch, run_clm determines the epochs should be 3:

[INFO|] 2021-07-31 06:31:02,447 >> ***** Running training *****
[INFO|] 2021-07-31 06:31:02,447 >>   Num examples = 1649
[INFO|] 2021-07-31 06:31:02,448 >>   Num Epochs = 3
[INFO|] 2021-07-31 06:31:02,448 >>   Instantaneous batch size per device = 1
[INFO|] 2021-07-31 06:31:02,449 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|] 2021-07-31 06:31:02,449 >>   Gradient Accumulation steps = 1
[INFO|] 2021-07-31 06:31:02,449 >>   Total optimization steps = 4947
{'loss': 0.1737, 'learning_rate': 4.494643218111988e-05, 'epoch': 0.3}

I’m going to try an ensemble of these 3-epoch models to see what that looks like

Phil 7.30.21

Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization

  • The successes of deep learning critically rely on the ability of neural networks to output meaningful predictions on unseen data — generalization. Yet despite its criticality, there remain fundamental open questions on how neural networks generalize. How much do neural networks rely on memorization — seeing highly similar training examples — and how much are they capable of human-intelligence styled reasoning — identifying abstract rules underlying the data? In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. While PVR tasks can consist of visual as well as symbolic inputs, each with varying levels of difficulty, they all have a simple underlying rule. One part of the PVR task input acts as a pointer, giving the location of a different part of the input, which forms the value (and output). We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture. The interaction of position, values and the pointer rule also allow the development of nuanced tests of generalization, by introducing distribution shift and increasing functional complexity. These reveal both subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

  • This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub “prompt-based learning”. Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input x is modified using a template into a textual string prompt x’ that has some unfilled slots, and then the language model is used to probabilistically fill the unfilled information to obtain a final string x, from which the final output y can be derived. This framework is powerful and attractive for a number of reasons: it allows the language model to be pre-trained on massive amounts of raw text, and by defining a new prompting function the model is able to perform few-shot or even zero-shot learning, adapting to new scenarios with few or no labeled data. In this paper we introduce the basics of this promising paradigm, describe a unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions, e.g.the choice of pre-trained models, prompts, and tuning strategies. To make the field more accessible to interested beginners, we not only make a systematic review of existing works and a highly structured typology of prompt-based concepts, but also release other resources, e.g., a website this http URL including constantly-updated survey, and paperlist.

GPT Agents

  • This looks interesting for paragraph clustering: Sentence Transformers in the Hugging Face Hub
    • Sentence Transformers is a framework for sentence, paragraph and image embeddings. This allows to derive semantically meaningful embeddings (1) which is useful for applications such as semantic search or multi-lingual zero shot classification. As part of Sentence Transformers v2 release, there are a lot of cool new features:
      • Sharing your models in the Hub easily.
      • Widgets and Inference API for sentence embeddings and sentence similarity.
      • Better sentence-embeddings models available (benchmark and models in the Hub).
  • Finished the 6k epoch tests yesterday. Maybe finish creating models for 100k today?


  • Abstract! Done!
  • See how Rukan is doing. Tell him about cubes and other shapes – done. Looks good


  • Meeting with Michelle at 2:00. Worked on the positioning statement. It’s almost ready to go to an agent!

4:00 NLP Meeting – cancelled

Phil 7.29.21

When the Echo Chamber Shatters: Examining the Use of Community-Specific Language Post-Subreddit Ban

  • Community-level bans are a common tool against groups that enable online harassment and harmful speech. Unfortunately, the efficacy of community bans has only been partially studied and with mixed results. Here, we provide a flexible unsupervised methodology to identify in-group language and track user activity on Reddit both before and after the ban of a community (subreddit). We use a simple word frequency divergence to identify uncommon words overrepresented in a given community, not as a proxy for harmful speech but as a linguistic signature of the community. We apply our method to 15 banned subreddits, and find that community response is heterogeneous between subreddits and between users of a subreddit. Top users were more likely to become less active overall, while random users often reduced use of in-group language without decreasing activity. Finally, we find some evidence that the effectiveness of bans aligns with the content of a community. Users of dark humor communities were largely unaffected by bans while users of communities organized around white supremacy and fascism were the most affected. Altogether, our results show that bans do not affect all groups or users equally, and pave the way to understanding the effect of bans across communities.


  • 9:15 standup – nope
  • Write abstract!

GPT Agents

  • Creating spreadsheets to see how the different epochs affect accuracy in addition to format.
  • Running all the models (1 – 32 epochs) and putting the results in the database

4:30 NLP Meeting

Add this to the Saturday ride calendar

Phil 7.28.21

How epidemic psychology works on Twitter

  • During 2020, based on changes on the use of language on Twitter, three distinct phases were identified. The first was the refusal phase: people in the US refused to accept reality despite the increasing numbers of deaths in other countries. The second was the anger phase, started after the announcement of the first death in the country: people’s fear translated into anger about the looming feeling that things were about to change. The third phase was the acceptance phase, started after the authorities imposed physical-distancing measures: people found a “new normal” for their daily activities. During the year, as cases surged in waves, so did anger, re-emerging cyclically at each wave. These results suggest the concrete future possibility of embedding epidemic psychology derived from the use of language on social media into more traditional epidemiological models.

Research note: Examining potential bias in large-scale censored data

  • We examine potential bias in Facebook’s 10-trillion cell URLs dataset, consisting of URLs shared on its platform and their engagement metrics. Despite the unprecedented size of the dataset, it was altered to protect user privacy in two ways: 1) by adding differentially private noise to engagement counts, and 2) by censoring the data with a 100-public-share threshold for a URL’s inclusion. To understand how these alterations affect conclusions drawn from the data, we estimate the prevalence of fake news in the massive, censored URLs dataset and compare it to an estimate from a smaller, representative dataset. We show that censoring can substantially alter conclusions that are drawn from the Facebook dataset. Because of this 100-public-share threshold, descriptive statistics from the Facebook URLs dataset overestimate the share of fake news and news overall by as much as 4X. We conclude with more general implications for censoring data.


  • SMD: Register for conference, book hotel and flight – Done. That took hours! There are few flights to Huntsville, and I had to use Delta, which doesn’t integrate with car rental. A lot of hotels were booked, too.
  • LAIC: Finish writeup. Had a meeting with Aaron about what was needed, and then we worked on his SPIE presentation
  • IRAD: Coordinate with Rukan. We reworked the JSON file so that it is human readable. Rukan will use this as a basis for rendering the targets and platforms, as well as running the scenarios. Once that’s working we’ll add ordinance

GPT Agents

  • Set up finetuning folder. See if it works with the current setup and upgrade Pytorch, TF, and Transformers as needed. Could not get that to work, so I went back to the run_clm CI approach. I rebased the transformers project, and found that there are now TF and Pytorch versions. I’m using Pytorch for this.
  • Try training using the 6k corpus for fixed epochs.
  • Built models for 1, 2, 4, 8, 16, and 32 epochs. You can see the formatting results improve until 16 epochs. I need to build spreadsheets that show the values and compare them to the ground truth to see if this is a better approximation. THen move on to larger corpora


  • 7:00 Meeting

Phil 7.27.21

In a fake battle for Taiwan, U.S. forces lost network access almost immediately.

Curation Bubbles: Domain Versus URL Level Analysis of Partisan News Sharing on Social Media

  • Empirical inquiries of political news consumption are typically based on analysis at the level of the news source: a given web domain can be assigned a partisanship score reflective of its relative tendency to be shared by Democrats or Republicans. This practical, tractable approach represents an important methodological advance which has allowed for large-scale empirical studies of how democratic citizens consume political information online. However, despite strong evidence that information sharing is dominated by in-group bias, previous work has also found that most users are exposed to information from a balanced variety of mainstream sources. Such conflicting findings around filter bubbles and echo chambers highlights the need to be able to estimate partisanship at the more fine-grained level of individual stories. It may be that individuals tend to consume politically homogeneous content which originates from a relatively heterogeneous collection of sources. Rather than never sharing stories associated with their political opponents, partisans may selectively share out-group content precisely when that information is favorable to them. Using a panel of 1.6 million Twitter users linked to administrative data, we test this dynamic by examining within-domain sharing patterns by user partisanship over time. Consistent with previous work, we find that, in aggregate, partisans do consume news from a variety of sources. However, we find notable story-level differences suggesting that, despite the heterogeneity of sources, the news curated from partisan’s social networks contains politically homogeneous information. Our findings suggest that domain-level analyses of information sharing gives a false impression of exposure to politically diverse content, and raises new concerns regarding polarization in the consumption and sharing of digital media
  • This really fits with my experience, where Fox News viewers share links to the NYTimes that are mentioned on Fox, often without reading them


  • Add 200k data to rollup spreadsheet
  • Here’s the 200k added to the stars counts for each model vs the Yelp 75k ground truth
  • It seems to be better at the lower star counts, but worse at 5. To make sure this wasn’t an artifact of the training data, here’s a measure of the error vs the specific data used to create the training corpora:
  • 3:30 Meeting
    • Have the same size corpora (100k) and the same number of training steps, and prompt with “stars = “.
    • Fine-tuning a pretrained model: In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly trained using Keras and the fit method. In PyTorch, there is no generic training loop so the 🤗 Transformers library provides an API with the class Trainer to let you fine-tune or train a model from scratch easily. Then we will show you how to alternatively write the whole training loop in PyTorch.


  • 9:15 Standup
  • More proposal?
  • Loop back to the simulator and Rukan
  • Start working on LAIC tasking splits for IP

Phil 7.26.21

There is this thing called PyTorch Lightning that supports scaling of your models and integrates with Grid.AI

Grid.AI is designed for developing and training deep learning models at scale.

  • Create a DATASTORE with your dataset.
  • Spin up an interactive SESSION to develop, analyze and prototype models/ideas.
  • When you have something that works, train it at scale via RUN.
  • Grid allocates all the machines and GPUs you need on demand, so you only pay for what you need when you need it.


  • Talk to Rukan about the simulator – brief discussion, then pulled back into the proposal
  • Work on slides – finished first draft


  • Good chat with Michelle on Friday. The proposal is coming along!


  • Finished the 200k Yelp run
  • 4:30 Meeting with Andreea. Nice discussion. We might be able to set up some research that involves understanding the use of metaphor by the GPT

Phil 7.23.21

Got my 40 hours in already this week, so no SBIR stuff

The Wikipedia article on Moral panic is great! It includes a link to the The ‘Momo Challenge’ and the ‘Blue Whale Game’: Online Suicide Game Conspiracies

  • The Momo Challenge is a repackaged version of an older, nearly identical and largely debunked suicide game called the Blue Whale Game. In 2017 scary warnings circulated on social media asking parents, teachers, and police to beware of a hidden threat to children: a sinister online “game” that can lead to death.

GPT Agents

  • Create a 200k corpora and start training the model. Started training at 9:00am
  • Start analysis on stored data. Results!

Phil 7.22.21

Pushshift will soon provide API endpoints to do audio to text transcribing (at a combined aggregate rate of 10 minutes of audio per second). The upcoming API will also have convenience methods where you can provide a link to a youtube video and..

Welcome to OpenNRE’s documentation!


  • 9:15 IRAD Standup
  • Need to get started on the slides for the Huntsville talk
  • 10:30 LAIC followup
  • 5:30 – 6:00 Phase II meeting

GPT Agents

  • Validating the 100, 50, 25, 12, and 6 models
  • 100 and 50 look fine
  • 25 starts to have problems, so the parser is going to have to be a bit better:
  • 12 looks about as bad:
  • 6 may look a little worse? I’ll write the parser to handle this version:
  • This makes me think that 150k – 250k lines may be a sweet spot for best accuracy with the least data. It may actually be possible to figure out a regression that will tell us this
  • Starting on the parser. I’ll need to take the first of each. If there is no value, use zero.
line_regex = re.compile(r"\w+[,|.|=|(|)|:| ]+\d+")
key_val_regex = re.compile(r"\w+")
  • Starting the run that creates synthetic entries for all the models – done

4:00 – 6:00 Arpita’s defense! Looks like she did a great job. It makes me wonder whether larger databases with features that indicate where data came from for multiple platforms could be better than separate smaller models for each platform. In the satellite anomaly detection systems I build, we typically train one model per vehicle. This work implies that it might be better to have one model for all vehicles with domain knowledge about each vehicle included in the feature vector

Phil 7.21.21

The Myth of Panic

  • …when cities in China, Europe, and finally the United States descended into lockdown, there was no mass panic. There was fear, yes, plenty of it—but that fear did not lead to irrational, hysterical, or violent group behavior. Our fear did not lead to looting, pogroms, or unrest. The fearful of Wuhan did not rise up in rebellion against the Communist Party; even when Italian doctors began rationing medical equipment and supplies, the fearful of Milan did not loot stores or disrupt the medical system; the fearful of New York did not duel each other to the death over toilet paper rolls.
  • I do think that panics do happen when dimensions are sufficiently reduced. We have examples of human stampedes in confined areas, as well as runaway conditions in constrained belief spaces like stock market bubbles and crashes. And lastly, there are examples of herding such as the Rwandan genocide and the current moral panic about Critical Race Theory (the most recent of many). So it is more complex. That being said, when dimensions are high, I think the article is exactly right.

GPT Agents

  • Quick meeting yesterday. We all agree that the results look good.
  • The 50k model is done. Training the 25k model now


  • A lot more writing. Need to get the proper charge number – done
  • At first complete draft except section 4
  • LAIC meeting went well too. Money will get turned on shortly?

Phil 7.20.21

Only one more thing to push out the door and then things start to settle down?


  • Sprint planning, get stories in before 9:15 (DSR 628, 629)
  • 1:30 proposal technical approach meeting
    • Install IE to handle antique version of sharepoint
  • 3:00 LAIC followup
  • 4:00 Phase 2 Tagup

GPT Agents

  • Create corpora for 50k, 25k, 12.5k and 6k rows and start training models
    • Generated corpora
    • Training 50k model
  • Meeting at 3:30?

Phil 7.18.21

Ping Andreea that I can’t make tomorrow’s meeting

A real-world example of stampede vs. flock behavior?

There is also a more detailed exploration here. What I think is really important here is the idea that populations that would not normally be included in a stampede can be “pulled along” by the structure of the surrounding belief environment.

… the lower-left quadrant (counties which are deep blue politically but also have a low vaccination rate), since that would seem, on the surface, to go completely against my “Red/Blue = Vaccination Status” theme. What’s going on there? Shouldn’t these deep-blue counties have higher-than-average vaccination rates? … 62 of them are more than 40% Black (in fact, 55 of those are majority Black counties). Of the remaining 13 counties, 7 are majority Native American (over 80% of the population, in fact), while 1 in Texas (Zavala County) is “91.22% Hispanic or Latino of any race” according to Wikipedia.


  • Working on slide deck for a bit before I head home

Phil 7.15.21

Arpita’s defense is next week! Practice tomorrow


  • Chat yesterday with Aaron abut the proposal. We still don’t have Peter’s contributions. Maybe Loren can do it?
  • Had a thought about the communication without coordination concept. The simulations can be compressed using something like run-length encoding at some specified quantization level. The compressed models can be compared (maybe just as number of bytes?) to give an idea of how well they are in agreement. There should be some level of granularity that the representations (and hence the underlying models) diverge. That should be an indication of a need for how much and what kind of coordinating data.
  • Working on slides this afternoon after the ride

Phil 7.14.21


  • Working on presentation of final report. While on vacation. Sigh
  • 4:00 Skype test
    • Everything worked!
    • 1 hours worth of slides. Expect questions during

AI and Compute

  • We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period).[1] Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.

Phil 7.13.21

NIST Proposes Approach for Reducing Risk of Bias in Artificial Intelligence

  • NIST outlines the approach in A Proposal for Identifying and Managing Bias in Artificial Intelligence (NIST Special Publication 1270), a new publication that forms part of the agency’s broader effort to support the development of trustworthy and responsible AI. NIST is accepting comments on the document until Sept. 10, 2021 (extended from the original deadline of Aug. 5, 2021), and the authors will use the public’s responses to help shape the agenda of several collaborative virtual events NIST will hold in coming months . This series of events is intended to engage the stakeholder community and allow them to provide feedback and recommendations for mitigating the risk of bias in AI. Comments are sought on the publication, which is part of NIST’s effort to develop trustworthy AI.

Working on slide deck