Phil 7.8.21

Evaluating Large Language Models Trained on Code

  • We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

Apocalypse now and then: How a biblical genre shapes American politics

  • Today’s white evangelicals in the U.S.—along with many conservative white Catholics and mainline Protestants—imagine themselves to be the persecuted faithful, victims of state oppression in the mold of biblical apocalypses. While this might seem ludicrous to outsiders, it aptly captures their sense of the disorder of the last half century as they’ve been compelled to share cultural and political power with other groups. As it did centuries ago, apocalypse channels the persecuted group’s fear, focusing their resentment and properly directing their anger. Apocalypse’s crucial component for U.S. politics today is this extreme moral dualism, not the imminent End Times.

GPT Agents

  • Sent a note with preliminary results to the team
  • Back up db


  • Put some text together for Jarod’s proposal – done


  • Set up technical meeting for the 20th – done
  • More writing Got through section 3