Monthly Archives: October 2023

Phil 10.31.2023

Plumber?

SBIRs

  • M30 White paper
  • 9:00 standup
  • 2:00 BMD meeting
  • 3:00 M30 meeting

GPT Agents

  • Disable the token ring buffer to see the right orientation. Looks like it’s correct:
  • I think that prompt length (and the ring buffer) might be a good way to map out a space. A short buffer should have less “direction” and should meander more
  • Projecting the embedding for each layer as the narrative progresses may be helpful
  • Need to set up an overleaf project to capture this
  • Need to export to spreadsheets with text and sheets by layer
  • Write IUI 2024 review (done) and start next paper
  • 3:30 call with Greg and ContextTest – found a bug with CORS and cross-site posting. Told Zach and we will work to fix
  • ContextTest with Stacey

Phil 10.30.2023

RV to winterizing!

Plumber! Sent email

SBIRs

  • Working on LLM mapping. Got my first view of all the layers as angles from the average working. This is for “the game begins as [white]:
  • Turns out I was taking the wrong axis of the vectors. This is more what it looks like. Need to work out which axis is which, but this is all the parts working more-or-less correctly:

Phil 10.27.2023

Woke up at 3:00 with heartburn, which has kinda thrown off the day

GPT Agents

  • Ran ContextTest through its paces yesterday. Made some tweaks and updated. Time for the “friends and family” round
  • Need to set up Box account
  • Reviewing next paper for IUI

SBIRs

  • 10:00 meeting with Rukan over M30 paper
  • 2:00 Technology fellow debrief

Phil 10.26.2023

Got my COVID shot!

Plumber – not done

SBIRs

  • Expense report! Done
  • Lots of driving and meetings all day
  • Research council went well. Good questions with an involved audience
  • M30 meeting. Late because I got hung up at the gate. Good discussion though. I think there are several phases in roughly this order (put these in an Overleaf project):
    • RCSNN hierarchy for both systems, varying only by the bottom layers. Top layers could be LLM-driven, which would be fun to evaluate. Probably a lot of context prompting and a GPT-3.5 back end?
    • Simulator acceleration. There is never enough data to explore outlier states, so adding SimAccel autoencoding -> autoregression would increase the data available for effective training. Because all simulators are base on implicit assumptions, this data will almost certainly be wrong, which will be addressed with…
    • Simulator cleaning. Like data cleaning, but for data generators. The quality of the generated data can potentially be evaluated by the way that the trained model has “behavior attractors” that can be identified through something like salience analysis. These would be examples of bias, either intentional or unintentional, Imagine that a car simulator that is extended to airplanes. The choice to use Euler angles (rather than Quaternions) for orientation – something that makes sense for a vehicle that navigates in 2D, will completely screw up an airplane doing basic fighter maneuvers such as an Immelmann, Split S, or wingover maneuver. The inability to produce that kind of data would produce artifacts in the model that could either be identified on their own or when compared with other models (e.g. MAST vs. NGTS).
    • Coevolution of AI and Simulators towards the goal of useful models. Each iteration of training and Simulator cleaning will have impacts on the understanding of the system as a whole. Consideration of this iterative development needs to be part of the process.
    • System trust. As the AI/ML simulator becomes better, the pressure to deploy it prematurely will increase. To counter this, the UI that is purposefully “low fidelity” or “wireframed” should be used for demonstrations and recordings to indicate the level of progress in the capability of the system.
  • Get the GPT-2 layer display first pass working
  • Start slides?

GPT Agents

  • Clear and test ContextApp one last time before going live! Done
  • First official result!
  • Set up Box account – not done
  • Did finish my first review for IUI 2024

Phil 10.24.2023

SBIRs

  • 9:00 sprint planning – long
  • 1:00 M30 meeting – weird, but good
  • 2:30 AI Ethics?
  • Try to hook up the graphing to the layer activations – good progress!

GPT Agents

  • Back to reviewing papers

Phil 10.23.2023

Got a nice hike in with a bald eagle sighting 🙂

3:00 podcast meeting

SBIRs

GPT Agents

  • The IRB is submitted? Waiting for a response.
  • Apparently this technique is now called retrieval-augmented generation?
  • Start on slides, and make sure all the software works on the laptop

Phil 10.20.2023

I have a new dishwasher. It works so well!

SBIRs

  • Need to send an email stating I intend to submit a paper for the RHB prize – done
  • Register for the ETF – done
  • Added all the reviewer comments to the “venues” section of the Overleaf doc
  • 3:00 meeting to go over RC slides

GPT Agents

  • Need to start an outline for JHU and UMBC Guest lecture
  • JHU
    • Trust and coordination at scale
    • Stories
    • Dimension reduction
    • Network density and stiffness
    • Diversity for intelligence, hierarchy for speed
    • Student activities
  • UMBC
    • Why does conflict between Nation States happen?
    • Why do we have combat?
    • Why attack/invade/occupy?
    • Why defend?
    • What does AI bring to combat?
    • What can go wrong, and how do/can we we fix that?
    • What happens when communication is denied?
    • What are the implications of massive, patient, reasonably smart weapons in the information and physical domain?
    • What if AI is better (safer, more targeted, less confused) at combat than people? What does that mean for other jobs?

Phil 10.19.2023

GPT Agents

  • Make IRB changes. I think that it’s basically Data will be stored in a password-protected Dropbox folder. There will be two files. One is a list of names. email addresses, and dates with an associated anonymous string (eg “P1”, “E1”, or “D1”). The other file will contain all experiment data with all names and dates replaced with the substitutes from the other list. Mostly done. Need to change the website text to not mention Supabase and convert items to pdf. – Done
  • Read IUI paper # 1. Good! Need to write the review
  • 2:00 LLM meeting

SBIRs

  • 9:00 standup- done!
  • Research Council slide review – done!
  • Incorporate Clay’s comments into IPT deck- done!
  • Fill out forms for Clay- done!

Phil 10.18,2023

SBIRs

  • Start on the other slide deck due this week. Get a meeting with Aaron for more context
  • The War Elephants presentation got nominated for best presentation at MORS. I need to submit “A complete paper (in PDF format), not to exceed 40 pages or 10,000 words including appendices. Please see accompanying formatting guidelines for additional information.” Need to put in the reviewer suggestions and submit by Feb 29 2024.
  • Roll in changes for the research council slides and distribute.

GPT Agents

  • Make IRB changes
  • Alden meeting
  • Review IUI paper # 1

Phil 10.17.2023

This is the key to making trustworthy models:

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

  • If you reduce the parameter count in an LLM, it tends to lose recall of facts before it gets worse at learning from examples in the prompt. This holds for parameter count reductions via both pruning and using a smaller dense model.
  • How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques — weight pruning and simply training a smaller or larger model, which we refer to as dense scaling — and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60–70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

The thing is that for sociology, the large pretrained (not finetuned) models will probably be best.

SBIRs

  • Add a 3 point Research Council story – done
  • 9:00 standup – done
  • 1:00 Dr. Banerjee – done. Fun!
  • 2:00 BMD – done. Did a slide walkthrough and got some action items
  • 2:30 AI Ethics
  • 3:00 AIMSS?

GPT Agents

  • Thinking more about how to watch the changes of the model under prompting. I think a ring buffer prompt, where the oldest tokens drop off while new ones are added makes the most sense. I checked, and the Llama-2 models do come in pretrained and finetuned (chat) flavors.
  • Put in a request for Llama-2 access – got it! That was quick. Yep pretrained and chat
  • My talk is back at its original time!
  • The atproto sdk looks very nice!
from atproto import Client, models


def main():
    client = Client()
    profile = client.login('my-handle', 'my-password')
    print('Welcome,', profile.display_name)
    
    response = client.send_post(text='Hello World from Python!')
    client.like(models.create_strong_ref(response))

    
if __name__ == '__main__':
    main()

Phil10.16.2023

Spamming where the skies are blue

  • Includes python code that uses the atproto package for consuming public data. All you need is a login and password!

Internet Archive Scholar

SBIRs

  • Start research council slide deck – Friday 20th!
  • Start Futures IPT slide deck – Wednesday 25th!
  • 2:00 MDA meeting. Offer updated report

GPT Agents:

  • The chess model has had 23 downloads in 4 days!
  • Slide deck for AI Ethics class – Wednesday 18th!

Phil 10.13.2023

I have papers to review by November 17

Speaking of reviews: Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

  • Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4’s evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus’s capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at this https URL.

GPT Agents

  • Had a good discussion with Jimmy and Shimei yesterday about bias and the chess model. In chess, white always moves first. That’s bias. Trying to get the model to get black to move first is hard and maybe impossible. That, and other chess moves that are more common and less so might be a good way to evaluate how successful treating bias in a model could be, without destroying them.
  • My personal thought is that there may need to be either “mapping functions” that are attached to the prompt vector that steer the machine in certain ways, or even entire models who’s purpose is to detect and mitigate bias.
  • Started on getting ecco data to build maps. I need to install the project so that it can be edited, since I’m going to have to tweak. Here’s how: pip.pypa.io/en/latest/topics/local-project-installs

SBIRs

  • Need to add references to figures in the white paper – done
  • 10:30 IPT meeting – just getting ducks in a row

10.12.2023

SBIRs

  • Review fixes to white paper
  • 9:00 standup – done
  • 11:30 CSC touch point – done
  • 12:00 SEG ROM discussion – done
  • 1:00 JSC meeting
  • 3:30 M30 meeting

GPT Agents

  • 2:00 Weekly meeting
  • Created a Huggingface model card for the chess model to use with map research. I think I’m going to try to build color maps for each layer as tokens are generated and see how they change as a game is generated

Phil 10.11.2023

Clean bathroom!

SBIRs

  • 2:00 Weekly MDA meeting
  • Had a good discussion yesterday with Rukan about the demo app. Made the cutest wireframe you have ever seen using emojis
  • Got the Organizational Lobotomy story accepted into Future Tense!

GPT Agents

  • Finished my IUI bidding

Phil 10.10.2023

This weekend’s fun:

SBIR’s

  • 9:00 Demos – need to make slides!
  • 11:30 Rocket review
  • 2:30 AI Ethics
  • 3:00 M30 meeting
  • 4:00 Sprint planning

GPT Agents

  • Started bidding on the IUI papers. Need to see if there are any more “reluctant”