Phil 5.15.2024

Tasks

  • Periodontist
  • Bank?
  • 10:00 Tim

SBIRs

  • 1:00 RAG discussion
  • Change base class to allow setting the app for more complicated figures
  • Submit final paper
  • CUI registration

GPT Agents

  • 2:00 Alden meeting

Phil 5.15.2024

[OC] Most common 4 digit PIN numbers from an analysis of 3.4 million. The top 20 constitute 27% of all PIN codes!

SBIRs

  • I got accepted into the CUI 2024 conference! Need to get flights, hotel, and register. Like, today. Flight – done. Hotel – Done. Need to move some funds around to cover all this travel!
    • Also need to prepare a final version of the paper
  • 9:00 standup – done
  • 2:30 BD Meeting – delayed
  • Good progress with Plotly:

Phil 5.13.2024

Tasks

  • Call Judith – done
  • Pay Nathan
  • 2:10 Dentist – done
  • Start vacation spreadsheet
  • Reinstall TexStudio and MikTex
  • Guardian – done!

SBIRs

  • Rework hours from Chris’ data – done. Two versions. One is covered by a split between us and SEG, and the other is where we take the full hit.
  • More Plotly finished the first section and tweaked things a bit so __init__ can be overloaded
  • 1:00 SBIR meeting – on track
  • Deltek appears to be broken – fixed
  • Get hotel reservation number – done

Phil 5.10.2024

Chores

  • Clean house -done
  • Garmin maps – done
  • Stop procrastinating about Guardian
  • Shopping -done
  • Send Tim’s contact info to Nathan – done

SBIRs

  • 2:00 Meeting – done
  • Finish writing Dash OO for StackOverflow
  • Get hotel reservation number

Phil 5.9.2024

Large Language Models Reveal Information Operation Goals, Tactics, and Narrative Frames

  • Adversarial information operations can destabilize societies by undermining fair elections, manipulating public opinions on policies, and promoting scams. Despite their widespread occurrence and potential impacts, our understanding of influence campaigns is limited by manual analysis of messages and subjective interpretation of their observable behavior. In this paper, we explore whether these limitations can be mitigated with large language models (LLMs), using GPT-3.5 as a case-study for coordinated campaign annotation. We first use GPT-3.5 to scrutinize 126 identified information operations spanning over a decade. We utilize a number of metrics to quantify the close (if imperfect) agreement between LLM and ground truth descriptions. We next extract coordinated campaigns from two large multilingual datasets from X (formerly Twitter) that respectively discuss the 2022 French election and 2023 Balikaran Philippine-U.S. military exercise in 2023. For each coordinated campaign, we use GPT-3.5 to analyze posts related to a specific concern and extract goals, tactics, and narrative frames, both before and after critical events (such as the date of an election). While the GPT-3.5 sometimes disagrees with subjective interpretation, its ability to summarize and interpret demonstrates LLMs’ potential to extract higher-order indicators from text to provide a more complete picture of the information campaigns compared to previous methods.

Tasks

  • Guardian
  • Call Judith – done
  • Talk to Nathan about compensating Tim for the unexpected use of his driveway. Done

SBIRs

  • Now that I have Dash running, see if I can plot some activation matrices. I’ll need to put them into DataFrames first. Might even list the various layers/elements and let the UI switch between them. Be a good exercise in learning Dash.
  • 9:00 standup
  • 10:30 AFRL meeting prep. Is there an AFRL meeting as well?
  • 4:30 Book club. I managed to read my chapter!

Dash

GPT Agents

2:00 Meeting. Talk about activations? Also the topic of the invited paper. I’d like to propose ‘The Pancake Printer: AI and the risk of a Two-Tiered Economy.’ It would include things like what’s discussed in Meet AdVon, the AI-Powered Content Monster Infecting the Media Industry (futurism.com). Also, Facebook’s AI Spam Isn’t the ‘Dead Internet’: It’s the Zombie Internet

  • Made a shared doc for ideas

Phil 5.8.2024

https://futurism.com/advon-ai-content

SBIRs

  • Ping Tivren to send the doc – done
  • 8:30 – Stunt Aaron – done
  • Call hotel to reserve at 831-372-7551 – done
  • Build some activation matrices and see what they look like. Drat. They won’t draw in a notebook unless I do some fancier things. It could be that the interactive window is local, and the models are remote. Might be able to get plotly to work instead. It works! Port forwarding and everything. Wild.

Phil 5.7.2024

SBIRs

  • Sent the sanitized paper to Orest and Tivern
  • Sent the latest version of the white paper to Matt
  • Dahlgren expenses – done
  • Get flight, car and hotel for the 92nd Symposium. Got flight and car. Working on hotel.
  • 9:00 Standup
  • Work on pulling out layer activations and using UMAP and/or aligned UMAP
  • I just discovered that you can plot inside of Visual Studio, you need to run a Jupyter notebook once to set things up, then just “run in interactive window“. Works for plotly and matplotlib!

Phil 4.6.2024

Happy Seis de Maio!

Vote!

The Kinetic Sculpture Race was wet, but still fun:

SBIRs

  • Ranked a bunch of potential topics for BD
  • Write up notes from Thursday’s meeting
  • Work with Protima to access activations on the GPT just using HFace, then visualize with UMAP. Started by downloading and prompting the chess model, which is working!
  • Got the environment running again after the password reset!
  • Dahlgren expenses

Phil 3.4.2024

The Kinetic Sculpture Race is tomorrow! hoping that the rain holds off

Last year’s winner

Tasks:

  • 3:30 Ground Rent – done
  • Change AC service date – done
  • Chores – finished all the ones that don’t require a car. Didn’t feel like driving more

SBIRs

  • Initial forms – done
  • Tech transition meeting – went well
  • Reset AWS password – done

Phil 5.2.2024

USNA Capstone all day yesterday

SBIRs

  • 9:00 standup from the car. A truck lost a tire in front of me while I was talking. Sheesh!
  • Meetings in Dahlgren all day – not as much fun as USNA 🙂
  • Make a LinkedIn and BlueSky post on the new paper
  • No book club today

GPT Agents

  • 2:00 Weekly meeting – cancelled since I was stuck in the car

Phil 4.30.2024

Tasks

  • Reschedule AC
  • International driver’s license
  • Plants!
  • Till
  • iPhone stuff – done

SBIRs

  • 10:00 SEG Meeting – agreed to work initial financials and schedule – done
  • 1:00 SBIR Prop meeting – much discussion of paperwork
  • Capstone reception 5:00 – 7:00
  • War Elephants on ArXiv – done

Phil 4.29.2024

Tasks

  • International driver’s license
  • Screen door
  • Plants!
  • Till

SBIRs

  • 9:00 – Sprint Demos
  • 12:30 Kickoff
  • 3:00 Sprint Planning

Phil 4.26.2024

Today on AI used with Ill intent:

  • Baltimore County Police arrested Pikesville High School’s former athletic director Thursday morning and charged him with crimes related to the alleged use of artificial intelligence to impersonate Principal Eric Eiswert, leading the public to believe Eiswert made racist and antisemitic comments behind closed doors.

Also, it seems he probably scored high on the SDO scale: What it was like to be a student of Dazhon Darien, accused of framing principal with AI

  • It was a good day for a few students at Pikesville High School on Thursday hanging out in the parking lot after school. Their former athletic director, who they said belittled them and made them feel uncomfortable, wasn’t coming back.

Phil 4.25.2024

Can’t seem to backup my phone using itunes any more. Doing the cloud thing

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  • Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Followup: Simple probes can catch sleeper agents

  • This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon.

Related: Coup probes: Catching catastrophes with probes trained off-policy

  • We present experiments measuring the generalization abilities of probes trained off-policy in a toy setting. We show that probes can generalize well to different text formats and also generalize from harmful text the LLM wouldn’t output to harmful text where the LLM has been jailbroken to actually output the harmful text.

Related: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

  • Large language models (LLMs) can “lie”, which we define as outputting false statements despite “knowing” the truth in a demonstrable sense. LLMs might “lie”, for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting — prompting GPT-3.5 to lie about factual questions — the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

SBIRs

  • 9:00 Standup
  • 3:00 AFRL meeting – looks like we’ll set up an overleaf project and start generating a white paper every few months. Topic 1-(something) will be first. Going to see what goes on with the MORS talk first?
  • 4:00 ONR meeting – We can repurpose the M30 content into the slide format, then maybe do that with the AFRL white papers
  • 4:30 Book club

GPT Agents

  • 2:00 Meeting

Phil 4.24.2024

Or 4/24/24. Or 24/4/24, which also looks nice

SBIRs

  • 1:30: Some CwoC discussion? Yup
  • Spent the rest of the day setting up my dev environment