Monthly Archives: April 2024

Phil 4.30.2024

Tasks

  • Reschedule AC
  • International driver’s license
  • Plants!
  • Till
  • iPhone stuff – done

SBIRs

  • 10:00 SEG Meeting – agreed to work initial financials and schedule – done
  • 1:00 SBIR Prop meeting – much discussion of paperwork
  • Capstone reception 5:00 – 7:00
  • War Elephants on ArXiv – done

Phil 4.29.2024

Tasks

  • International driver’s license
  • Screen door
  • Plants!
  • Till

SBIRs

  • 9:00 – Sprint Demos
  • 12:30 Kickoff
  • 3:00 Sprint Planning

Phil 4.26.2024

Today on AI used with Ill intent:

  • Baltimore County Police arrested Pikesville High School’s former athletic director Thursday morning and charged him with crimes related to the alleged use of artificial intelligence to impersonate Principal Eric Eiswert, leading the public to believe Eiswert made racist and antisemitic comments behind closed doors.

Also, it seems he probably scored high on the SDO scale: What it was like to be a student of Dazhon Darien, accused of framing principal with AI

  • It was a good day for a few students at Pikesville High School on Thursday hanging out in the parking lot after school. Their former athletic director, who they said belittled them and made them feel uncomfortable, wasn’t coming back.

Phil 4.25.2024

Can’t seem to backup my phone using itunes any more. Doing the cloud thing

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  • Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

Followup: Simple probes can catch sleeper agents

  • This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon.

Related: Coup probes: Catching catastrophes with probes trained off-policy

  • We present experiments measuring the generalization abilities of probes trained off-policy in a toy setting. We show that probes can generalize well to different text formats and also generalize from harmful text the LLM wouldn’t output to harmful text where the LLM has been jailbroken to actually output the harmful text.

Related: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

  • Large language models (LLMs) can “lie”, which we define as outputting false statements despite “knowing” the truth in a demonstrable sense. LLMs might “lie”, for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting — prompting GPT-3.5 to lie about factual questions — the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

SBIRs

  • 9:00 Standup
  • 3:00 AFRL meeting – looks like we’ll set up an overleaf project and start generating a white paper every few months. Topic 1-(something) will be first. Going to see what goes on with the MORS talk first?
  • 4:00 ONR meeting – We can repurpose the M30 content into the slide format, then maybe do that with the AFRL white papers
  • 4:30 Book club

GPT Agents

  • 2:00 Meeting

Phil 4.24.2024

Or 4/24/24. Or 24/4/24, which also looks nice

SBIRs

  • 1:30: Some CwoC discussion? Yup
  • Spent the rest of the day setting up my dev environment

Phil 4.23.2024

Woke up nice and relaxed after a good night’s sleep. The night before a presentation is not easy for me.

I’ve been thinking about this slide from the talk yesterday:

I think that AI researchers are in a place that nuclear researchers were in the ’30’s. There is this amazing technology that is going to change the world, but no one is sure how. Then the world engages in a total war that depends on technology and the Allies are not doing well. Some of the researchers think that a nuclear weapon might turn the tide. It works, but in retrospect it was too much too late. But for 10 years the chance that there could be a broad nuclear war was high, and take as just an extension of current developments – a bigger bomb. It took decades for that viewpoint to shift. AI weapons are probably here already, and there are nations and organizations that are working out the best way to use them – as an extension of current “active measures” strategies and tactics. And like the atomic bomb, we really have no idea where this will go.

SBIRs

  • Read a bunch of stuff for upcoming meetings
  • Fire up the NNM instance and see if I can remember how to use it. Add an instruction section to the notebook – got sidetracked into doing a detailed read of a BAA
  • 9:00 standup
  • 11:30 AI Ethics training discussion with Hall Research. They are legit as it gets. Let’s see what kind of training they put together, but for now I give a ringing endorsement
  • 3:30 meeting on the Phase IIe. We have three weeks to respond, but it doesn’t seem like they are asking for much? Very confused. Maybe because it’s an extension?

Phil 4.22.2024

Finished the slide deck and gave the talk. A single question. Still, it’s a nice deck that could get used elsewhere. Also need to update my CV – done

SBIR’s

  • Need to show Protima how to set up an Overleaf project for research documentation.
  • Need to start again on the NNM code, but at 4:30, it’s too late in the day

Phil 4.19.2024

Chores

  • Ground rent research
  • House cleaning
  • Yard
  • See if truck is ready – back in the driveway, and it’s electronics magically fixed themselves when it was getting its oil change at no charge. I’m agape.
  • BSO
  • Water bill

SBIRs

  • Slides <- could have done more, but now I have a PLAN!

Phil 4.18.2024

Dentist!

7:30 syphony

SBIRs

  • Did some work on the slides but not enough.
  • Helped Aaron on the paper and go motivated to create an ASRC template so we never have to do *that* again
  • 9:00 standup
  • 11:00 CUI meeting – cancelled
  • 4:30 book club – cancelled for the best – and saddest – of reasons

GPT-Agents

  • Set Alden’s overleaf project up. Probably overkill
  • 2:00 meeting – Just Jimmy, but fun!
  • CUI provocation v3 submitted

Phil 4.17.2024

SBIRs

  • Big kerfuffle on the report yesterday afternoon. As a result, I just worked on it till it was done. Submitted this morning to show my commitment. Sigh
  • Need to work on the slide deck today but my motivation is lacking

Phil 4.15.2024

Tax day!

Read Collective intelligence: A unifying concept for integrating biology across scales and substrates, which is wild, and feeds into the prompt-as-life concept I’ve been toying with. Among other things, it opens up experiments to show the level of self-organization available to prompts:

  • A central claim of the emerging field of diverse intelligence is that cognitive capacities (Box. 1) exist on a spectrum: that tools, concepts, and approaches from behavioral sciences can be productively applied to understand and control systems far beyond familiar animals with central nervous systems (without the necessity to attribute advanced, human-level metacognitive traits). 
  • Biological intelligent systems demonstrate increased ability to achieve their (collective) goals despite obstacles by integrating the individual competencies of their components (which can perform tasks in their own space without any inkling of the large-scale goals to which they contribute)
  • Thus, the physiological process that leads to the emergence of integrated collectives, which scientists and conspecifics recognize as discrete individuals is fundamentally dependent on the geometry of interactions (and signaling barriers) present during the early establishment of individuality and the setting of borders between Self and outside world (since every cell is some other cell’s adjacent neighbor).
  • However, the more interesting and fundamental issue is seen when considering just one cut: the cells on either side of the cut will create a head and tail respectively, but they were adjacent neighbors before the cut and located at the same positional information value. In other words, it is actually impossible for an anatomical decision like this to be made locally – the cells of the wound must coordinate with the remaining fragment to get information about where they are located, which way they are facing, and what other structures exist121,122, in order to make adaptive decisions about large-scale growth and form that enable regeneration of normal worms.
  • This recruitment of individuals to accomplish a high-level goal is seen in other collective systems like ant colonies152,153, which often call in helpers when a task is large. The ability to recruit participants to complete tasks may be a central competency of collective intelligence that works across scales, from cells to swarms of entire organisms7.
  • Cell and developmental biology offer very rich fodder for the emerging field of diverse intelligence: discovering a vast spectrum of problem-solving capacities in novel substrates and at unconventional spatiotemporal scales. Because of life’s multi-scale competency architecture, a fundamental aspect of intelligence is collective behavior: all intelligences appear to be made of parts, connected by mechanisms implementing policies that bind the competent components into a cooperative (and competitive6) computational medium that solves problems in new spaces and at higher scales.
  • Importantly, the definition of intelligence as the ability to reach the same endpoint despite internal or external changes emphasizes not only robustness (successful use of novel navigational policies to overcome perturbations) but also its failure modes. Numerous ways of targeting of its sensory, memory, decision-making, or other components can de-rail the performance of a collective intelligence, resulting in birth defects and malformations.
    • I think this is a really important way to probe and examine prompts and models. How well do they reach their goals when damaged, and how do they do it.
  • Cancer, a kind of dissociative identity disorder of the somatic collective intelligence109, limitations in regenerative ability, and many physiological disorders could all be advanced by techniques that exploit not just the low-level mechanisms, but also the higher-level decision-making of life16,17
  • Living matter is a kind of agential material with the ability to propagate information across scales – a phenomenon which has many implications for evolution9, and for bioengineering21.

Ordered The Sentient Cell: The Cellular Foundations of Consciousness

SBIRS

  • Write email summary of Friday’s meeting. Also find out who I send the MCMC description to. Done
  • Start slide deck for the 22nd – started! Using ContextExplorer which is really good for this sort of thing.
  • Submit paper – done
  • Gotta rewrite the final report in a way that “substantially revises” it. Sigh. Waiting for some direction from someone in authority.

Phil 4.12.2024

Conquering the COVID-19 Infodemic: How the Digital Black Press Battled Racialized Misinformation in 2020

  • In 2020, as many Black people around the world fought both anti-Black racism and COVID-19, the Black press in the US was dealing with another widespread problem: an infodemic. Editors of Black digital publications were on the frontlines of dispelling racialized misinformation about COVID-19, all while reporting on a contentious presidential election, ongoing protests for racial justice, and a rising COVID-19 death toll that disproportionately affected African Americans. This mixed-methods study—which includes semi-structured interviews in addition to website and social media analyses—explains the top five tactics that Black outlets used to serve as an advocate for, and an adviser to, their communities during its time of dire need. Their strategies provided an editorial slant that challenged anti-Black racism in public discourse and countered misinformation with factual public interest journalism.

SBIRs

  • Driving 4 hours for a 2 hour meeting
  • It went pretty well, but Dr. J was a no show. Everyone is very interested in MP-type stuff, so I think we should organize the DTA work to produce a demo along these lines, maybe within the context of the MDBE. That could look very fancy.
  • Wrote up a easy to read Monte Carlo Markov Chain description.
  • The big use for NNMs (that no one can grok) is the visualization and prediction of mismanagement. It’s all people talk about in these places.

Markov Chain Monte Carlo: Exploring Probability Through Random Walks

Imagine you’re trying to figure out how effective a new drug is at treating a disease. The traditional statistical methods might not work very well because the problem is too complex – there are just too many factors to consider. This is where Markov Chain Monte Carlo (MCMC) can really shine.

MCMC is a powerful technique that combines two key ideas: Markov chains and Monte Carlo simulations.

A Markov chain is a sequence of events where each step only depends on the previous one. It’s like a random walk, where your next move is based only on where you are now, not on your whole history.

Monte Carlo simulations, on the other hand, are all about playing with randomness to find answers. Instead of trying to calculate everything exactly, you take a bunch of random samples and use those to estimate what you want to know.

Put these two ideas together, and you get MCMC. The basic process is:

  • Start with an initial guess about the drug’s effectiveness.
  • Take a small, random step from that starting point. This step represents the uncertainty or noise in the data.
  • Decide whether to keep this new guess based on how well it fits the data you have. The better it fits, the more likely you are to accept it.
  • Repeat steps 2 and 3 many, many times.

Over time, as you keep taking these random steps and accepting or rejecting them, your guesses will start to converge towards a meaningful result. This convergence is key – it tells you that the MCMC process has thoroughly explored the probability distribution and found a stable estimate.

MCMC is really powerful because it can handle complex models and uncertainties that would be very difficult to deal with using traditional methods. In our drug example, MCMC could help you estimate the drug’s effectiveness while accounting for all sorts of factors like patient characteristics, side effects, and so on.

The great thing about MCMC is that it’s flexible and can be adapted to all kinds of research problems, from predicting disease progression to optimizing drug combinations for cancer treatment. By leveraging the power of random walks and probability, MCMC can turn complexity into clarity and uncertainty into insight. It’s a truly remarkable tool in the researcher’s toolkit.

Phil 04.11.2024

Chores:

  • Call Jim Donnie’s
  • Schedule oil change

SBIRs

  • Finishing and submitting CUI paper
  • 4:30 Book club
  • Installed an ancient printer driver and it worked!

GPT Agents

  • 2:00 LLM Meeting

Phil 4.10.2024

The eclipse was cool. Even with clouds, it’s magical:

That’s the view to where the sun is still shining. We could also see the eclipse faintly through the clouds, which must have been how most people saw it before rapid travel. The light was so faint, it was easy to imagine the sun and the moon fighting. I swear I saw sparks.

SBIRs

  • Finish CUI paper (done!) and prep for submit (topics, keywords, etc)
  • Add UMBC talk to stories – done
  • Contact Ian with info – done. Looks like it is a 1 hour talk!
  • Respond to Kate S.