Phil 6.11.2024

Tasks:

SBIRs

Performance goals – done
Letter to Anthropic
- I spent the whole day reading Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. It’s very good and very interesting. The Anthropic folks are looking at features, not sequences. That being said, their feature work is really good, and the UMAP relationships they show are very map-like, in kind of the same way that text embedding is. Which makes sense as those embeddings are also coming from LLMs (OpenAI’s ada-embedding002, frequently). There is also some really interesting work in using the features to help the model detect manipulative content, which aligns with my White Hat AI concept.
- I was thinking that it might make sense to wait to contact Anthropic after getting some layer mapping done, but I think it might make sense to reach out as planned. Particularly since they have a concept of features that are “smeared” across layers, which I hadn’t thought about before but makes sense. They call this Cross-Layer Superposition.
- Anyway, I’ll write up an email tomorrow. Note – Include a picture from the conspiracy theory map!
- I really wonder if dictionary learning of features using sparse autoencoders can be used for sequences rather than features.
Change out images in presentation and resubmit – done. Decided to leave the LLM embeddings
Work on book – nope

GPT-Agents

Ping Shimei & Jimmy to see if they’d like to meet over the next two weeks – done
Conflict of Interest (COI) disclosure

viztales