Monthly Archives: November 2025

Phil 11.27.2025

Happy Tday to those who celebrate!

Early science acceleration experiments with GPT-5

  • AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

CIFAR10 hyperlightspeedbench is a neural network implementation of a very speedily-training network that originally started as a painstaking reproduction of David Page’s original ultra-fast CIFAR-10 implementation on a single GPU, but written nearly from the ground-up to be extremely rapid-experimentation-friendly. Part of the benefit of this is that we now hold the world record for single GPU training speeds on CIFAR10, for example.

What we’ve added:

  • custom architecture that is somehow even faster
  • way too much hyperparameter tuning
  • miscellaneous architecture trimmings (see the patch notes)
  • memory format changes (and more!) to better use tensor cores/etc
  • dirac initializations on non-depth-transitional layers (information passthrough on init)
  • and more!

What we’ve removed:

  • explicit residual layers. yep.

This code, in comparison to David’s original code, is in a single file and extremely flat, but is not as durable for long-term production-level bug maintenance. You’re meant to check out a fresh repo whenever you have a new idea. It is excellent for rapid idea exploring — almost everywhere in the pipeline is exposed and built to be user-friendly. I truly enjoy personally using this code, and hope you do as well! 😀 Please let me know if you have any feedback. I hope to continue publishing updates to this in the future, so your support is encouraged. Share this repo with someone you know that might like it!

Phil 11.22.2025

I did a Big Deal Thing. Anyone want a nice house in Catonsville, MD?

Tasks

  • Print tags – done
  • Pack – done
  • Moar paperwork
  • Get a ride in? Yes!

Phil 11.21.2025

I think the Aztecs had it right about winter. Their year was 18 months of 18 days, with 5 days at the winter solstice to tray to get the sun to start rising earlier. Their methods were horrific, but I can appreciate the sentiment.

Tasks

  • Bills – done
    • Ricardo, Sande, and Edwin’s – done
  • Print tags
  • Chores – done
  • Dishes -done
  • Lawn – done
  • Keys – done
  • Shelving – done
  • Storage run
  • Recycling – done
  • Light groceries

Phil 11.20.2025

What Donald Trump Has Taught Us about American Political Institutions | Political Science Quarterly | Oxford Academic

  • Generations of political scientists have viewed the American constitutional system and its surrounding pluralist civil society as stable touchstones that safeguard against the threat of authoritarian leadership. Capitalizing on changes that go back several decades—the rise of nationalized polarization, the development of the unitary executive theory, and the growing sway of populist conservatives within the Republican Party—Donald Trump has demonstrated that the sources of countervailing power in the U.S. political system are far more fragile than previously understood. Trump has prevailed upon congressional Republicans to surrender their core constitutional responsibilities, has eviscerated critical foundations of the modern administrative state, and upended the relationship between the federal government and major civil society actors. Political scientists did not anticipate the potential for democratic breakdown that has emerged; we must now direct our energies to understanding this new constellation of power, as well as the pathways available for opponents to respond.

SBIRs

  • 9:00 Standup – done
  • 11:00 Phase2+ – done
  • 4:00 MDA – done
  • Create walk sequences of cluster trajectories (ignore -1) and make an index2vec model. Let’s see what it looks like!
    • Wrote a create_csv_from_story_embedding.py script that creates a walk sequence csv file and creates a model card
    • Trained a 2D and 3D model!
    • Got everything working! The embeddings are different than the topic embeddings. They appear to be more linear-ish. I think I need a good deal more data, because the 2D model seems to be better than the 3D. And multiple points are placed in the same coordinate. So 1) Try smaller clusters. That should give me 20% more data right there. And then generate more scenarios. Looking at Gutenberg just to see what-re-embedding means is also appropriate as a next step.

Phil 11.19.2025

Need to try this: Generative UI: A rich, custom, visual interactive user experience for any prompt

  • We introduce a novel implementation of generative UI, enabling AI models to create immersive experiences and interactive tools and simulations, all generated completely on the fly for any prompt. This is now rolling out in the Gemini app and Google Search, starting with AI Mode.

Scammers net nearly $100k in Chesapeake catfish – The Baltimore Banner

  • The phone numbers checked out. The emails seemed to come from McCain Foods. Even the name on the order matched an executive at the multinational frozen foods company.

How to disable all AI stuff in Visual Studio Code

SBIRs

  • Add a “unclustered” count – done. They are all 1,091 unclustered points out of a total of 5,326
  • Work on scenario trajectories. Working! This is raw embeddings for Scenario 2. It seems to show that even though the topics are close together, the trajectory through the space is somewhat chaotic. It’ll be interesting to see what the index2vec training does:
  • Added some detail:
  • Maybe start on walk sequences – not quite.

GPT Agents

  • 2:30 meeting – done

Phil 11.18.2025

White nationalist talking points and racial pseudoscience: welcome to Elon Musk’s Grokipedia | Elon Musk | The Guardian

  • “Grokipedia is a copy of Wikipedia but one where in each instance that Wikipedia disagrees with the richest man in the world, it’s ‘rectified’ so that it’s congruent with them.”

Tasks

  • Send Nellie a response on lead and keys – done
  • Vanessa xfer – done!
  • See if there is anything to pull from the garden

SBIRs

  • Did some clustering on the sentence embedding data and see what that looks like. It seems as though the lowest number of dimensions results in the best clustering. Not surprising, but good to know the curse of dimensionality intuition holds
  • Here’s a set of screenshots for each of the UMAP/HDBSCAN variations:
  • Write a class that reads in one scenario and then plots lines for each version. See how to animate dots that move along the lines.

Phil 11.17.2025

Disrupting the first reported AI-orchestrated cyber espionage campaign \ Anthropic

  • This campaign has substantial implications for cybersecurity in the age of AI “agents”—systems that can be run autonomously for long periods of time and that complete complex tasks largely independent of human intervention. Agents are valuable for everyday work and productivity—but in the wrong hands, they can substantially increase the viability of large-scale cyberattacks.

Tasks

  • 4:00 Meeting with Nellie – done

SBIRs

  • Slides! Done!
  • 9:00 Sprint demos – done
  • 3:00 Sprint planning – done
  • Got sentence-level embeddings done
  • Now I need to see how clustering looks. Definitely some different regions, though there may just be a big blob too.

Phil 11.16.2025

Along the lines of last Thursday, I wonder if the layers of an LLM could help identify the text that is most useful for identifying a topic. In particular, I’m thinking of Jay Alamar’s work on using NNMF to visualize what’s going on in the layers of a model (Interfaces for Explaining Transformer Language Models)

Added this thought to the project documentation and tweaked the layout so there is now a “prompts and stories” appendix. Makes things read better.

Phil 11.14.25

Had some interesting thoughts about the embedding space results from yesterday. I want to look at how each variation of a particular scenario relates to the others within the scenario. That could be interesting and a way of showing the “probability cone” of LLM narratives.

The other thing to try is to do an embedding at the sentence level and see what that looks like. Since all the tools are in place and embedding is ludicrously inexpensive, this should be straightforward and affordable

Tasks

  • Bills – done
  • Chores – done
  • Dishes – done
  • Print and sign things this afternoon, maybe. Nope
  • Submitted ticket for broken BeliefSpaces email – fixed

Phil 11.13.2025

Tasks

  • Windows today. Noonish? Done and lovely
  • Slow cook a chicken – cooking! Cooked

SBIRs

  • There may be a GPT5.1? Need to check the available models
  • 9:00 Standup – done
  • 10:00 Ron’s meeting – done
  • 3:00 SEG – done. Fast Matt apparently forgot. I need to read his notes later
  • 4:00 ADS. Mention that I won’t be able to make it next week – canceled
  • UMAP! Working!
  • These are embeddings of 5 scenarios that should be in a roughly similar space. I’m a bit surprised that they don’t overlap. Probably need a lot more scenarios. I’ll make a few more and see how that changes things

Phil 11.12.2025

Tasks

  • Groceries! Done
  • Goodwill – done
  • Nag Aaron for paperwork – done

SBIRs

  • 10:00 Ron’s meeting – postponed
  • Now that I have the displays up and running, either get started on UMAP or play with HDBSCAN
  • I moved all the Plotly code into a DfScatter3D class, since the clusterers and reducers shouldn’t have to care about rendering.
  • I created a python script that calls HDBSCAN on the same data I’ve been using but without the assigned clustering. It’s really easy to use:
    df = pd.DataFrame(l)
    blobs = df.values.tolist()
    clusterer = hdbscan.HDBSCAN()
    clusterer.fit(blobs)
    df['cluster'] = clusterer.labels_
  • And that gives results that are almost as good as the assigned values:
  • These results are with the default values. Note that the points that can’t be assigned have dark coloring
  • The next thing I’ll do is create a line along the X axis and use hdbscan.approximate_predict(clusterer, x_axis_points) to see where they get assigned.
  • It’s the same sort of patterns as before, though you do have to concat the two dataframes together for proper rendering:
    df2 = pd.DataFrame(l)
    test_points = df2.values.tolist()
    test_labels, strengths = hdbscan.approximate_predict(clusterer, test_points)
    df2['cluster'] = test_labels
    df3 = pd.concat([df, df2])
  • Which gives us this:

UMBC

  • 3:00 Alden

Phil 11.11.2025

OpenRouter is “the first LLM marketplace, OpenRouter has grown to become the largest and most popular AI gateway for developers. We eliminate vendor lock-in while offering better prices, higher uptime, and enterprise-grade reliability.” They have all kinds of interesting data about models they are serving (rankings), and piles of big-name and obscure models.

Mapping the Latent Past: Assessing Large Language Models as Digital Tools through Source Criticism

  • This article examines how digital historians can use large language models (LLMs) as research tools while critically assessing their limitations through source criticism of their underlying training data. Case studies of LLM performance on historical knowledge benchmarks, oral history transcriptions, and OCR corrections reveal how these technologies encode patterns of whose history has been digitised and made computationally legible. These variations in performance across linguistic and temporal domains reveal the uneven terrain of knowledge encoded within generative AI systems. By mapping this “jagged frontier” of AI capabilities, historians can evaluate LLMs not just as tools but as historical sources shaped by the scale and diversity of their training. The article concludes by examining how historians can develop new forms of source criticism to navigate generative AI’s uneven potential while contributing to broader debates about these technologies’ societal impact.

Tasks

  • Finish slides – scroll through notes for links – done
  • Check in and ping Sande – done
  • 4:30 class – done!

SBIRs

  • Change df so that cluster id is a column and see if I can get that to work
  • That works nicely. Here’s the code that creates the df:
    num_populations = 5
    num_samples = 1000*num_populations
    l = []
    scalar = 5.0
    for i in range(num_samples):
        c = np.random.randint(0, num_populations)
        d = {'cluster': f"c{c}", 'x':np.random.normal()+(float(c)-num_populations/2.0)*scalar, 'y': np.random.normal(), 'z':np.random.normal()}
        l.append(d)

    df = pd.DataFrame(l)
  • Here’s the rendering code:
    fig = px.scatter_3d(df,
                x='x',
                y='y',
                z='z',
                color='cluster'
        )
    fig.update_traces(marker=dict(size=3))

And here are the results:

Phil 11.10.2025

Tasks

  • Finalize LLC paperwork (signed?) Some typos and edits. Sent back and asked about signing
    • Bank stuff – started
  • Ping Aaron and see if he wants to do BS stuff – done
  • Turn off water – done
  • Slides – started
  • Pay dentist
  • Barbara noonish – fun!

SBIRs

  • Start plotting, UMAP and clustering.
  • Split off the StoryEmbedding class since that will be needed for this effort too
  • Got the 3D scatterplot running in Plotly/dash, and got it hooked up to a DataFrame:
  • Tomorrow we’ll try getting UMAP to work

Phil 11.7.2025

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

  • Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations — testing whether humans can distinguish AI from human output — despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies — including fine-tuning, stylistic prompting, and context retrieval — benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations — and offer a cautionary note about their current limitations in capturing human communication.

Tasks

  • Bills – done
  • Dishes – done
  • Chores – done
  • LLC stuff -skimmed and found a typo. Need to decide on text for “purpose.” I’m thinking of something along the lines of the development and application of ethical machine-learning and generative AI solutions, and to promote awareness of malicious and nefarious uses of these technologies. Sent draft to Aaron – done
  • Slides for Tuesday
  • Water plants – done
  • Mow lawn – done
  • Bike shoes – done
  • Storage run – done