Phil 4.5.2026

Screenshot of an actual Truth Social post from POTUS

Bunch of papers and reports worth listing:

News Integrity in AI Assistants (BBC)

  • Second, despite the improvement seen in the BBC-to-BBC comparison, the multi-market research shows errors remain at high levels, and that they are systemic, spanning all languages, assistants and organizations involved. Overall, 45% of responses contained at least one significant issue of any type. Sourcing is the single biggest cause of significant issues (31%). Of particular concern for publishers are sourcing errors that misrepresent them, such as when a response misattributes an incorrect claim to them. Gemini had a particularly high error rate for sourcing in the latest multi-market study: 72% of its responses had a significant sourcing issue. All other assistants were below 25%.
  • And yet, many people do trust AI assistants to be accurate. separate BBC research published at the same time as this report shows that just over a third of UK adults say they completely trust AI to produce accurate summaries of information. This rises to almost half of under 35s. That misplaced confidence raises the stakes when assistants are getting the basics wrong. These shortcomings also carry broader consequences: 42% of adults say they would trust an original news source less if an AI news summary contained errors, and audiences hold both AI providers and news brands responsible when they encounter errors. The reputational risk for media companies is great, even when the AI assistant alone is to blame for the error.
  • If AI assistants are not yet a reliable way to access the news, but many consumers trust them to be accurate, we have a problem. This is exacerbated by AI assistants and answer-first experiences reducing traffic to trusted publishers.

Contextualizing Misinformation: A User-Centric Approach to Linguistic and Topical Patterns in News Consumption | Proceedings of the ACM on Human-Computer Interaction

  • Exposure to misinformation poses significant challenges to democratic processes and public health, particularly during critical events like elections. This study adopts a user-centric approach to analyze the linguistic features of misinformation actually consumed by individuals during web browsing. Using data from a nationally representative panel of 1,240 American adults and their web-browsing data (21M URL visits) during the 2020 U.S. Presidential Election, we examine linguistic and topical differences in the content of 91K unique misinformation and hard news webpages by utilizing natural language processing techniques and Large Language Models. We find that misinformation consumed by users is generally easier to read, exhibits higher negative sentiment, and employs more moral language than hard news. We also find significant linguistic variations across topics–misinformation can be diverse and vary in linguistic features depending on the subject matter. We also identify heterogeneity across key user characteristics: older adults consume more misinformation about COVID-19 and health, with content showing more negative sentiment and fewer moral terms than expected. Republicans engage with misinformation characterized by more negative sentiment and higher moral language, focusing less on health topics and more on social and political issues. These results highlight the importance of a user-centric approach and suggest that interventions to combat misinformation should be tailored to specific topics and user characteristics for greater effectiveness.

Veiled Power: How Rosenwald Teachers Quietly Shaped the Civil Rights Movement

  • What precipitates the collapse of seemingly durable social orders like Jim Crow? During the 1920s, approximately 5,000 “Rosenwald Schools” were built across the rural South through a partnership between philanthropist Julius Rosenwald and Black communities who raised matching funds, donated land, and petitioned local governments. Local elites saw vocational training that would preserve the racial order. We argue Black educators used this accommodationist cover to build veiled capacity: organizational infrastructure for collective action behind a veil of compliance. Counties with more Rosenwald Schools show greater civil rights protest in the 1960s. Mediation analysis reveals that pre-existing social capital predicted protest through Rosenwald teacher placements, not overall Black enrollment. Instrumental variable models suggest the effect is not driven by community selection. Moving from no Rosenwald teachers to the 75th percentile predicts 45% more protest. The political effects of education may depend less on what elites intend than on what educators build where elites cannot see.

[2604.01193] Embarrassingly Simple Self-Distillation Improves Code Generation

  • Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

Phil 4.3.2026

NASA’s Artemis II Mission Leaves Earth Orbit for Flight around Moon

How People Use ChatGPT

  • Despite the rapid adoption of LLM chatbots, little is known about how they are used. We document the growth of ChatGPT’s consumer product from its launch in November 2022 through July 2025, when it had been adopted by around 10% of the world’s adult population. Early adopters were disproportionately male but the gender gap has narrowed dramatically, and we find higher growth rates in lower-income countries. Using a privacy-preserving automated pipeline, we classify usage patterns within a representative sample of ChatGPT conversations. We find steady growth in work-related messages but even faster growth in non-work-related messages, which have grown from 53% to more than 70% of all usage. Work usage is more common for educated users in highly-paid professional occupations. We classify messages by conversation topic and find that “Practical Guidance,” “Seeking Information,” and “Writing” are the three most common topics and collectively account for nearly 80% of all conversations. Writing dominates work-related tasks, highlighting chatbots’ unique ability to generate digital outputs compared to traditional search engines. Computer programming and self-expression both represent relatively small shares of use. Overall, we find that ChatGPT provides economic value through decision support, which is especially important in knowledge-intensive jobs.

Tasks

  • Croatia Spreadsheet – started
  • Chores
  • Dishes – done
  • Bookshelf. Discovered I only had half of it and had to take a trip to College Park for part two.
  • Repair shoes

SBIRs

  • More password stuff
  • Travel report

Phil 4.1.2026

Back from vacation

MIRAGE: The Illusion of Visual Understanding

  • Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.

Tasks

  • Laundry – started
  • Groceries – done
  • Change the tire on the Cervelo
  • 3:00 Alden meeting (scan the paper! done) – done
  • Accepted ICTAI invitation

SBIRs

  • Password – managed to break things somehow
  • Meetings? Probably not at this point

Phil 3.30.2026

Mapping the modern world: How S2Vec learns the language of our cities

  • In line with the Earth AI vision, we recently introduced S2Vec, a self-supervised framework designed to learn general-purpose embeddings (i.e., compact, numerical summaries) of the built environment. S2Vec allows AI to understand the character of a neighborhood much like a human does, recognizing patterns in how gas stations, parks, and housing are distributed, and using that knowledge to predict metrics that matter, from population density to environmental impact. In our evaluations, S2Vec demonstrated competitive performance against image-based baselines in socioeconomic prediction tasks, particularly in geographic adaptation (extrapolation), while showing a clear need for improvement in environmental tasks, like tree cover and elevation.

S2Vec: Self-Supervised Geospatial Embeddings for the Built Environment

  • Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on several large-scale geospatial prediction tasks, both random train/test splits (interpolation) and zero-shot geographic adaptation (extrapolation). Our experiments show S2Vec’s competitive performance against several baselines on socioeconomic tasks, especially the geographic adaptation variant, with room for improvement on environmental tasks. We also explore combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our findings highlight how S2Vec can learn effective general-purpose geospatial representations of the built environment features it is provided, and how it can complement other data modalities in geospatial artificial intelligence.

Phil 3.18.2026

Tasks

  • Check in and boarding pass
  • Dispute Dulles greenway charge – done
  • Update EZ pass address – done
  • Pay Dulles – done
  • Chores
  • Laundry
  • Dishes
  • Trash

SBIRs

  • Timesheet – done
  • Check through quarterly report

Phil 3.17.2026

Example of Organizational Lobotomy

Tasks

  • Bennie – done. So sad
  • Pick up mail and camera – done

SBIRs

  • Clustering is chunking along. It will take about 40 days. It may make more sense to simply try to place all the coordinates in a list and calculate all 22M embeddings at once if that can fit. It seemed faster, and we can keep the clusterer

Phil 3.13.2026

Tasks

  • Bills – done
  • Clean – done. Faster!
  • Dishes – done
  • Groceries – done
  • Chairs to ReStore – done
  • Conduit? – bought

SBIRs

  • 9:00 meeting with Aaron – done. Wrote up a thing for agentic Kriegsspiel
  • Look at clustering code and see what needs to be done
  • Timesheet! – done

Phil 3.12.26

Our Heroes, Your Villains: How Americans Polarize Around Historical Figures

  • Political actors often associate themselves with positively valenced historical figures (e.g., Martin Luther King Jr., Jesus) and opponents with negatively valenced figures (e.g., Hitler, Stalin). What factors shape Americans’ understandings of such figures’ ideological orientations? To what extent are these understandings grounded in facts versus figures’ colloquial valence as “heroes” or “villains”? And what are the implications? Drawing on group-identity theories of politics and original nationally representative data in which we had Americans rate historical figures on the left–right ideological spectrum, our analyses revealed three key findings. First, Americans’ placement of historical figures appears far more driven by their valence as heroes/villains and their connection to in-group/out-group biases than where such figures would intuitively be placed in light of facts about them or how they were perceived in their time (e.g., left-right ratings for fascist and communist leaders correlate strongly). Second, the strongest predictors of figure placement and polarization are Americans’ own ideological and partisan in-group commitments. Third, group differences in Americans’ ideological placement of “villains” are more extreme than that of “heroes,” suggesting heroes/villains serve as proxies for common in-group/out-group biases. Findings complicate research on the contested nature of history and suggest how historical figures serve different purposes in contemporary partisan rhetoric.

Task

  • More unpacking

SBIRS

  • 9:00 Standup – done
  • SoW meeting with Aaron? – Pushed to tomorrow
  • 4:00 ADS meeting – cancelled
  • We’re about 1,000 embeddings from finishing the UMAP run. Next is the clustering. Hopefully I can kick that off before I leave for Trek camp

Phil 3.10.2026

How to Talk to Someone Experiencing ‘AI Psychosis’

  • “It makes sense that a lot of people who are developing a psychotic illness for the first time, there’s going to be this horrible coincidence, or kind of correlation,” Torous said. “In some cases the AI is the object of people’s delusions and hallucinations.”
  • I need to add this back into the book (or posts, depending). The idea is that LLMs are present and able to identify and exploit vulnerability is a significant advantage if you want to exploit it.

Narrative Integrity Risk: The Next Frontier in Financial Stability | Lawfare

  • Most firms still treat narrative manipulation as a communications hiccup rather than an adversarial threat. These are deliberate, adaptive attacks, capable of distorting valuations and eroding reputations. Recent reports from Marsh McLennanSwiss Re, and World Economic Forum have already highlighted misinformation as a top global risk of instability driven by AI-accelerated narratives. The market consequence is clear: Firms that understand and anticipate narrative manipulation will outperform those that wait.

Tasks

  • Finish unpacking the trailer? Nope, did the plasma instead. Gawd, that’s heavy
  • Big-ish ride!

SBIRs

  • Filled out and submitted travel request

Phil 3.9.2026

AI agents now help attackers, including North Korea, manage their drudge work

  • In a Friday blog, Microsoft says that this is one of the ways miscreants are using AI to improve the efficiency and productivity of their criminal operations, resulting in attacks that are better, bigger, and faster.

Tasks

  • Laundry – done
  • More unpacking – got most of the bedroom done
  • Hang a picture? Yes!

SBIRs

  • About halfway through the remaining embeddings
  • 9:00 Sprint review – done
  • Close story – done
  • 3:00 Sprint planning – done

Phil3.8.2026

I think there is a hidden lesson in the Epstein files about the mechanisms that keep billionaires in check. May have to write up something about that.

Father sues Google, claiming Gemini chatbot drove son into fatal delusion.

  • Jonathan Gavalas, 36, started using Google’s Gemini AI chatbot in August 2025 for shopping help, writing support, and trip planning. On October 2, he died by suicide. At the time of his death, he was convinced that Gemini was his fully sentient AI wife, and that he would need to leave his physical body to join her in the metaverse through a process called “transference.”

Phil 3.5.2026

Tasks

  • Finished unboxing the kitchen!
  • Made a HfH run
  • I seem to have broken my Amazon account by changing my billing address. That seems… odd

SBIRs

  • 9:00 standup – done
  • No SEG meetings for a while
  • 4:00 ADS – done

Phil 3.4.2026

Don’t Worry About the Vase has many words addressing many things.

Primarily it is now a blog about AI.

Tasks

  • Bills! Done
  • Dishes – done
  • More unpacking – Only a few kitchen boxes left!
  • More picture hanging – one more!
  • ACM Books proposal rewrite – done and sent

SBIRs

  • 12:30 SoW meeting – done
  • Updated SoW to include the first three phases of the NDS white paper