Phil 12.18.2024

Johns Hopkins is still doing tracking of COVID. Here’s Maryland:

Replication for Language Models Problems, Principles, and Best Practice for Political Science

  • Excitement about Large Language Models (LMs) abounds. These tools require minimal researcher input and yet make it possible to annotate and generate large quantities of data. While LMs are promising, there has been almost no systematic research into the reproducibility of research using them. This is a potential problem for scientific integrity. We give a theoretical framework for replication in the discipline and show that much LM work is wanting. We demonstrate the problem empirically using a rolling iterated replication design in which we compare crowdsourcing and LMs on multiple repeated tasks, over many months. We find that LMs can be (very) accurate, but the observed variance in performance is often unacceptably high. In many cases the LM findings cannot be re-run, let alone replicated. This affects “downstream” results. We conclude with recommendations for best practice, including the use of locally versioned ‘open source’ LMs.
import pandas as pd
from tkinter import filedialog

filename = filedialog.askopenfilename(filetypes=(("XLSX files", "*.xlsx"),("All Files", "*.*")), title="Load XLSX Files")
if filename:
    print("opening {}".format(filename))

    df = pd.read_excel(filename)
    print("\\begin{table}[]\n\centering")
    print(df.to_latex())
    print("\caption{Caption}\n\label{tab:my_label}\n\end{table}")

SBIRs

  • Got some feedback. Need to roll it in. Also, shorter bios and start trimming to 5 pages.
  • Maaaaaaaaaaaaayyyyyyyybeeee get back to some coding.

GPT Agents

  • Start working on diagram. Maybe tweak the paper to mention the above?

Phil 12.17.2024

Learned a new word today:

Tasks

  • Ping Carlos – done
  • Check for TW Ellis – done
  • Call Dentist – and now I have a new filling

SBIRs

  • Frame out a rough draft – done

Phil 12.16.2024

Email for Carlos

SBIRs

  • 9:00 Sprint Demos – done
  • 3:00 Sprint Planning – stories are written
  • Started on the SBIR proposal. Template is done
  • Two unplanned meetings that might mean a trip to Huntsville again
  • Got together with Aaron to figure out the approach to the SBIR

Phil 12.14.2024

I Traded My News Apps for Rumble, the Right-Wing YouTube. Here’s What I Saw

  • Blame for any hiccups in Mr. Trump’s strategy was assigned to Democrats or even Republicans who were not sufficiently obedient.

Speaking of detecting and disrupting manipulation: A phone company developed an AI ‘granny’ to beat scammers at their own game. The number seeding is a literal counterattack/exploit in its own right

  • O2, the company behind the scam-baiting granny, said the AI technology can keep scammers on the phone for 40 minutes at a time. Daisy was trained with the help of YouTuber and software engineer Jim Browning, who has made an online career exposing scammers to his community of 4.4 million subscribers.
  • In order to bait scammers into time-wasting calls, the company utilized the practice of “number seeding,” which put the AI granny’s number on lists used by scammers to find their victims. The granny gimmick’s goal is twofold: to keep scammers away from real people and to raise awareness about the dangers of risky phone hoaxes.

Phil 12.13.2024

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

  • Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where modelsare iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI’s GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document the amount of data leaked to these models during the first year after the model’s release. We report that these models have been globally exposed to ∼4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts.

SBIRs

  • 9:00 Meeting
  • 12:50 USNA

Phil 12.12.2024

Atlantic circulation collapse? New clues on the fate of a crucial conveyor belt

  • So in summary, we have at least some reassurance from the North Atlantic data that a full-on AMOC collapse hasn’t begun. And it’s unlikely that any future collapse would reach its end point any sooner than the early to mid-2100s. Yet there’s also legitimate concern – stoked by recent work from climate modelers and statisticians – that a tipping point toward eventual collapse could arrive as soon as the next several decades, especially if fossil-fuel emissions aren’t cut sharply. In part two of this two-part post, we’ll look at some of the new research on early-warning signs of AMOC collapse, what those scientists and other assessments are telling us about the threat, and how we can help limit the odds of an AMOC collapse happening in the first place.

SBIRs

  • 9:00 standup
  • 11:00 drop off car / pick up car – done
  • 2:30 “AI” Meeting – could not work it in
  • 4:30 Book club – Rukan couldn’t make it
  • Good progress today, but still not quite right

GPT Agents

  • LLM meeting – worked on the diagram for the egalitarian AI paper

Phil 12.11.24

Ugh:

SBIRs

  • See how the line segment intersection code improves things. Maybe make some test files?
  • Also, I want to see how fast I could calculate a 256×256 grid given a source, destination, and a spectator.

GPT Agents

  • 3:00 Alden meeting. Bring up the GPTZero / Chess model as human result.
  • Work on book?

Phil 12.10.2024

Open source maintainers are drowning in junk bug reports written by AI

  • “Recently I’ve noticed an uptick in extremely low-quality, spammy, and LLM-hallucinated security reports to open source projects,” he wrote, pointing to similar findings from the Curl project in January. “These reports appear at first glance to be potentially legitimate and thus require time to refute.”

SBIRs

  • 9:00 standup
  • Finish refactoring and integrate with trajectories?

GPT Agents

  • Add a section on Salt Typhoon

Phil 12.9.2024

Write up something about the chess model fooling GPTZero:

SBIRs

  • 9:00 tax thing?
  • 3:00 Tradeshow demo tagup
  • Work on getting foms generated. Small runs first! Got some good refactoring done and then pulled into 2025 planning

GPT Agents

  • Good progress on the KA book! Need to bring in some content from the slide deck – nah. Didn’t really work
  • Reach out to Dr. Bryson about proposal – done
  • Ping Greg too – done

Phil 12.6.2024

Tasks

  • Bills – done
  • Clean house -done
  • Dishes – done
  • Laundry – done
  • Groceries – done
  • Tires? Goodyear is closed weekends. Scheduled for Thursday
  • And it seems I have to do a self-assessment – done

Phil 12.5.2024

Signed up for https://www.arliai.com/ as an AI inference service.

Tires! (410) 415-1411 10:00am – 7:00pm – nope

Passport: https://travel.state.gov/content/travel/en/passports/how-apply/processing-times.html – Nope, too soon. Has to be within a year

Translated from Romanian (source):

  • The secret services declassified the information about Călin Georgescu: Support from people who threatened Romania’s sovereignty
  • The activity on Tiktok would have been coordinated by a state actor
  • Votes purchased
  • Similar campaign of Russia in Ukraine

Here’s an English version from dw.com: EU probes TikTok after surprise win in Romania election

SBIRs

  • Update password!
  • 9:00 standup
  • 9:15 proposal go/no go meeting
  • 12:45 USNA? Yeah. Not much fun
  • 2:30 Hall Research. Forgot about this one
  • 4:30 Book club – delayed – Rukan can’t make it
  • In between all these things, work on the demo code. I need a simpler trajectory for the baseline, so I need to do that first. Done. And fixed a bunch of stuff. Also put placekeepers in GitLab

GPT Agents

  • 2:45 LLM meeting

Phil 12.4.2024

Going to this to because it is the front line for human rights these days.

And then there is this. It’s a great read: Six hours under martial law in Seoul

SBIRs

  • More demo. Try to generate a spreadsheet with the fom curves? Done!

GPT Agents

  • Along the lines of White Hat AI, looking to put together a labeler that recognizes manipulation techniques and also LLM detection. I think I could host everything on https://www.arliai.com/, and maybe use CLIP to look for classes of memes

Phil 12.2.2024

Did more rewriting of the proposal and sent the current version to Carlos

SBIRs

  • Working on the demo – fixed the problem I was having last week. Good progress overall, but I’m not quite at the point where I can calculate the distance to the trajectory line from the intercept point.
  • 1:30 CoA meeting
  • 3:00 demo tagup

Phil 11.29.2024

It’s looking a lot like winter. Going to have to pull out the warm things:

Tasks

  • Bills – done
  • Laundry
  • Clean House
  • Ignite paperwork
  • Put together a Gannt chart and a tasking table for the proposal (spreadsheet in assets), then send to Greg, Carlos, and Thorsten