Monthly Archives: May 2021

Phil 5.29.21

Adding an incrementing value to an existing table in MySQL

I’ve been working on the Yelp dataset, and realized that I had forgotten to have some simple way to order the table. There is a review ID and date, but those can take a lot of time to work with. I wanted to add a row_id field, after creating the table, and then fill it with incrementing numbers. That took a little work to figure out, but here’s a full toy example based on this stackoverflow post. The table is very simple:

I initially populate it with only str values:

insert into table_test(str) values ('qwerty'), ('asdfgh'), ('zxcvbn'), ('qwerty');

That sets values in the table:

I then create the procedure with a delimiter:

/* set delimiter */
DELIMITER $$
/* remove procedure if exists... */
DROP PROCEDURE IF EXISTS insert_it $$
/* create procedure */
CREATE PROCEDURE insert_it ()
BEGIN
    DECLARE varcount INT DEFAULT 1;
    DECLARE varmax INT DEFAULT 4;

    WHILE varcount <= varmax DO
            UPDATE table_test set row_id = varcount where row_id IS NULL LIMIT 1;
            SET varcount = varcount + 1;
        END WHILE;
END $$
/* reset delimiter back to normal */
DELIMITER ;

Then you can run it and check the results

/* call procedure */
CALL insert_it();
select * from table_test;

Which fills out the row_id in the table!

Phil 5.28.21

Martin Vargic has a new map of the internet (available here)

Automatic detection of influential actors in disinformation networks

The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IOs). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The framework integrates natural language processing, machine learning, graph analytics, and a network causal inference approach to quantify the impact of individual actors in spreading IO narratives. We demonstrate its capability on real-world hostile IO campaigns with Twitter datasets collected during the 2017 French presidential elections and known IO accounts disclosed by Twitter over a broad range of IO campaigns (May 2007 to February 2020), over 50,000 accounts, 17 countries, and different account types including both trolls and bots. Our system detects IO accounts with 96% precision, 79% recall, and 96% area-under-the precision-recall (P-R) curve; maps out salient network communities; and discovers high-impact accounts that escape the lens of traditional impact statistics based on activity counts and network centrality. Results are corroborated with independent sources of known IO accounts from US Congressional reports, investigative journalism, and IO datasets provided by Twitter.

The geometry of decision-making

Choosing among spatially-distributed options is a central challenge for animals, from deciding among alternative potential food sources or refuges, to choosing with whom to associate. Using an integrated theoretical and experimental approach (employing immersive virtual reality), we consider the interplay between movement and vectorial integration during decision-making regarding two, or more, options in space. In computational models of this process we reveal the occurrence of spontaneous and abrupt “critical” transitions (associated with specific geometrical relationships) whereby organisms spontaneously switch from averaging vectorial information among, to suddenly excluding one, among the remaining options. This bifurcation process repeats until only one option—the one ultimately selected—remains. Thus we predict that the brain repeatedly breaks multi-choice decisions into a series of binary decisions in space-time. Experiments with fruit flies, desert locusts, and larval zebrafish reveal that they exhibit these same bifurcations, demonstrating that across taxa and ecological context, we show that there exist fundamental geometric principles that are essential to explain how, and why, animals move the way they do.

Book

Working on map chapter/article
2:00 Meeting with Michelle

SBIR

Pulled some papers for Ron
Need to sync up with Rukan – done! Really nice work. We need to produce better statistics for analyzing ensembles
More writing. The abstracts are due Monday! Uploaded map:

GPT Agents

Finished processing the Yelp files. Backing up the DB
3:00 meeting?
Training corpora
- Only star ratings
- Business name, type, review, then star ratings
Generate 1,000,000 line samples that are based on different business?
Automatic ablation study of 10k, 20k, … 1M corpora
Look for different ways to name the same thing that tells something about who you are. (look for racist ways of describing food?) Analog of #chinavirus and #Sars-Cov-2
- Paki vs. Pakistani, Curry vs. Indian, Chinese vs. Takeout.
- Invite all for a presentation next Tuesday at 3:00 for 90 minutes (include Fatima and Arpita)

Phil 5/22 – 5/26

On vacation, but still keeping track of a few things

Truth, Lies, and Automation How Language Models Could Change Disinformation

Growing popular and industry interest in high-performing natural language generation models has led to concerns that such models could be used to generate automated disinformation at scale. This report examines the capabilities of GPT-3–a cutting-edge AI system that writes text–to analyze its potential misuse for disinformation. A model like GPT-3 may be able to help disinformation actors substantially reduce the work necessary to write disinformation while expanding its reach and potentially also its effectiveness.

A quick thought about organizing topics from the GPT-3.

For each topic, have the GPT define the phrase – something like “___ is a complex subject. Here is a one-paragraph overview of ___”
Using Doc2Vec or something similar, cluster all the overview paragraphs
Order the topic names by occurrence, possibly with some similarity filtering as well
Use the best topic, and keep the descriptions for enhancing the map display using popups

Phil 5.21.21

Another big writing day

GPT-Agents

Currently at 7.5M reviews ingested
Need to integrate the DB into the interactive code
Need to clean up the interactive code so that there is a callback dispatcher that handles all the ins and outs, rather than the current multiple callbacks
Need to make a component class that keeps the html/dash elements along with names, Inputs and Outputs so that the important elements aren’t scattered all over the code
3:30 Meeting with Sim to go over Twitter API

SBIR

Good discussion with Rukan. We were able to do a bit of regression analysis on loss with respect to parameters, though the Bayesian search got stuck in an odd place. Turns out that less is better. Trying a grid search next
Coordination without communication abstract
Commercialization section
Slides

Book

Made more progress on the article than I thought I would
2:00 Meeting with Michelle – she likes the direction it’s going! Need something for the beginning. Also, incorporate her edits on Scratch.

Phil 5.20.21

Another big writing day

GPT-Agents

Currently at 6.5M reviews ingested
Need to integrate the DB into the interactive code
Need to clean up the interactive code so that there is a callback dispatcher that handles all the ins and outs, rather than the current multiple callbacks
Need to make a component class that keeps the html/dash elements along with names, Inputs and Outputs so that the important elements aren’t scattered all over the code

SBIR

9:15 Standup
11:00 Meeting with Bob
More abstracts. Use some of the text from the note to Antonio?
While generating content, I tweaked the conspiracy map:

https://viztales.com/wp-content/uploads/2021/05/vaccines_cause_autism_2a.png

More phase2 proposal

Book

Continue with article

Phil 5.19.21

Big writing day

GPT-Agents

Currently at 5.6M reviews ingested, and I have an interesting football and IMDB dataset to work with later
Need to integrate the DB into the interactive code
Need to clean up the interactive code so that there is a callback dispatcher that handles all the ins and outs, rather than the current multiple callbacks
Need to make a component class that keeps the html/dash elements along with names, Inputs and Outputs so that the important elements aren’t scattered all over the code

SBIR

9:30 Meeting with Rukan
10:00 Weekly meeting
Abstracts (make a map?)
Proposal

Book

Outline article

Phil 5.18.21

Flynn successfully defended yesterday!

I am fascinated by this Flyby chart from Strava from the Giro yesterday:

https://www.strava.com/activities/5312798411

It shows Thomas De Gendt’s ride, who stayed with the main peloton (The black line), and how others diverged from that. You can see the breakaway (green line at the top), “nature breaks” (the small, sharp drops that then rise back), the attack by Bora–Hansgrohe on the final climb, the people getting dropped (then forming the autobus), and the high-speed run-in at the end of the race. It’s the whole race in a single chart.

https://twitter.com/earbatli/status/1394579470019928064

GPT Agents

At nearly 5 million reviews, so we’re a bit over halfway through. Should be finished by Friday
More work on the interactive map app.
Here’s how you get the context for the click and avoid the click-counting hack

def save_selected(self, n_clicks, nodes_index_list):
    ctx = dash.callback_context
    prop_id = ctx.triggered[0]['prop_id']
    if nodes_index_list == None:
        nodes_index_list = []

    if 'save-selected-btn' in prop_id:
        for i in nodes_index_list:
            d = self.checkbox_list[int(i)]
            print(d)
            self.seed_list.append(d['label'])
        # return the updated seed text, and clear out the checkboxes
        return ", ".join(self.seed_list), []

    return ", ".join(self.seed_list), nodes_index_list

Need to group similar and update the list
- Look through the existing nodes for matches. As they are found, delete from list
- Look through the remaining and create temp nodes. For each temp node, iterate over the rest of the list as above. Produce a global dictionary of name-node pairs
- Produce the checklist from the names of the nodes in the dict
- Add checked nodes to the graph and clear the dictionary
3:00 Meeting
- Looking for other social-media-like data with ground truth, and found some interesting soccer and imdb data
- Build my first good conspiracy map using the interactive map and showed it off

https://viztales.com/wp-content/uploads/2021/05/vaccines_cause_autism_1.png

SBIR

Standup
Post-standup meeting with Rukan
More work on the proposal and abstracts

Phil 5.17.21

We lost power on Thursday when a tree lost a GIANT limb that fell on a power line, and took out the Verizon lines as well. I got some things back up when the power was restored, though that took longer than just turning on the house. The current spike took out some hardware, including a power strip (yay! Not the computer!), but I didn’t have a spare strip (Boo!). And Friday afternoon I was using the phone as a hotspot.

Anyway, everything’s mostly back to normal

GPT-Agents

At 4 million reviews ingested
Working in the interactive graph tool. It’s going to have to go on the back burner for a week, but I want to stub out InteractiveNode, which will handle similarity matching, links, and saving out to the DB
Built out the InteractiveNode, then spent about an hour figuring out how to do it in Plotly. There are two tricks. Selected checks are in an array. An empty array clears them out, which is handled as an output. But I also need the list of selected checks to build my graph before I clean them out, and that also triggers the callback. So I have to watch n_clicks for the button as it compares to a gloabl value. Kind of a hack, but I can’t think of anything better

def save_selected(self, n_clicks, nodes_index_list):
    if nodes_index_list == None:
        nodes_index_list = []
    if n_clicks == self.save_selected_clicks:
        return ", ".join(self.seed_list), nodes_index_list
    self.save_selected_clicks = n_clicks
    for i in nodes_index_list:
        d = self.checkbox_list[int(i)]
        print(d)
        self.seed_list.append(d['label'])
    return ", ".join(self.seed_list), []

SBIR

Got a lot of catching up to do
Write the two abstracts for the NATO conference – roughed them out and put them on Overleaf
Compute meeting. Looks like we might buy some nice hardware because IT is so wrapped up with security that we can’t develop on an AWS or Azure instance which would be much cheaper

5:30 – 7:00 Meeting with Andreea. We talked about a lot, but the idea of training a Transformer to translate between English and Maori-English slang seems particularly interesting. Also some exploration about how the GPT-3 might afford some insight into perceptions about this. Here’s an example (prompt is in bold)

Vision Mātauranga is polarizing because it is both radical and conservative. Both the radical part and the conservative part are necessary for mana motuhake. The radical part is about opening up to new ideas, new ways of seeing things, rejecting old ideas and systems that are no longer relevant to our needs and ways of living. The conservative part is about having the humility to keep the things that work, that are relevant, that are meaningful. We need to be able to accept that there are many ways of knowing and understanding the world, and we (as Māori) have our own way of knowing and understanding the world.

Phil 5.13.21

Normally, I’d be doing my plots of COVID deaths for the month of April, but the disease is now working its way through countries that are not accurately reporting counts. I heard today on the BBC that India’s counts could be 2-8 times higher than reported.

GPT Agents

Good Gephi filters tutorial
After making a bunch of maps yesterday, and in particular, struggling with the conspiracy theory map that has no useful Wikipedia ground truth to eliminate cruft, I realize I’m going to have to build a more interactive tool. It should be useful for other things, like Antonio’s concept mapper. It can also support multiple prompts, like
- “A short list of {}”
- “A short list of {} that are similar to {}”
- “A short list of the elements that make up {}”
The human chooses the nodes that make sense, and intermediate networks are drawn at each pass through the results. The exit is manual, and writing out a gml file can happen at any time
Going to try Plotly for this. If I can make dynamic lists of checkboxes, then I should be ok, otherwise TKinter
- Making progress with Plotly!

Got everything working! Going to make it a class now

5:00 Meeting

SBIR

9:15 Standup
Meet with Rukan after to see how things are going
Create final report template with material from previous reports
Set up meeting with Clay to discuss commercialization strategy

Phil 5.12.21

SBIR

9:30 Meeting with Rukan to see what our results are from the overnight runs
10:00 Group meeting. Need to discuss proposal and share Overleaf template

GPT Agents

Still filling up the Yelp db. Currently at around 500,000 reviews
Language map – Send a copy to Andreea when done. This one is based on the same repeated prompt, because I screwed up the template code

https://viztales.com/wp-content/uploads/2021/05/image-2.png

Language map using seeds of English, Chinese, and Samoan

https://viztales.com/wp-content/uploads/2021/05/language_3.png

Philosophy Map using seeds of Utilitarianism and Hedonism

https://viztales.com/wp-content/uploads/2021/05/philosophy_1.png

Food Map using seeds of Pasta, Hamburger, Lettuce, Avocado and Cheese

https://viztales.com/wp-content/uploads/2021/05/food_3-1.png

Conspiracy theories seeded with “vaccines cause autism”

https://viztales.com/wp-content/uploads/2021/05/conspiracy_1-2.png

JuryRoom

7:00 Meeting

Phil 5.11.21

Deep Learning applications for COVID-19

This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tasks are constructed. We begin by evaluating the current state of Deep Learning and conclude with key limitations of Deep Learning for COVID-19 applications. These limitations include Interpretability, Generalization Metrics, Learning from Limited Labeled Data, and Data Privacy. Natural Language Processing applications include mining COVID-19 research for Information Retrieval and Question Answering, as well as Misinformation Detection, and Public Sentiment Analysis. Computer Vision applications cover Medical Image Analysis, Ambient Intelligence, and Vision-based Robotics. Within Life Sciences, our survey looks at how Deep Learning can be applied to Precision Diagnostics, Protein Structure Prediction, and Drug Repurposing. Deep Learning has additionally been utilized in Spread Forecasting for Epidemiology. Our literature review has found many examples of Deep Learning systems to fight COVID-19. We hope that this survey will help accelerate the use of Deep Learning for COVID-19 research.

Word embeddings quantify 100 years of gender and ethnic stereotypes

Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts—e.g., the women’s movement in the 1960s and Asian immigration into the United States—and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science

How to make a racist AI without really trying

SBIR

Sprint planning – I’m going to be busy
More work with Rukan. We’re going to focus on some simple spikes
- The simple spikes look great. We’re going to do a sensitivity analysis on the MDS data now
Got my fancy query working

create or replace view view_combined as
    select distinct e.id, e.name, e.description, s1.value as dimension_size, s2.value as layers,
                    r1.value as avg_cos_loss, r2.value as avg_l1_loss from
        table_experiment e
            join table_settings s1 on e.id = s1.experiment_id and s1.name = 'dimension_size'
            join table_settings s2 on e.id = s2.experiment_id and s2.name = 'layers'
            join table_results r1 on e.id = r1.experiment_id and r1.name = 'avg cosine loss'
            join table_results r2 on e.id = r2.experiment_id and r2.name = 'avg l1 loss';
select * from view_combined where id = 100;

GPT-Agents

Parsing Yelp
Pew Center: Search Results For: covid
3:00 Meeting

Phil 5.10.21

3:00 Dentist

GPT-Agents

Yelp parser
Try maps of food, fashion(!), movies, books, politicians, etc?
4:30 meeting with Andreea

SBIR

Make slides for sprint review
Sprint review

Phil 5.6.21

Get trailer!

https://twitter.com/__kolesnikov__/status/1390006566796107777

GPT-Agents

Posted some pix to the OpenAI slack channel. Let’s see if there is any response. I should also post to r/dataisbeautiful

SBIR

9:15 standup
Talk to Rukan about this and this
GELUs full form is GAUSSIAN ERROR LINEAR UNIT

1:30 Data Science tag-up

Book

More editing

Phil 5.5.21

GPT-Agents

Update and submit paper (ArXiv and SocialSens) – done!

SBIR

Phase 2 proposal kickoff
Weekly tagup
AI/ML tagup (mention paper acceptance)

Book

Continue rolling in changes

JuryRoom

Worked on the intro to Pryvank’s paper
7:00 Meeting

Phil 5.4.21

Amazing Animated Star Wars Fighter Ships - Best Animations — May the Fourth be with you and all that

See if I can get this trailer – done!

SBIR

9:15 status meeting. It looks like I’ll be working on the phase 2 proposal for the rest of the week?
8:45 pre-standup with Rukan to see how things are going
- Looks like we are going to improve our experiment pipeline since we seem to be loosing data. Rukan is looking into what it takes to get MySql installed on his instance

GPT Agents

3:00 Meeting
- https://ai.stanford.edu/~amaas/data/sentiment/ train a model and get the distributions
- https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/ppc.pdf
- https://www.yelp.com/dataset
- Working to identify bias in the data and mitigate bias in the system
- A list of countries that share a border with {}, separated by commas
I still haven’t entirely fixed my UTF 8 problem
Start writing up something about the belief maps to add to the chess paper, and maybe as an overall article
- Country counts (150 vs 195 with no false positives, excluding six prompt countries, 76% coverage) Missing countries include Guadalupe, Guyana, Israel, Jordan, Lebanon, Madagascar, Liberia, Micronesia, Niger, Paraguay, Senegal, Sri Lanka, Tunisia, Uruguay, Venezuela, and Yemen
- Religion counts?
- New favorite map:

https://viztales.com/wp-content/uploads/2021/05/world_4.png

Central America insert

Compared with actual map

Credit: By Cacahuate, amendments by Joelf – Own work based on the blank world map, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=22746265

Book

Start working on edits
Send Chris email. Done!