Extract the sentiment into a workbook. It looks like it should be pretty easy:
select count(*) as count, probe from table_output where experiment_id = 89 and tag = 'raw' and sent_label = 'NEGATIVE' group by probe order by probe;
Continue on paper, upload to Overleaf, too
Meeting at 5:00
SBIR
More work with Rukan? Need to figure out why a 5×256 is going in, but a 256×256 is coming out. We could try an attention layer first. Let’s see how things go?
Do algorithm-driven news sources have different effects on political behavior when compared to non-algorithmic news sources? Media companies compete for our scarce time and attention; one way they do this is by leveraging algorithms to select the most appealing content for each user. While algorithm-driven sites are increasingly popular sources of information, we know very little about the effects of algorithmically determined news at the individual level. The objective of this paper is to define and measure the effects of algorithmically generated news. We begin by developing a taxonomy of news delivery by distinguishing between two types of algorithmically generated news, socially driven and user-driven, and contrasting these with non-algorithmic news. We follow with an exploratory analysis of the effects of these news delivery modes on political behavior, specifically political participation and polarization. Using two nationally representative surveys, one of young adults and one of the general population, we find that getting news from sites that use socially driven or user-driven algorithms to generate content corresponds with higher levels of political participation, but that getting news from non-algorithmic sources does not. We also find that neither non-algorithmic nor algorithmically determined news contribute to higher levels of partisan polarization. This research helps identify important variation in the consequences of news consumption contingent on the mode of delivery.
GPT Agents
Finished POS tokenizing terms
Started sentiment (POS/NEG) on terms – done
Stubbed out the POS and sentiment for the token full_string – done
Monitoring social discourse about COVID-19 vaccines is key to understanding how large populations perceive vaccination campaigns. We focus on 4765 unique popular tweets in English or Italian about COVID-19 vaccines between 12/2020 and 03/2021. One popular English tweet was liked up to 495,000 times, stressing how popular tweets affected cognitively massive populations. We investigate both text and multimedia in tweets, building a knowledge graph of syntactic/semantic associations in messages including visual features and indicating how online users framed social discourse mostly around the logistics of vaccine distribution. The English semantic frame of “vaccine” was highly polarised between trust/anticipation (towards the vaccine as a scientific asset saving lives) and anger/sadness (mentioning critical issues with dose administering). Semantic associations with “vaccine,” “hoax” and conspiratorial jargon indicated the persistence of conspiracy theories and vaccines in massively read English posts (absent in Italian messages). The image analysis found that popular tweets with images of people wearing face masks used language lacking the trust and joy found in tweets showing people with no masks, indicating a negative affect attributed to face covering in social discourse. A behavioural analysis revealed a tendency for users to share content eliciting joy, sadness and disgust and to like less sad messages, highlighting an interplay between emotions and content diffusion beyond sentiment. With the AstraZeneca vaccine being suspended in mid March 2021, “Astrazeneca” was associated with trustful language driven by experts, but popular Italian tweets framed “vaccine” by crucially replacing earlier levels of trust with deep sadness. Our results stress how cognitive networks and innovative multimedia processing open new ways for reconstructing online perceptions about vaccines and trust.
There is also an API that gives you more control described here.
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained("bert-large-uncased")
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=test_dataset # evaluation dataset
)
Got tired of recalculating parts-of-speech, so I added a field to table_output for that and sentiment. currently reprocessing all the tables from Fauci/Trump forward.
Update the Overleaf doc
Figuring out what to do with the chess paper with Antonio
import flair
from flair.models import TextClassifier
flair_sentiment = TextClassifier.load('en-sentiment')
text="Avengers: Infinity War is a giant battle for which directors Anthony and Joe Russo have given us touches of JRR Tolkien’s Return of the King and JK Rowling’s Harry Potter and the Deathly Hallows. The film delivers the sugar-rush of spectacle and some very amusing one-liners."
sentence=flair.data.Sentence(text)
flair_sentiment.predict(sentence)
total_sentiment = sentence.labels
print(total_sentiment)
Output:
[POSITIVE (0.9994151592254639)]
SBIR
Sprint planning
Trying to get a charge number for the RFI response – done
IT’s the end of the month, so it’s time for these two charts again. i think we’re seeing the Vaccines starting to have and effect? At least Switzerland seems to have gotten its second wave under control. Italy though…
Proposal. Boy, was that interesting. We had one vague paragraph to go on. I fed that into the GPT playground along with some additional text to structure the response. and it damn near wrote the whole thing, pulling latent knowledge out of the model. I’l do a more detailed writeup later.
Got a text from Maryland asking if I wanted a shot, and to reply “Y” to set up an appt. I was in the middle of brushing my teeth, so I waited a few minutes. By that time, the slot had been filled. For now, respond immediately!
Pay bills!
GPT Agents
Find a good source that explains “grounding” in NLP
Back up DB
Create Ecco spreadsheets
Start creating the probe/model, term table. Got it done for terms and nouns. Do a second table that is percentages
Need to to ranked tokens next
GOES
2:00 Meeting. Didn’t wind up presenting. Tomorrow
Got 16 hours for writing
JuryRoom
Had a good chat with Jarod. UW is still working on the adjunct thing
The setup is a video call where the author explains a paper to me. We can use screen-sharing, for figures, etc. We’ll record the call and post to YouTube. Possible participants are authors of a paper in network science or data science.
GPT-Agents
Ranking is still running. I really should have checked the amount of data I was generating, but now I have sunk costs!
3:00 Meeting today. I’d like to add something about qualitative research to the discussion section
Create a new set of spreadsheets where all models are compared. Probe is the sheet, the model is the column, and the terms are the rows. Display as heatmap
Need to create a new table for the GPT-3 mapping that has graph information in it. This looks like a good article with examples: Trees and Other Hierarchies in MySQL
SBIR/ONR
Keep working on slide deck – done!
Added a bunch of generators to the data directory
Ping Rukan around 10:00 to start figuring out how to assemble the Transformer. I want to try assembling a one-to-many and a many-to-one set of densely connected layers of arbitrary dimensionality. Started building a stripped-down MLP
Explicitcontrol – AI immediately stops the car, even in the middle of the highway because it interprets demands literally. This is what we have today with assistants such as SIRI and other narrow AIs.
Implicitcontrol – AI attempts to comply safely by stopping the car at the first safe opportunity, perhaps on the shoulder of the road. This AI has some common sense, but still tries to follow commands.
Alignedcontrol – AI understands that the human is probably looking for an opportunity to use a restroom and pulls over to the first rest stop. This AI relies on its model of the human to understand the intentions behind the command.
Delegatedcontrol – AI does not wait for the human to issue any commands. Instead, it stops the car at the gym because it believes the human can benefit from a workout. This is a superintelligent and human-friendly system which knows how to make the human happy and to keep them safe better than the human themselves. This AI is in control.
Machine learning methods offer great promise for fast and accurate detection and prognostication of coronavirus disease 2019 (COVID-19) from standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we consider all published papers and preprints, for the period from 1 January 2020 to 3 October 2020, which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. All manuscripts uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE in this timeframe are considered. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 62 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher-quality model development and well-documented manuscripts.
Many papers gave little attention to establishing the original source of the images
All proposed models suffer from a high or unclear risk of bias in at least one domain
We advise caution over the use of public repositories, which can lead to high risks of bias due to source issues and Frankenstein datasets as discussed above
[Researchers] should aim to match demographics across cohorts, an often neglected but important potential source of bias; this can be impossible with public datasets that do not include demographic information
Researchers should be aware that algorithms might associate more severe disease not with CXR imaging features, but the view that has been used to acquire that CXR. For example, for patients that are sick and immobile, an anteroposterior CXR view is used for practicality rather than the standard posteroanterior CXR projection
We emphasize the importance of using a well-curated external validation dataset of appropriate size to assess generalizability
In recent years, there has been a great deal of concern about the proliferation of false and misleading news on social media1,2,3,4. Academics and practitioners alike have asked why people share such misinformation, and sought solutions to reduce the sharing of misinformation5,6,7. Here, we attempt to address both of these questions. First, we find that the veracity of headlines has little effect on sharing intentions, despite having a large effect on judgments of accuracy. This dissociation suggests that sharing does not necessarily indicate belief. Nonetheless, most participants say it is important to share only accurate news. To shed light on this apparent contradiction, we carried out four survey experiments and a field experiment on Twitter; the results show that subtly shifting attention to accuracy increases the quality of news that people subsequently share. Together with additional computational analyses, these findings indicate that people often share misinformation because their attention is focused on factors other than accuracy—and therefore they fail to implement a strongly held preference for accurate sharing. Our results challenge the popular claim that people value partisanship over accuracy8,9, and provide evidence for scalable attention-based interventions that social media platforms could easily implement to counter misinformation online.
GPT Agents
Ranking is still running
Worked on the workshop paper. Added in a modified version of the intro from the chess paper that uses the GPT-3 now
ONR
Working on literature
SBIR
10:00 Meeting
GOES
2:00 Meeting
Turns out that we still have to do a demo. I need to create some data to show what that would look like. Set up a meeting with Vadim for Friday to make sure all the new code is working
Generated all the scripts – about 700! Tomorrow I’ll run the “sim” and generate training values
Intro – discuss Tay, and how machine learning incorporates human input and reflects it back. This means that we have created ‘oracles’ that we can ask about the populations that contributed to their knowledge. In this type of computational sociology, finding and understanding the biases in these populations is an important part of the research
Introduce finetuned language models. Start with the chess model, and show how we can see the rank of piece terms rise and fall over the course of a sentence
Methods/results – describe the process of extracting chinavirus and sars-cov-2 as potential markers of different populations. Then prompts and runs to see the central terms that the models use. Show the stats. Then using the most popular terms from each model, run Ecco trajectories to show the rank behavior of these terms
Discussion. The possibilities of “interactive snapshots” of a population’s online behavior. The ongoing difficulty in prompt creation. Potential of maps?
Created the template
Note – Create Dr. Fauci and Donald Trump prompts – done!
Worked on getting useful text to look at out of the models. Using flair to scan for POS. That way I can grab the first noun that occurs which makes for less text to look through, and more useful than just looking at the first word. I think that this will also be the approach that I’ll use to pull data out of the GPT-3 for maps.
Finished training the COVID model, and committed to VCS
Got some results for the first term. Going to re-run for some number of terms next. Also played around with the resulting spreadsheets a bit to look for patterns
SBIR
Updating my drivers, verifying that TF still works, and upgrading to PT 1.8