Monitoring social discourse about COVID-19 vaccines is key to understanding how large populations perceive vaccination campaigns. We focus on 4765 unique popular tweets in English or Italian about COVID-19 vaccines between 12/2020 and 03/2021. One popular English tweet was liked up to 495,000 times, stressing how popular tweets affected cognitively massive populations. We investigate both text and multimedia in tweets, building a knowledge graph of syntactic/semantic associations in messages including visual features and indicating how online users framed social discourse mostly around the logistics of vaccine distribution. The English semantic frame of “vaccine” was highly polarised between trust/anticipation (towards the vaccine as a scientific asset saving lives) and anger/sadness (mentioning critical issues with dose administering). Semantic associations with “vaccine,” “hoax” and conspiratorial jargon indicated the persistence of conspiracy theories and vaccines in massively read English posts (absent in Italian messages). The image analysis found that popular tweets with images of people wearing face masks used language lacking the trust and joy found in tweets showing people with no masks, indicating a negative affect attributed to face covering in social discourse. A behavioural analysis revealed a tendency for users to share content eliciting joy, sadness and disgust and to like less sad messages, highlighting an interplay between emotions and content diffusion beyond sentiment. With the AstraZeneca vaccine being suspended in mid March 2021, “Astrazeneca” was associated with trustful language driven by experts, but popular Italian tweets framed “vaccine” by crucially replacing earlier levels of trust with deep sadness. Our results stress how cognitive networks and innovative multimedia processing open new ways for reconstructing online perceptions about vaccines and trust.
There is also an API that gives you more control described here.
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
model = BertForSequenceClassification.from_pretrained("bert-large-uncased")
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=test_dataset # evaluation dataset
Got tired of recalculating parts-of-speech, so I added a field to table_output for that and sentiment. currently reprocessing all the tables from Fauci/Trump forward.
Update the Overleaf doc
Figuring out what to do with the chess paper with Antonio
from flair.models import TextClassifier
flair_sentiment = TextClassifier.load('en-sentiment')
text="Avengers: Infinity War is a giant battle for which directors Anthony and Joe Russo have given us touches of JRR Tolkien’s Return of the King and JK Rowling’s Harry Potter and the Deathly Hallows. The film delivers the sugar-rush of spectacle and some very amusing one-liners."
total_sentiment = sentence.labels
Trying to get a charge number for the RFI response – done
IT’s the end of the month, so it’s time for these two charts again. i think we’re seeing the Vaccines starting to have and effect? At least Switzerland seems to have gotten its second wave under control. Italy though…
And here’s the USA. Georgia is over two times worse than the UK. Think about that.
Proposal. Boy, was that interesting. We had one vague paragraph to go on. I fed that into the GPT playground along with some additional text to structure the response. and it damn near wrote the whole thing, pulling latent knowledge out of the model. I’l do a more detailed writeup later.
Got a text from Maryland asking if I wanted a shot, and to reply “Y” to set up an appt. I was in the middle of brushing my teeth, so I waited a few minutes. By that time, the slot had been filled. For now, respond immediately!
Find a good source that explains “grounding” in NLP
Back up DB
Create Ecco spreadsheets
Start creating the probe/model, term table. Got it done for terms and nouns. Do a second table that is percentages
Need to to ranked tokens next
2:00 Meeting. Didn’t wind up presenting. Tomorrow
Got 16 hours for writing
Had a good chat with Jarod. UW is still working on the adjunct thing
The setup is a video call where the author explains a paper to me. We can use screen-sharing, for figures, etc. We’ll record the call and post to YouTube. Possible participants are authors of a paper in network science or data science.
Ranking is still running. I really should have checked the amount of data I was generating, but now I have sunk costs!
3:00 Meeting today. I’d like to add something about qualitative research to the discussion section
Create a new set of spreadsheets where all models are compared. Probe is the sheet, the model is the column, and the terms are the rows. Display as heatmap
Ping Rukan around 10:00 to start figuring out how to assemble the Transformer. I want to try assembling a one-to-many and a many-to-one set of densely connected layers of arbitrary dimensionality. Started building a stripped-down MLP
Explicitcontrol – AI immediately stops the car, even in the middle of the highway because it interprets demands literally. This is what we have today with assistants such as SIRI and other narrow AIs.
Implicitcontrol – AI attempts to comply safely by stopping the car at the first safe opportunity, perhaps on the shoulder of the road. This AI has some common sense, but still tries to follow commands.
Alignedcontrol – AI understands that the human is probably looking for an opportunity to use a restroom and pulls over to the first rest stop. This AI relies on its model of the human to understand the intentions behind the command.
Delegatedcontrol – AI does not wait for the human to issue any commands. Instead, it stops the car at the gym because it believes the human can benefit from a workout. This is a superintelligent and human-friendly system which knows how to make the human happy and to keep them safe better than the human themselves. This AI is in control.
Machine learning methods offer great promise for fast and accurate detection and prognostication of coronavirus disease 2019 (COVID-19) from standard-of-care chest radiographs (CXR) and chest computed tomography (CT) images. Many articles have been published in 2020 describing new machine learning-based models for both of these tasks, but it is unclear which are of potential clinical utility. In this systematic review, we consider all published papers and preprints, for the period from 1 January 2020 to 3 October 2020, which describe new machine learning models for the diagnosis or prognosis of COVID-19 from CXR or CT images. All manuscripts uploaded to bioRxiv, medRxiv and arXiv along with all entries in EMBASE and MEDLINE in this timeframe are considered. Our search identified 2,212 studies, of which 415 were included after initial screening and, after quality screening, 62 studies were included in this systematic review. Our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases. This is a major weakness, given the urgency with which validated COVID-19 models are needed. To address this, we give many recommendations which, if followed, will solve these issues and lead to higher-quality model development and well-documented manuscripts.
Many papers gave little attention to establishing the original source of the images
All proposed models suffer from a high or unclear risk of bias in at least one domain
We advise caution over the use of public repositories, which can lead to high risks of bias due to source issues and Frankenstein datasets as discussed above
[Researchers] should aim to match demographics across cohorts, an often neglected but important potential source of bias; this can be impossible with public datasets that do not include demographic information
Researchers should be aware that algorithms might associate more severe disease not with CXR imaging features, but the view that has been used to acquire that CXR. For example, for patients that are sick and immobile, an anteroposterior CXR view is used for practicality rather than the standard posteroanterior CXR projection
We emphasize the importance of using a well-curated external validation dataset of appropriate size to assess generalizability
In recent years, there has been a great deal of concern about the proliferation of false and misleading news on social media1,2,3,4. Academics and practitioners alike have asked why people share such misinformation, and sought solutions to reduce the sharing of misinformation5,6,7. Here, we attempt to address both of these questions. First, we find that the veracity of headlines has little effect on sharing intentions, despite having a large effect on judgments of accuracy. This dissociation suggests that sharing does not necessarily indicate belief. Nonetheless, most participants say it is important to share only accurate news. To shed light on this apparent contradiction, we carried out four survey experiments and a field experiment on Twitter; the results show that subtly shifting attention to accuracy increases the quality of news that people subsequently share. Together with additional computational analyses, these findings indicate that people often share misinformation because their attention is focused on factors other than accuracy—and therefore they fail to implement a strongly held preference for accurate sharing. Our results challenge the popular claim that people value partisanship over accuracy8,9, and provide evidence for scalable attention-based interventions that social media platforms could easily implement to counter misinformation online.
Ranking is still running
Worked on the workshop paper. Added in a modified version of the intro from the chess paper that uses the GPT-3 now
Working on literature
Turns out that we still have to do a demo. I need to create some data to show what that would look like. Set up a meeting with Vadim for Friday to make sure all the new code is working
Generated all the scripts – about 700! Tomorrow I’ll run the “sim” and generate training values
Intro – discuss Tay, and how machine learning incorporates human input and reflects it back. This means that we have created ‘oracles’ that we can ask about the populations that contributed to their knowledge. In this type of computational sociology, finding and understanding the biases in these populations is an important part of the research
Introduce finetuned language models. Start with the chess model, and show how we can see the rank of piece terms rise and fall over the course of a sentence
Methods/results – describe the process of extracting chinavirus and sars-cov-2 as potential markers of different populations. Then prompts and runs to see the central terms that the models use. Show the stats. Then using the most popular terms from each model, run Ecco trajectories to show the rank behavior of these terms
Discussion. The possibilities of “interactive snapshots” of a population’s online behavior. The ongoing difficulty in prompt creation. Potential of maps?
Created the template
Note – Create Dr. Fauci and Donald Trump prompts – done!
Worked on getting useful text to look at out of the models. Using flair to scan for POS. That way I can grab the first noun that occurs which makes for less text to look through, and more useful than just looking at the first word. I think that this will also be the approach that I’ll use to pull data out of the GPT-3 for maps.
Finished training the COVID model, and committed to VCS
Got some results for the first term. Going to re-run for some number of terms next. Also played around with the resulting spreadsheets a bit to look for patterns
Updating my drivers, verifying that TF still works, and upgrading to PT 1.8
Building computational models to account for the cortical representation of language plays an important role in understanding the human linguistic system. Recent progress in distributed semantic models (DSMs), especially transformer-based methods, has driven advances in many language understanding tasks, making DSM a promising methodology to probe brain language processing. DSMs have been shown to reliably explain cortical responses to word stimuli. However, characterizing the brain activities for sentence processing is much less exhaustively explored with DSMs, especially the deep neural network-based methods. What is the relationship between cortical sentence representations against DSMs? What linguistic features that a DSM catches better explain its correlation with the brain activities aroused by sentence stimuli? Could distributed sentence representations help to reveal the semantic selectivity of different brain areas? We address these questions through the lens of neural encoding and decoding, fueled by the latest developments in natural language representation learning. We begin by evaluating the ability of a wide range of 12 DSMs to predict and decipher the functional magnetic resonance imaging (fMRI) images from humans reading sentences. Most models deliver high accuracy in the left middle temporal gyrus (LMTG) and left occipital complex (LOC). Notably, encoders trained with transformer-based DSMs consistently outperform other unsupervised structured models and all the unstructured baselines. With probing and ablation tasks, we further find that differences in the performance of the DSMs in modeling brain activities can be at least partially explained by the granularity of their semantic representations. We also illustrate the DSM’s selectivity for concept categories and show that the topics are represented by spatially overlapping and distributed cortical patterns. Our results corroborate and extend previous findings in understanding the relation between DSMs and neural activation patterns and contribute to building solid brain-machine interfaces with deep neural network representations.
Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.
Multiview representation learning (MVRL) leverages information from multiple views to obtain a common representation summarizing the consistency and complementarity in multiview data. Most previous matrix factorization-based MVRL methods are shallow models that neglect the complex hierarchical information. The recently proposed deep multiview factorization models cannot explicitly capture consistency and complementarity in multiview data. We present the deep multiview concept learning (DMCL) method, which hierarchically factorizes the multiview data, and tries to explicitly model consistent and complementary information and capture semantic structures at the highest abstraction level. We explore two variants of the DMCL framework, DMCL-L and DMCL-N, with respectively linear/nonlinear transformations between adjacent layers. We propose two block coordinate descent-based optimization methods for DMCL-L and DMCL-N. We verify the effectiveness of DMCL on three real-world data sets for both clustering and classification tasks.
Writing part of the introduction of the IEEE issue on diversity in transportation.
2:00 AI/ML tagup
Pinged Eric about getting a code to charge some of the hours – he’ll provide later
3:30 Meeting / Happy hour. Went over results. I’m going to run a larger experiment to generate text (not ranks). 50 tokens, 1,000 results for chinavirus and sars-cov-2
This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition.
The weather’s great, so I’m taking the day off to ride this:
It was a marvelous ride on a warm, almost hot day. I write this on the morning of the 12th, and I can still feel it in my legs. I haven’t felt that in months. Even a long winter ride doesn’t do that. It’s really hard to put out sustained power when you’re (a) cold and (b) trying to avoid sweating too much and getting colder