Monthly Archives: May 2020

Phil 5.12.20

D20

#COVID Aseel’s docs don’t seem to be in the proper unicode? I tried downloading a version of the Quran from here, and that seems to be working. Hmmm. Trying to train on the Quran with these args:

--output_dir=output --model_type=gpt2 --model_name_or_path=gpt2 --per_gpu_train_batch_size=1 --do_train --train_data_file=..\input\quran-simple.txt

GPT-2 Agents

  • Added basic moves for all the pieces. Still need to handle hints
    Evaluating move [d4 Nf6]
     Fred Van der Vliet moves white pawn from d2 to d4.
     Loek Van Wely moves black knight from g8 to f6.
    Evaluating move [c4 g6]
     Fred Van der Vliet moves white pawn from c2 to c4.
     Loek Van Wely moves black pawn from g7 to g6.
    Evaluating move [g3 Bg7]
     Fred Van der Vliet moves white pawn from g2 to g3.
     Loek Van Wely moves black bishop from f8 to g7.
    Evaluating move [Bg2 O-O]
     Fred Van der Vliet moves white bishop from f1 to g2.
     Loek Van Wely kingside castles.

GOES

  • Working on NoiseGAN
    • Seems to be training without blowing up….
    • It ran, but the results are weird.

acc_loss

Noise_trained

    • As you can see, the fake data seems to have learned the noise well, but the scale is wrong.
    • It does seem to be able to learn about the scale though:

Noise_trained

    • Adding dropout seems to help:

Noise_trained

    • The discriminator so far:
      self.d_model = Sequential()
      self.d_model.add(Dense(64, activation='relu', kernel_initializer='he_uniform', input_dim=self.vector_size))
      self.d_model.add(Dropout(0.2))
      self.d_model.add(Dense(25, activation='relu', kernel_initializer='he_uniform', input_dim=self.vector_size))
      self.d_model.add(Dropout(0.2))
      self.d_model.add(Dense(1, activation='sigmoid'))
      # compile model
      self.d_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Noise_trained

 

Phil 5.11.20

Cut my hair for the second time. It looks ok from the front…

I’m also having dreams with crowds in them. Saturday night I dreamed I was at some job with a lot of people in a large building. Last night I dreamed I was sharing a dorm at the Naval Academy?

A foolproof way to shrink deep learning models

  • Train the model, prune its weakest connections, retrain the model at its fast, early training rate, and repeat, until the model is as tiny as you want. 

Graph Neural Networks (GNN)

  • Graph neural networks (GNNs) are connectionist models that capture the dependence of graphs via message passing between the nodes of graphs. Unlike standard neural networks, graph neural networks retain a state that can represent information from its neighborhood with arbitrary depth.

D20

  • Zach’s having issues getting the map to work on mobile
  • Need to start pulling off controlled entities like China and Diamond Princess
  • Made a duplicate of the trending code to play with

GPT-2 Agents

  • More PGNtoEnglish
  • I have pawns and knights moving!

chessboard

  • With expanded text!
    • ‘Fred Van der Vliet moves white pawn from d2 to d4’
    • ‘Loek Van Wely moves black knight from g8 to f6’

GOES

  • Continue with NoiseGAN
  • Isolating noise. Done!

noise

  • Now I need to subsample to produce the training and test sets. Seems to be working
  • Fitting the timeseries sampling into the GAN

Noise_untrained

  • Try training the GAN?

Fika

  • Community Spaces for Interdisciplinary Science and Engagement
    • Dr. Lisa Scheifele is an Associate Professor at Loyola University Maryland and head of the Build-a-Genome research network, where her research focuses on designing and programming cells for new and complex functions. She is also Executive Director at the Baltimore Underground Science Space (BUGSS) community lab. BUGSS provides unique and creative projects to members of the public who have few other opportunities to engage with modern science. As an informal and nontraditional science space, BUGSS’ activities blend biotechnology research, computational tools, artistic expression, and design principles to accomplish interdisciplinary projects driven by community interest and need.

Phil 5.8.20

D20

  • Really have to fix the trending. Places like Brazil, where the disease is likely to be chronic, are not working any more
  • Aaron and I agree if the site’s not updated by 5/15 to pull it down

GPT-2 Agents

  • More PGNtoEnglish
  • Worked out way to search for pieces in a rules-based range. It’ll work for pawns, knights, and kings right now. Will need to add rooks, bishops and queens

#COVID

  • Try finetuning the model on Arabic to see what happens. Don’t see the txt files?

GOES

  • The time taken for all the DB calls is substantial. I need to change the Measurements class so that there is a set of master Measurements that are big enough to subsample other Measurements from. Done. Much faster!
  • Start building noise query, possibly using a high pass filter? Otherwise, subtract the “real” signal from the simulated one
    • Starting with the subtraction, since I have to set up queries anyway, and this will help me debug them
    • Created NoiseGAN class that extends OneDGAN
    • Pulling over table building code from InfluxTestTrainBase()
    • Success!
    • "D:\Program Files\Python37\python.exe" D:/Development/Sandboxes/Influx2_ML/Influx2_ML/NoiseGAN.py
      2020-05-08 14:45:36.077292: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
      OneDGAN.reset()
      NoiseGAN.reset()
      query = from(bucket:"org_1_bucket") |> range(start:2020-04-13T13:30:00Z, stop:2020-04-13T13:40:00Z) |> filter(fn:(r) => r.type == "noisy_sin" and (r.period == "8"))
      vector size = 100, query returns = 590
    • Probably a good place to stop for the day
  • 10:00 Meeting. Vadim seems to be making good progress. Check in on Tuesday

Phil 5.7.20

D20

  • Everything is silent again.

GPT-2 Agents

  • Continuing with PGNtoEnglish
    • Building out move text
    • Changing board to a dataframe, since I can display it as a table in pyplot – done!

chessboard

  • Here’s the code for making the chesstable table in pyplot:
    import pandas as pd
    import matplotlib.pyplot as plt
    
    class Chessboard():
        board:pd.DataFrame
        rows:List
        cols:List
    
        def __init__(self):
            self.reset()
    
        def reset(self):
            self.cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
            self.rows = [8, 7, 6, 5, 4, 3, 2, 1]
            self.board = df = pd.DataFrame(columns=self.cols, index=self.rows)
            for number in self.rows:
                for letter in self.cols:
                    df.at[number, letter] = pieces.NONE.value
    
            self.populate_board()
            self.print_board()
    
        def populate_board(self):
            self.board.at[1, 'a'] = pieces.WHITE_ROOK.value
            self.board.at[1, 'h'] = pieces.WHITE_ROOK.value
            self.board.at[1, 'b'] = pieces.WHITE_KNIGHT.value
            self.board.at[1, 'g'] = pieces.WHITE_KNIGHT.value
            self.board.at[1, 'c'] = pieces.WHITE_BISHOP.value
            self.board.at[1, 'f'] = pieces.WHITE_BISHOP.value
            self.board.at[1, 'd'] = pieces.WHITE_QUEEN.value
            self.board.at[1, 'e'] = pieces.WHITE_KING.value
    
            self.board.at[8, 'a'] = pieces.BLACK_ROOK.value
            self.board.at[8, 'h'] = pieces.BLACK_ROOK.value
            self.board.at[8, 'b'] = pieces.BLACK_KNIGHT.value
            self.board.at[8, 'g'] = pieces.BLACK_KNIGHT.value
            self.board.at[8, 'c'] = pieces.BLACK_BISHOP.value
            self.board.at[8, 'f'] = pieces.BLACK_BISHOP.value
            self.board.at[8, 'd'] = pieces.BLACK_KING.value
            self.board.at[8, 'e'] = pieces.BLACK_QUEEN.value
    
            for letter in self.cols:
                self.board.at[2, letter] = pieces.WHITE_PAWN.value
                self.board.at[7, letter] = pieces.BLACK_PAWN.value
    
        def print_board(self):
            fig, ax = plt.subplots()
    
            # hide axes
            fig.patch.set_visible(False)
            ax.axis('off')
            ax.axis('tight')
    
            ax.table(cellText=self.board.values, colLabels=self.cols, rowLabels=self.rows, loc='center')
    
            fig.tight_layout()
    
            plt.show()

GOES

  • Continuing with the MLP sequence-to-sequence NN
  • Writing
  • Reading
    • Hmm. Just realized that the input vector being defined by the query is a bit problematic. I think I need to define the input vector size and then ensure that the query creates sufficient points. Fixed. It now stores the model with the specified input vector size:

model_name

  • And here’s the loaded model in newly-retrieved data:
  • Here’s the model learning two waveforms. Went from 400×2 neurons to 3200×2:
  • Combining with GAN
    • Subtract the sin from the noisy_sin to get the moise and train on that
  • Start writing paper? What are other venues beyond GVSETS?
  • 2:00 status meeting

JuryRoom

  • 3:30 Meeting
  • 6:00 Meeting

Phil 5.6.20

#COVID

  • I looked at the COVID-19-TweetIDs GitHub project, and it is in fact lists of ids:
    1219755883690774529
    1219755875407224832
    1219755707001659393
    1219755610494861312
    1219755586272813057
    1219755378428338181
    1219755293397012480
    1219755288988798981
    1219755197645279233
    1219755157438828545
  • These can work by appending that number to the string “twitter.com/anyuser/status/”, like this: twitter.com/anyuser/status/1219755883690774529
  • The way to get the text in Python appears to be tweepy. This snippet from stackoverflow appears to show how to do it, but I haven’t verified yet.
    import tweepy
    consumer_key = xxxx
    consumer_secret = xxxx
    access_token = xxxx
    access_token_secret = xxxx
    
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    
    api = tweepy.API(auth)
    
    tweets = api.statuses_lookup(id_list) # id_list is the list of tweet ids
    tweet_txt = []
    for i in tweets:
        tweet_txt.append(i.text)

     

GPT-2 Agents

  • Continuing with PGNtoEnglish
    • Figuring out how to parse the moves text, using the wonderful regex101 site
  • 4:30 meeting
    • We set up an Overleaf project with the goal to submit to the Harvard/Kennedy Misinformation Review
    • We talked about the GPT-2 as a way of clustering tweets. Going to try finetuning with some Arabic novels first to see if it can work in that language

GOES

  • Continuing with the MLP sequence-to-sequence NN
    • Getting the data to fit into nice, rectangular arrays, which is no straightforward, since the time window of the query can return a varying number of results. So I have to run the query, then trim the arrays down so that they are all the length of the shortest. Here’s the results:
  • I’ve got the training and prediction working pretty well. Stopping for the day
  • Tomorrow I’ll get the models to write out and read in
  • 2:00 status meeting
    • Two weeks to getting the sim running?

Phil 5.5.20

D20Cubic

  • Just goes to show that you shouldn’t take regression fits as correct

GPT-2 Agents

  • More PGNtoEnglish
  • Discovered typing.TextIO. I love typing to death 🙂
  • Finished parsing meta information

#COVID

GOES

  • Progress meeting with Vadim and Isaac
  • Train and save a 2-layer, 400 neuron MLP. No ensembles for now
  • Set up GAN to add noise

 

Phil 5.4.20

It is a Chopin sort of morning

D20

  • Zach got maps and lists working over the weekend. Still a lot more to do though
  • Need to revisit the math to work over the past days

GPT-2 Agents

  • Working on PGN to English.
    • Added game class that contains all the information for a game and reads it in. Games are created and managed by the PGNtoEnglish class
  • Rebased the transformers project. It updates fast

GOES

  • Figure out how to save and load models. I’m really not sure what to save, since you need access to the latent space and the discriminator? So far, it’s:
    def save_models(self, directory:str, prefix:str):
        p = os.getcwd()
        os.chdir(directory)
        self.d_model.save("{}_discriminator.tf}".format(prefix))
        self.g_model.save("{}_generator.tf}".format(prefix))
        self.gan_model.save("{}_GAN.tf}".format(prefix))
        os.chdir(p)
    
    def load_models(self, directory:str, prefix:str):
        p = os.getcwd()
        os.chdir(directory)
        self.d_model = tf.keras.models.load_model("{}_discriminator.tf}".format(prefix))
        self.g_model = tf.keras.models.load_model("{}_generator.tf}".format(prefix))
        self.gan_model = tf.keras.models.load_model("{}_GAN.tf}".format(prefix))
        os.chdir(p)
    • Here’s the initial run. Very nice for 10,000 epochs!

acc_lossGAN_inputsGAN_trained

    • And here’s the results from the loaded model:

GAN_trained

    • The discriminator works as well:
      real accuracy = 100.00%, fake accuracy = 100.00%
      real loss = 0.0154, fake loss = 0.0947%
    • An odd thing is that I can save the GAN model, but can’t load it?
      ValueError: An empty Model cannot be used as a Layer.

      I can rebuild it from the loaded generator and discriminator models though

  • Set up MLP to convert low-fidelity sin waves to high-fidelity
    • Get the training and test data from InfluxDB
      • input is square, output is sin, and the GAN should be noisy_sin minus sin. Randomly move the sample through the domain
    • Got the queries working:
    • Train and save a 2-layer, 400 neuron MLP. No ensembles for now
  • Set up GAN to add noise

Fika

  • Ask question about what the ACM and CHI are doing, beyond providing publication venues, to fight misinformation that lets millions of people find fabricated evidence that supports dangerous behavior.
  • Effects of Credibility Indicators on Social Media News Sharing Intent
    • In recent years, social media services have been leveraged to spread fake news stories. Helping people spot fake stories by marking them with credibility indicators could dissuade them from sharing such stories, thus reducing their amplification. We carried out an online study (N = 1,512) to explore the impact of four types of credibility indicators on people’s intent to share news headlines with their friends on social media. We confirmed that credibility indicators can indeed decrease the propensity to share fake news. However, the impact of the indicators varied, with fact checking services being the most effective. We further found notable differences in responses to the indicators based on demographic and personal characteristics and social media usage frequency. Our findings have important implications for curbing the spread of misinformation via social media platforms.

Phil 5.1.20

Geez, it’s May! What a weird time

D20

  • Chatted with Zach. He’s bogged down in database issues, but I think it’s coming along

GPT-2 Agents

  • Upgrade TF, Torch, transformers, Nvidia, and CUDA on laptop
  • Set up input and output files
  • Pull char count of probe out and add that to the total generated
  • Try training on Moby Dick as per these instructions
    • The following example fine-tunes GPT-2 on WikiText-2. We’re using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.
      export TRAIN_FILE=/path/to/dataset/wiki.train.raw
      export TEST_FILE=/path/to/dataset/wiki.test.raw
      
      python run_language_modeling.py \
          --output_dir=output \
          --model_type=gpt2 \
          --model_name_or_path=gpt2 \
          --do_train \
          --train_data_file=$TRAIN_FILE \
          --do_eval \
          --eval_data_file=$TEST_FILE
      

      This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches a score of ~20 perplexity once fine-tuned on the dataset.

  • Ran with this command
    python run_language_modeling.py --output_dir=output .\gpt2data\moby_dick_model --model_type=gpt2 --model_name_or_path=gpt2 --do_train --train_data_file=.\gptdata\moby_dick_train.txt --do_eval --eval_data_file=.\gptdata\moby_dick_test.txt

    Which started the task correctly, but…

    RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 8.00 GiB total capacity; 6.26 GiB already allocated; 77.55 MiB free; 6.31 GiB reserved in total by PyTorch)

    Guess I’ll try running it on my work machine. If it runs there, I guess it’s time to upgrade my graphics card

  • That was not the problem! There is something going on with batch size. Added  per_gpu_train_batch_size=1
  • Couldn’t use links. os.isfile() chokes
  • The model doesn’t seem to be saved? Looks like it is:
    05/01/2020 09:43:49 - INFO - transformers.trainer -   Saving model checkpoint to output
    05/01/2020 09:43:49 - INFO - transformers.configuration_utils -   Configuration saved in output\config.json
    05/01/2020 09:43:50 - INFO - transformers.modeling_utils -   Model weights saved in output\pytorch_model.bin
    05/01/2020 09:43:50 - INFO - __main__ -   *** Evaluate ***
    05/01/2020 09:43:50 - INFO - transformers.trainer -   ***** Running Evaluation *****
    05/01/2020 09:43:50 - INFO - transformers.trainer -     Num examples = 97
    05/01/2020 09:43:50 - INFO - transformers.trainer -     Batch size = 16
    Evaluation: 100%|██████████| 7/7 [00:06<00:00,  1.00it/s]
    05/01/2020 09:43:57 - INFO - __main__ -   ***** Eval results *****
    05/01/2020 09:43:57 - INFO - __main__ -     perplexity = 43.311306196182095
  • Found it. It defaults to the output directory in transformers/examples
  • To get this version, which is a PyTorch model, you have to add the ‘from_pt=True‘ argument:
    model = TFGPT2LMHeadModel.from_pretrained("../data/moby_dick_model", pad_token_id=tokenizer.eos_token_id, from_pt=True)
  • And the results are great!
    I enjoy walking with my cute dog:
    	[0]: I enjoy walking with my cute dog, and then I like to take pictures! But, as for you, you will have to go all the way round for the proper weather! Here, I have some water in my belly! How am I
    	[1]: I enjoy walking with my cute dog when I walk in the yard, and when we have been going in, I am always excited to try a little bit of the wildest stuff. I like to see my dogs do it. I like
    	[2]: I enjoy walking with my cute dog because he has no fear of you leaving him alone. In that case, let me explain that I am a retired Sperm Whale in my Sperm Whale breeding herd. I was recently the leader of the
    
    Far out in the uncharted backwaters of the unfashionable end:
    	[0]: Far out in the uncharted backwaters of the unfashionable end of the Indian Ocean, you will see whales of many great variety. “Wherever they go, their mouths may be wide open, or they may be so packed
    	[1]: Far out in the uncharted backwaters of the unfashionable end of the planet. On his way, it seemed that he was about to embark upon something which no mortal could have foreseen; it being the Cape Horn of the Pacific
    	[2]: Far out in the uncharted backwaters of the unfashionable end. A curious discovery is made of the whale-whale. How much is he? I wonder how many sperm whales have there! I am still trying to get
    
    It was a pleasure to burn. :
    	[0]: It was a pleasure to burn. His teeth were the first thing to slide down to the side of his cheeks—a pointless thing—while my face stood there in this hideous position. It was my last, and only,
    	[1]: It was a pleasure to burn. But, as the day wore on, another peculiarity was discovered in the method. When this first method was advanced to be used for preparing the best lye, it was found that it was, instead
    	[2]: It was a pleasure to burn. “Sir, “aye, that’s true—” said I with a sort of exasperation. I then took one of the other boats and in a very similar
    
    It was a bright cold day in April, and the clocks were striking thirteen. :
    	[0]: It was a bright cold day in April, and the clocks were striking thirteen. It seemed that Captain Peleg had had just arrived, and was sitting in his Captain-Commander's cabin, and was trying to get up some time; but Pe
    	[1]: It was a bright cold day in April, and the clocks were striking thirteen. One of us, who had been living in the tent for six days, still felt like the moon. I saw him. I saw him again. He looked just like
    	[2]: It was a bright cold day in April, and the clocks were striking thirteen. “Good afternoon, sir, it was the very first Sabbath of the year, and the New Year is the first time the people of the world have an

     

  • Need to get the chess database and build a corpora. Working on a PGN to English translator. Doesn’t look toooooo bad

GOES

    • Continue with GANS. Maybe explore 1D CNNs?
    • The run with the high-frequency run actually looks pretty good:

      I think it may be a better use of my time to assemble all the components for a first pass proof-of concept

  • 10:00 Meeting with Vadim and Isaac
    • I walked through the whole controller architecture from the base class to the running version. Vadim will start implementing a Sim2 version using the base classes and the dictionary. Then we can work on writing to and reading from InfluxDB