Category Archives: Machine Learning

Phil 4.27.20

Took the motorcycle for its weekly spin and rode past the BWI terminal. By far the most Zombie Apocalypse thing I’ve seen so far.

The repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020.

D20

Reworked regression code to only use the last 14 days of data. It seems to take the slowing rate change into account better
That could be a nice interactive feature to add to the website. A js version of regression curve fitting is here.

ACSOS

Got Antonio’s revisions back and enbiggened the two chats for better readability

GPT-2 Agents

Going to try the GPT-2 Client and see how it works.
Whoops, needs TF 2.1. Upgraded that and the drivers – done

GOES

Step through the GAN code and look for ways of restricting the latent space to being near the simulation output
Here’s the GAN trying to fit a bit of a sin wave from the beginning of the day
And here’s the evolution of the GAN using hints and 5 latent dimensions from the end of the day:

And here are the accuracy outputs:

epoch = 399, real accuracy = 87.99999952316284%, fake accuracy = 37.99999952316284%
epoch = 799, real accuracy = 43.99999976158142%, fake accuracy = 56.99999928474426%
epoch = 1199, real accuracy = 81.00000023841858%, fake accuracy = 25.999999046325684%
epoch = 1599, real accuracy = 81.00000023841858%, fake accuracy = 40.99999964237213%
epoch = 1999, real accuracy = 87.99999952316284%, fake accuracy = 25.999999046325684%
epoch = 2399, real accuracy = 89.99999761581421%, fake accuracy = 20.000000298023224%
epoch = 2799, real accuracy = 87.00000047683716%, fake accuracy = 46.00000083446503%
epoch = 3199, real accuracy = 80.0000011920929%, fake accuracy = 47.999998927116394%
epoch = 3599, real accuracy = 76.99999809265137%, fake accuracy = 43.99999976158142%
epoch = 3999, real accuracy = 68.99999976158142%, fake accuracy = 30.000001192092896%
epoch = 4399, real accuracy = 75.0%, fake accuracy = 33.000001311302185%
epoch = 4799, real accuracy = 63.999998569488525%, fake accuracy = 28.00000011920929%
epoch = 5199, real accuracy = 50.0%, fake accuracy = 56.00000023841858%
epoch = 5599, real accuracy = 36.000001430511475%, fake accuracy = 56.00000023841858%
epoch = 5999, real accuracy = 49.000000953674316%, fake accuracy = 60.00000238418579%
epoch = 6399, real accuracy = 34.99999940395355%, fake accuracy = 58.99999737739563%
epoch = 6799, real accuracy = 70.99999785423279%, fake accuracy = 43.00000071525574%
epoch = 7199, real accuracy = 70.99999785423279%, fake accuracy = 30.000001192092896%
epoch = 7599, real accuracy = 47.999998927116394%, fake accuracy = 50.0%
epoch = 7999, real accuracy = 40.99999964237213%, fake accuracy = 52.99999713897705%
epoch = 8399, real accuracy = 23.000000417232513%, fake accuracy = 82.99999833106995%
epoch = 8799, real accuracy = 23.000000417232513%, fake accuracy = 75.0%
epoch = 9199, real accuracy = 31.00000023841858%, fake accuracy = 69.9999988079071%
epoch = 9599, real accuracy = 37.99999952316284%, fake accuracy = 68.00000071525574%
epoch = 9999, real accuracy = 23.000000417232513%, fake accuracy = 83.99999737739563%

Found a bug in the short-regression code. Need to roll in the fix

Here’s the working code:

slope, intercept, r_value, p_value, std_err = stats.linregress(xsub, ysub)
# slope, intercept = np.polyfit(x, y, 1)
yn = np.polyval([slope, intercept], xsub)

steps = 0
if slope < 0:
    steps = abs(y[-1] / slope)

reg_x = []
reg_y = []
start = len(yl) - max_samples
yval = intercept + slope * start
for i in range(start, len(yl)-offset):
    reg_x.append(i)
    reg_y.append(yval)
    yval += slope

Phil 4.24.20

It is very wet today

radar

Spent far too much time trying to upload a picture to the graduation site. It appears to be broken

D20

Changed the CONTROLLED days to < 2, since things are generally looking better

ACSOS

Sent the revised draft to Antonio

GPT-2 Agents

Found what appears to be just what I’m looking for. Searching on GitHub for GPT-2 tensorflow led me to this project, GPT-2 Client. I’ll give that a try and see how it works. The developer, Rishabh Anand seems to have solid skills so I have some hope that this could work. I do not have the energy to start this on a Friday and then switch to GANs for the rest of the day. Sunday looks like another wet one, so maybe then.

GOES

More looking at layers. This is Imagenet’s block3_conv3

Advanced CNNs
Start GANS? Yes!
- Got this version working. Now I need to step through it. But here are some plots of it learning:

- I had dreams about this, so I’m going to record the thinking here:
  - An MLP should be able to get from a simple simulation (square wave) to a more accurate(?) simulation sin wave. The data set is various start points and frequency queries into the DB, with matching (“real”/noisy) as the test. My intuition is that the noise will be lost, so that’s the part we’re going to have to get back with the GAN.
  - So I think there is a two-step process
    - Train the initial NN that will produce the generalized solution
    - Use the output of the NN and the “real” data to train the GAN for fine tuning

Phil 4.23.20

Transformer Architecture: The Positional Encoding

In this article, I don’t plan to explain its architecture in depth as there are currently several great tutorials on this topic (here, here, and here), but alternatively, I want to discuss one specific part of the transformer’s architecture – the positional encoding.

D20

Add centroids for states – done
Return the number of neighbors as an argument – done
Chatted with Aaron and Zach. More desire to continue than abandon

ACSOS

More revisions. Swap steps for discussion and future work

GOES

- IRS proposal went in yesterday
- Continue with GANs
- Using the VGG model now with much better results. Also figured out how to loads weights and read the probabilities in the output layer:
- Same thing using the pre-trained model from Keras:
```
from tensorflow.keras.applications.vgg16 import VGG16
# prebuild model with pre-trained weights on imagenet
model = VGG16(weights='imagenet', include_top=True)
model.compile(optimizer='sgd', loss='categorical_crossentropy')
```
- Trying to visualize a layer using this code. And using that code as a starting point, I had to explore how to slice up the tensors in the right way. A CNN layer has a set of “filters” that contain a square set of pixels. The data is stored as an array of pixels at each x, y, coordinate, so I had to figure out how to get one image at a time. Here’s my toy:
```
import numpy as np
import matplotlib.pyplot as plt

n_rows = 4
n_cols = 8
depth = 4

my_list = []

for r in range(1, n_rows):
    row = []
    my_list.append(row)
    for c in range(1, n_cols):
        cell = []
        row.append(cell)
        for d in range(depth):
            cell.append(d+c*10+r*100)

print(my_list)
nl = np.array(my_list)
for d in range(depth):
    print("\nlayer {} = \n{}".format(d, nl[:, :, d]))
    plt.figure(d)
    plt.imshow(nl[:, :, d], aspect='auto', cmap='plasma')

plt.show()
```
- This gets features from a cat image at one of the pooling layers. The color map is completely arbitrary:
```
# get the features from this block
features = model.predict(x)
print(features.shape)
farray = np.array(features[0])
print("{}".format(farray[:, :, 0]))

for d in range(4):
   plt.figure(d)
   plt.imshow(farray[:, :, d], aspect='auto', cmap='plasma')
```
- But we get some cool pix!

Phil 4.22.20

ELLIS Workshop on Geometric and Relational Deep Learning

Amsterdam, 24 April 2020
This workshop aims to bring together researchers and practitioners from the emerging fields of Graph Representation Learning and Geometric Deep Learning. The workshop will feature invited talks and a poster session. There will be ample opportunity for discussion and networking.
Invited talks will be live-streamed on YouTube: https://www.youtube.com/watch?v=Zf_nLR4kMo4

RL Theory Virtual Seminars

Looking for an online seminar that presents the latest advances in reinforcement learning theory? You just found it! We aim to bring you a virtual seminar (approximately) every Tuesday at 5pm UTC featuring the latest work in theoretical reinforcement learning.

D20

Added P-threshold to json file. I’m concerned that everyone is too busy to participate any more. Aaron hasn’t even asked about the project since he got better and is complaining about how overworked he is. Zach seems to be equally busy. If no one steps up by the end of the week, I think it’s time to either take over the project entirely or shut it down.

ACSOS

Started working on Antonio’s changes
Changed the MappApp so that the trajectory lines are blue

GOES

Finish CNN chapter

Enable Tensorflow profiling

Installed the plugin: pip install tensorboard_plugin_profile

Updated setup_tensorboard():

def setup_tensorboard(dir_str: str, windows_slashes:bool = True) -> List:
    if windows_slashes:
        dir_str = dir_str.replace("/", "\\")
    try:
        shutil.rmtree(dir_str)
    except:
        print("no file {} at {}".format(dir_str, os.getcwd()))

    # use TensorBoard, princess Aurora!
    callbacks = [tf.keras.callbacks.TensorBoard(log_dir=dir_str, profile_batch = '500,510')]
    return callbacks

Huh. Looks like scipy.misc.imresize() and scipy.misc.imread() are both deprecated and out of the library. Trying opencv

pip install opencv-python

Here’s how I did it, with some debugging to varify that everything was working correctly thrown in:

img_names = ['cat.jpg', 'steam-locomotive.jpg']
img_list = []
for name in img_names:
    img = cv2.imread(name)
    res = np.array(cv2.resize(img, dsize=(32, 32), interpolation=cv2.INTER_CUBIC))
    cv2.imwrite(name.replace(".jpg","_32x32.jpg"), res)
    img_list.append(res)

imgs = np.transpose(img_list, (0, 2, 1, 3))
imgs = np.array(img_list) / 255

This forced me to go down a transpose() in multiple dimensions rabbit hole that’s worth documenting. First, here’s code that takes some tiny images in an array and transposes them:

import numpy as np

img_list = [
    # image 1
    [[[10, 20, 30],
      [11, 21, 31],
      [12, 22, 32],
      [13, 23, 33]],

     [[255, 255, 255],
      [48, 45, 58],
      [101, 150, 205],
      [255, 255, 255]],

     [[255, 255, 255],
      [43, 56, 75],
      [77, 110, 157],
      [255, 255, 255]],

     [[255, 255, 255],
      [236, 236, 238],
      [76, 104, 139],
      [255, 255, 255]]],
    # image 2
    [[[100, 200, 300],
      [101, 201, 301],
      [102, 202, 302],
      [103, 203, 303]],

     [[159, 146, 145],
      [89, 74, 76],
      [207, 207, 210],
      [212, 203, 203]],

     [[145, 155, 164],
      [52, 40, 36],
      [166, 160, 163],
      [136, 132, 134]],

     [[61, 56, 60],
      [36, 32, 35],
      [202, 195, 195],
      [172, 165, 177]]]]

np_imgs = np.array(img_list)
print("np_imgs shape = {}".format(np_imgs.shape))

imgs = np.transpose(img_list, (0, 2, 1, 3))
print("imgs shape = {}".format(np_imgs.shape))
#imgs = np.array(imgs) / 255

print("pix 0: \n{}".format(np_imgs[0]))
print("transposed pix 0: \n{}".format(imgs[0]))
print("\n------------------------\n")
print("pix 1: \n{}".format(np_imgs[1]))
print("transposed pix 1: \n{}".format(imgs[1]))

So, this is a complex matrix, with a shape of (2, 4, 4, 3). What we want to do is rotate the images (the inner 4, 4) by 90 degrees by transposing them. The way to understand Numpy’s transpose is that it interchanges two axis. The trick is understanding how.
For this matrix, applying a transpose that does nothing means writing this:
```
imgs = np.transpose(img_list, (0, 1, 2, 3))
```
Think of it as an identity transpose. What we want to do is reverse the order of the inner 4, 4, which we do like this:
```
imgs = np.transpose(img_list, (0, 2, 1, 3))
```
That’s it! Now the second “4” will be transposed with the first “4”. You can do this with any of the elements. So
```
imgs = np.transpose(img_list, (3, 2, 1, 0))
```
Reverses everything!

Ok, so things are working, but the results are crap. Not really worrying about it for now because it’s CFAR and I always have this problem:

./images\airplane.jpg = [8] ship
./images\automobile.jpg = [0] airplane 
./images\bird.jpg = [4] deer
./images\cat.jpg = [0] airplane 
./images\cat2.jpg = [6] frog
./images\cat3.jpg = [8] ship
./images\deer.jpg = [8] ship
./images\dog.jpg = [2] bird
./images\horse.jpg = [8] ship
./images\ship.jpg = [0] airplane 
./images\steam-locomotive.jpg = [2] bird
./images\truck.jpg = [3] cat
[8 0 4 0 6 8 8 2 8 0 2 3]

Meeting

Phil 2.21.20

D20

Add P-threshold value to the json meta values. Items lower than this should be pale colors with black text

ML Seminar

finish reading paper, put together some quick slides – Done! Had to turn one table into charts
Good discussion. It makes me think that the way to have agents play games is to learn on text up to the latest point in the game as probe, and then add the question to the end of the probe, something like “<game text> + <character name> “. Have to try that with some novel text and talk to transformer. Here’s an example from the beginning of O Henry’s the Skylight Room. The Text begins: First Mrs. Parker would show you the double parlours. You would not dare to interrupt her description of their advantages and of the merits of the gentleman who had occupied them for eight years. Then you would manage to stammer forth the confession that you were neither a doctor nor a dentist. Mrs. Parker’s
- O Henry: manner of receiving the admission was such that you could never afterward entertain the same feeling toward your parents, who had neglected to train you up in one of the professions that fitted Mrs. Parker’s parlours.
- TTT 1: eyes would turn on you with the ferocity of her surgeon’s. “If I see a cripple with no teeth, I get him to be my apprentice, and he stays a month. I put him to work in a laundry. I give him a salary of seven hundred pounds a year, with a board and everything to live upon.”
- TTT 2: memory is rather dim, and she was not accustomed to be questioned directly; and she was anxious to see that you could remember as little as possible. You felt sure that by telling a story from childhood that had so often caught your fancy, you would escape this forced duty. So you went in, decided to visit the fair each evening, and spent an hour or two putting it off.
- TTT 3: vicarage would be a long, square room overlooking a grassy acre. Not less than four hundred acres were required to cover the floor of the park with his striking landscape. The grounds were monotonous, both in their beauty and in their repetition. To name a single flower would be so pedestrian an exercise that I am surprised that any of my country neighbours ever took that course.
So there is something there, but working out the right probe is still an issue.

GOES

CNNs
- Tried Mirrored strategy. It’s slower haha
- Here’s a nice utility to set up the log files that cleans out the old ones and fixes the string for windows:
```
def setup_tensorboard(dir_str: str) -> List:
   dir_str = dir_str.replace("/", "\\")
   try:
      shutil.rmtree(dir_str)
   except:
      print("no file {} at {}".format(dir_str, os.getcwd()))

   # use TensorBoard, princess Aurora!
   callbacks = [tf.keras.callbacks.TensorBoard(log_dir=dir_str)]
   return callbacks
```
- Put it together with some plot utils in a tf_utils_1 library
- Tips for Training Stable Generative Adversarial Networks
- How to Develop a 1D Generative Adversarial Network From Scratch in Keras
The frontier of simulation-based inference
- Many domains of science have developed complex simulations to describe phenomena of interest. While these simulations provide high-fidelity models, they are poorly suited for inference and lead to challenging inverse problems. We review the rapidly developing field of simulation-based inference and identify the forces giving new momentum to the field. Finally, we describe how the frontier is expanding so that a broad audience can appreciate the profound change these developments may have on science.

Phil 4.20.20

GOES

Reading the Distill article on Gaussian processes (highlighted page here)
Copy over neural-tangents code from notebook to IDE
Working on regression

Ran into a problem with Tensorboard

Traceback (most recent call last):
  File "d:\program files\python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "d:\program files\python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\Program Files\Python37\Scripts\tensorboard.exe\__main__.py", line 7, in 
  File "d:\program files\python37\lib\site-packages\tensorboard\main.py", line 75, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "d:\program files\python37\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "d:\program files\python37\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "d:\program files\python37\lib\site-packages\tensorboard\program.py", line 289, in main
    return runner(self.flags) or 0
  File "d:\program files\python37\lib\site-packages\tensorboard\program.py", line 305, in _run_serve_subcommand
    server = self._make_server()
  File "d:\program files\python37\lib\site-packages\tensorboard\program.py", line 409, in _make_server
    self.flags, self.plugin_loaders, self.assets_zip_provider
  File "d:\program files\python37\lib\site-packages\tensorboard\backend\application.py", line 183, in standard_tensorboard_wsgi
    flags, plugin_loaders, data_provider, assets_zip_provider, multiplexer
  File "d:\program files\python37\lib\site-packages\tensorboard\backend\application.py", line 272, in TensorBoardWSGIApp
    tbplugins, flags.path_prefix, data_provider, experimental_plugins
  File "d:\program files\python37\lib\site-packages\tensorboard\backend\application.py", line 345, in __init__
    "Duplicate plugins for name %s" % plugin.plugin_name
ValueError: Duplicate plugins for name projector

After poking around a bit online with the “Duplicate plugins for name %s” % plugin.plugin_name ValueError: Duplicate plugins for name projector, I found this diagnostic, which basically asked me to reinstall everything*. That didn’t work, so I went into the Python37\Lib\site-packages and deleted by hand. Tensorboard now runs, but now I need to upgrade my cuda so that I have cudart64_101.dll
- Installed the minimum set of items from the Nvidia Package Launcher (cuda_10.1.105_418.96_win10.exe)
- Installed the cuDNN drivers from here: https://developer.nvidia.com/rdp/cudnn-download
- The regular (e.g. MNIST) demos work byt when I try the distribution code I got this error: tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op ‘NcclAllReduce’. It turns out that there are only two viable MirroredStrategy operations, for windows, and the default is not one of them. These are the valid calls:
```
distribution = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.ReductionToOneDevice())
distribution = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
```
- And this call is not
```
# distribution = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.NcclAllReduce()) # <-- not valid for Windows
```
Funny thing. After reinstalling and getting everything to work, I tried the diagnostic again. It seems it always says to reinstall everything

And Tensorboard is working! Here’s the call that puts data in the directory:

linear_est = tf.estimator.LinearRegressor(feature_columns=feature_columns, model_dir = 'logs/boston/')

And when launched on the command line pointing at the same directory:

D:\Development\Tutorials\Deep Learning with TensorFlow 2 and Keras\Chapter 3>tensorboard --logdir=.\logs\boston
2020-04-20 11:36:42.999208: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
W0420 11:36:46.005735 18544 plugin_event_accumulator.py:300] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
W0420 11:36:46.006743 18544 plugin_event_accumulator.py:312] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.1 at http://localhost:6006/ (Press CTRL+C to quit)

I got this!
Of course, we’re not done yet. When attempting to use the Keras callback, I get the following error: tensorflow.python.eager.profiler.ProfilerNotRunningError: Cannot stop profiling. No profiler is running. It turns out that you have to specify the log folder like this
- - command line:
```
tensorboard --logdir=.\logs
```
  - in code:
```
logpath = '.\\logs'
```
That seems to be working!
Finished regression chapter

ASRC

Submitted RFI response for review

ACSOS

Got Antonio’s comments back

D20

Need to work on the math to find second bumps
- If the rate has been < x% (maybe 2.5%), calculate an offset that leaves a value of 100 for each day. When the rate jumps more than y% (e.g. 100 – 120 = 20%), freeze that number until the rate settles down again and repeat the process
- Change the number of samples to be the last x days
Work with Zach to get maps up?

ML seminar

Worked on getting the neural tangents notebook running on my box but jaxlib is not ported to Windows. Sigh. Find another paper.
Maybe this? Bringing Stories Alive: Generating Interactive Fiction Worlds

Phil 4.15.20

Fix siding from wind!

D20

Talked to Aaron about taking a derivative of the regression slope to see what it looks like. There may be common features in the pattern of rates, or of the slopes of the regressions changing over time
Still worried about countries that don’t report well. I’d like to be able to use rates from neighboring countries as some kind of check
Got the first pass on a world map json file done
Spread of SARS-CoV-2 in the Icelandic Population
- As of April 4, a total of 1221 of 9199 persons (13.3%) who were recruited for targeted testing had positive results for infection with SARS-CoV-2. Of those tested in the general population, 87 (0.8%) in the open-invitation screening and 13 (0.6%) in the random-population screening tested positive for the virus. In total, 6% of the population was screened. Most persons in the targeted-testing group who received positive tests early in the study had recently traveled internationally, in contrast to those who tested positive later in the study. Children under 10 years of age were less likely to receive a positive result than were persons 10 years of age or older, with percentages of 6.7% and 13.7%, respectively, for targeted testing; in the population screening, no child under 10 years of age had a positive result, as compared with 0.8% of those 10 years of age or older. Fewer females than males received positive results both in targeted testing (11.0% vs. 16.7%) and in population screening (0.6% vs. 0.9%). The haplotypes of the sequenced SARS-CoV-2 viruses were diverse and changed over time. The percentage of infected participants that was determined through population screening remained stable for the 20-day duration of screening.

ACSOS

Finished first pass of the lit review. Now at 13 pages

GOES

Start looking at GANs. Also work on fixing Optevolver for multiple CPUs
- Starting Deep Learning with TensorFlow 2 and Keras: Regression, ConvNets, GANs, RNNs, NLP, and more with TensorFlow 2 and the Keras API, 2nd Edition. Chapter six is GANs, which is what I’m interested in, but I’m ok with getting some review in first.
- Working on embeddings with the IMDB sentiment analysis project. It’s the first time I’ve seen an embedding layer which is 1) Cool, and 2) Something to play with. I’d noticed when I was working with Word2Vec for my research that embeddings didn’t seem to change shape much as a function of the number of dimensions. It seemed like a lot of information was being kept at very low dimensions, like three, rather than the more accepted 128 or so:

place-embeddings

- Well, this example gave me an opportunity to test that with some accuracy numbers. Here’s what I get:

EmbeddingDimensions

- That is super interesting. It basically means that model building, testing, and visualization can happen at low dimensions. That makes everything faster, and with about a 10% improvement likely as one of the last steps.
- Continuing with book.
Wrote up a response to Mike M’s questions about the white paper. Probably pointless, and has pretty much wasted my afternoon. And it was pointless! Now what?
Slides for John?

Phil 4.14.20

Fix siding from wind!

D20

I want to try taking a second derivative of the rates to see what it looks like. There may be common features in the pattern of rates, or of the slopes of the regressions changing over time
I’m also getting worried about countries that don’t report well. I’d like to be able to use rates from neighboring countries as some kind of check
Work with Zach on cleanup and map integration?

COVID Twitter

Finished ingesting the new data. It took almost 24 hours

ACSOS

Finished first pass of the introduction. Still at 14 pages

GOES

Start looking at GANs. Also work on fixing Optevolver for multiple CPUs
- Starting Deep Learning with TensorFlow 2 and Keras: Regression, ConvNets, GANs, RNNs, NLP, and more with TensorFlow 2 and the Keras API, 2nd Edition. Chapter six is GANs, which is what I’m interested in, but I’m ok with getting some review in first.
- Downloaded code samples. Turns out I have a Packt account
- Chapter 1!
Slides for John?

Phil 4.13.20

That was a very solitary weekend. I fixed some bikes, planted some herbs and vegetables, cleaned house, and procrastinated about pretty much everything else. I pinged Don and Wayne about D20 ideas, and got a ping for more info from Don, then silence. Everyone seems to be wrapped up tight in their worlds.

And for good reason. Maryland is looking grim:

D20

Worked with Zach to get states in. It’s working!

COVID Twitter

Went looking for new data to ingest, but didn’t see anything new? It wasn’t there yet. Ingesting now
1:30 Meeting

ACSOS

Reading through paper and pulling out all the parts from Simple Trick
Ping Antonio to let him know I’m working

GOES

Get absolute queries working in InfluxDB2. It took some looking, but here’s an example from the API reference on range(). Done!
- Everything is in GMT. As usual, the parser is picky about the format, which is ISO-8601:
```
range_args = "start:2020-04-13T13:30:00Z, stop:2020-04-13T13:30:10Z"
```
Start on TF2/GANs for converting square waves to noisy sin waves of varying frequencies using saved InfluxDB data
- First, pull a square, sin, and noisy sin and plot using matplotlib so we know we have good vectors. Success!

Waveforms

Fika

Phil 4.10.20

Went grocery shopping yesterday. I go a little less than once every two weeks, and every time I go, the world has changed. Now, everyone wears masks. We wait in a spread-out line to enter the store. I bring everything home and stage it in the basement before cleaning it and bringing it to the kitchen. And yet it’s also spring, and the world is sunny and smells of growing things. I wake up to birds chirping in the morning, and opened the windows a couple of times this week.

D20

Trying to get ahold of Zach. Finally connected in the late afternoon and made some tweaks. Leaflet might be a good map API

GOES

I think today’s goal is to build a little python app that I can run from the command line that loads samples in real time into the InfluxDB. Square waves (SqW), sin waves (SiW), and noisy sin waves (NoW). Then I want to build one network that produces NoW from SqW, and another that tries to detect the difference between an actual NsW and a synthesized one.
Working!

Queries are returning as well. Here’s the last ten seconds from sin_p1_a1.0_o0.5:

FluxTable() columns: 8, records: 8
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 34, 54, tzinfo=datetime.timezone.utc), '_value': 0.9690184703994814, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 34, 55, tzinfo=datetime.timezone.utc), '_value': 0.9395197317147641, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 34, 56, tzinfo=datetime.timezone.utc), '_value': 0.9006336224346869, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 34, 57, tzinfo=datetime.timezone.utc), '_value': 0.8527486797091374, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 34, 58, tzinfo=datetime.timezone.utc), '_value': 0.7963433540571716, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 34, 59, tzinfo=datetime.timezone.utc), '_value': 0.7319812288475823, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 35, tzinfo=datetime.timezone.utc), '_value': 0.6603053891601736, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}
{'result': '_result', 'table': 0, '_start': datetime.datetime(2020, 4, 10, 14, 34, 53, 868766, tzinfo=datetime.timezone.utc), '_stop': datetime.datetime(2020, 4, 10, 14, 35, 3, 868766, tzinfo=datetime.timezone.utc), '_time': datetime.datetime(2020, 4, 10, 14, 35, 1, tzinfo=datetime.timezone.utc), '_value': 0.5820319962922194, '_field': 'val', '_measurement': 'sin_p1_a1.0_o0.5'}

Now I need to extract the useful info for ML processing. That was easy:

for table in tables:
    print(table)
    rd:Dict
    for record in table.records:
        rd = record.values
        print("time = {}, name = {}, value = {}".format(rd["_time"], rd["_measurement"], rd["_value"]))

Hmmm. I can do a specific query:

tables = query_api.query('from(bucket:"{}") |> range(start: -10s) |> filter(fn:(r) => r._measurement == "sin_p1_a1.0_o0.5")'.format(bucket))

but using wildcards like * chokes

tables = query_api.query('from(bucket:"{}") |> range(start: -10s) |> filter(fn:(r) => r._measurement == "sin_*")'.format(bucket))

Time to RTFM.
- The syntax is specified using Extended Backus-Naur Form (“EBNF”). EBNF is the same notation used in the Go programming language specification, which can be found here. Not so coincidentally, InfluxDB is written in Go.
  - I wonder if these folks are ex-Googlers?
- Wrong manual – the link above is for version 1.7. The query reference for 2.0 is here.

Basically, it’s a very simple query language, which is why you need all the tags. So here’s how it works.

First, create well-tagged data:

def write_point(self, name:str, val:float, tags: Dict = {}):
    p = Point(name).field(self.keyfield, val).time(self.current_time)
    for key, val in tags.items():
        p.tag(key, val)
    self.write_api.write(bucket=self.bucket_name, record=p)
    print("\tto_influx {}".format(p.to_line_protocol()))

def sin_wave(self, t:float, period:float = 1.0, amplitude:float = 1.0, offset:float = 0,name:str = None):
    tags = {"type": "sin", "period":period, "amplitude": amplitude, "offset":offset}
    if name == None:
        name = "sin_p{}_a{}_o{}".format(period, amplitude, offset)
    val = math.sin(t/period + offset)*amplitude
    self.write_point(name, val, tags)

Here we have two methods, one that creates a value for a point of a sin wave, and one that writes the point. In this case, all the tags are stored as a Dict and passed as an argument to write_point, which is used by all the various functions. The output looks like this:

to_influx noisy_sin_p7_a1.0_o3.5,amplitude=1.0,offset=3.5,period=7,type=noisy_sin val=0.13146298019922603 1586545970000000000
to_influx square_p7_a1.0_o3.5,amplitude=1.0,offset=3.5,period=7,type=square val=0.0006153287497587468 1586545970000000000
to_influx sin_p8_a1.0_o4.0,amplitude=1.0,offset=4.0,period=8,type=sin val=0.8523503891730094 1586545970000000000
to_influx noisy_sin_p8_a1.0_o4.0,amplitude=1.0,offset=4.0,period=8,type=noisy_sin val=0.717585870814358 1586545970000000000

To query this, we do the following:

query_api = self.client.query_api()
# Queries have these basic components, connected by the forward operator (|>) which applies each step in sequence:
# source 'from (bucket: "my-bucket")
# range (relative): range(start: -1h, stop: -10m). The stop is optional. If left out, all results up to the present will be returned
# range (absolute): range(start: 2018-11-05T23:30:00Z, stop: 2018-11-06T00:00:00Z)
# filter : an anonymous function that compares string values with >, <, and  == comparitors. There are no wildcards. This is why tagging is important    tables =

Just remember that all tags are regarded as strings (as you can see in the filter_func string below), so be careful in generating them if they represent floating point values!

filter_func = 'r.type == "sin" and r.period == "4"'
tables = query_api.query('from(bucket:"{}") |> range(start: -10s) |> filter(fn:(r) => {})'.format(bucket, filter_func))

This query gives the following result:

FluxTable() columns: 12, records: 7
type = sin period = 4, time = 2020-04-10 19:16:44+00:00, name = sin_p4_a1.0_o2.0, value = -0.7178200203799832
type = sin period = 4, time = 2020-04-10 19:16:45+00:00, name = sin_p4_a1.0_o2.0, value = -0.7349996180484573
type = sin period = 4, time = 2020-04-10 19:16:46+00:00, name = sin_p4_a1.0_o2.0, value = -0.7517198648809216
type = sin period = 4, time = 2020-04-10 19:16:47+00:00, name = sin_p4_a1.0_o2.0, value = -0.7679703112673733
type = sin period = 4, time = 2020-04-10 19:16:48+00:00, name = sin_p4_a1.0_o2.0, value = -0.7837408012077955
type = sin period = 4, time = 2020-04-10 19:16:49+00:00, name = sin_p4_a1.0_o2.0, value = -0.7990214786593275
type = sin period = 4, time = 2020-04-10 19:16:50+00:00, name = sin_p4_a1.0_o2.0, value = -0.8138027936959693

And that’s enough for the day/week, I think

Phil 4.8.20

D20:

Talk to Zach about chart size bug?
- Yes! The charts are fixed. We also went through the rest of the punch list.
- Had to update the json file to handle date and other meta information
We are going to need a top level dashboard, something like number of countries in the DANGER, WARNING, and CONTROLLED buckets

COVID Twitter

Continue getting spreadsheets ingested.
Got the first one in, trying all of them now. Had to remember about INSERT IGNORE
It’s chugging along!

GOES

Got the db behaving!
The first and most important thing is that you have to multiply unixtime by 1,000,000,000 for it to work. Got that from this page in the 1.7 guide

Second is how tags can be added in code:

p = Point(self.name).field(self.keyfield, t[0]).time(t[1])
for key, val in self.tags_dict.items():
    p.tag(key, val)

That’s pretty nice.

Another nice feature that I discovered looking through the code is that there is a to_line_protocol() method, which produces correct lines. It looks like the InfluxDB parser doesn’t like spaces. Here’s an example of correct lines that I am reading in:

measure_1,tagKey_1=tagValue_11,tagKey_2=tagValue_12,tagKey_3=tagValue_13 val_1=0.0 1586352302000000000
measure_1,tagKey_1=tagValue_11,tagKey_2=tagValue_12,tagKey_3=tagValue_13 val_1=0.09983341664682815 1586352312000000000
measure_1,tagKey_1=tagValue_11,tagKey_2=tagValue_12,tagKey_3=tagValue_13 val_1=0.19866933079506122 1586352322000000000
measure_1,tagKey_1=tagValue_11,tagKey_2=tagValue_12,tagKey_3=tagValue_13 val_1=0.29552020666133955 1586352332000000000
measure_1,tagKey_1=tagValue_11,tagKey_2=tagValue_12,tagKey_3=tagValue_13 val_1=0.3894183423086505 1586352342000000000
measure_1,tagKey_1=tagValue_11,tagKey_2=tagValue_12,tagKey_3=tagValue_13 val_1=0.479425538604203 1586352352000000000

The reason that I’m reading in data is that the direct, SYNCHRONOUS writes to the database are pretty slow. Looking into that.

Coming up next, queries

Phil 4.7.20

D20:

Talk to Zach about chart size bug?
We are going to need a top level dashboard, something like number of countries in the DANGE, WARNING, and CONTROLLED buckets
Should look into using scipy’s linregress method to get accuracy values – done!

COVID Twitter

Read xls files into db (using this)
- Starting with reading the xls files into a Pandas Dataframe

Wow, you can recursively get files in three lines, including the import:

import glob
for filename in glob.iglob("./" + '**/*.xls', recursive=True):
    print(filename)

Had to do a bunch of things to get Arabic to score correctly. I think I need to set the database to:

alter database covid_misinfo character set utf8 collate utf8_general_ci;

, then set the table to utf-8, like so

DROP TABLE IF EXISTS `table_tweets`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `table_tweets` (
  `GUID` bigint(20) NOT NULL,
  `date` datetime NOT NULL,
  `URL` varchar(255) DEFAULT NULL,
  `contents` mediumtext NOT NULL,
  `translation` varchar(255) DEFAULT NULL,
  `author` varchar(255) DEFAULT NULL,
  `name` varchar(255) DEFAULT NULL,
  `country` varchar(255) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  `category` varchar(255) DEFAULT NULL,
  `emotion` varchar(255) DEFAULT NULL,
  `source` varchar(255) DEFAULT NULL,
  `gender` varchar(16) DEFAULT NULL,
  `posts` int(11) DEFAULT NULL,
  `followers` int(11) DEFAULT NULL,
  `following` int(11) DEFAULT NULL,
  `influence_score` float DEFAULT NULL,
  `post_title` varchar(255) DEFAULT NULL,
  `post_type` varchar(255) DEFAULT NULL,
  `image_url` varchar(255) DEFAULT NULL,
  `brand` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`GUID`)
) ENGINE=InnoDB DEFAULT CHARSET= utf8;

Anyway, it’s now working! (RT @naif_khalaf رحلة تطوير لقاح وقائي لمرض كورونا. استغرقت ٤ سنوات من المعمل لحيوانات التجارب للدراسات الحقلية على الإبل ثم للدراسة السريرية الأولية على البشر المتطوعين. ولازالت مستمرة. https://t.co/W3MjaFOAoC)

GOES

Write, visualize, and query test data

Writing seems to be working? I don’t get any errors, but I can’t see anything show up

Here’s an example of the data in what I think is correct line format:

measure_1, tagKey_1=tagValue_11 val_1=0.0 1586270395
measure_1, tagKey_1=tagValue_11 val_1=0.09983341664682815 1586270405
measure_1, tagKey_1=tagValue_11 val_1=0.19866933079506122 1586270415
measure_1, tagKey_1=tagValue_11 val_1=0.2955202066613396 1586270425
measure_1, tagKey_1=tagValue_11 val_1=0.3894183423086505 1586270435
measure_1, tagKey_1=tagValue_11 val_1=0.479425538604203 1586270445
measure_1, tagKey_1=tagValue_11 val_1=0.5646424733950355 1586270455
measure_1, tagKey_1=tagValue_11 val_1=0.6442176872376911 1586270465
measure_1, tagKey_1=tagValue_11 val_1=0.7173560908995228 1586270475
measure_1, tagKey_1=tagValue_11 val_1=0.7833269096274834 1586270485
measure_1, tagKey_1=tagValue_11 val_1=0.8414709848078965 1586270495
measure_1, tagKey_1=tagValue_11 val_1=0.8912073600614354 1586270505

Here’s how I’m writing it:

def to_influx(self, client:InfluxDBClient, bucket_name:str, org_name:str):
    write_api = client.write_api(write_options=SYNCHRONOUS)
    for i in range(len(self.measurement_list)):
        t = self.measurement_list[i]
        for key, val in self.tags_dict.items():
            p = Point(self.name).tag(key, val).field(self.keyfield, t[0])
            write_api.write(bucket=bucket_name, record=p)
            print("writing {}, {}={}, {}={} {}".format(self.name, key, val, self.keyfield, t[0], t[1]))

That seems to work. Here’s the output while it’s storing:

writing measure_10, tagKey_1=tagValue_101, val_10=-0.34248061846961253 1586277701
writing measure_10, tagKey_1=tagValue_101, val_10=-0.2469736617366209 1586277691
writing measure_10, tagKey_1=tagValue_101, val_10=-0.1489990258141953 1586277681
writing measure_10, tagKey_1=tagValue_101, val_10=-0.04953564087836742 1586277671
writing measure_10, tagKey_1=tagValue_101, val_10=0.05042268780681122 1586277661

I get no warnings or errors, but the Data Explorer is blank: influxdb

Oh, you have to use Unix Timestamps in milliseconds (timestamp * 1000):
```
mm.add_value(val, ts*1000)
```
Ok, it’s working, but my times are wrong

1:00 IRAD meeting

ML Seminar

Good chat on Neural Ordinary Differential Equations

Phil 4.6.20

Based on a chat with David K, I’m going to see if I can add a field for the detail view that says whether the estimate is better or worse than yesterday’s. Something like “Today’s estimate is x days better/worse than yesterday”

And it seems to be working. Need to get it on the website next

Get twitter parser to MySQL converter built

Created the table in mySQL
Dumped the .sql file (with just the table) to src/data

Continue to set up influx on laptop.

Set a fixed IP address – done! In lubuntu, it’s done through Settings->Advanced Network Configuration. I tried just setting the address manually, but it didn’t like that. So I let the dhcp automatically find an address and didn’t delete the static one, and now I can reach both?

Pinging 192.168.1.183 with 32 bytes of data:
Reply from 192.168.1.183: bytes=32 time=1ms TTL=64
Reply from 192.168.1.183: bytes=32 time<1ms TTL=64
Reply from 192.168.1.183: bytes=32 time<1ms TTL=64
Reply from 192.168.1.183: bytes=32 time<1ms TTL=64 Ping statistics for 192.168.1.183:     Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds:     Minimum = 0ms, Maximum = 1ms, Average = 0ms C:\Users\Phil>ping 192.168.1.111

Pinging 192.168.1.111 with 32 bytes of data:
Reply from 192.168.1.111: bytes=32 time=297ms TTL=64
Reply from 192.168.1.111: bytes=32 time<1ms TTL=64
Reply from 192.168.1.111: bytes=32 time<1ms TTL=64
Reply from 192.168.1.111: bytes=32 time<1ms TTL=64

Ping statistics for 192.168.1.111:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 0ms, Maximum = 297ms, Average = 74ms

And I’m connected over the local network!
Generate and store a set of square and sin waves, store and retrieve them.
Built a generator and can save to a file, but it looks like I need to use the API? Here’s the Python page.

How to do it?

import rx
from rx import operators as ops

from influxdb_client import InfluxDBClient, Point, WriteOptions
from influxdb_client.client.write_api import SYNCHRONOUS

_client = InfluxDBClient(url="http://localhost:9999", token="my-token", org="my-org")
_write_client = _client.write_api(write_options=WriteOptions(batch_size=500,
                                                             flush_interval=10_000,
                                                             jitter_interval=2_000,
                                                             retry_interval=5_000))

"""
Write Line Protocol formatted as string
"""
_write_client.write("my-bucket", "my-org", "h2o_feet,location=coyote_creek water_level=1.0 1")
_write_client.write("my-bucket", "my-org", ["h2o_feet,location=coyote_creek water_level=2.0 2",
                                            "h2o_feet,location=coyote_creek water_level=3.0 3"])

COVID Misinfo meeting
- Talked about the lateral thinking paper, and that we basically want to automate that.
- We’re going to put some tweet threads together for misinfo, ambiguous, and trustworthy and have a small Twitter party next week
- Finish getting the extractor running. There will be other tables as we start to figure things out

And I am done for the day. I hate this fucking timeline

Phil 4.3.20

Temp is up a bit this morning, which, of course, I’m overreacting to.

Need to get started on State information from here: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv

Generated some favicons from here: https://favicon.io/favicon-generator/, which, of course we didn’t use

Getting close to something that we can release

GOES:

Update Linux on laptop and try Influx there. Nope. The laptop is hosed.
Grabbing another computer to configure. I mean, worst case, I can set up the work laptop as an Ubuntu box. I’d love to know if Influx would work FIRST, though. Looks like I have to. My old dev box won’t boot. Backing up.
Installed Debian on the work laptop. It seems to be booting? Nope:
I guess we’ll try Ubuntu again? Nope. Trying one more variant.
Trying lubuntu. It uses different drivers for some things, and so far hasn’t frozen or blocked yet. It works!
And now the Docker version (docker run –name influxdb -p 9999:9999 quay.io/influxdb/influxdb:2.0.0-beta) works too. Maybe because the system got upgraded?
11:00 IRAD Meeting
- Send note about NOAA being a customer for simulated anomalies for machine learning

Phil 4.2.20

Wake up, shower, write some code. The linear estimate is now integrated with the predictions. I think that tells the story well. Rather than Italy, let’s look at Switzerland:

Now I need to put together a punch list of final issues (I’m worried most about load times and performance under high demand), get the URL(s) and publish!

Punch list is done. Zach is getting the domains.

Here’s the UI description:

GOES: 8 hours

Status report for March
InfluxDB!
Hmm – I appear to have broken Docker? No just being dumb with commands. Here’s what I needed:
- docker container run -it –name influx2 ubuntu /bin/sh
- Success!
Time to RTFM: v2.docs.influxdata.com/v2.0/get-started/
getting my Ubuntu image current, using
- apt update
- apt upgrade
- apt wget
- wget https://dl.influxdata.com/influxdb/releases/influxdb_2.0.0-beta.7_linux_amd64.tar.gz
- tar -xvzf influxdb_2.0.0-beta.7_linux_amd64.tar.gz
- created a user so I can have things like tab complete (adduser). Created phil with regular test pwd
- Hmm influxd isn’t on the path. Going to try running it in its directory
- Things are happening!
- But the webserver isn’t visible at localhost:9999 (ERR_CONNECTION_REFUSED). Drat! Still happens when I run as root
Town hall meeting

viztales

Dimension reduction, State, Orientation, and Speed

Category Archives: Machine Learning

Phil 4.27.20

Phil 4.24.20

Phil 4.23.20

Phil 4.22.20

Phil 2.21.20

Phil 4.20.20

Phil 4.15.20

Phil 4.14.20

Phil 4.13.20

Phil 4.10.20

Phil 4.8.20

Phil 4.7.20

Phil 4.6.20

Phil 4.3.20

Phil 4.2.20