Monthly Archives: April 2017

Phil 4.12.17

7:00 – 8:00 Research

Just found the HCIC Boaster Poster description
Resistbot is kind of along the lines of what I was thinking about anonymous news input. And article on chatbot technology from CIO. One of the interesting platforms is ChatFuel, which has a non-programming (example-based) creation process. It’s really tied into FB. Not sure I want to do that without setting up a specific account.
Gupshup is another bot system that deploys to a lot of platforms (FB, twitter, slack, etc)
NLP/NLU services. Here’s Google’s documentation for NLP and Prediction, which seems related
Downloading the QT community IDE. Not sure if it has designer or not. Also, there was only the option to download versions 5.x, so the V4 options of pyqt may be problematic. Big install. Lots of documentation though it looks like.
Downloaded and running. Tutorials tomorrow

8:30 – 1:00 BRC

Adding stats dataframe – done. Interesting results in the test128 data:
Sprint grooming. And while that’s running, reading Thoughtful Machine Learning. Downloaded accompanying code from GitHub

Phil 4.11.17

7:00 – 8:00 Research

Information ecology thoughts. A lot of what I am going can be framed from this perspective. Note that collective intelligence and knowledge ecosystem are referenced on the Wikipedia page
Working on card
Junk News and Bots during the U.S. Election: What Were Michigan Voters Sharing Over Twitter? And here’s the full paper.
- Reading the paper. This also makes me think about Kate Starbird‘s work. She frames her data looking at the Boston bombing, but if you look at Google trends, you’ll see that ‘False Flag’ goes at least as far back as the Aurora shootings. What I thought of as ‘patient zero’, gamergate, is later than both of those events:

8:30 – 4:00 BRC

Running clustering on local machine.
Need to add a stats DataFrame that has eps, min_size, c.num_clusters, c.get_num_clustered, c.get_num_items. Write that out at the end. Tomorrow
Added reading of csv file and conversion to a DataFrame xlsx. Played around with that in Excel. Some nice results if I take the log of the value. Need to try clustering on that, but I want to add the stats output first
Jonker-Volgenant Algorithm + t-SNE = Super Powers. THis could be a way of producing a map view for the research browser

Phil 4.10.17

7:00 – 8:30, 3:00 – 6:00 Research

Continuing submission. Figuring out PhySH. That’s a pretty nice system!
Done!

05Apr2017 es2017apr05_602 Physical Review E (Regular Article)
Status: Submitted 10Apr2017-07:31 EDT
Title: Modeling the Law of Group Polarization
Authors: Feldman,Philip / Engel,Don
Starting on business cards

9:00 – 2:30 BRC

Sprint retrospective
I seem to have lost the cluster I ran last week. Rerunning. I thought there might be an error since there were no (-1), but it’s because the DataFrame has the values of the item replaced with the cluster ID +2. Which means that a row that has a clusterID of zero, it is completely empty, and +1 is unclustered.

Phil 4.8.17

8:00pm – 12:00pm, plus an hour on the 9th

Spent much time trying to figure out why the PDF wouldn’t print on the website from the LaTex file. As a result, I learned how to read the log file, fixed some image scaling issues, and learned to be VERY careful in checking the cases in the files. Also, as per here, I learned how to take the bib file and convert it to a bbl file, which can then be pasted into the tex file resulting in a single file for submission. No errors, and only minor quibbles on citations!

BRC

And this – Open sourcing Sonnet – a new library for constructing neural networks
- It’s now nearly a year since DeepMind made the decision to switch the entire research organisation to using TensorFlow (TF). It’s proven to be a good choice – many of our models learn significantly faster, and the built-in features for distributed training have hugely simplified our code. Along the way, we found that the flexibility and adaptiveness of TF lends itself to building higher level frameworks for specific purposes, and we’ve written one for quickly building neural network modules with TF. We are actively developing this codebase, but what we have so far fits our research needs well, and we’re excited to announce that today we are open sourcing it. We call this framework Sonnet.
- Code page

Phil 4.7.17

7:15 – 8:15 Research

So this happened: Dozens of U.S. Missiles Hit Air Base in Syria. Wonder how it will play out. Like Infinite Reach? That actually had more justification, since Americans were killed in the preceding events…
Direction matching for sparse movement datasets: determining interaction rules in social groups Data is here, in the very cool Movebank. Reminds me of GLOBE
Still working on getting paper submitted. Word is hanging, which is weird.
The APS journal submissions login page is here

Length Check
Please be aware that this is only an estimate and meant to help avoid delays associated with excessive length.

Journal/article type is not length constrained (4281)

 *** Word count calculation may be inaccurate for the following reasons ***
   * No acknowledgment environment found - acknowledgments may have been counted
   * No bibliography environment found - references may have been counted

  Figure   Aspect Ratio   Wide?   Word Equivalent
     1         1.33         No        132
     2         0.98         No        173
     3         0.93         No        181
     4         1.01         No        168
     5         1.75         No        105
     6         1.67         No        109
     7         2.99         No         70
     8         1.22         No        142
     9         0.95         No        177
    10         1.66         No        110

WORD COUNT SUMMARY
    Note: Text word count excludes title, abstract, byline, PACS,
          receipt date, acknowledgments, and references.

              Text word count   2914 
        Equations word equiv.      0 
          Figures word equiv.   1367
           Tables word equiv.      0
                              ------
                       TOTAL    4281 (Maximum length is unlimited)

Getting PDF conversion errors. Not sure what to do next…

WARNING: Supplemental file: ModelingGroupPolarizationNotes.bib.
Checking figure order in 'ModelingGroupPolarization.tex'...
PDF file creation failed:
ERROR: Following figure files not called into TeX:
DTWmatrix.png,toolScreenshot.png,ExplorerPDF.png
AND TeX file calls in the following missing figure files:
toolscreenshot.png,explorerPDF.png,DTWMatrix.png
Please check and fix these file names.

8:30 – 2:30 BRC

Imperative programming in TensorFlow
Working on reading in the old integrity as a DataFrame from Excel
Had to install xlrd

TypeError: Image data can not convert to float. This pattern seems useful:

mat = df.as_matrix()  # get the data matrix
mat = mat.astype(np.float64)  # force it to float64
df.update(mat)  # replace the 'float' mat with the 'float64' mat

Well, it was before, but now it won’t update the int64 to float 64. This worked though. When in doubt build a new mat

mat = mat.astype(np.float64)  # force it to float64
indices = df.index.values
cols = df.columns.values
df = pandas.DataFrame(mat, indices, cols)

Got the data in though. Interesting.This is the densest corner of the sorted matrix:

Phil 4.6.17

8:30 – 5:30 BRC

Worked too late yesterday and slept in. Will try to get the submission issues worked on once the sprint review gets going
Added code that deletes previous file if it exists
refactored the fitness landscape test to be 100 eps increments by 10 cluster increments, starting at the min_cluster. Seems to produce more useful results
Sprint review at ~~9:30~~, ~~11:00~~, 1:00 Done! Went over my stuff too fast.
- Need to figure out a way to cluster diagnosis codes
- Discovered that the data is still the same integrity data from the last sprint. Need to write a read_dataframe(file_name: str) -> pandas.DataFrame: method to hdfs_csv_reader tomorrow.

Phil 4.5.17

7:00 – 8:00 Research

Finishing up poster?
Starting submission for Phys Rev E, based on my notes here.
- Created account
- Created ORCID account
- Started submission. Errors!There were problems with the following fields:
  - AllExploit_SI0.0.psd: Classification description value must be selected for each file and cannot be left unknown.
  - 90_ExploitR10_10_ExploreR0.psd: Classification description value must be selected for each file and cannot be left unknown.
  - polarized.jpg: Classification description value must be selected for each file and cannot be left unknown.
  - clusterMembership.png: Classification description value must be selected for each file and cannot be left unknown.
  - toolScreenshot.png: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarization2col.pdf: Classification description value must be selected for each file and cannot be left unknown.
  - exploring.jpg: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarization.pdf: Classification description value must be selected for each file and cannot be left unknown.
  - psheader.txt: Classification description value must be selected for each file and cannot be left unknown.
  - influenced.jpg: Classification description value must be selected for each file and cannot be left unknown.
  - ExplorerPDF.png: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarization.synctex.gz: Classification description value must be selected for each file and cannot be left unknown.
  - AllExploit_SI10.0.jpg: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarizationDraft.pdf: Classification description value must be selected for each file and cannot be left unknown.
  - flocking.jpg: Classification description value must be selected for each file and cannot be left unknown.
  - DTWmatrix.png: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarization.log: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarization.dvi: Classification description value must be selected for each file and cannot be left unknown.
  - AllExploit_SI10.0.psd: Classification description value must be selected for each file and cannot be left unknown.
  - ModelingGroupPolarization.blg: Classification description value must be selected for each file and cannot be left unknown.
  - 90_ExploitR10_10_ExploreR0.jpg: Classification description value must be selected for each file and cannot be left unknown.
  - AllExploit_SI0.2.psd: Classification description value must be selected for each file and cannot be left unknown.
  - SimilarPaths.png: Classification description value must be selected for each file and cannot be left unknown.
  - RevTex41_example.aux: Classification description value must be selected for each file and cannot be left unknown.
  - label 0 is duplicated
  - ModelingGroupPolarization.aux: Classification description value must be selected for each file and cannot be left unknown.
  - revtex41_template.blg: Classification description value must be selected for each file and cannot be left unknown.
  - RevTex41_example.blg: Classification description value must be selected for each file and cannot be left unknown.
  - DTWclusters.png: Classification description value must be selected for each file and cannot be left unknown.
  - Main text file is required

9:00 – ~~5:00~~, ~~6:00~~, 7:30 BRC

Working on HDFS reading and writing
Integrating code
Compiled! Now waiting for things to blow up

Success! After fixing this:

writer.write(u'%s' % cstr) # good
writer.write('%s', cstr) # bad

Phil 4.4.17

7:00 – 8:30 Research

Add Amundson/Scott and implications for design. Done, but too many ‘incorporate’. Need better words. Scott * Amundsen look good though
SpaCy – Industrial-Strength Natural Language Processing in Python. Video tutorial
Got Thoughtful Machine Learning with Python. Looks like it hits all the bases from classification to unit testing

9:00 – 5:30 BRC

Need to submit paperwork for collective intelligence
Discussion with Aaron and Jeremy about curation webapp
Test re-hydration code. Had a few minor issues in the code that produced the tests, but all working now
Start on HDFS? Aaron is close to testing his Python code. Integration after that?

Aaron 4.3.17

ML Architecture
- Spent a bunch of time last Friday meeting with Phil to discuss the proposed path for the Machine Learning epics to develop the research browser.
- Our plan uses a thin-client Angular 2 app for the bulk of the annotation/tagging process, with an optional companion browser plugin developed later to do in-document tagging, which will capture the URL, and snippet text.
- We’re intending to a simple Naive Bayesian classifier for document categories; and to use more complex classifiers (DNNs) for snippet content and user behaviors in the future.
- Given this we’re feeling pretty confident about the proposed timeframe. It’s unclear how we’re implement the Bayesian Classifier, since it’s already been developed in Weka/Java, it may not be in our best interests to re-write it into a Python-based version.
Python integration
- Using ProcessBuilder works for the simple case where we want to do essentially batch clustering, but it is very difficult to debug in CI/Prod instances as it becomes a “black box”. There are methods to make it more communicative, but we should investigate looking at a Python based WSO2 secured microservice. It would make it far easier to integrate Python code into our stack.
- I looked at multiple methods to do HDFS integration using Python, and found some canonical recent examples with Python 3.x.
  - http://wesmckinney.com/blog/python-hdfs-interfaces/
  - https://pypi.python.org/pypi/hdfs/
Hadoop is dead, long live ML?
- http://www.datasciencecentral.com/profiles/blogs/goodbye-age-of-hadoop-hello-cambrian-explosion-of-deep-learning
- https://www.thoughtworks.com/radar
ClusteringService
- Reviewed the MapReduce code for the service. It’s pretty straightforward, using the mapper to build the row data and the reducer to format it for output.
- The actual table it needs to pull from is currently missing… so tests do not pass if set to the real table, but once my new laptop is loaded I will be able to make changes.

Phil 4.3.17

7:00 – 8:30, 3:00 -4:00 Research

Finished the first cut at the Illustrator version of the poster. I think I like it better? Everything fits!
Fika

8:30 – 2:30

Hotel for Collective intelligence!
Read data into DataFrame – done!
Next step is to tie this into HDFS and then PyUnit?

viztales

Dimension reduction, State, Orientation, and Speed

Monthly Archives: April 2017

Phil 4.12.17

Phil 4.11.17

Phil 4.10.17

Phil 4.8.17

Phil 4.7.17

Phil 4.6.17

Phil 4.5.17

Phil 4.4.17

Aaron 4.3.17

Phil 4.3.17