Monthly Archives: April 2017

Phil 4.12.17

7:00 – 8:00 Research

  • Just found the HCIC Boaster Poster description
  • Resistbot is kind of along the lines of what I was thinking about anonymous news input. And article on chatbot technology from CIO. One of the interesting platforms is ChatFuel, which has a non-programming (example-based) creation process. It’s really tied into FB. Not sure I want to do that without setting up a specific account.
  • Gupshup is another bot system that deploys to a lot of platforms (FB, twitter, slack, etc)
  • NLP/NLU services. Here’s Google’s documentation for NLP and Prediction, which seems related
  • Downloading the QT community IDE. Not sure if it has designer or not. Also, there was only the option to download versions 5.x, so the V4 options of pyqt may be problematic. Big install. Lots of documentation though it looks like.
  • Downloaded and running. Tutorials tomorrow

8:30 – 1:00 BRC

  • Adding stats dataframe – done. Interesting results in the test128 data: clustersVsUnclustered
  • Sprint grooming. And while that’s running, reading Thoughtful Machine Learning. Downloaded accompanying code from GitHub

Phil 4.11.17

7:00 – 8:00 Research

8:30 – 4:00 BRC

  • Running clustering on local machine.
  • Need to add a stats DataFrame that has eps, min_size, c.num_clusters, c.get_num_clustered, c.get_num_items. Write that out at the end. Tomorrow
  • Added reading of csv file and conversion to a DataFrame xlsx. Played around with that in Excel. Some nice results if I take the log of the value. Need to try clustering on that, but I want to add the stats output first integrity3
  • Jonker-Volgenant Algorithm + t-SNE = Super Powers. THis could be a way of producing a map view for the research browser

Phil 4.10.17

7:00 – 8:30, 3:00 – 6:00 Research

  • Continuing submission. Figuring out PhySH. That’s a pretty nice system!
  • Done!

    05Apr2017 es2017apr05_602 Physical Review E (Regular Article)
    Status: Submitted 10Apr2017-07:31 EDT
    Title: Modeling the Law of Group Polarization
    Authors: Feldman,Philip / Engel,Don

  • Starting on business cards

9:00 – 2:30 BRC

  • Sprint retrospective
  • I seem to have lost the cluster I ran last week. Rerunning. I thought there might be an error since there were no (-1), but it’s because the DataFrame has the values of the item replaced with the cluster ID +2. Which means that a row that has a clusterID of zero, it is completely empty, and +1 is unclustered.

Phil 4.8.17

8:00pm – 12:00pm, plus an hour on the 9th

  • Spent much time trying to figure out why the PDF wouldn’t print on the website from the LaTex file. As a result, I learned how to read the log file, fixed some image scaling issues, and learned to be VERY careful in checking the cases in the files. Also, as per here, I learned how to take the bib file and convert it to a bbl file, which can then be pasted into the tex file resulting in a single file for submission. No errors, and only minor quibbles on citations!

BRC

Phil 4.7.17

7:15 – 8:15 Research

  • So this happened: Dozens of U.S. Missiles Hit Air Base in Syria. Wonder how it will play out. Like Infinite Reach? That actually had more justification, since Americans were killed in the preceding events…
  • Direction matching for sparse movement datasets: determining interaction rules in social groups Data is here, in the very cool Movebank. Reminds me of GLOBE
  • Still working on getting paper submitted. Word is hanging, which is weird.
  • The APS journal submissions login page is here
  • Length Check
    Please be aware that this is only an estimate and meant to help avoid delays associated with excessive length.
    
    Journal/article type is not length constrained (4281)
    
     *** Word count calculation may be inaccurate for the following reasons ***
       * No acknowledgment environment found - acknowledgments may have been counted
       * No bibliography environment found - references may have been counted
    
      Figure   Aspect Ratio   Wide?   Word Equivalent
         1         1.33         No        132
         2         0.98         No        173
         3         0.93         No        181
         4         1.01         No        168
         5         1.75         No        105
         6         1.67         No        109
         7         2.99         No         70
         8         1.22         No        142
         9         0.95         No        177
        10         1.66         No        110
    
    WORD COUNT SUMMARY
        Note: Text word count excludes title, abstract, byline, PACS,
              receipt date, acknowledgments, and references.
    
                  Text word count   2914 
            Equations word equiv.      0 
              Figures word equiv.   1367
               Tables word equiv.      0
                                  ------
                           TOTAL    4281 (Maximum length is unlimited)
  • Getting PDF conversion errors. Not sure what to do next…
    WARNING: Supplemental file: ModelingGroupPolarizationNotes.bib.
    Checking figure order in 'ModelingGroupPolarization.tex'...
    PDF file creation failed:
    ERROR: Following figure files not called into TeX:
    DTWmatrix.png,toolScreenshot.png,ExplorerPDF.png
    AND TeX file calls in the following missing figure files:
    toolscreenshot.png,explorerPDF.png,DTWMatrix.png
    Please check and fix these file names.

8:30 – 2:30 BRC

  • Imperative programming in TensorFlow
  • Working on reading in the old integrity as a DataFrame from Excel
  • Had to install xlrd
  • TypeError: Image data can not convert to float. This pattern seems useful:
    mat = df.as_matrix()  # get the data matrix
    mat = mat.astype(np.float64)  # force it to float64
    df.update(mat)  # replace the 'float' mat with the 'float64' mat
  • Well, it was before, but now it won’t update the int64 to float 64. This worked though. When in doubt build a new mat
    mat = mat.astype(np.float64)  # force it to float64
    indices = df.index.values
    cols = df.columns.values
    df = pandas.DataFrame(mat, indices, cols)
  • Got the data in though. Interesting.This is the densest corner of the sorted matrix: integrity

Phil 4.6.17

8:30 – 5:30 BRC

  • Worked too late yesterday and slept in. Will try to get the submission issues worked on once the sprint review gets going
  • Added code that deletes previous file if it exists
  • refactored the fitness landscape test to be 100 eps increments by 10 cluster increments, starting at the min_cluster. Seems to produce more useful results

     

  • Sprint review at 9:30, 11:00, 1:00 Done! Went over my stuff too fast.
    • Need to figure out a way to cluster diagnosis codes
    • Discovered that the data is still the same integrity data from the last sprint. Need to write a read_dataframe(file_name: str) -> pandas.DataFrame: method to hdfs_csv_reader  tomorrow.

Phil 4.5.17

7:00 – 8:00 Research

  • Finishing up poster? CI_GP_Poster
  • Starting submission for Phys Rev E, based on my notes here.
    • Created account
    • Created ORCID account
    • Started submission. Errors!There were problems with the following fields:
      • AllExploit_SI0.0.psd: Classification description value must be selected for each file and cannot be left unknown.
      • 90_ExploitR10_10_ExploreR0.psd: Classification description value must be selected for each file and cannot be left unknown.
      • polarized.jpg: Classification description value must be selected for each file and cannot be left unknown.
      • clusterMembership.png: Classification description value must be selected for each file and cannot be left unknown.
      • toolScreenshot.png: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarization2col.pdf: Classification description value must be selected for each file and cannot be left unknown.
      • exploring.jpg: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarization.pdf: Classification description value must be selected for each file and cannot be left unknown.
      • psheader.txt: Classification description value must be selected for each file and cannot be left unknown.
      • influenced.jpg: Classification description value must be selected for each file and cannot be left unknown.
      • ExplorerPDF.png: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarization.synctex.gz: Classification description value must be selected for each file and cannot be left unknown.
      • AllExploit_SI10.0.jpg: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarizationDraft.pdf: Classification description value must be selected for each file and cannot be left unknown.
      • flocking.jpg: Classification description value must be selected for each file and cannot be left unknown.
      • DTWmatrix.png: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarization.log: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarization.dvi: Classification description value must be selected for each file and cannot be left unknown.
      • AllExploit_SI10.0.psd: Classification description value must be selected for each file and cannot be left unknown.
      • ModelingGroupPolarization.blg: Classification description value must be selected for each file and cannot be left unknown.
      • 90_ExploitR10_10_ExploreR0.jpg: Classification description value must be selected for each file and cannot be left unknown.
      • AllExploit_SI0.2.psd: Classification description value must be selected for each file and cannot be left unknown.
      • SimilarPaths.png: Classification description value must be selected for each file and cannot be left unknown.
      • RevTex41_example.aux: Classification description value must be selected for each file and cannot be left unknown.
      • label 0 is duplicated
      • ModelingGroupPolarization.aux: Classification description value must be selected for each file and cannot be left unknown.
      • revtex41_template.blg: Classification description value must be selected for each file and cannot be left unknown.
      • RevTex41_example.blg: Classification description value must be selected for each file and cannot be left unknown.
      • DTWclusters.png: Classification description value must be selected for each file and cannot be left unknown.
      • Main text file is required

9:00 – 5:00, 6:00, 7:30 BRC

  • Working on HDFS reading and writing
  • Integrating code
  • Compiled! Now waiting for things to blow up
  • Success! After fixing this:
    writer.write(u'%s' % cstr) # good
    writer.write('%s', cstr) # bad
    

Phil 4.4.17

7:00 – 8:30 Research

  • Add Amundson/Scott and implications for design. Done, but too many ‘incorporate’. Need better words. Scott * Amundsen look good though
  • SpaCy – Industrial-Strength Natural Language Processing in Python. Video tutorial
  • Got Thoughtful Machine Learning with Python. Looks like it hits all the bases from classification to unit testing

9:00 – 5:30 BRC

  • Need to submit paperwork for collective intelligence
  • Discussion with Aaron and Jeremy about curation webapp
  • Test re-hydration code. Had a few minor issues in the code that produced the tests, but all working now
  • Start on HDFS? Aaron is close to testing his Python code. Integration after that?

Aaron 4.3.17

  • ML Architecture
    • Spent a bunch of time last Friday meeting with Phil to discuss the proposed path for the Machine Learning epics to develop the research browser.
    • Our plan uses a thin-client Angular 2 app for the bulk of the annotation/tagging process, with an optional companion browser plugin developed later to do in-document tagging, which will capture the URL, and snippet text.
    • We’re intending to a simple Naive Bayesian classifier for document categories; and to use more complex classifiers (DNNs) for snippet content and user behaviors in the future.
    • Given this we’re feeling pretty confident about the proposed timeframe. It’s unclear how we’re implement the Bayesian Classifier, since it’s already been developed in Weka/Java, it may not be in our best interests to re-write it into a Python-based version.
  • Python integration
    • Using ProcessBuilder works for the simple case where we want to do essentially batch clustering, but it is very difficult to debug in CI/Prod instances as it becomes a “black box”. There are methods to make it more communicative, but we should investigate looking at a Python based WSO2 secured microservice. It would make it far easier to integrate Python code into our stack.
    • I looked at multiple methods to do HDFS integration using Python, and found some canonical recent examples with Python 3.x.
  • Hadoop is dead, long live ML?
  • ClusteringService
    • Reviewed the MapReduce code for the service. It’s pretty straightforward, using the mapper to build the row data and the reducer to format it for output.
    • The actual table it needs to pull from is currently missing… so tests do not pass if set to the real table, but once my new laptop is loaded I will be able to make changes.

Phil 4.3.17

7:00 – 8:30, 3:00 -4:00 Research

  • Finished the first cut at the Illustrator version of the poster. I think I like it better? Everything fits! CI_GP_Poster
  •  Fika

8:30 – 2:30

  • Hotel for Collective intelligence!
  • Read data into DataFrame – done!
  • Next step is to tie this into HDFS and then PyUnit?