Category Archives: Phil

Phil 5.25.16

7:00 – 4:30 VTX

  • Took the weekend off for the ESCN. Bailed on Saturday because of rain, then dodged rain for two days, then got a nice ride in on Tuesday.
  • Chatting with Aaron last night, I discovered that the REST API won’t work for Demo. I’ll need to get a new SQL dump from Heath. No, actually it works just fine in that it is accessable, but anything other than empty sets is a timeout.
  • Need to try to build a new jar file for the CorpusManager so it can have its own executable. Put the Manifest in the CM directory? not sure how to do that.
  • Writing
  • Looks like my old laptop finally bit the dust. Chromebook time?
  • Working on the Corpus manager to pull in links in the config file. Done!
  • So I’m having all kinds of problems getting the flag info from Jeremy’s rest service. I did realize that I can use the dashboard though and harvest the urls by following the links and build my list that way. Except that the flags are crap. Back to moby dick for the moment.
  • Actually, those are pretty bad too. Margarita put three urls up on confluence.
  • Got url scanning done through config file.
  • Ingested the first four chapters of Moby Dick. Pretty interesting. Ill try those three files tomorrow and we’ll see what we’ve got, at least for a sense of .gov sites…

Phil 5.20.16

7:00 – 3:30

  • Writing
  • Going to try LSI. I think the term clustering is simply the sum if the TF-IDF across docs by term. That should give a topic list. Then use that for centrality calculations? Take the top n words?
    • Actually, then the user could group words into concepts and that could make a smaller matrix where the concept count is the union of the counts of its component terms.
  • Have a LSI-lite version going that sums the TF-IDF scores and then sorts based on the sum of all scores * (number of docs with score / number of docs). Then sort and take the top n terms.
  • Need to multiply the matrix by something so that the count gets populated with something reasonable. Maybe 100? Tried that – it looks good.
  • Got the PDF parsing working. Need to get it to work with webpages next and try it on Moby Dick. Then output from the flag data
    https://dockerapps5.eip.nj.vistronix.com:9443/authenticationendpoint/login.do?client_id=w674kmsNj7flgKkTp_t_8ArPES0a&commonAuthCallerPath=%2Foauth2%2Fauthorize&forceAuth=false&passiveAuth=false&redirect_uri=http%3A%2F%2Fdockerapps.vistronix.com%2Flogin&response_type=code&scope=openid&state=RrKxRY&tenantDomain=carbon.super&sessionDataKey=fbcaf4a0-679a-4eed-93df-5464bca702ff&relyingParty=w674kmsNj7flgKkTp_t_8ArPES0a&type=oidc&sp=EIP-CI&isSaaSApp=false&authenticators=BasicAuthenticator:LOCAL
    
    http://dockerapps.vistronix.com/gtc-server/physicianservice/flags
  • Need to make sure that I use the above pointing at the demo system. From Andy’s email:

    Yes …looks you are looking at dev….in Confluence, search on environment details…that Will give you the urls for the dashboards on dev, ci and demo…we are working on demo now.

Phil 5.19.16

7:00 – 5:00 VTX

  • Looks like I saved the wrong version of the code to dropbox, so I can’t update the app image.
  • More writing
  • System and Social trust, revisited: Algorithms, clickworkers, and the befuddled fury around Facebook Trends
  • GDELT uses some of the world’s most sophisticated natural language and data mining algorithms to extract more than 300 categories of “events” and the networks of people, organizations, locations, themes, and emotions that tie them together.
  • Working on the Corpus Processing Tool
  • Need to break apart calculateAndSave
  • Need to build matrix
  • Need to save spreadsheets
  • Need name to save too.
  • Start and stopwords
  • Add Latent Semantic Indexing? I have most of the pieces.

Phil 5.18.16

7:00 – 4:30 VTX

  • Writing
    • Wanted to show how a network could be used for intercoder agreement so I had to refresh my understanding of Cohen’s kappa
    • It occurs to me that if one coder’s rank can be mapped to another coder’s rank we have a kind of information distance measure. Although the math to do that eludes me. Rank comparison could make a lot of sense to compare centrality. Another possibility is to compare the network measures?
  • Adding ‘Filter’ field and button to LMN
  • This appears to be how you do it.
  • And it worked like a charm 🙂
  • Worked through scoring math with Aaron

Phil 5.17.16

7:00 -7:00

  • Great discussion with Greg yesterday. Very encouraging.
  • Some thoughts that came up during Fahad’s (Successful!) defense
    • It should be possible to determine the ‘deletable’ codes at the bottom of the ranking by setting the allowable difference between the initial ranking and the trimmed rank.
    • The ‘filter’ box should also be set by clicking on one of the items in the list of associations for the selected items. This way, selection is a two-step process in this context.
    • Suggesting grouping of terms based on connectivity? Maybe second degree? Allows for domain independence?
    • Using a 3D display to show the shared second, third and nth degree as different layer
    • NLP tagged words for TF-IDF to produce a more characterized matrix?
    • 50 samples per iteration, 2,000 iterations? Check! And add info to spreadsheet! Done, and it’s 1,000 iterations
  • Writing
  • Parsing Jeremy’s JSON file
    • Moving the OptionalContent and JsonLoadable over to JavaJtils2
    • Adding javax.persistence-2.1.0
    • Adding json-simple-1.1.1
    • It worked, but it’s junk. It looks like these are un-curated pages
  • Long discussion with Aaron about calculating flag rollups.

Phil 5.16.16

7:00 – 4:00 VTX

  • Writing
  • Extracting PDFs? Works! Built NaiveParser and added to JavaUtils2. Updating SVN and then trying against a small corpus of PDFs.
    • Need to add a way of adding strings with wildcard characters for things to delete, grab content, etc.
  • Fahad’s defense

Phil 5.13.16

7:00 – 4:00 VTX

  • More writing
  • Looking for a discussion of ‘Thinking in the World’ by Edwin Hutchins
  • Fix workspace. Delete .idea folder from project and reload from svn?
    • Tried that. Turns out that the jar file was corrupted somehow. Wound up splitting off JavaUtils2 with the math and stanfordNLP libraries
    • Comitted JavaUtils2. Still need to clean up JavaUtils or make a JavaUtils1
  • Talk on modelling user behavior using social media
  • Talking with Aaron at lunch, I need a button to hide items from rank calculations. That allows for iterative refinement?
  • Start on parsing PDFs?

Phil 5.12.16

7:00 – 5:00 VTX

  • Found a more detailed version of The Law of Group Polarization from the University of Chicago Law School
  • More paper writing.
    • Not sure if hemmingwayapp is good for these kinds of papers, but it might be interesting to try
    • Working through code frequency counts
  • Finish TF-IDF – Done
  • Added sorting based on value in a Map.
  • Build term spreadsheet maker
    • TF-IDF term document matrix with weights at intersections
    • DF-ITF term document matrix with counts at intersections
    • DF-ITF as above with rank column?
  • Run several known corpora through and validate
  • Moved PageReader to JavaUtils
  • Had a nice chat with Stan.
  • Top ten DF-ITF terms get moby dick.(Library lie company quote Whale Melville Herman Summary search she)
  • Corrupted my LanguageModelTest workspace.

Phil 5.11.16

7:00 – 4:30 VTX

  • Continuing paper – working on the ‘motivations’ section
  • Need to set the mode to interactive after a successful load
  • Need to find out where the JSON ratings are in the medicalpractitioner db? Or just rely on Jeremy’s interface? I guess it depends on what gets blown away. But it doesn’t seem like the JSON is in the db.
  • Added a stanfordNLP package to JavaUtils
    • NLPtoken stores all the extracted information about a token (word, lemma, index, POS, etc)
    • DocumentStatistics holds token data across one or more documents
    • StringAnnotator parses strings into NLPtokens.
  • Fixed a bunch of math issues (in Excel, too), but here are the two versions;
    am = 1.969
    be = 2.523
    da = 0.984
    do = 1.892
    i = 1.761
    is = 1.130
    it = 1.130
    let = 1.130
    not = 1.380
    or = 3.523
    thfor = 1.380
    think = 1.380
    to = 1.469
    what = 1.380

    And Excel:

     da	is	 it	 let	 not	 thfor	 think	 what	 to	 i	 do	 am	 be	 or
    0.984	1.130	1.130	1.130	1.380	1.380	1.380	1.380	1.469	1.761	1.892	1.969	2.523	3.523
    

Phil 5.10.16

7:00 – 4:00 VTX

  • Paper. Slow progress
  • Meeting with Wayne at 4:00.
    • Default to ‘interactive’ on LMP
    • Write the ‘motivations’ section – ‘The Lit Review that ate Itself’
  • Doing morning webpage rating
    • Need to clear out contents of notes and paste after a save
  • DF-ITF today
  • Labeled the spreadsheet.
  • Putting in the StanfordNLP plumbing
    • To solve the ‘missing models’ load bug, I went to the github download:
      C:\Development\stanford-corenlp-full-2015-12-09\stanford-corenlp-full-2015-12-09

      and linked to the

      stanford-corenlp-3.6.0-models.jar
  • Heth pulled me into a vortex of getting the derived datastore up and running on my windows box instance of PostGres. You need a eip user and eip role to do this. The command that restores from a tar file is:
    pg_restore -c -i -U eip -d medicalpractitioner -v "qa-derived.tar"

    Here’s a screenshot of the populated DB: MedicalPractitionerDB and here’s the role screenshot: EipLoginRole

Phil 5.9.16

7:00 – 4:00 VTX

  • Started the paper describing the slider interface
  • TF-IDF today!
    • Read docs from web and PDF
    • Calculate the rank
    • Create matrix of terms and documents, weighted by occurrence.
  • Hmm. What I’m actually looking for is the lowest-occurring terms within a document that occur over the largest number of documents. I’ve used this page as a starting point. After flailing for many hours in java, I wound up walking through the algorithm in Excel and I think I’ve got it. This is the spreadsheet that embodies my delusional thinking ATM.

Phil 5.6.16

7:00 – 4:00 VTX

  • Today’s shower thought is to compare the variance of the difference of two (unitized) rank matrices. The maximum difference would be (matrix size), so we do have a scale. If we assume a binomial distribution (there are many ways to be slightly different, only two ways to be completely different), then we can use a binomial (one tailed?) distribution centered on zero and ending at (matrix size). That should mean that I can see how far one item is from the other? But it will be withing the context of a larger distribution (all zeros vs all ones)…
  • Before going down that rabbit hole, I decided to use the bootstrap method just to see if the concept works. It looks mostly good.
    • Verified that scaling a low-ranked item (ACLED) by 10 has less impact than scaling the highest ranking item (P61) by 1.28.
    • Set the stats text to red if it’s outside 1 SD and green if it’s within.
    • I think the terms can be played around with more because the top one (Pertinence) gets ranked at .436, while P61 has a rank of 1.
    • There are some weird issues with the way the matrix recalculates. Some states are statistically similar to others. I think I can do something with the thoughts above, but later.
  • There seems to be a bug calculating the current mean when compared to the unit mean. It may be that the values are so small? It’s occasional….
  • Got the ‘top’ button working.
  • And that’s it for the week…

LMT With Data2

Oh yeah – Everything You Ever Wanted To Know About Motorcycle Safety Gear

Phil 5.5.16

7:00 – 5:30 VTX

  • Continuing An Introduction to the Bootstrap.
  • This helped a lot. I hope it’s right…
  • Had a thought about how to build the Bootstrap class. Build it using RealVector and then use Interface RealVectorPreservingVisitor to do whatever calculation is desired. Default methods for Mean, Median, Variance and StdDev. It will probably need arguments for max iteration and epsilon.
  • Didn’t do that at all. Wound up using ArrayRealVector for the population and Percentile to hold the mean and variance values. I can add something else later
  • I think to capture how the centrality affects the makeup of the data in a matrix. I think it makes sense to use the normalized eigenvector to multiply the counts in the initial matrix and submit that population (the whole matrix) to the Bootstrap
  • Meeting with Wayne? Need to finish tool updates though.
  • Got bogged down in understanding the Percentile class and how binomial distributions work.
  • Built and then fixed a copy ctor for Labled2DMatrix.
  • Testing. It looks ok, but I want to try multiplying the counts by the eigenVec. Tomorrow.

Phil 5.4.16

7:00 – 5:30

  • Had a thought about looking at the difference of a re-weighted networks and the ‘original’ network. If the reweighted network is say, 95% similar to the original, then the re-weighting can be considered not to be significant and can therefore be read as a viable hypothesis. If, on the other hand, the difference is greater than that, then the degree of difference is an indication of how poor the match of the concepts(?) vs the data is.
  • And with that in mind, starting on An Introduction to the Bootstrap. Here’s hoping it’s readable… So far, so good. Made it through chapter one understanding most(?) things?

  • Added exponential mapping for weight slider

  • Commented out the lines that changed the weight in the docList and termList. And added them back in if the ‘use single counts is being changed.
  • Added the ‘top’ button. Need to implement
  • Adding a simple difference calculation
  • Figured out most of bootstrap in Excel.
  • Sprint planning.

Phil 5.3.16

7:00 – 3:30 VTX

  • Out riding, I realized that I could have a column called ‘counts’ that would add up the total number of ‘terms per document’ and ‘documents per terms ‘. Unitizing the values would then show the number of unique terms per document. That’s useful, I think.
  • Helena pointed to an interesting CHI 2016 site. This is sort of the other side of extracting pertinence from relevant data. I wonder where they got their data from?
    • Found it!. It’s in a public set of Google docs, in XML and JSON formats. I found it by looking at the GitHub home page. In the example code  there was this structure:
      source: {
          gdocId: '0Ai6LdDWgaqgNdG1WX29BanYzRHU4VHpDUTNPX3JLaUE',
          tables: "Presidents"
        }

      That gave me a hint of what to look for in the document source of the demo, where I found this:

      var urlBase = 'https://ca480fa8cd553f048c65766cc0d0f07f93f6fe2f.googledrive.com/host/0By6LdDWgaqgNfmpDajZMdHMtU3FWTEkzZW9LTndWdFg0Qk9MNzd0ZW9mcjA4aUJlV0p1Zk0/CHI2016/';
      

      And that’s the link from above.

    • There appear to be other useful data sets as well. For example, there is an extensive CHI paper database sitting behind this demo.
    • So this makes generalizing the PageRank approach much more simple since it looks like I can pull the data down pretty simply. In my case I think the best thing would be to write small apps that pull down the data and build Excel spreadsheets that are read in by the tool for now.
  • Exporting a new data set from Atlas. Done and committed. I need to do runs before meeting with Wayne.
  • Added Counts in and refactored a bit.
  • I think I want a list of what a doc or term is directly linked to and the number of references. Addid the basics. Wiring up next. Done! But now I want to click on an item in the counts list and have it be selected? Or at least highlighted?
  • Stored the new version on dropbox: https://www.dropbox.com/s/92err4z2posuaa1/LMN.zip?dl=0
  • Meeting with Wayne
    • There’s some bug with counts. Add it to the WeightedItem.toString() and test.
    • Add a ‘move to top’ button near the weight slider that adds just enough weight to move the item to the top of the list. This could be iterative?
    • Add code that compares the population of ranks with the population of scaled ranks. Maybe bootstrapping? Apache Commons Math has KolmogorovSmirnovTest, which has public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict), which looks promising.
  • Added ability to log out of the rating app.