Author Archives: pgfeldman

Phil 5.16.16

7:00 – 4:00 VTX

  • Writing
  • Extracting PDFs? Works! Built NaiveParser and added to JavaUtils2. Updating SVN and then trying against a small corpus of PDFs.
    • Need to add a way of adding strings with wildcard characters for things to delete, grab content, etc.
  • Fahad’s defense

Phil 5.13.16

7:00 – 4:00 VTX

  • More writing
  • Looking for a discussion of ‘Thinking in the World’ by Edwin Hutchins
  • Fix workspace. Delete .idea folder from project and reload from svn?
    • Tried that. Turns out that the jar file was corrupted somehow. Wound up splitting off JavaUtils2 with the math and stanfordNLP libraries
    • Comitted JavaUtils2. Still need to clean up JavaUtils or make a JavaUtils1
  • Talk on modelling user behavior using social media
  • Talking with Aaron at lunch, I need a button to hide items from rank calculations. That allows for iterative refinement?
  • Start on parsing PDFs?

Phil 5.12.16

7:00 – 5:00 VTX

  • Found a more detailed version of The Law of Group Polarization from the University of Chicago Law School
  • More paper writing.
    • Not sure if hemmingwayapp is good for these kinds of papers, but it might be interesting to try
    • Working through code frequency counts
  • Finish TF-IDF – Done
  • Added sorting based on value in a Map.
  • Build term spreadsheet maker
    • TF-IDF term document matrix with weights at intersections
    • DF-ITF term document matrix with counts at intersections
    • DF-ITF as above with rank column?
  • Run several known corpora through and validate
  • Moved PageReader to JavaUtils
  • Had a nice chat with Stan.
  • Top ten DF-ITF terms get moby dick.(Library lie company quote Whale Melville Herman Summary search she)
  • Corrupted my LanguageModelTest workspace.

Phil 5.11.16

7:00 – 4:30 VTX

  • Continuing paper – working on the ‘motivations’ section
  • Need to set the mode to interactive after a successful load
  • Need to find out where the JSON ratings are in the medicalpractitioner db? Or just rely on Jeremy’s interface? I guess it depends on what gets blown away. But it doesn’t seem like the JSON is in the db.
  • Added a stanfordNLP package to JavaUtils
    • NLPtoken stores all the extracted information about a token (word, lemma, index, POS, etc)
    • DocumentStatistics holds token data across one or more documents
    • StringAnnotator parses strings into NLPtokens.
  • Fixed a bunch of math issues (in Excel, too), but here are the two versions;
    am = 1.969
    be = 2.523
    da = 0.984
    do = 1.892
    i = 1.761
    is = 1.130
    it = 1.130
    let = 1.130
    not = 1.380
    or = 3.523
    thfor = 1.380
    think = 1.380
    to = 1.469
    what = 1.380

    And Excel:

     da	is	 it	 let	 not	 thfor	 think	 what	 to	 i	 do	 am	 be	 or
    0.984	1.130	1.130	1.130	1.380	1.380	1.380	1.380	1.469	1.761	1.892	1.969	2.523	3.523
    

Phil 5.10.16

7:00 – 4:00 VTX

  • Paper. Slow progress
  • Meeting with Wayne at 4:00.
    • Default to ‘interactive’ on LMP
    • Write the ‘motivations’ section – ‘The Lit Review that ate Itself’
  • Doing morning webpage rating
    • Need to clear out contents of notes and paste after a save
  • DF-ITF today
  • Labeled the spreadsheet.
  • Putting in the StanfordNLP plumbing
    • To solve the ‘missing models’ load bug, I went to the github download:
      C:\Development\stanford-corenlp-full-2015-12-09\stanford-corenlp-full-2015-12-09

      and linked to the

      stanford-corenlp-3.6.0-models.jar
  • Heth pulled me into a vortex of getting the derived datastore up and running on my windows box instance of PostGres. You need a eip user and eip role to do this. The command that restores from a tar file is:
    pg_restore -c -i -U eip -d medicalpractitioner -v "qa-derived.tar"

    Here’s a screenshot of the populated DB: MedicalPractitionerDB and here’s the role screenshot: EipLoginRole

Phil 5.9.16

7:00 – 4:00 VTX

  • Started the paper describing the slider interface
  • TF-IDF today!
    • Read docs from web and PDF
    • Calculate the rank
    • Create matrix of terms and documents, weighted by occurrence.
  • Hmm. What I’m actually looking for is the lowest-occurring terms within a document that occur over the largest number of documents. I’ve used this page as a starting point. After flailing for many hours in java, I wound up walking through the algorithm in Excel and I think I’ve got it. This is the spreadsheet that embodies my delusional thinking ATM.

Phil 5.6.16

7:00 – 4:00 VTX

  • Today’s shower thought is to compare the variance of the difference of two (unitized) rank matrices. The maximum difference would be (matrix size), so we do have a scale. If we assume a binomial distribution (there are many ways to be slightly different, only two ways to be completely different), then we can use a binomial (one tailed?) distribution centered on zero and ending at (matrix size). That should mean that I can see how far one item is from the other? But it will be withing the context of a larger distribution (all zeros vs all ones)…
  • Before going down that rabbit hole, I decided to use the bootstrap method just to see if the concept works. It looks mostly good.
    • Verified that scaling a low-ranked item (ACLED) by 10 has less impact than scaling the highest ranking item (P61) by 1.28.
    • Set the stats text to red if it’s outside 1 SD and green if it’s within.
    • I think the terms can be played around with more because the top one (Pertinence) gets ranked at .436, while P61 has a rank of 1.
    • There are some weird issues with the way the matrix recalculates. Some states are statistically similar to others. I think I can do something with the thoughts above, but later.
  • There seems to be a bug calculating the current mean when compared to the unit mean. It may be that the values are so small? It’s occasional….
  • Got the ‘top’ button working.
  • And that’s it for the week…

LMT With Data2

Oh yeah – Everything You Ever Wanted To Know About Motorcycle Safety Gear

Phil 5.5.16

7:00 – 5:30 VTX

  • Continuing An Introduction to the Bootstrap.
  • This helped a lot. I hope it’s right…
  • Had a thought about how to build the Bootstrap class. Build it using RealVector and then use Interface RealVectorPreservingVisitor to do whatever calculation is desired. Default methods for Mean, Median, Variance and StdDev. It will probably need arguments for max iteration and epsilon.
  • Didn’t do that at all. Wound up using ArrayRealVector for the population and Percentile to hold the mean and variance values. I can add something else later
  • I think to capture how the centrality affects the makeup of the data in a matrix. I think it makes sense to use the normalized eigenvector to multiply the counts in the initial matrix and submit that population (the whole matrix) to the Bootstrap
  • Meeting with Wayne? Need to finish tool updates though.
  • Got bogged down in understanding the Percentile class and how binomial distributions work.
  • Built and then fixed a copy ctor for Labled2DMatrix.
  • Testing. It looks ok, but I want to try multiplying the counts by the eigenVec. Tomorrow.

Phil 5.4.16

7:00 – 5:30

  • Had a thought about looking at the difference of a re-weighted networks and the ‘original’ network. If the reweighted network is say, 95% similar to the original, then the re-weighting can be considered not to be significant and can therefore be read as a viable hypothesis. If, on the other hand, the difference is greater than that, then the degree of difference is an indication of how poor the match of the concepts(?) vs the data is.
  • And with that in mind, starting on An Introduction to the Bootstrap. Here’s hoping it’s readable… So far, so good. Made it through chapter one understanding most(?) things?

  • Added exponential mapping for weight slider

  • Commented out the lines that changed the weight in the docList and termList. And added them back in if the ‘use single counts is being changed.
  • Added the ‘top’ button. Need to implement
  • Adding a simple difference calculation
  • Figured out most of bootstrap in Excel.
  • Sprint planning.

Phil 5.3.16

7:00 – 3:30 VTX

  • Out riding, I realized that I could have a column called ‘counts’ that would add up the total number of ‘terms per document’ and ‘documents per terms ‘. Unitizing the values would then show the number of unique terms per document. That’s useful, I think.
  • Helena pointed to an interesting CHI 2016 site. This is sort of the other side of extracting pertinence from relevant data. I wonder where they got their data from?
    • Found it!. It’s in a public set of Google docs, in XML and JSON formats. I found it by looking at the GitHub home page. In the example code  there was this structure:
      source: {
          gdocId: '0Ai6LdDWgaqgNdG1WX29BanYzRHU4VHpDUTNPX3JLaUE',
          tables: "Presidents"
        }

      That gave me a hint of what to look for in the document source of the demo, where I found this:

      var urlBase = 'https://ca480fa8cd553f048c65766cc0d0f07f93f6fe2f.googledrive.com/host/0By6LdDWgaqgNfmpDajZMdHMtU3FWTEkzZW9LTndWdFg0Qk9MNzd0ZW9mcjA4aUJlV0p1Zk0/CHI2016/';
      

      And that’s the link from above.

    • There appear to be other useful data sets as well. For example, there is an extensive CHI paper database sitting behind this demo.
    • So this makes generalizing the PageRank approach much more simple since it looks like I can pull the data down pretty simply. In my case I think the best thing would be to write small apps that pull down the data and build Excel spreadsheets that are read in by the tool for now.
  • Exporting a new data set from Atlas. Done and committed. I need to do runs before meeting with Wayne.
  • Added Counts in and refactored a bit.
  • I think I want a list of what a doc or term is directly linked to and the number of references. Addid the basics. Wiring up next. Done! But now I want to click on an item in the counts list and have it be selected? Or at least highlighted?
  • Stored the new version on dropbox: https://www.dropbox.com/s/92err4z2posuaa1/LMN.zip?dl=0
  • Meeting with Wayne
    • There’s some bug with counts. Add it to the WeightedItem.toString() and test.
    • Add a ‘move to top’ button near the weight slider that adds just enough weight to move the item to the top of the list. This could be iterative?
    • Add code that compares the population of ranks with the population of scaled ranks. Maybe bootstrapping? Apache Commons Math has KolmogorovSmirnovTest, which has public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict), which looks promising.
  • Added ability to log out of the rating app.

Phil 5.2.16

7:00 – 3:00 VTX

  • How to get funding using Web of Science
  • http://www.grants.gov/web/grants/search-grants.html
  • http://www.research.gov/
  • Finished  Supporting Reflective Public Thought with ConsiderIt
    • Watched the ConsiderIt demo. I love the histogram that shows how the issue polarization is characterized.
  • Back to  Informed Citizenship in a Media-Centric Way of Life
    • Page 225 – Conclusions: As prescriptive as it may sound, it is time to suspend the normative traditions that envelop journalism and democracy, take stock of how knowledge is explicated and operationalized, and calibrate research practice to accommodate an explication of informed citizenship and democratic participation itted to contemporary life. Doing so strays from the dominant research paradigm, grounded in convictions about the supremacy of rational thought, verbal information, news as cold hard facts, and electoral activities as the gold standard of participatory practices. We advanced arguments for a departure from tradition and elaborated on how the very notions of informed citizenship and political participation are mutating in (and because of) the current media environment.
    • And this is kind of scary: Freedom is on the longest global downward trajectory in 40 years (Freedom House, 2011), democratic failure is at the highest rate since the mid-1980s (Diamond, 1999), and there are indicators of qualitative erosion in democratic practice worldwide (Bertelsmann Foundation, 2012). he people’s view on democratic life appears tepid, in several parts of the world, there are reports of a so-called authoritarial nostalgia among citizens who live in Asian countries that are transforming to democratic systems of governance (Chang, Chu, & Park, 2007) while a mere half (or fewer) of Russians, Poles, Ukrainians, and Indonesians expressed strong support for democratic rule (World Public Opinion.org, 2015).
      • Make America Great Again.
    • Done. Reading this makes me feel more like a connectivist/AI revolution is coming that will either tend towards isolating us more or finding ways to bring us together. The thing is that we’re wired to do both. So this really is a design problem.
  • ————————————
  • Well drat, was going to do some light work on developing the ranking app, but it looks like I forgot to check in the latest version of Java Utils
  • Installed Launch4j
  • TODO:
    • Add a ‘session name’ text field – done
    • Add a ‘interactive’ checkbox. If it’s selected, then change in the weight slider will fire calculate(). Done
    • Fixed the ‘Reset Weights’
    • Got the ‘Use Unit Weights’ option. I just replace all the non-zero values in the derived symmetric matrix to 1.0. I have a suspicion that this will come back to bite me, but for now I can’t think of a reason. The only thing that I really don’t like is that there is no obvious change in the data. The ‘Weights’ column actually means ‘scalar’. This issue is that the whole matrix would have to be shown, since the weight exists at the intersection of two items. So a row or column is sort of a sum of weights.
    • Start TF-IDF app. It should do the following:
      • Take a list of URIs (local or remote, pdf, html, text). These are the documents
      • Read each of the documents into a data structure that has
        • Document title
        • Keywords (if called out)
        • Word list (lemmatized)
          • Word
          • Document count
          • Parts Of Speech(?)
      • Run TF-IDF to produce an ordered list of terms
      • Build a co-occurrence matrix of terms and documents
      • Output matrix to Excel.
  • The end of a good day:

LMT With Data

Phil 5.1.16

  • I have Supporting Reflective Public Thought with ConsiderIt for homework, but it’s worth adding to the Lit review
    • ConsiderIt is still around. It’s looking pretty nice, actually. Not much in the way of backlinks though (https___consider).
    • I like the inclusion of Nudge Theory. It’s an important point that design that affects masses of people has to take this basic consideration to heart. I contend that nudging is happening now, towards fragmentation and Group Polarization. The forces that drive advertising (and through association, content) to ever more targeted audiences means that each of these audiences can be nudged in different directions without knowing that they are even part of the group that is polarizing.

Phil 4.29.16

7:00 – 5:00 VTX

  • Expense reports and timesheets! Done.
  • Continuing Informed Citizenship in a Media-Centric Way of Life
    • The pertinence interface may be an example of a UI affording the concept of monitorial citizenship.
      • Page 219: The monitorial citizen, in Schudson’s (1998) view, does environmental surveillance rather than gathering in-depth information. By implication, citizens have social awareness that spans vast territory without having in-depth understanding of specific topics. Related to the idea of monitorial instead of informed citizenship, Pew Center (2008) data identified an emerging group of young (18–34) mobile media users called news grazers. These grazers ind what they need by switching across media platforms rather than waiting for content to be served.
    • Page 222: Risk as Feelings. The abstract is below. There is an emotional hacking aspect here that traditional journalism has used (heuristically?) for most(?) of its history.
      • Virtually all current theories of choice under risk or uncertainty are cognitive and consequentialist. They assume that people assess the desirability and likelihood of possible outcomes of choice alternatives and integrate this information through some type of expectation-based calculus to arrive at a decision. The authors propose an alternative theoretical perspective, the risk-as-feelings hypothesis, that highlights the role of affect experienced at the moment of decision making. Drawing on research from clinical, physiological, and other subfields of psychology, they show that emotional reactions to risky situations often diverge from cognitive assessments of those risks. When such divergence occurs, emotional reactions often drive behavior. The risk-as-feelings hypothesis is shown to explain a wide range of phenomena that have resisted interpretation in cognitive–consequentialist terms.
    • At page 223 – Elections as the canon of participation

  • Working on getting tables to sort – Done

  • Loading excel file -done
  • Calculating – done
  • Using weights -done
  • Reset weights – done
  • Saving (don’t forget to add sheet with variables!) – done
  • Wrapped in executable – done
  • Uploading to dropbox. Wow – the files with JavaFX are *much* bigger than Swing.

Phil 4.28.16

7:00 – 5:00 VTX

  • Reading Informed Citizenship in a Media-Centric Way of Life
    • Jessica Gall Myrick
    • This is a bit out of the concentration of the thesis, but it addresses several themes that relate to system and social trust. And I’m thinking that behind these themes of social vs. system is the Designer’s Social Trust of the user. Think of it this way: If the designer has a high Social Trust intention with respect to the benevolence of the users, then a more ‘human’ interactive site may result with more opportunities for the user to see more deeply into the system and contribute more meaningfully. There is risks in this, such as hellish comment sections, but also rewards (see the YouTube comments section for The Idea Channel episodes). If the designer has a System Trust intention with respect to say, the reliability of the user watching ads, then different systems get designed that learns to generate click-bait using neural networks such as clickotron). Or, closer to home, Instagram might decide to curate a feed for you without affordances to support changing of feed options. The truism goes ‘If you’re not paying, then you’re the product’. And products aren’t people. Products are systems.
    • Page 218: Graber (2001) argues that researchers oten treat the information value of images as a subsidiary to verbal information, rather than having value themselves. Slowly, studies employing visual measures and examining how images facilitate knowledge gain are emerging (Grabe, Bas, & van Driel, 2015; Graber, 2001; Prior, 2014). In a burgeoning media age with citizens who overwhelmingly favor (audio)visually distributed information, research momentum on the role of visual modalities in shaping informed citizenship is needed. Paired with it, reconsideration of the written word as the preeminent conduit of information and rational thought are necessary.
      • The rise of infographics  makes me believe that it’s not image and video per se, but clear information with low cognitive load.
  • ————————–
  • Bob had a little trouble with inappropriate and unclear identity, as well as education, info and other
  • Got tables working for terms and docs.
  • Got callbacks working from table clicks
  • Couldn’t get the table to display. Had to use this ugly hack.
  • Realized that I need name, weight and eigenval. Sorting is by eigenval. Weight is the multiplier of the weights in a row or column associated with a term or document. Mostly done.

Phil 4.27.16

7:00 – 5:30 VTX

  • Finished A fistful of bitcoins: characterizing payments among men with no names
    • In reading the discussion about ‘peeling’, I wonder if in a similar way, if someone returns to a story repeatedly, would an adversary be able to find out anything useful?Or, if Bitcoin were used to pay for stories, would tracking transactions do anything as well? One of the nice things about using aliases for BC addresses is that other than the initial mapping, the address can be hidden in the system.
    • Page 93: ...even the most motivated Bitcoin users (i.e., criminals) are engaging in idioms of use that allow us to erode their anonymity.
      • This is an important point. As with biometrics at the small scale, we are identifiable through our behaviors. In this case, idioms or patterns of usage.
  • Rating app
    • Add people – done
    • Add John’s suggestions – done
    • Build and deploy – Done. Waiting on Andy.
  • Write up TF_IDF story
    • Basic capability – 11 points
      • The initial part of the effort is to scan over the collection of documents and produce a list of words ordered by TF-IDF. This means iterating over all the documents and producing a Set<String> of words that are then run over the the set of documents. The output should be an excel file that lists the documents in the corpus, and the list of words.
        • Documents should be listed in a file (xml?) as URIs. HTML docs can be read by jsoup, PDF by PDFBox.
        • The TF-IDF algorithm is discussed here: https://guendouz.wordpress.com/2015/02/17/implementation-of-tf-idf-in-java/
    • Pull pages from approved flags – 3 points
      • The second part of the effort is to use Jeremy’s REST interface to extract the URLs of ‘cleared’ flags to use as the input to the app, via the input file (or call from within the app, though there may be certs issues)
    • Report with new term recommendations – 3 points
      • Using the rating app, we should be able to try using these new terms and see if they improve results. One of the items that will need to be returned from the DB (that’s already stored in the QueryObject2) so we can see if we’re getting cleaner results.
  • LanguageModelNetworks
    • Read in a spreadsheet (xls and xlsx)
    • Write out spreadsheets (page containing the data information
      • File
      • User
      • Date run
      • Settings used
    • allow for manipulation of row and column values (in this case, papers and codes, but the possibilities are endless)
      • Select the value to manipulate (reset should be an option)
      • Spinner/entry field to set changes (original value in label)
      • ‘Calculate’ button
      • Sorted list(s) of rows and columns. (indicate +/- change in rank)
    • Reset all button
    • Normalize all button
    • Progress for today! Lots of wiring up to do though: LMT