Phil 6.14.16

7:00 – 3:00 VTX

  • Working on finishing the papers of the CSCW chairs
  • Built a new version of RatingApp and sent over to Andy to deploy.
  • More rating. Once done, run through the domains and see what comes up.
    • Finished!
    • Ok, I have a mix of html, pdf and msword docs.
      • Change corpusManager so the config file can handle multiple types
      • Convert the docs to pdf. Done
      • Parsing. Ran into a java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider error. Added the org.bouncycastle:bcprov-jdk16:1.46 and now all is fine…?
  • Need to trap a reset connection and resubmit. Done

Phil 6.13.16

6:30 – 2:30 VTX

Phil 6.9.16

6:00 – 12:00 Writing

  • Going to go through the RQs and describe how to address them
  • Start with the back end and my local cohort, which I can assume to be diversity-seeking because of where they are.
  • Iteratively develop tool so that it gets used for diversity-related activities
  • Logs and questionairres.
  • Scraping for Google Scholar and CaseLaw? Java code is here.
  • Looks like Google Scholar has also started to add the concept of pertinence in?
  • Finished the Research Plan. Do need a timeline.
  • Finished discussion/conclusion. Done(ish)!

Phil 6.8.16

6:30 – 4:30 Writing

  • Wondering if I should add a section on trust and credibility
  • Huh – just saw this on Google image search. You get a bar of context words that allow for drilling down into a result
  • Reworked a lot of the paper since the whole anonymous part has been shelved
  • Started on the hypotheses and research questions in the plan section

Phil 6.6.16

6:30 – 1:00 Writing

  • Realized that I had forgotten to go into how information seeking behavior of the IR users can potentially be used to vet the quality of the information they are looking at.
  • Working my way through the lit review.

Phil 6.5.16

8:00 – 2:00 – Writing

Phil 6.4.16

7:30 – 1:30 Writing

  • More on libraries and serendipity. Found lots, and then went on to look for metions in electronic retrieval. Found Foster’s A Nonlinear Model of Information-Seeking Behavior, which also has some spiffy citations. Going to take a break from writing and actually read this one. Because, I just realized that interdisciplinary researchers are the rough academic equivalent of the explorer pattern.
  • Investigating Information Seeking BehaviorUsing the Concept of Information Horizons
    • Page 3 – To design and develop a new research method we used Sonnenwald’s (1999) framework for human information behavior as a theoretical foundation. This theoretical framework suggests that within a context and situation is an ‘information horizon’ in which we can act. For a particular individual, a variety of information resources may be encompassed within his/her information horizon. They may include social networks, documents, information retrieval tools, and experimentation and observation in the world. Information horizons, and the resources they encompass, are determined socially and individually. In other words, the opinions that one’s peers hold concerning the value of a particular resource will influence one’s own opinions about the value of that resource and, thus, its position within one’s information horizon. 

Phil 6.2.16

7:00 – 5:00 VTX

  • Writing
  • Write up sprint story – done
    • Develop a ‘training’ corpus known bad actors (KBA) for each domain.

      • KBAs will be pulled from http://w3.nyhealth.gov/opmc/factions.nsf, which provides a large list.
      • List of KBAs will be added to the content rating DB for human curation
      • HTML and PDF data will be used to populate a list of documents that will then be scanned and analyzed to prepare TF-IDF and LSI term-document tables.
      • The resulting table will in turn be analyzed using term centrality, with the output being an ordered list of terms to be evaluated for each domain.

  • Building view to get person, rating and link from the db – done, or at least V1
    CREATE VIEW view_ratings AS
      select io.link, qo.search_type, po.first_name, po.last_name, po.pp_state, ro.person_characterization from item_object io
        INNER JOIN query_object qo ON io.query_id = qo.id
        INNER JOIN rating_object ro on io.id = ro.result_id
        INNER JOIN poi_object po on qo.provider_id = po.id;
  • Took results from w3.nyhealth.gov and ran them through the whole system. The full results are in the Corpus file under w3.nyhealth.gov-PDF-centrality_06_02_16-13_12_09.xlsx and w3.nyhealth.gov-WEB-centrality_06_02_16-13_12_09.xlsx. The results seem to make incredibly specific searches. Here are the two first examples. Note that there are very few .com sites.:

Phil 6.1.16

7:00 – 2:00VTX

Phil 5.31.16

7:00 – 4:30 VTX

  • Writing. Working on describing how maintaining many codes in a network contains more (and more subtle) information than grouping similar codes.
  • Working on the UrlChecker
    • In the process, I discovered that the annotation.xml file is unique only for the account and not for the CSE. All CSEs for one account are contained in one annotation file
    • Created a new annotation called ALL_annotations.xml
    • fixed a few things in Andy’s file
    • Reading in everything. Now to produce the new sets of lists.
    • I think it’s just easier to delete all the lists and start over.
    • Done and verified. You run UrlChecker from the command line, with the input file being a list of domains (one per line) and the ALL_annotations.xml file.
  • https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2
  • Need to add a Delete or Hide button to reduce down a large corpus to a more effective size.
  • Added. Tomorrow I’ll wire up the deletion of a row or cilumn and the recreation of the initialMatrix

Phil 5.30.16

7:00 – 10:00 Thesis/VTX

  • Built a new matrix for the coded lit review. I had coded a couple of more papers
  • Working on copying over the read papers into a new folder that I can run text analytics over
  • After carefully reading through the doc manager list and copying over each paper, I just discovered I could have exported selected.
  • Ooops: Exception in thread “JavaFX Application Thread” java.lang.IllegalArgumentException: Invalid column index (16384).  Allowable column range for EXCEL2007 is (0..16383) or (‘A’..’XFD’)
    • Going to add a limit of
      SpreadsheetVersion.EXCEL2007.getMaxColumns()-8

      columns for now. Clearly that can be cut down.

    • Figuring out where to cut the terms. I’m summing the columns of the LSI calculation, starting at the highest value and then dividing that by the sum of all values. The top 20% of rank weights gives 280 columns. Going to try that first
    • Success! Some initial thoughts
      • The coded version is much more ‘crisp’
      • There are interesting hints in the LSI version
      • Clicking on a term or paper to see the associated items is really nice.
      • I think that document subgroups might be good/better, and it might be possible to use the tool to help build those subgroups. This goes back to the ‘hiding’ concept. (hide item / hide item and associated)

Phil 5.27.16

7:00 – 2:00 VTX

  • Wound up writing the introduction and saving the old intro to a new document – Themesurfing
  • Renamed the current document
  • Got the parser working. Old artifact settings.
  • Added some tweaks to show progress better. I’m kinda stuck with the single thread in JavaFx having to execute before text can get shown.
  • Need an XML parser to find out what sites have already been added. Added an IntelliJ project to the GoogleCseConfigFiles SVN file. Should be able to finish it on Tuesday.

Phil 5.26.16

7:00 – 5:00 VTX

Phil 5.25.16

7:00 – 4:30 VTX

  • Took the weekend off for the ESCN. Bailed on Saturday because of rain, then dodged rain for two days, then got a nice ride in on Tuesday.
  • Chatting with Aaron last night, I discovered that the REST API won’t work for Demo. I’ll need to get a new SQL dump from Heath. No, actually it works just fine in that it is accessable, but anything other than empty sets is a timeout.
  • Need to try to build a new jar file for the CorpusManager so it can have its own executable. Put the Manifest in the CM directory? not sure how to do that.
  • Writing
  • Looks like my old laptop finally bit the dust. Chromebook time?
  • Working on the Corpus manager to pull in links in the config file. Done!
  • So I’m having all kinds of problems getting the flag info from Jeremy’s rest service. I did realize that I can use the dashboard though and harvest the urls by following the links and build my list that way. Except that the flags are crap. Back to moby dick for the moment.
  • Actually, those are pretty bad too. Margarita put three urls up on confluence.
  • Got url scanning done through config file.
  • Ingested the first four chapters of Moby Dick. Pretty interesting. Ill try those three files tomorrow and we’ll see what we’ve got, at least for a sense of .gov sites…