Author Archives: pgfeldman

Phil 2.3.16

7:00 – 3:00 VTX

  • Just discovered Publius –  a Web publishing system that is highly resistant to censorship and provides publishers with a high degree of anonymity. No longer active, but produced a paper.
  • Continuing On the Accuracy of Media-based Conflict Event Data. Currently starting Matching Media-based Conflict Reports with Military Records
  • Back to Googlehacking
    • Since I’ve got the provider JSON, setting up objects that I can use for more in-depth parsing. Thinking that this could be an example of ‘code’ in the dictionary. A work can be an object that knows how to look through a section of text to see if it can find itself.
    • I think running several dictionaries over a document could be interesting. For example, using a medical and a legal dictionary on a document would let the system infer malpractice as opposed to a document on foreign aid.
    • Generating the right queries and they work in the browser:
      "Ram Singh"
      	ALL_GOV(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+sanctions
      	ALL_GOV(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+criminal
      	ALL_GOV(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+malpractice
      	ALL_GOV(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+board+actions
      	ALL_US(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+sanctions
      	ALL_US(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+criminal
      	ALL_US(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+malpractice
      	ALL_US(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+board+actions
      	ALL_ORG(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+sanctions
      	ALL_ORG(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+criminal
      	ALL_ORG(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+malpractice
      	ALL_ORG(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+board+actions
      	RESTRICTED_COM(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+sanctions
      	RESTRICTED_COM(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+criminal
      	RESTRICTED_COM(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+malpractice
      	RESTRICTED_COM(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+board+actions
      	ALL_EDU(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+sanctions
      	ALL_EDU(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+criminal
      	ALL_EDU(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+malpractice
      	ALL_EDU(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+board+actions
  • So the next thing is to start running these queries and looking at the results to see if there are patterns. And I would be further along, but IntelliJ choked when I tried to add JPA. After flailing for a while I just gave up, created a new project, copied all the lib src and persistence directories over, updated the structure, and it all works. Grumble grumble.

Phil 2/2/16

7:00 –

Phil 2.1.16

9:00 – 4:00VTX

Phil 1.29.16

7:00 – 3:30 VTX

Phil 1.28.16

5:30 – 3:30 VTX

  • Continuing The Hybrid Representation Model for Web Document Classification. Good stuff, well written. This paper (An Efficient Algorithm for Discovering Frequent Subgraphs) may be good for recognizing patterns between stories. Possibly also images.
  • Useful page for set symbols that I can never remember: http://www.rapidtables.com/math/symbols/Set_Symbols.htm
  • Finally discovered why the RdfStatementNodes aren’t assembling properly. There is no root statement… Fixed! We can now go from:
    <rdf:RDF
      xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
      xmlns:vCard='http://www.w3.org/2001/vcard-rdf/3.0#'
       >
    
      <rdf:Description rdf:about="http://somewhere/JohnSmith/">
        <vCard:FN>John Smith</vCard:FN>
        <vCard:N rdf:parseType="Resource">
       <vCard:Family>Smith</vCard:Family>
       <vCard:Given>John</vCard:Given>
        </vCard:N>
      </rdf:Description>
    
      <rdf:Description rdf:about="http://somewhere/RebeccaSmith/">
        <vCard:FN>Becky Smith</vCard:FN>
        <vCard:N rdf:parseType="Resource">
       <vCard:Family>Smith</vCard:Family>
       <vCard:Given>Rebecca</vCard:Given>
        </vCard:N>
      </rdf:Description>
    
      <rdf:Description rdf:about="http://somewhere/SarahJones/">
        <vCard:FN>Sarah Jones</vCard:FN>
        <vCard:N rdf:parseType="Resource">
       <vCard:Family>Jones</vCard:Family>
       <vCard:Given>Sarah</vCard:Given>
        </vCard:N>
      </rdf:Description>
    
      <rdf:Description rdf:about="http://somewhere/MattJones/">
        <vCard:FN>Matt Jones</vCard:FN>
        <vCard:N
       vCard:Family="Jones"
       vCard:Given="Matthew"/>
      </rdf:Description>
    
    </rdf:RDF>

    to this:

    [1]: http://somewhere/SarahJones/
    --[5] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Sarah Jones"
    --[4] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffd)
    ----[6] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Sarah"
    ----[7] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Jones"
    [3]: http://somewhere/MattJones/
    --[15] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Matt Jones"
    --[14] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffc)
    ----[11] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Jones"
    ----[10] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Matthew"
    [0]: http://somewhere/RebeccaSmith/
    --[3] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Becky Smith"
    --[2] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffe)
    ----[9] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Smith"
    ----[8] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Rebecca"
    [2]: http://somewhere/JohnSmith/
    --[12] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7fff)
    ----[1] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Smith"
    ----[0] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "John"
    --[13] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "John Smith"
  • Some thoughts about information retrieval using graphs
  • Sent a note to Theresa asking for people to do manual flag extraction

Phil 1.27.16

7:00 – 4:00VTX

Phil 1.26.16

7:00 – 3:00 VTX

  • Finished the Crowdseeding paper. I was checking out the authors, and went to Macartan Humphreys’ website. He’s been doing interesting work, and he’s up in NYC at Colombia, so it would be possible to visit. Anyway, there is one paper that looks very interesting: Mixing Methods: A Bayesian Approach. It’s about inferring information from quantitative and qualitative sources. Anyway, it sounds related, both to how I’m putting together my proposal and how the overall system should(?) work.
  • Reviewing a paper. Don’t forget to mention other analytic systems like Palantir Gotham
  • On to Theme-based Retrieval of Web News. And in looking at papers that cite this, found The Hybrid Representation Model for Web Document Classification. Not too impressed with the former. The latter looks like it contains some good overview in the previous works section. One of the authors: Mark Last (lots of data discovery in large data sets)
  • Downloading new IntelliJ. Ok, back to normal and the tutorial.
    • Huh. Tried loading the (compact) “N-TRIPLES” format, which barfed, even though Jena wrote out the file. The (pretty) “RDF/XML-ABBREV” works for read and write though. Maybe I’m using the wrong read() method? Pretty is good for now anyway. The goal is to have a human-readable / RDF format anyway.
    • Can do some primitive search and navigation-like behavior, but not getting where I want to go. For example, it’s possible to list all the resources:
      ResIterator iter = model.listResourcesWithProperty(prop);
      while(iter.hasNext()){
          Resource r = iter.nextResource();
          StmtIterator iter = resource.listProperties(prop);
          while(iter.hasNext()){
              System.out.println("\t"+iter.nextStatement().getObject().toString());
          }
      }
    • But getting the parent of any of those resources is not supported. It looks like this requires using the Jena Ontology API, so on to the next tutorial…
    • Got Gregg’s simpleCredentials.owl file and was able to parse. Now I need to unpack it and create a dictionary.
    • Finished with the Jena Ontology API . No useful navigation, so very disappointing. Going to take the model.listStatements and see if I can assemble a tree (with relationships?) for the dictionary taxonomy conversion tomorrow.

Phil 1.25.16

8:00 – 4:00 VTX

  • Working from home today
  • I think a good goal is to put together a human-readable dictionary input-output file. Need to ask Gregg about file formats he uses.
  • Downloaded the sandbox files for JPA and SNLP projects
  • Updating my Intellij
    • Indexing…
    • Installing plugin updates
    • Still indexing…
    • Testing.
      • Stanford NLP: Missing the ‘models’ jar file – fixed
      • JavaJPA: Worked first time
  • Updating Java to 8u72
  • Pinged Gregg about what file format he uses. It’s RDF. He’s sending an example that I’m going to try to import with Apache Jana.
  • Created Jena project.
  • After a frustrating detour into Maven with Intellij, imported the Jena libraries directly.
  • Whoops, forgot to set log4j.
  • Starting the tutorial.
  • Ok, good progress. I can create a model, add resources, and print out the XML representation. I think a variation of this should be fine to describe the dictionary:
    <rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#">
     <rdf:Description rdf:about="http://somewhere/JohnSmith">
     <vcard:N rdf:parseType="Resource">
     <vcard:Family>Smith</vcard:Family>
     <vcard:Given>John</vcard:Given>
     </vcard:N>
     <vcard:FN>John Smith</vcard:FN>
     </rdf:Description>
    </rdf:RDF>
    
  • But now, I’m going to try my snowshoes out for lunch…
  • Ok, back from the adventure.
  • Writing out files – done
  • Reading in files – done
  • Checking project into Subversion done

Phil 1.24.16

7:00 – 9:00(am)

  • Boy, that was a lot of snow…
  • Finished Security-Controlled Methods for Statistical Databases. Lots of good stuff, but the main takeaway is that data from each user could be adjusted by a fixed value so that its means and variances would be indistinguishable from some other user. We’d have to save those offsets for differentiation, but those are small values that can be encrypted and even stored offline.
  • Starting Crowdseeding Conflict Data.
    • Just found out about FrontlineSMS and SimLab
    • ACLED (Armed Conflict Location & Event Data Project)
    • We close with reflections on the ethical implications of taking a project like this to scale. During the pilot project we faced no incidents that threatened the safety of the phone holders. However, this might be dierent when the project is scaled up and the attention of armed groups is drawn to it. For both humanitarian and research purposes a project such as Voix des Kivus becomes truly useful only when it is taken to scale; but those are precisely the conditions which might create the greatest risks. We did not assess these risks because we could not bear them ourselves. But given the importance and utility of the data these are risks that others might be better placed to bear.
    • Internal validation seems to help a lot. This really does beg the question as to what the interface should look like to enforce conformity without leading to information overload.
    • So restrict the user choice (like the codes used here), or have the system infer categories? A mix? Maybe like the search autocomplete?
    • Remember, this needs to work for mobile, even SMS. I’m thinking that maybe a system that has a simple question/answer interaction that leads down a tree might be general enough. As the system gets more sophisticated, the text could get more conversational.
    • This could be tested on Twitter as a bot. It would need to keep track of the source’s id to maintain the conversation, and could ask for posts of images, videos, etc.

Phil 1.22.16

6:45 – 2:15 VTX

  • Timesheet day? Nope. Next week.
  • Ok, now that I think I understand Laplace Transforms and why they matter, I think I can get back to Calibrating Noise to Sensitivity in Private Data Analysis. Ok, kinda hit the wall on the math on this one. These aren’t formulas that I would be using at this point in the research. It’s nice to know that they’re here, and can probably help me determine the amount of noise that would be needed in calculating the biometric projection (which inherently removes information/adds noise).
  • Starting on Security-Control  Methods  for  Statistical  Databases: A  Comparative  Study
  • Article on useful AI chatbots. Sent SemanticMachines an email asking about their chatbot technology.
  • Got the name disambiguation working pretty well. Here’s the text:
    • – RateMDs Name Signup | Login Claim Doctor Profile | Claim Doctor Profile See what’s new! Account User Dashboard [[ doctor.name ]] Claim Doctor Profile Reports Admin Sales Admin: Doctor Logout Toggle navigation Menu Find A Doctor Find A Facility Health Library Health Blog Health Forum Doctors › Columbia › Family Doctor / G.P. › Unfollow Follow Share this Doctor: twitter facebook Dr. Robert S. Goodwin Family Doctor / G.P. 29 reviews #9 of 70 Family Doctors / G.P.s in Columbia, Maryland Male Dr Goodwin & Associates Unavailable View Map & ……………plus a lot more ………………..Hospitalizes Infant In Spain Wellness How Did Google Cardboard Save This baby’s life? Health 7 Amazing Stretches To Do On a Plane Follow Us You may also like Dr. Charles L. Crist Family Doctor / G.P. 24 reviews Top Family Doctors / G.P.s in Columbia, MD Dr. Mark V. Sivieri 21 reviews #1 of 70 Dr. Susan B. Brown Schoenfeld 8 reviews #2 of 70 Dr. Nj Udochi 4 reviews #3 of 70 Dr. Sarah L. Connor 4 reviews #4 of 70 Dr. Kisa S. Crosse 7 reviews #5 of 70 Sign up for our newsletter and get the latest health news and tips. Name Email Address Subscribe About RateMDs About Press Contact FAQ Advertise Privacy & Terms Claim Doctor Profile Top Specialties Family G.P. Gynecologist/OBGYN Dentist Orthopedics/Sports Cosmetic Surgeon Dermatologist View all specialties > Top Local Doctors New York Chicago Houston Los Angeles Boston Toronto Philadelphia Follow Us Facebook Twitter Google+ ©2004-2016 RateMDs Inc. – The original and largest doctor rating site.
    • Here’s the list of extracted people:
      PERSON: Robert S. Goodwin
      PERSON: Robert S. Goodwin
      PERSON: L. Crist
      PERSON: Goodwin
      PERSON: Goodwin
      PERSON: Goodwin
      PERSON: Goodwin
      PERSON: Goodwin
      PERSON: G
      PERSON: Robert S. Goodwin
      PERSON: Goodwin
      PERSON: Goodwin
      PERSON: Goodwin
      PERSON: Ajay Kumar
      PERSON: Charles L. Crist
      PERSON: Mark V. Sivieri
      PERSON: B. Brown Schoenfeld
      PERSON: L. Connor
      PERSON: S. Crosse
    • And here some tests against that set (low scores are better. Information Distance):
      Best match for Robert S. Goodwin is PERSON: Robert S. Goodwin (score = 0.0)
      Best match for Goodwin Robert S. is PERSON: Robert S. Goodwin (score = 0.0)
      Best match for Dr. Goodwin is PERSON: Robert S. Goodwin (score = 1.8)
      Best match for Bob Goodwin is PERSON: Robert S. Goodwin (score = 2.0)
      Best match for Rob Goodman is PERSON: Robert S. Goodwin (score = 2.6)
  • So I can cluster together similar (and misspelled) words, and SNLP hands me information about DATE, DURATION, PERSON, ORGANIZATION, LOCATION
  • Don’t know why I didn’t see this before – this is the page for the NER with associated papers. That’s kind as close to a guide as I think you’ll find in this system

Phil 1.21.16

7:00 – 4:00 VTX

  • Inverse Laplace examples
  • Dirac delta function
  • Useful link of the day: Firefox user agent strings
  • Design Overview presentation.
  • Working on (simple!) name disambiguation
    • Building word chains of sequential tokens that are entities (PERSON and ORGANIZATION) Done
    • Given a name, split by spaces and get best match on last name, then look ahead one or two words for best match on first name. If both sets are triples, then check the middle. Wound up iterating over all the elements looking for the best match. This does let things like reverse order work. Not sure if it’s best
    • Checks need to look for initials for first and middle name in source and target. Still working on this one.
    • Results (lower is better):
      ------------------------------
      Robert S. Goodwin
      PERSON: Robert S. Goodwin score = 0.0
      PERSON: Robert S. Goodwin score = 0.0
      PERSON: L. Crist score = 6.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: G score = 2.0
      PERSON: Robert S. Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Ajay Kumar score = 9.0
      PERSON: Charles L. Crist score = 13.0
      PERSON: Mark V. Sivieri score = 10.0
      PERSON: B. Brown Schoenfeld score = 13.0
      PERSON: L. Connor score = 6.0
      PERSON: S. Crosse score = 6.0
      
      ------------------------------
      Goodwin Robert S.
      PERSON: Robert S. Goodwin score = 0.0
      PERSON: Robert S. Goodwin score = 0.0
      PERSON: L. Crist score = 6.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: G score = 2.0
      PERSON: Robert S. Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Goodwin score = 0.0
      PERSON: Ajay Kumar score = 9.0
      PERSON: Charles L. Crist score = 13.0
      PERSON: Mark V. Sivieri score = 10.0
      PERSON: B. Brown Schoenfeld score = 13.0
      PERSON: L. Connor score = 6.0
      PERSON: S. Crosse score = 6.0

Phil 1.20.16

7:00 – 5:30 VTX

 

Phil 1.19.16

7:00 – 4:00 VTX

  • Laplace Transforms 2 – Laplace Transforms 6
  • While cleaning up my post from yesterday, I discovered GloVe, another item from the stanfordnlp group. “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space“. Could be good, but it’s written in C (and I mean straight struct and function C) so it would have to be translated to be used. Still, could be useful for a more sophisticated dictionary. Each entry would simply have to store its coordinates. or a pointer to the trained data.
  • The Stanford NLP JavaDoc index page
  • Ok! Parsing is working (using Moby Dick again). Lemma works and so does edit distance. Now I need to think about building the entries, dictionaries, and using them to parse text.
  • Wondering about using lemmas to build hierarchies in the dictionary. It could be redundant (it’s already in the NLP data). But if we want to make specialty dictionaries (Java vs. Java vs. Java), it might be needed.
  • First, I really need to get familiar with the POS annotations. Then I can start to see what are the putative candidates for creating a dictionary from scratch. That essentially creates the annotated (overloaded term!) bag-of-words that is the dictionary. The dictinoary will need to be edited, so it might as well be able to be read in and written out as a JSON or XML file. Then something about synonyms leading to concepts maybe?
  • Results for today:
    Sentence [7] is:
    If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.
    
    Sentence [7] tokens are:
    	almost	(POS:RB, Lemma:almost)
    	men	(POS:NNS, Lemma:man)
    	degree	(POS:NN, Lemma:degree)
    	time	(POS:NN, Lemma:time)
    	other	(POS:JJ, Lemma:other)
    	cherish	(POS:JJ, Lemma:cherish)
    	very	(POS:RB, Lemma:very)
    	nearly	(POS:RB, Lemma:nearly)
    	same	(POS:JJ, Lemma:same)
    	feelings	(POS:NNS, Lemma:feeling)
    	ocean	(POS:NN, Lemma:ocean)
    		close match between 'osean' and 'ocean'

Phil 1.18.16

7:00 – 4:00 VTX

  • Started Calibrating Noise to Sensitivity in Private Data Analysis.
    • In TAJ, I think the data source (what’s been typed into the browser) may need to be perturbed before it gets to the server in a way that someone looking at the text can’t figure out who wrote it. The trick here is to create a mapping function that can recognize but not reconstruct. My intuition is that this would resemble a noisy mapping function (Which is why this paper is in the list). Think of a 3D shape. It can cast a shadow that can be recognizable, and with no other information, could not be used to reconstruct the 3D shape. However, multiple samples over time as the shape rotates could be used to reconstruct the shape. To get around that, either the original 3D or the derived 2D shape might have to have noise introduced in some way.
    • And reading the paper means that I have to brush up on Laplace Transforms. Hello, Khan Academy….
  • Next step is getting the dictionary to produce networks. Time to drill down more into the Stanford NLP Looking at the paper and the book to begin with. Chapter 18 looks to be particularly useful. Also downloaded all of 3.6 for reference. It contains the Stanford typed dependencies manual, which is also looking useful (But impossible to use without this guide to the Penn Treebank tags). There don’t seem to be any tutorials to speak of. Interestingly, the Cognitive Computation Group  at Urbana has similar research and better documentation (example), including Medical NLP Packages. Fallback?
  • Checking through the documentation, and both lemmas (edu.stanford.nlp.process.Morphology) and edit distance (edu.stanford.nlp.util.EditDistance) appear to be supported in a straightforward way.
  • Getting a Exception in thread “main” java.lang.RuntimeException: edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model.
  • Which seems to be caused by: Unable to resolve “edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger” as either class path, filename or URL
  • Which is not in the code that I downloaded. Making a fill download from Github. Huh. Not there either.
  • Ah! It’s in the stanford-corenlp-xxx-models.jar.
  • Ok, everything works. It’s installed from the Maven Repo, so it’s version 3.5.2, except for the models, which are 3.6, which are contained in the download mentioned above. I also pulled out the models directory, since some of the examples want to use some files explicitly.  Anyway, I’m not sure what all the pieces do, but I can start playing with parts.

Phil 1.15.16

7:00 – 4:00 VTX

  • Finished Communication Power and Counter-power in the Network Society
  • Started The Future of Journalism: Networked Journalism
  • Here’s a good example of a page with a lot of outbound links, videos and linked images. It’s about the Tunisia uprising before it got real traction. So can we now vet it as a trustworthy source? Is this a good pattern? The post is by Ethan Zuckerman. He directs the Center for Civic Media at MIT, among other things.
  • Public Insight Network: “Every day, sources in the Public Insight Network add contextdepthhumanity and relevance to news stories at trusted newsrooms around the country.”
  • Hey, my computer wasn’t restarted last night. Picking up JPA at Queries and Uncommitted Changes.
  • Updating all the nodes as objects:
    //@NamedQuery(name = "BaseNode.getAll", query = "SELECT bn FROM base_nodes bn")
    TypedQuery<BaseNode> getNodes = em.createNamedQuery("BaseNode.getAll", BaseNode.class);
    List<BaseNode> nodeList = getNodes.getResultList();
    Date date = new Date();
    em.getTransaction().begin();
    for(BaseNode bn : nodeList){
        bn.setLastAccessedOn(date);
        bn.setAccessCount(bn.getAccessCount()+1);
        em.persist(bn);
    }
    em.getTransaction().commit();
  • Updating all nodes with a JPQL call:
    //@NamedQuery(name = "BaseNode.touchAll", query = "UPDATE base_nodes bn set bn.accessCount = (bn.accessCount+1), bn.lastAccessedOn = :lastAccessed")
    em.getTransaction().begin();
    TypedQuery<BaseNode> touchAllQuery = em.createNamedQuery("BaseNode.touchAll", BaseNode.class);
    touchAllQuery.setParameter("lastAccessed", new Date());
    touchAllQuery.executeUpdate();
    em.getTransaction().commit();
  • And we can even add in query logic. This updates the accessed date and increments the accessed count if it’s not null:
    @NamedQuery(name = "BaseNode.touchAll", query = "UPDATE base_nodes bn " +
            "set bn.accessCount = (bn.accessCount+1), " +
            "bn.lastAccessedOn = :lastAccessed " +
            "where NOT (bn.accessCount IS NULL )")