Monthly Archives: January 2016

Phil 1.29.16

7:00 – 3:30 VTX

Continuing The Hybrid Representation Model for Web Document Classification.
- Finished. That one’s a keeper.
Based on a discussion with a retired cop (Deputy Sheriff of Worcester County, Criminal, Narcotic, etc) who mentioned the Reid technique for evaluating truth telling, I thought it might be a good idea to look for an overview of the field (snowball methods!). So I’m starting Eliciting Information and Detecting Lies in Intelligence Interviewing: An Overview Of Recent Research. Both authors, Pär Anders Granhag and Aldert Vrij, have additional publications on credibility. There are quite a few papers on cognitive load, so that would be an interesting piece to incorporate into the interface…
Mothballing the Ontology to Dictionary work for a bit
Stanford Entity Resolution Framework (SERF)
Learning-based Entity Resolution with MapReduce
Palantir Gotham
IBM Infosphere
A taxonomy of tools that support the fluent and flexible use of visualizations
Modern Information Retrieval: A Brief Overview (By Google in 2001. Describes how all the pieces work)
Starting on White Paper
- Definitions
  - Precision – the fraction of retrieved instances that are relevant
    - We can measure this. In the top N results from our test query, how many were useful?
  - Recall – the fraction of relevant instances that are retrieved. We can’t measure this from Google, but we could with a static repository like CommonCrawl.
  - Rank – the ordering of the returned result, determined by some algorithm (i.e. The Eigenvector from PageRank)
  - Entity Resolution
- Previous Work
  - Research
  - Other Systems
- The problems as I see them
  - Finding the Corpus to search for entities (best signal-to-noise)
    - Finding reputable documents also needs human-evaluated documents
      - Guidelines for raters. An interview with a rater describing the work
    - Look for words or the words in back links pointing to the document
  - Finding correct entities within the corpus
  - Finding information that correspond to to Flags
  - Associating Flags with Entities
  - Ordering
- The current model
  - No baseline data currently exists
  - Building ‘Gold Standard’ data to aid in productionAlso, here’s a Google video showing how Google uses human raters to build ‘gold standard’ data to evaluate information retrieval quality: https://www.youtube.com/watch?v=nmo3z8pHX1E
- Improving the current model
  - Mechanical Turk
- Alternate models
  - Finding the Corpus to search for entities (best signal-to-noise)
  - Finding correct entities within the corpus
  - Finding information that correspond to to Flags
  - Associating Flags with Entities
- Conclusions and Recommendations

Phil 1.28.16

5:30 – 3:30 VTX

Continuing The Hybrid Representation Model for Web Document Classification. Good stuff, well written. This paper (An Efficient Algorithm for Discovering Frequent Subgraphs) may be good for recognizing patterns between stories. Possibly also images.
Useful page for set symbols that I can never remember: http://www.rapidtables.com/math/symbols/Set_Symbols.htm

Finally discovered why the RdfStatementNodes aren’t assembling properly. There is no root statement… Fixed! We can now go from:

<rdf:RDF
  xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
  xmlns:vCard='http://www.w3.org/2001/vcard-rdf/3.0#'
   >

  <rdf:Description rdf:about="http://somewhere/JohnSmith/">
    <vCard:FN>John Smith</vCard:FN>
    <vCard:N rdf:parseType="Resource">
   <vCard:Family>Smith</vCard:Family>
   <vCard:Given>John</vCard:Given>
    </vCard:N>
  </rdf:Description>

  <rdf:Description rdf:about="http://somewhere/RebeccaSmith/">
    <vCard:FN>Becky Smith</vCard:FN>
    <vCard:N rdf:parseType="Resource">
   <vCard:Family>Smith</vCard:Family>
   <vCard:Given>Rebecca</vCard:Given>
    </vCard:N>
  </rdf:Description>

  <rdf:Description rdf:about="http://somewhere/SarahJones/">
    <vCard:FN>Sarah Jones</vCard:FN>
    <vCard:N rdf:parseType="Resource">
   <vCard:Family>Jones</vCard:Family>
   <vCard:Given>Sarah</vCard:Given>
    </vCard:N>
  </rdf:Description>

  <rdf:Description rdf:about="http://somewhere/MattJones/">
    <vCard:FN>Matt Jones</vCard:FN>
    <vCard:N
   vCard:Family="Jones"
   vCard:Given="Matthew"/>
  </rdf:Description>

</rdf:RDF>

to this:

[1]: http://somewhere/SarahJones/
--[5] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Sarah Jones"
--[4] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffd)
----[6] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Sarah"
----[7] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Jones"
[3]: http://somewhere/MattJones/
--[15] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Matt Jones"
--[14] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffc)
----[11] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Jones"
----[10] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Matthew"
[0]: http://somewhere/RebeccaSmith/
--[3] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Becky Smith"
--[2] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffe)
----[9] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Smith"
----[8] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Rebecca"
[2]: http://somewhere/JohnSmith/
--[12] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7fff)
----[1] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Smith"
----[0] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "John"
--[13] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "John Smith"

Some thoughts about information retrieval using graphs
- Characterise our current data
  - URL
  - Domain
  - HTML text- allows for tagged information <tite>, etc. Outbound links (how many hops?)
  - Cleaned text
  - Word Bag?
  - Date retrieved
  - Query (items returned)
    - Currently 4 queries per entity – few 100k to mbytes.
      - Felony – top 10
      - 3 others
  - Flags produced
- From that, we can analyze
  - Most productive pages for flags (power law?)
  - Temporal patterns
  - Network characteristics for good/bad medical providers, pages producing flags, etc.
    - Diameter
    - Flag page adjacency and connecting paths
  - Statistical differences for above, including the same query over time
- Once we know the size of the network we use, we can look to how we might do our own crawl.
  - Impact of Similarity Measures on Web-page Clustering
    - Look up “Standard Graph Representation”
    - An Efficient Algorithm for Discovering Frequent Subgraphs
    - Efficient Graph-Based Representation of Web Documents
- Patient satisfaction is systemic, criminal is isolated? Implies a signal to noise problem.
- Can we produce a training set of documents?
- Can we test against other search engines? Bing? Mechanical Turk?
- Data Sets
  - CMU World Wide Knowledge Base (Web->KB) project
- Using Google better
  - Only use Google to find new sites and exclude the ones that we know about: https://support.google.com/customsearch/answer/2631038?hl=en
  - Can we infer a bad/good provider? Google Similarity Distance If so, we can only run extensive queries about bad ones. (A quick example)
  - How often do we really need to do a deep crawl on a provider. Are there inferential triggers that we could use?
Sent a note to Theresa asking for people to do manual flag extraction

Phil 1.27.16

7:00 – 4:00VTX

The 401k is still coming out of my paycheck and not going into my account. W. T. F.
Starting The Hybrid Representation Model for Web Document Classification.
- Citation: On a relation between graph edit distance and maximum common subgraph – might be useful for identifying bad behaviors…
- Very nice related work section
Working on creating a tree of RDF statements that I can traverse.
Hmm. It works on my toy examples but blows up on Gregg’s files..
- Actually, it doesn’t. Getting an additional entry.
Google custom search
Web Structure Mining from Advanced Techniques in Web Intelligence – I
Answering Enumeration Queries with the Crowd
IM’d with Bob about CSE sources. Meeting set up for tomorrow at 2:00

Phil 1.26.16

7:00 – 3:00 VTX

Finished the Crowdseeding paper. I was checking out the authors, and went to Macartan Humphreys’ website. He’s been doing interesting work, and he’s up in NYC at Colombia, so it would be possible to visit. Anyway, there is one paper that looks very interesting: Mixing Methods: A Bayesian Approach. It’s about inferring information from quantitative and qualitative sources. Anyway, it sounds related, both to how I’m putting together my proposal and how the overall system should(?) work.
Reviewing a paper. Don’t forget to mention other analytic systems like Palantir Gotham
On to Theme-based Retrieval of Web News. And in looking at papers that cite this, found The Hybrid Representation Model for Web Document Classification. Not too impressed with the former. The latter looks like it contains some good overview in the previous works section. One of the authors: Mark Last (lots of data discovery in large data sets)
Downloading new IntelliJ. Ok, back to normal and the tutorial.
- Huh. Tried loading the (compact) “N-TRIPLES” format, which barfed, even though Jena wrote out the file. The (pretty) “RDF/XML-ABBREV” works for read and write though. Maybe I’m using the wrong read() method? Pretty is good for now anyway. The goal is to have a human-readable / RDF format anyway.
- Can do some primitive search and navigation-like behavior, but not getting where I want to go. For example, it’s possible to list all the resources:
```
ResIterator iter = model.listResourcesWithProperty(prop);
while(iter.hasNext()){
    Resource r = iter.nextResource();
    StmtIterator iter = resource.listProperties(prop);
    while(iter.hasNext()){
        System.out.println("\t"+iter.nextStatement().getObject().toString());
    }
}
```
- But getting the parent of any of those resources is not supported. It looks like this requires using the Jena Ontology API, so on to the next tutorial…
- Got Gregg’s simpleCredentials.owl file and was able to parse. Now I need to unpack it and create a dictionary.
- Finished with the Jena Ontology API . No useful navigation, so very disappointing. Going to take the model.listStatements and see if I can assemble a tree (with relationships?) for the dictionary taxonomy conversion tomorrow.

Phil 1.25.16

8:00 – 4:00 VTX

Working from home today
I think a good goal is to put together a human-readable dictionary input-output file. Need to ask Gregg about file formats he uses.
Downloaded the sandbox files for JPA and SNLP projects
Updating my Intellij
- Indexing…
- Installing plugin updates
- Still indexing…
- Testing.
  - Stanford NLP: Missing the ‘models’ jar file – fixed
  - JavaJPA: Worked first time
Updating Java to 8u72
Pinged Gregg about what file format he uses. It’s RDF. He’s sending an example that I’m going to try to import with Apache Jana.
Created Jena project.
After a frustrating detour into Maven with Intellij, imported the Jena libraries directly.
Whoops, forgot to set log4j.
Starting the tutorial.

Ok, good progress. I can create a model, add resources, and print out the XML representation. I think a variation of this should be fine to describe the dictionary:

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#">
 <rdf:Description rdf:about="http://somewhere/JohnSmith">
 <vcard:N rdf:parseType="Resource">
 <vcard:Family>Smith</vcard:Family>
 <vcard:Given>John</vcard:Given>
 </vcard:N>
 <vcard:FN>John Smith</vcard:FN>
 </rdf:Description>
</rdf:RDF>

But now, I’m going to try my snowshoes out for lunch…
Ok, back from the adventure.
Writing out files – done
Reading in files – done
Checking project into Subversion done

Phil 1.24.16

7:00 – 9:00(am)

Boy, that was a lot of snow…
Finished Security-Controlled Methods for Statistical Databases. Lots of good stuff, but the main takeaway is that data from each user could be adjusted by a fixed value so that its means and variances would be indistinguishable from some other user. We’d have to save those offsets for differentiation, but those are small values that can be encrypted and even stored offline.
Starting Crowdseeding Conflict Data.
- Just found out about FrontlineSMS and SimLab
- ACLED (Armed Conflict Location & Event Data Project)
  - Selected articles and book chapters
  - Guide for Media Users
  - Users Guide (includes instructions about datasets)
- We close with reflections on the ethical implications of taking a project like this to scale. During the pilot project we faced no incidents that threatened the safety of the phone holders. However, this might be dierent when the project is scaled up and the attention of armed groups is drawn to it. For both humanitarian and research purposes a project such as Voix des Kivus becomes truly useful only when it is taken to scale; but those are precisely the conditions which might create the greatest risks. We did not assess these risks because we could not bear them ourselves. But given the importance and utility of the data these are risks that others might be better placed to bear.
- Internal validation seems to help a lot. This really does beg the question as to what the interface should look like to enforce conformity without leading to information overload.
- So restrict the user choice (like the codes used here), or have the system infer categories? A mix? Maybe like the search autocomplete?
- Remember, this needs to work for mobile, even SMS. I’m thinking that maybe a system that has a simple question/answer interaction that leads down a tree might be general enough. As the system gets more sophisticated, the text could get more conversational.
- This could be tested on Twitter as a bot. It would need to keep track of the source’s id to maintain the conversation, and could ask for posts of images, videos, etc.

Phil 1.22.16

6:45 – 2:15 VTX

Timesheet day? Nope. Next week.
Ok, now that I think I understand Laplace Transforms and why they matter, I think I can get back to Calibrating Noise to Sensitivity in Private Data Analysis. Ok, kinda hit the wall on the math on this one. These aren’t formulas that I would be using at this point in the research. It’s nice to know that they’re here, and can probably help me determine the amount of noise that would be needed in calculating the biometric projection (which inherently removes information/adds noise).
Starting on Security-Control Methods for Statistical Databases: A Comparative Study
Article on useful AI chatbots. Sent SemanticMachines an email asking about their chatbot technology.
Got the name disambiguation working pretty well. Here’s the text:
- – RateMDs Name Signup | Login Claim Doctor Profile | Claim Doctor Profile See what’s new! Account User Dashboard [[ doctor.name ]] Claim Doctor Profile Reports Admin Sales Admin: Doctor Logout Toggle navigation Menu Find A Doctor Find A Facility Health Library Health Blog Health Forum Doctors › Columbia › Family Doctor / G.P. › Unfollow Follow Share this Doctor: twitter facebook Dr. Robert S. Goodwin Family Doctor / G.P. 29 reviews #9 of 70 Family Doctors / G.P.s in Columbia, Maryland Male Dr Goodwin & Associates Unavailable View Map & ……………plus a lot more ………………..Hospitalizes Infant In Spain Wellness How Did Google Cardboard Save This baby’s life? Health 7 Amazing Stretches To Do On a Plane Follow Us You may also like Dr. Charles L. Crist Family Doctor / G.P. 24 reviews Top Family Doctors / G.P.s in Columbia, MD Dr. Mark V. Sivieri 21 reviews #1 of 70 Dr. Susan B. Brown Schoenfeld 8 reviews #2 of 70 Dr. Nj Udochi 4 reviews #3 of 70 Dr. Sarah L. Connor 4 reviews #4 of 70 Dr. Kisa S. Crosse 7 reviews #5 of 70 Sign up for our newsletter and get the latest health news and tips. Name Email Address Subscribe About RateMDs About Press Contact FAQ Advertise Privacy & Terms Claim Doctor Profile Top Specialties Family G.P. Gynecologist/OBGYN Dentist Orthopedics/Sports Cosmetic Surgeon Dermatologist View all specialties > Top Local Doctors New York Chicago Houston Los Angeles Boston Toronto Philadelphia Follow Us Facebook Twitter Google+ ©2004-2016 RateMDs Inc. – The original and largest doctor rating site.
- Here’s the list of extracted people:
```
PERSON: Robert S. Goodwin
PERSON: Robert S. Goodwin
PERSON: L. Crist
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: G
PERSON: Robert S. Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Ajay Kumar
PERSON: Charles L. Crist
PERSON: Mark V. Sivieri
PERSON: B. Brown Schoenfeld
PERSON: L. Connor
PERSON: S. Crosse
```
- And here some tests against that set (low scores are better. Information Distance):
```
Best match for Robert S. Goodwin is PERSON: Robert S. Goodwin (score = 0.0)
Best match for Goodwin Robert S. is PERSON: Robert S. Goodwin (score = 0.0)
Best match for Dr. Goodwin is PERSON: Robert S. Goodwin (score = 1.8)
Best match for Bob Goodwin is PERSON: Robert S. Goodwin (score = 2.0)
Best match for Rob Goodman is PERSON: Robert S. Goodwin (score = 2.6)
```
So I can cluster together similar (and misspelled) words, and SNLP hands me information about DATE, DURATION, PERSON, ORGANIZATION, LOCATION
Don’t know why I didn’t see this before – this is the page for the NER with associated papers. That’s kind as close to a guide as I think you’ll find in this system

Phil 1.21.16

7:00 – 4:00 VTX

Inverse Laplace examples
Dirac delta function
Useful link of the day: Firefox user agent strings
Design Overview presentation.

Working on (simple!) name disambiguation

Building word chains of sequential tokens that are entities (PERSON and ORGANIZATION) Done
Given a name, split by spaces and get best match on last name, then look ahead one or two words for best match on first name. If both sets are triples, then check the middle. Wound up iterating over all the elements looking for the best match. This does let things like reverse order work. Not sure if it’s best
Checks need to look for initials for first and middle name in source and target. Still working on this one.

Results (lower is better):

------------------------------
Robert S. Goodwin
PERSON: Robert S. Goodwin score = 0.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: L. Crist score = 6.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: G score = 2.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Ajay Kumar score = 9.0
PERSON: Charles L. Crist score = 13.0
PERSON: Mark V. Sivieri score = 10.0
PERSON: B. Brown Schoenfeld score = 13.0
PERSON: L. Connor score = 6.0
PERSON: S. Crosse score = 6.0

------------------------------
Goodwin Robert S.
PERSON: Robert S. Goodwin score = 0.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: L. Crist score = 6.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: G score = 2.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Ajay Kumar score = 9.0
PERSON: Charles L. Crist score = 13.0
PERSON: Mark V. Sivieri score = 10.0
PERSON: B. Brown Schoenfeld score = 13.0
PERSON: L. Connor score = 6.0
PERSON: S. Crosse score = 6.0

Phil 1.20.16

7:00 – 5:30 VTX

More Laplace Transform Tools
Laplace transform of t:L{t} (revisit og lo res lesson)
Laplace transform of t^n: L{t^n} (revisit og lo res lesson, with some variation)
Laplace transform of the unit step function
And then things became a bit surreal, when my water stopped. Called it in.
Wound up spending some time listening to the Ontology discussion
Grabbed the Kaggle health insurance marketplace datafiles
Pondering how to semiautomate taxonomy creation.
- This helps: Formatting NER output from Stanford Corenlp
- Got a good deal of the NER code working and trying to figure out how to disambiguate names and so forth.
Wrote up a design overview for presentation tomorrow.

Phil 1.19.16

7:00 – 4:00 VTX

Laplace Transforms 2 – Laplace Transforms 6
While cleaning up my post from yesterday, I discovered GloVe, another item from the stanfordnlp group. “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space“. Could be good, but it’s written in C (and I mean straight struct and function C) so it would have to be translated to be used. Still, could be useful for a more sophisticated dictionary. Each entry would simply have to store its coordinates. or a pointer to the trained data.
The Stanford NLP JavaDoc index page
Ok! Parsing is working (using Moby Dick again). Lemma works and so does edit distance. Now I need to think about building the entries, dictionaries, and using them to parse text.
Wondering about using lemmas to build hierarchies in the dictionary. It could be redundant (it’s already in the NLP data). But if we want to make specialty dictionaries (Java vs. Java vs. Java), it might be needed.
First, I really need to get familiar with the POS annotations. Then I can start to see what are the putative candidates for creating a dictionary from scratch. That essentially creates the annotated (overloaded term!) bag-of-words that is the dictionary. The dictinoary will need to be edited, so it might as well be able to be read in and written out as a JSON or XML file. Then something about synonyms leading to concepts maybe?

Results for today:

Sentence [7] is:
If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.

Sentence [7] tokens are:
	almost	(POS:RB, Lemma:almost)
	men	(POS:NNS, Lemma:man)
	degree	(POS:NN, Lemma:degree)
	time	(POS:NN, Lemma:time)
	other	(POS:JJ, Lemma:other)
	cherish	(POS:JJ, Lemma:cherish)
	very	(POS:RB, Lemma:very)
	nearly	(POS:RB, Lemma:nearly)
	same	(POS:JJ, Lemma:same)
	feelings	(POS:NNS, Lemma:feeling)
	ocean	(POS:NN, Lemma:ocean)
		close match between 'osean' and 'ocean'

Phil 1.18.16

7:00 – 4:00 VTX

Started Calibrating Noise to Sensitivity in Private Data Analysis.
- In TAJ, I think the data source (what’s been typed into the browser) may need to be perturbed before it gets to the server in a way that someone looking at the text can’t figure out who wrote it. The trick here is to create a mapping function that can recognize but not reconstruct. My intuition is that this would resemble a noisy mapping function (Which is why this paper is in the list). Think of a 3D shape. It can cast a shadow that can be recognizable, and with no other information, could not be used to reconstruct the 3D shape. However, multiple samples over time as the shape rotates could be used to reconstruct the shape. To get around that, either the original 3D or the derived 2D shape might have to have noise introduced in some way.
- And reading the paper means that I have to brush up on Laplace Transforms. Hello, Khan Academy….
Next step is getting the dictionary to produce networks. Time to drill down more into the Stanford NLP Looking at the paper and the book to begin with. Chapter 18 looks to be particularly useful. Also downloaded all of 3.6 for reference. It contains the Stanford typed dependencies manual, which is also looking useful (But impossible to use without this guide to the Penn Treebank tags). There don’t seem to be any tutorials to speak of. Interestingly, the Cognitive Computation Group at Urbana has similar research and better documentation (example), including Medical NLP Packages. Fallback?
Checking through the documentation, and both lemmas (edu.stanford.nlp.process.Morphology) and edit distance (edu.stanford.nlp.util.EditDistance) appear to be supported in a straightforward way.
Getting a Exception in thread “main” java.lang.RuntimeException: edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model.
Which seems to be caused by: Unable to resolve “edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger” as either class path, filename or URL
Which is not in the code that I downloaded. Making a fill download from Github. Huh. Not there either.
Ah! It’s in the stanford-corenlp-xxx-models.jar.
Ok, everything works. It’s installed from the Maven Repo, so it’s version 3.5.2, except for the models, which are 3.6, which are contained in the download mentioned above. I also pulled out the models directory, since some of the examples want to use some files explicitly. Anyway, I’m not sure what all the pieces do, but I can start playing with parts.

Phil 1.15.16

7:00 – 4:00 VTX

Finished Communication Power and Counter-power in the Network Society
Started The Future of Journalism: Networked Journalism
Here’s a good example of a page with a lot of outbound links, videos and linked images. It’s about the Tunisia uprising before it got real traction. So can we now vet it as a trustworthy source? Is this a good pattern? The post is by Ethan Zuckerman. He directs the Center for Civic Media at MIT, among other things.
Public Insight Network: “Every day, sources in the Public Insight Network add context, depth, humanity and relevance to news stories at trusted newsrooms around the country.”
Hey, my computer wasn’t restarted last night. Picking up JPA at Queries and Uncommitted Changes.

Updating all the nodes as objects:

//@NamedQuery(name = "BaseNode.getAll", query = "SELECT bn FROM base_nodes bn")
TypedQuery<BaseNode> getNodes = em.createNamedQuery("BaseNode.getAll", BaseNode.class);
List<BaseNode> nodeList = getNodes.getResultList();
Date date = new Date();
em.getTransaction().begin();
for(BaseNode bn : nodeList){
    bn.setLastAccessedOn(date);
    bn.setAccessCount(bn.getAccessCount()+1);
    em.persist(bn);
}
em.getTransaction().commit();

Updating all nodes with a JPQL call:

//@NamedQuery(name = "BaseNode.touchAll", query = "UPDATE base_nodes bn set bn.accessCount = (bn.accessCount+1), bn.lastAccessedOn = :lastAccessed")
em.getTransaction().begin();
TypedQuery<BaseNode> touchAllQuery = em.createNamedQuery("BaseNode.touchAll", BaseNode.class);
touchAllQuery.setParameter("lastAccessed", new Date());
touchAllQuery.executeUpdate();
em.getTransaction().commit();

And we can even add in query logic. This updates the accessed date and increments the accessed count if it’s not null:

@NamedQuery(name = "BaseNode.touchAll", query = "UPDATE base_nodes bn " +
        "set bn.accessCount = (bn.accessCount+1), " +
        "bn.lastAccessedOn = :lastAccessed " +
        "where NOT (bn.accessCount IS NULL )")

Phil 1.14.16

7:00 – 4:00 VTX

Good Meeting with Thom Lieb
- Here’s a good checklist for reporting on different types of stories: http://www.sbcc.edu/journalism/manual/checklist/index.php
- Ordered Melvin Mencher’s News Reporting and Writing
- Discussed Chatbots, fashion in technology, and NewsTrust, a fact-checking site that piloted out of Baltimore in 2011. This post explains why it wound up folding. Important note: Tie into social media for inputs and outputs!!!
Added Communication Power and Counter-power in the Network Society to the corpus
Manuel Castells is the author of the above. Really clear thinking. Added another paper, The Future of Journalism: Networked Journalism
Had an interesting chat with an ex-cop about trustworthiness. He’s a fan of the Reid Technique and had a bunch of perspectives that I hadn’t considered. Looking for applications to text, I came across this, which looks potentially relevant: Eliciting Information and Detecting Lies in Intelligence Interviewing: An Overview Of Recent Research
Todd Schneider analyzes big data in interesting posts on his blog.

Chapter 7 Using Queries

JPQL
Totally digging the @NamedQuery annotation.

How to paginate a result:

int pageSize = 15;
int maxPages = 10;
for(int curPage = 0; curPage < maxPages; ++curPage){
    List l = nt.runRawPagedQuery(GuidBase.class, curPage, pageSize, "SELECT gb.id, gb.name, gb.guid FROM guid_base gb");
    if(l == null || l.size() == 0){
        break;
    }else{
        System.out.println("Batch ["+curPage+"]");
        nt.printListContents(l);
    }
    System.out.println();
}

Stopping at Queries and Uncommitted Changes, in case my computer is rebooted under me tonight.

Phil 1.13.16

7:00 – 3:00 VTX

More document coding
Review today?
On to Chapter 6.
Thinking about next steps.
- Server
  - Produce a dictionary from a combination of manual entry and corpus extraction
  - Add word-specific code like stemming, edit distance
  - Look into synonyms. They are dictionary specific (Java as in drink, Java as in Language, Java as in island)
  - Analyze documents using the dictionary to produce the master network of items and associations. This resides on the server. I think this negates the need for flags, since the Eigenrank of the doctor will be explained by the associations, and the network can be interrogated by looking for explanatory items within some number of hops. The dictionary entry that was used to extract that item is also added to the network as an item
    - PractitionerDictionary finds medical practitioners <membership roles?>. Providers are added to the item table and to the master network
      - Each practitioner is checked for associations like office, hospital, specialty. New items are created as needed and associations are created
    - LegalDictionary finds (disputes and findings?) in legal proceedings, and adds legal items that are associated with items currently in the network. Items that are associated with GUILTY get low (negative?) weight. A directly attributable malpractice conviction should be a marker that is always relevant. Maybe a reference to it is part of the practitioner record directly?
    - SocialDictionary finds rating items from places like Yelp. High ratings provide higher weight, low ratings produce lower weight. The weight of a rating shouldn’t be more important than a conviction, but a lot of ratings should have a cumulative effect.
    - Other dictionaries? Healthcare providers? Diseases? Medical Schools?
    - Link age. Should link weight move back to the default state as a function of time?
    - Matrix calculation. I think we calculate the rank of all items and their adjacency once per day. Queries are run against the matrix
  - Client
    - Corporate
      - The user is presented with an dashboard ordered by pre-specified criteria (“show new bad practitioners?”). This is calculated by the server looking through the eigenrank starting at the top looking for N items that contain text/tags that match the query (high Jacquard index?). It returns the set to eliminate duplication. The dictionary entries that were associated with the creation of the item are also returned.
    - Consumer
      - The user types in a search: “cancer specialist maryland carefirst”
      - The search looks through the eigenrank starting at the top looking for N items that contain text/tags that match the query (high Jacquard index?). It returns the set to eliminate duplication. The dictionary entries that were associated with the creation of the item are also returned.
    - Common
      - In the browser, the section(s) of the network are reproduced, and the words associated with the items are displayed beside search results, along with sliders that adjust their weights on the local browser network. If the user increases the slider items associated with that entry rise (as does the entry in the list?). This allows the user to reorder their results based on interactive refinement of their preferences.
      - When the user clicks on a result, the position of the clicked item, the positions of the other items, and the settings of the entry sliders is recorded on the server (with the user info?). These weights can be fed back into the master network so that the generalized user preferences are reflected. If we just want to adjust things to the particular user, the Eigenrank will have to be recalculated on a per user basis. I think this does not have to include a full network recalculation.

Phil 1.12.16

7:00 – 4:00 VTX

So I ask myself, is there some kind of public repository of crawled data? Why, of course there is! Common Crawl. So there is a way of getting the deep link structure for a given site without crawling it. That could give me the ability to determine how ‘bubbly’ a site is. I’m thinking there may be a ratio of bidirectional to unidirectional links (per site?) that could help here.
More lit review and integration.
Making diagrams for the Sprint review today
- Overview
  - The purpose of this effort is to provide a capability for the system to do more sophisticated queries that do several things
    - Allow the user to emphasize/de-emphasize words or phrases that relate to the particular search and to do this interactively based on linguistic analysis of the returned text.
    - Get user value judgments on the information provided based on the link results reordering
    - Use this to feed back to the selection criteria for provider Flags.
  - This work leans on the paper PageRank without Hyperlinks if you want more background/depth.
- Eiphcone 129 – Design database table schema.
  - Took my existing MySql db schema and migrated it to Java Persistent Entities. Basically this meant taking a db that was designed for precompiled query access and retrieval (direct data access for adding data, views for retrieval) and restructuring it. So we go from:
  - to
  - The classes are annotated POJOs in a simple hierarchy. The classes that have ‘Base’ in their names I expect to be extended, though there may be enough capability here. GuidBase has some additional capability to make adding data to one class that has a data relation to another class gets filled out properly in both: Since multiple dictionary entries can be present in multiple corpora BaseDictionaryEntry and Corpus both have a <Set> of BaseEntryContext that connects the corpora and entries with additional information that might be useful, such as counts.
  - This manifests itself in the database as the following: It’s not the prettiest drawing, but I can’t get IntelliJ to draw any better. You can see that the tables match directly to the classes. I used the InheritanceType.JOINED strategy since Jeremy was concerned about wasted space in the tables.
  - The next steps will be to start to create test cases that allow for tuning and testing of this setup at different data scales.
- Eiphcone 132 – Document current progress on relationship/taxonomy design & existing threat model
  - Currently, a threat is extracted by comparing a set of known entities to surrounding text for keywords. In the model shown above, practitioners would exist in a network that includes items like the practice, attending hospitals, legal representation, etc. Because of this relationship, flags could be extended to the other members of the network. If a near neighbor in this network has a Flag attached, it will weight the surrounding edges and influence the practitioner. So if one doctor in a practice is convicted of malpractice, then other doctors in the practice will get lower scores.
  - The dictionary and corpus can interact as their own network to determine the amount of wight that is given to a particular score. For example, words in a dictionary that are used to extract data from a legal corpus may have more weight than a social media corpus.
- Eiphcone 134 – Design/document NER processing in relation to future taxonomy
  - I compiled and ran the NER codebase and also walked though the Stanford NLP documentation. The current NER system looks to be somewhat basic, but solid and usable. Using it to populate the dictionaries and annotating the corpus appears to be straightforward addition of the capabilities already present in the Stanford API.
- Demo – I don’t really have a demo, unless people want to see some tests compile and run. To save the time, I have this exiting printout that shows the return of dynamically created data:

[EL Info]: 2016-01-12 14:09:40.481--ServerSession(1842102517)--EclipseLink, version: Eclipse Persistence Services - 2.6.1.v20150916-55dc7c3
[EL Info]: connection: 2016-01-12 14:09:40.825--ServerSession(1842102517)--/file:/C:/Development/Sandboxes/JPA_2_1/out/production/JPA_2_1/_NetworkService login successful

Users
firstName(firstname_0), lastName(lastname_0), login(login_0), networks( network_0)
firstName(firstname_1), lastName(lastname_1), login(login_1), networks( network_4)
firstName(firstname_2), lastName(lastname_2), login(login_2), networks( network_3)
firstName(firstname_3), lastName(lastname_3), login(login_3), networks( network_1 network_2)
firstName(firstname_4), lastName(lastname_4), login(login_4), networks()

Networks
name(network_0), owner(login_0), type(WAMPETER), archived(false), public(false), editable(true)
	[92]: name(DataNode_6_to_BaseNode_8), guid(network_0_DataNode_6_to_BaseNode_8), weight(0.5708945393562317), type(IDENTITY), network(network_0)
		Source: [86]: name('DataNode_6'), type(ENTITIES), annotation('annotation_6'), guid('50836752-221a-4095-b059-2055230d59db'), double(18.84955592153876), int(6), text('text_6')
		Target: [88]: name('BaseNode_8'), type(COMPUTED), annotation('annotation_8'), guid('77250282-3b5e-416e-a469-bbade10c5e88')
	[91]: name(BaseNode_5_to_UrlNode_4), guid(network_0_BaseNode_5_to_UrlNode_4), weight(0.3703539967536926), type(COMPUTED), network(network_0)
		Source: [85]: name('BaseNode_5'), type(RATING), annotation('annotation_5'), guid('bf28f478-626d-4e8f-9809-b4a37f2ad504')
		Target: [84]: name('UrlNode_4'), type(IDENTITY), annotation('annotation_4'), guid('bffe13ae-bb70-46a6-b1b4-9f58cadad04e'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')
	[98]: name(BaseNode_5_to_UrlNode_1), guid(network_0_BaseNode_5_to_UrlNode_1), weight(0.4556456208229065), type(ENTITIES), network(network_0)
		Source: [85]: name('BaseNode_5'), type(RATING), annotation('annotation_5'), guid('bf28f478-626d-4e8f-9809-b4a37f2ad504')
		Target: [81]: name('UrlNode_1'), type(UNKNOWN), annotation('annotation_1'), guid('f9693110-6b5b-4888-9585-99b97062a4e4'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')

name(network_1), owner(login_3), type(WAMPETER), archived(false), public(false), editable(true)
	[96]: name(BaseNode_2_to_UrlNode_1), guid(network_1_BaseNode_2_to_UrlNode_1), weight(0.5733484625816345), type(URL), network(network_1)
		Source: [82]: name('BaseNode_2'), type(ITEM), annotation('annotation_2'), guid('c5867557-2ac3-4337-be34-da9da0c7e25d')
		Target: [81]: name('UrlNode_1'), type(UNKNOWN), annotation('annotation_1'), guid('f9693110-6b5b-4888-9585-99b97062a4e4'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')
	[95]: name(DataNode_0_to_UrlNode_7), guid(network_1_DataNode_0_to_UrlNode_7), weight(0.85154128074646), type(MERGE), network(network_1)
		Source: [80]: name('DataNode_0'), type(USER), annotation('annotation_0'), guid('e9b7fa0a-37f1-41bd-a2c1-599841d1507a'), double(0.0), int(0), text('text_0')
		Target: [87]: name('UrlNode_7'), type(QUERY), annotation('annotation_7'), guid('b9351194-d10e-4f6a-b997-b84c61344fcf'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')
	[94]: name(DataNode_9_to_BaseNode_5), guid(network_1_DataNode_9_to_BaseNode_5), weight(0.72845458984375), type(KEYWORDS), network(network_1)
		Source: [89]: name('DataNode_9'), type(USER), annotation('annotation_9'), guid('5bdb67de-5319-42db-916e-c4050dc682dd'), double(28.274333882308138), int(9), text('text_9')
		Target: [85]: name('BaseNode_5'), type(RATING), annotation('annotation_5'), guid('bf28f478-626d-4e8f-9809-b4a37f2ad504')

name(network_2), owner(login_3), type(EXPLICIT), archived(false), public(false), editable(true)
	[90]: name(BaseNode_8_to_UrlNode_7), guid(network_2_BaseNode_8_to_UrlNode_7), weight(0.2619180679321289), type(WAMPETER), network(network_2)
		Source: [88]: name('BaseNode_8'), type(COMPUTED), annotation('annotation_8'), guid('77250282-3b5e-416e-a469-bbade10c5e88')
		Target: [87]: name('UrlNode_7'), type(QUERY), annotation('annotation_7'), guid('b9351194-d10e-4f6a-b997-b84c61344fcf'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')

name(network_3), owner(login_2), type(EXPLICIT), archived(false), public(false), editable(true)
	[93]: name(UrlNode_4_to_DataNode_3), guid(network_3_UrlNode_4_to_DataNode_3), weight(0.7689594030380249), type(ITEM), network(network_3)
		Source: [84]: name('UrlNode_4'), type(IDENTITY), annotation('annotation_4'), guid('bffe13ae-bb70-46a6-b1b4-9f58cadad04e'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')
		Target: [83]: name('DataNode_3'), type(UNKNOWN), annotation('annotation_3'), guid('e7565935-6429-451f-b7f4-cc2d612ca3fd'), double(9.42477796076938), int(3), text('text_3')
	[97]: name(DataNode_3_to_DataNode_0), guid(network_3_DataNode_3_to_DataNode_0), weight(0.5808262825012207), type(URL), network(network_3)
		Source: [83]: name('DataNode_3'), type(UNKNOWN), annotation('annotation_3'), guid('e7565935-6429-451f-b7f4-cc2d612ca3fd'), double(9.42477796076938), int(3), text('text_3')
		Target: [80]: name('DataNode_0'), type(USER), annotation('annotation_0'), guid('e9b7fa0a-37f1-41bd-a2c1-599841d1507a'), double(0.0), int(0), text('text_0')

name(network_4), owner(login_1), type(ITEM), archived(false), public(false), editable(true)
	[99]: name(UrlNode_4_to_UrlNode_7), guid(network_4_UrlNode_4_to_UrlNode_7), weight(0.48601675033569336), type(WAMPETER), network(network_4)
		Source: [84]: name('UrlNode_4'), type(IDENTITY), annotation('annotation_4'), guid('bffe13ae-bb70-46a6-b1b4-9f58cadad04e'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')
		Target: [87]: name('UrlNode_7'), type(QUERY), annotation('annotation_7'), guid('b9351194-d10e-4f6a-b997-b84c61344fcf'), Date(2016-01-11 11:51Z), html(some text), text('some text'), link('http://source.com/source.html'), image('http://source.com/soureImage.jpg')


Dictionaries
[30]: name(dictionary_0), guid(943ea8b6-6def-48ea-8b0f-a4e52e53954f), Owner(login_0), archived(false), public(false), editable(true)
	Entry = word_11
	Parent = word_10
	word_11 has 790 occurances in corpora0_chapter_1

	Entry = word_14
	word_14 has 4459 occurances in corpora1_chapter_2

	Entry = word_1
	Parent = word_0
	word_1 has 3490 occurances in corpora1_chapter_2

	Entry = word_10
	word_10 has 3009 occurances in corpora3_chapter_4

	Entry = word_4
	word_4 has 2681 occurances in corpora3_chapter_4

	Entry = word_5
	Parent = word_4
	word_5 has 5877 occurances in corpora1_chapter_2


[31]: name(dictionary_1), guid(c7b62a4b-b21a-4ebe-a939-0a71a891a3f9), Owner(login_0), archived(false), public(false), editable(true)
	Entry = word_3
	Parent = word_2
	word_3 has 4220 occurances in corpora0_chapter_1

	Entry = word_6
	word_6 has 4852 occurances in corpora2_chapter_3

	Entry = word_17
	Parent = word_16
	word_17 has 8394 occurances in corpora2_chapter_3

	Entry = word_2
	word_2 has 1218 occurances in corpora3_chapter_4

	Entry = word_19
	Parent = word_18
	word_19 has 8921 occurances in corpora2_chapter_3

	Entry = word_8
	word_8 has 4399 occurances in corpora3_chapter_4



Corpora
[27]: name(corpora1_chapter_2), guid(08803d93-deeb-4699-bdb2-ffa9f635c373), totalWords(1801), importer(login_1), url(http://americanliterature.com/author/herman-melville/book/moby-dick-or-the-whale/chapter-2-the-carpet-bag)
	word_15 has 5338 occurances in corpora1_chapter_2
	word_13 has 2181 occurances in corpora1_chapter_2
	word_14 has 4459 occurances in corpora1_chapter_2
	word_1 has 3490 occurances in corpora1_chapter_2
	word_5 has 5877 occurances in corpora1_chapter_2
	word_16 has 2625 occurances in corpora1_chapter_2

[EL Info]: connection: 2016-01-12 14:09:41.116--ServerSession(1842102517)--/file:/C:/Development/Sandboxes/JPA_2_1/out/production/JPA_2_1/_NetworkService logout successful

Sprint review delayed. Tomorrow
Filling in some knowledge holes in JPA. Finished Chapter 4.
Tried getting enumerated types to work. No luck…?

viztales

Dimension reduction, State, Orientation, and Speed

Monthly Archives: January 2016

Phil 1.29.16

Phil 1.28.16

Phil 1.27.16

Phil 1.26.16

Phil 1.25.16

Phil 1.24.16

Phil 1.22.16

Phil 1.21.16

Phil 1.20.16

Phil 1.19.16

Phil 1.18.16

Phil 1.15.16

Phil 1.14.16

Phil 1.13.16

Phil 1.12.16