Phil 2.9.16

7:00 – 4:00 VTX

Finished Publius: A robust, tamper-evident, censorship-resistant web publishing system
Starting Anonymity Loves Company – Anonymous Web Transactions with Crowds by Mike Reiter and Aviel Ruben, who was one of the co-authors on the Publius paper.
- Crowds could probably be built with PeerJS. The ISP would still know traffic, but that’s it.
Found this nice article in Communications of the ACM: Schema.org: Evolution of Structured Data on the Web. Nice overview. Very current.
The Big List of Naughty Strings
Time to combine everything
- Optional generation of Providers and queries – default is to load them from the DB
- Run queries from the DB
  - Show the number available and allow a request – done
  - Iterating over the queries and pages. Need to create, append and persist a rating Done
  - Named queries for
    - Queries that have the lowest number of results.ratings – done-ish. Currently it looks for -1 as a flag. Should also look for queries that have unrated results.
    - Queries associated with ‘bad’ providers
    - Queries associated with ‘good’ providers
  - Connect to DB remotely
- Wrap the app (done, with Launch4j. Very nice!) and test it on the other laptop. Note, it doesn’t have enough disk to install java on. That will have to wait.
- Packing up the laptop. Debating bringing multi monitor support. I’ll have the other laptop…
- Gratuitous screenshot:

Phil 2.8.16

7:00 – 5:00 VTX

My 401k still isn’t being done right. Sheesh.
More Publius: A robust, tamper-evident, censorship-resistant web publishing system
- Very good introduction, then it dives into the weeds of how the system was implemented and and the cryptologic challenges. Good stuff, and should be addressed. It does imply that the information stored in my system could be encrypted and sharded as an additional layer of protection agains malicious editing. Since in this case, text can have annotations pointing to it but the source should be archival.
- I think I also need to set up a new doc db of news items that I can use to make the story more readable.
  - Stories of people fooled by misinformation
  - Stories of people damaged by lack of anonymity
  - Stories about citizen journalism
  - Stories about computational journalism
  - Something about CSCW, Wikipedia maybe?
- Anderson’s Eternity Service?
  - https://freenetproject.org/
Need to make the ProviderObject persistent. Done
Need a rating object – date , who, the rating, anything else? Done-ish
Need to make a quick & dirty swing app for people to use – started. Once that’s working, then build the rating object that it will create
Need to connect to a remote DB
- Will also need summary statistics and charts to see how queries do.
- Will also need to store the good (“match” and “flaggable”) pages for later training.
Should make the app stand-alone-ish Jsmooth?
Discussion with Mike G., Heath, Bob H., and Theresa on how to integrate current NLP/NER

Phil 2.5.16

6:45 – 4:15 VTX

Starting Publius: A robust, tamper-evident, censorship-resistant web publishing system
- Marc Waldman

Change the JsonLoaded class to only look at declared fields – done
Register for Periscope Charts -done. Callback on Monday?
Working on parsing the query result.
- Had to set the charset to UTF-8. Huh.
- Can we pull back items by cacheId? Then we don’t need to load the primary store with internet info.
- Had a STUPID mistake in getting JPA set up. Had all the annotations pointing at each other, but forgot when creating the result objects that I had to pass the ‘parent’ query object in to get the mapping. Sigh.
- Adding a dirt-simple rating scheme
  - Java app iterates over all the urls returned and the user can pick from:
```
1 - not appropriate at all
2 - medical and or legal
3 - Correct person
4 - Correct person with flaggable
```
    The Java app then either opens the page or downloads and opens the file with the default application.
  - The user picks the value, the result object persists with the rating and we move on to the next item. Right now the DB is on my local machine, but if we made it networkable everyone could rate a few pages. Most of the results should only take a few seconds to evaluate.
I have the Google/db code running in one sandbox and the user eval running in another. Monday I’ll integrate them.

Phil 2.4.16

7:00 – 4:00 VTX

The way to handle multidimensional (human) ranking of documents (i.e. web pages) is to take the dimensions and and webpages and put them on a matrix? Each page has a greater or lesser score on that dimension. Then apply page rank. Tweak weights until pages order the way we think they should
Does “authority” mean quality? predicting expert quality ratings of Web documents
LandScan (Oak Ridge Labs)
Uppsala Conflict Data Program Geo-referenced Event Dataset
Nils Weidmann Dataverse (University of Konstanz)
Continuing On the Accuracy of Media-based Conflict Event Data. Done. Wow. And look at all the databases ^^^ !
Microsoft bot API
- Creating a generic bot
Back to GoogleHacking
- Added ‘CredEngine1’ as BASELINE search engine
- Looks like we blew through our limits. Using my key. Verified that the BASELINE search runs. That does mean that the current 4 queries factor out to 24 searches (6 search engines * 4 queries)
- Building search persistent object
- Building result item object. Actually, building a JasonLoadable base class since this trick is going to be used for the query items and info object
- Need a result info object that stores the meta information.
- Just stumbled across a GCS twitter search. Neat.
- Hitting the CSE and getting results. Tomorrow I’ll finish of the classes that will persist the search results. I’ve got a buffered search result to use instead of hitting google. Although it will still need to pull down the document referenced in the result. I wonder how Jsoup handles pdf and Word documents?

Phil 2.3.16

7:00 – 3:00 VTX

Just discovered Publius – a Web publishing system that is highly resistant to censorship and provides publishers with a high degree of anonymity. No longer active, but produced a paper.
Continuing On the Accuracy of Media-based Conflict Event Data. Currently starting Matching Media-based Conflict Reports with Military Records

Back to Googlehacking

Since I’ve got the provider JSON, setting up objects that I can use for more in-depth parsing. Thinking that this could be an example of ‘code’ in the dictionary. A work can be an object that knows how to look through a section of text to see if it can find itself.
I think running several dictionaries over a document could be interesting. For example, using a medical and a legal dictionary on a document would let the system infer malpractice as opposed to a document on foreign aid.

Generating the right queries and they work in the browser:

"Ram Singh"
	ALL_GOV(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+sanctions
	ALL_GOV(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+criminal
	ALL_GOV(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+malpractice
	ALL_GOV(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+board+actions
	ALL_US(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+sanctions
	ALL_US(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+criminal
	ALL_US(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+malpractice
	ALL_US(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:9qwxkhnqoi0&q=%22Ram+Singh%22+VA+board+actions
	ALL_ORG(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+sanctions
	ALL_ORG(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+criminal
	ALL_ORG(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+malpractice
	ALL_ORG(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:ux1lfnmx3ou&q=%22Ram+Singh%22+VA+board+actions
	RESTRICTED_COM(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+sanctions
	RESTRICTED_COM(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+criminal
	RESTRICTED_COM(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+malpractice
	RESTRICTED_COM(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:swl1wknfxia&q=%22Ram+Singh%22+VA+board+actions
	ALL_EDU(sanctions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+sanctions
	ALL_EDU(criminal): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+criminal
	ALL_EDU(malpractice): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+malpractice
	ALL_EDU(board actions): https://www.googleapis.com/customsearch/v1?key=AIzaSyAj6wa-zWuNWXrjeJ4FteuBMKj92mRP4vo&cx=017379340413921634422:lqt7ih7tgci&q=%22Ram+Singh%22+VA+board+actions

So the next thing is to start running these queries and looking at the results to see if there are patterns. And I would be further along, but IntelliJ choked when I tried to add JPA. After flailing for a while I just gave up, created a new project, copied all the lib src and persistence directories over, updated the structure, and it all works. Grumble grumble.

Phil 2/2/16

7:00 –

My Google Developer Dashboard. I keep forgetting how to find this thing.
Starting On the Accuracy of Media-based Conflict Event Data. One quick note – these datasets could easily go into GLOBE. Wouldn’t that be interesting…
Jacob N. Shapiro Associate Professor of Politics and International Affairs
Ok, back to entity extraction. Trying the Google CSE
- You have to include the www for healthgrades.com. Huh.
- added linkedin and top npi to exclude
- It definitely works better than the ‘-site’ version of google search
- Trying it with a rest interface. Google’s documentation
- My key was wrong. Using the pro key
  - All us: cx=017379340413921634422:9qwxkhnqoi0 – verified
  - All com (no healthgrades) : cx=017379340413921634422:swl1wknfxia – verified
  - All gov : cx=017379340413921634422:lqt7ih7tgci – verified
- Going to build a small test app that produces a list of providers and domains that we can look through in more detail and test out different analytics
  - Get back links?
  - Jsoup + NLP
  - Change over time
- Was just getting started on loading all the providers from JSON, when my IntelliJ stopped inserting breakpoints. Installing a new version

Phil 2.1.16

9:00 – 4:00VTX

Seminar today from 10:00 – 12:00. Need to send Coleen an email asking how to charge that. Done
Need to write a brief description of each coding term. How convenient! Atlas allows you to edit code description and then changes the icon. I am liking this package…
Back to reading about interrogation. Done. Not directly related to what I’m doing but still interesting was the section on the Scharff technique
Adding the Armed Conflict & Event Data Project User’s Manual. A nice example of good coding and definitions, I think. Also pointed to On the Accuracy of Media-based Conflict Event Data, which looks like a must-read.
Ok, Let’s get back to better searches
Looked at common crawl and the common crawl index some more. I’m worried that it misses smaller targets, as philfeldman.com doesn’t show, and that’s been up for years. We’ll come back to that later if I can’t make Google place nicer?
Playing with the google search API(s)
- This lovely example (from Google yet) seems to provide everything you need in JSON. Even without a key…
- Their example: https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS
- A version that excludes all .com sites and irs.gov: https://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=1040+-site%3A*.com+-site%3Airs.gov
- Wow – you can get back links
- Many options, including start and num. Num doesn’t seem to work in JSON, but start does (the first value is zero). So you seem to be limited to 4 returns at a time?
- Same query starting at the 20th result
- Looks like a complete list of operators
So now I’m going to try getting better provider queries
- https://www.google.com/search?safe=off&q=ramh+singh+malpractice+-site%3A+healthgrades.com
- This kinda works. It seems to exclude a lot more than I was expecting. healthgrades is gone, but so is a bunch of other sites like doctorwiki.com
- Regions work though: https://www.google.com/search?q=%22ram+singh%22+malpractice+site%3A.org&cr=countryUS
When Aaron is in tomorrow, I’ll ask him how the CSE/JSON integration works, and where to get ids. I got one from Google, but is sure doesn’t look right.

Phil 1.29.16

7:00 – 3:30 VTX

Continuing The Hybrid Representation Model for Web Document Classification.
- Finished. That one’s a keeper.
Based on a discussion with a retired cop (Deputy Sheriff of Worcester County, Criminal, Narcotic, etc) who mentioned the Reid technique for evaluating truth telling, I thought it might be a good idea to look for an overview of the field (snowball methods!). So I’m starting Eliciting Information and Detecting Lies in Intelligence Interviewing: An Overview Of Recent Research. Both authors, Pär Anders Granhag and Aldert Vrij, have additional publications on credibility. There are quite a few papers on cognitive load, so that would be an interesting piece to incorporate into the interface…
Mothballing the Ontology to Dictionary work for a bit
Stanford Entity Resolution Framework (SERF)
Learning-based Entity Resolution with MapReduce
Palantir Gotham
IBM Infosphere
A taxonomy of tools that support the fluent and flexible use of visualizations
Modern Information Retrieval: A Brief Overview (By Google in 2001. Describes how all the pieces work)
Starting on White Paper
- Definitions
  - Precision – the fraction of retrieved instances that are relevant
    - We can measure this. In the top N results from our test query, how many were useful?
  - Recall – the fraction of relevant instances that are retrieved. We can’t measure this from Google, but we could with a static repository like CommonCrawl.
  - Rank – the ordering of the returned result, determined by some algorithm (i.e. The Eigenvector from PageRank)
  - Entity Resolution
- Previous Work
  - Research
  - Other Systems
- The problems as I see them
  - Finding the Corpus to search for entities (best signal-to-noise)
    - Finding reputable documents also needs human-evaluated documents
      - Guidelines for raters. An interview with a rater describing the work
    - Look for words or the words in back links pointing to the document
  - Finding correct entities within the corpus
  - Finding information that correspond to to Flags
  - Associating Flags with Entities
  - Ordering
- The current model
  - No baseline data currently exists
  - Building ‘Gold Standard’ data to aid in productionAlso, here’s a Google video showing how Google uses human raters to build ‘gold standard’ data to evaluate information retrieval quality: https://www.youtube.com/watch?v=nmo3z8pHX1E
- Improving the current model
  - Mechanical Turk
- Alternate models
  - Finding the Corpus to search for entities (best signal-to-noise)
  - Finding correct entities within the corpus
  - Finding information that correspond to to Flags
  - Associating Flags with Entities
- Conclusions and Recommendations

Phil 1.28.16

5:30 – 3:30 VTX

Continuing The Hybrid Representation Model for Web Document Classification. Good stuff, well written. This paper (An Efficient Algorithm for Discovering Frequent Subgraphs) may be good for recognizing patterns between stories. Possibly also images.
Useful page for set symbols that I can never remember: http://www.rapidtables.com/math/symbols/Set_Symbols.htm

Finally discovered why the RdfStatementNodes aren’t assembling properly. There is no root statement… Fixed! We can now go from:

<rdf:RDF
  xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
  xmlns:vCard='http://www.w3.org/2001/vcard-rdf/3.0#'
   >

  <rdf:Description rdf:about="http://somewhere/JohnSmith/">
    <vCard:FN>John Smith</vCard:FN>
    <vCard:N rdf:parseType="Resource">
   <vCard:Family>Smith</vCard:Family>
   <vCard:Given>John</vCard:Given>
    </vCard:N>
  </rdf:Description>

  <rdf:Description rdf:about="http://somewhere/RebeccaSmith/">
    <vCard:FN>Becky Smith</vCard:FN>
    <vCard:N rdf:parseType="Resource">
   <vCard:Family>Smith</vCard:Family>
   <vCard:Given>Rebecca</vCard:Given>
    </vCard:N>
  </rdf:Description>

  <rdf:Description rdf:about="http://somewhere/SarahJones/">
    <vCard:FN>Sarah Jones</vCard:FN>
    <vCard:N rdf:parseType="Resource">
   <vCard:Family>Jones</vCard:Family>
   <vCard:Given>Sarah</vCard:Given>
    </vCard:N>
  </rdf:Description>

  <rdf:Description rdf:about="http://somewhere/MattJones/">
    <vCard:FN>Matt Jones</vCard:FN>
    <vCard:N
   vCard:Family="Jones"
   vCard:Given="Matthew"/>
  </rdf:Description>

</rdf:RDF>

to this:

[1]: http://somewhere/SarahJones/
--[5] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Sarah Jones"
--[4] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffd)
----[6] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Sarah"
----[7] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Jones"
[3]: http://somewhere/MattJones/
--[15] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Matt Jones"
--[14] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffc)
----[11] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Jones"
----[10] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Matthew"
[0]: http://somewhere/RebeccaSmith/
--[3] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "Becky Smith"
--[2] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffe)
----[9] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Smith"
----[8] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "Rebecca"
[2]: http://somewhere/JohnSmith/
--[12] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7fff)
----[1] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal:  "Smith"
----[0] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal:  "John"
--[13] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal:  "John Smith"

Some thoughts about information retrieval using graphs
- Characterise our current data
  - URL
  - Domain
  - HTML text- allows for tagged information <tite>, etc. Outbound links (how many hops?)
  - Cleaned text
  - Word Bag?
  - Date retrieved
  - Query (items returned)
    - Currently 4 queries per entity – few 100k to mbytes.
      - Felony – top 10
      - 3 others
  - Flags produced
- From that, we can analyze
  - Most productive pages for flags (power law?)
  - Temporal patterns
  - Network characteristics for good/bad medical providers, pages producing flags, etc.
    - Diameter
    - Flag page adjacency and connecting paths
  - Statistical differences for above, including the same query over time
- Once we know the size of the network we use, we can look to how we might do our own crawl.
  - Impact of Similarity Measures on Web-page Clustering
    - Look up “Standard Graph Representation”
    - An Efficient Algorithm for Discovering Frequent Subgraphs
    - Efficient Graph-Based Representation of Web Documents
- Patient satisfaction is systemic, criminal is isolated? Implies a signal to noise problem.
- Can we produce a training set of documents?
- Can we test against other search engines? Bing? Mechanical Turk?
- Data Sets
  - CMU World Wide Knowledge Base (Web->KB) project
- Using Google better
  - Only use Google to find new sites and exclude the ones that we know about: https://support.google.com/customsearch/answer/2631038?hl=en
  - Can we infer a bad/good provider? Google Similarity Distance If so, we can only run extensive queries about bad ones. (A quick example)
  - How often do we really need to do a deep crawl on a provider. Are there inferential triggers that we could use?
Sent a note to Theresa asking for people to do manual flag extraction

Phil 1.27.16

7:00 – 4:00VTX

The 401k is still coming out of my paycheck and not going into my account. W. T. F.
Starting The Hybrid Representation Model for Web Document Classification.
- Citation: On a relation between graph edit distance and maximum common subgraph – might be useful for identifying bad behaviors…
- Very nice related work section
Working on creating a tree of RDF statements that I can traverse.
Hmm. It works on my toy examples but blows up on Gregg’s files..
- Actually, it doesn’t. Getting an additional entry.
Google custom search
Web Structure Mining from Advanced Techniques in Web Intelligence – I
Answering Enumeration Queries with the Crowd
IM’d with Bob about CSE sources. Meeting set up for tomorrow at 2:00

Phil 1.26.16

7:00 – 3:00 VTX

Finished the Crowdseeding paper. I was checking out the authors, and went to Macartan Humphreys’ website. He’s been doing interesting work, and he’s up in NYC at Colombia, so it would be possible to visit. Anyway, there is one paper that looks very interesting: Mixing Methods: A Bayesian Approach. It’s about inferring information from quantitative and qualitative sources. Anyway, it sounds related, both to how I’m putting together my proposal and how the overall system should(?) work.
Reviewing a paper. Don’t forget to mention other analytic systems like Palantir Gotham
On to Theme-based Retrieval of Web News. And in looking at papers that cite this, found The Hybrid Representation Model for Web Document Classification. Not too impressed with the former. The latter looks like it contains some good overview in the previous works section. One of the authors: Mark Last (lots of data discovery in large data sets)
Downloading new IntelliJ. Ok, back to normal and the tutorial.
- Huh. Tried loading the (compact) “N-TRIPLES” format, which barfed, even though Jena wrote out the file. The (pretty) “RDF/XML-ABBREV” works for read and write though. Maybe I’m using the wrong read() method? Pretty is good for now anyway. The goal is to have a human-readable / RDF format anyway.
- Can do some primitive search and navigation-like behavior, but not getting where I want to go. For example, it’s possible to list all the resources:
```
ResIterator iter = model.listResourcesWithProperty(prop);
while(iter.hasNext()){
    Resource r = iter.nextResource();
    StmtIterator iter = resource.listProperties(prop);
    while(iter.hasNext()){
        System.out.println("\t"+iter.nextStatement().getObject().toString());
    }
}
```
- But getting the parent of any of those resources is not supported. It looks like this requires using the Jena Ontology API, so on to the next tutorial…
- Got Gregg’s simpleCredentials.owl file and was able to parse. Now I need to unpack it and create a dictionary.
- Finished with the Jena Ontology API . No useful navigation, so very disappointing. Going to take the model.listStatements and see if I can assemble a tree (with relationships?) for the dictionary taxonomy conversion tomorrow.

Phil 1.25.16

8:00 – 4:00 VTX

Working from home today
I think a good goal is to put together a human-readable dictionary input-output file. Need to ask Gregg about file formats he uses.
Downloaded the sandbox files for JPA and SNLP projects
Updating my Intellij
- Indexing…
- Installing plugin updates
- Still indexing…
- Testing.
  - Stanford NLP: Missing the ‘models’ jar file – fixed
  - JavaJPA: Worked first time
Updating Java to 8u72
Pinged Gregg about what file format he uses. It’s RDF. He’s sending an example that I’m going to try to import with Apache Jana.
Created Jena project.
After a frustrating detour into Maven with Intellij, imported the Jena libraries directly.
Whoops, forgot to set log4j.
Starting the tutorial.

Ok, good progress. I can create a model, add resources, and print out the XML representation. I think a variation of this should be fine to describe the dictionary:

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#">
 <rdf:Description rdf:about="http://somewhere/JohnSmith">
 <vcard:N rdf:parseType="Resource">
 <vcard:Family>Smith</vcard:Family>
 <vcard:Given>John</vcard:Given>
 </vcard:N>
 <vcard:FN>John Smith</vcard:FN>
 </rdf:Description>
</rdf:RDF>

But now, I’m going to try my snowshoes out for lunch…
Ok, back from the adventure.
Writing out files – done
Reading in files – done
Checking project into Subversion done

Phil 1.24.16

7:00 – 9:00(am)

Boy, that was a lot of snow…
Finished Security-Controlled Methods for Statistical Databases. Lots of good stuff, but the main takeaway is that data from each user could be adjusted by a fixed value so that its means and variances would be indistinguishable from some other user. We’d have to save those offsets for differentiation, but those are small values that can be encrypted and even stored offline.
Starting Crowdseeding Conflict Data.
- Just found out about FrontlineSMS and SimLab
- ACLED (Armed Conflict Location & Event Data Project)
  - Selected articles and book chapters
  - Guide for Media Users
  - Users Guide (includes instructions about datasets)
- We close with reflections on the ethical implications of taking a project like this to scale. During the pilot project we faced no incidents that threatened the safety of the phone holders. However, this might be dierent when the project is scaled up and the attention of armed groups is drawn to it. For both humanitarian and research purposes a project such as Voix des Kivus becomes truly useful only when it is taken to scale; but those are precisely the conditions which might create the greatest risks. We did not assess these risks because we could not bear them ourselves. But given the importance and utility of the data these are risks that others might be better placed to bear.
- Internal validation seems to help a lot. This really does beg the question as to what the interface should look like to enforce conformity without leading to information overload.
- So restrict the user choice (like the codes used here), or have the system infer categories? A mix? Maybe like the search autocomplete?
- Remember, this needs to work for mobile, even SMS. I’m thinking that maybe a system that has a simple question/answer interaction that leads down a tree might be general enough. As the system gets more sophisticated, the text could get more conversational.
- This could be tested on Twitter as a bot. It would need to keep track of the source’s id to maintain the conversation, and could ask for posts of images, videos, etc.

Phil 1.22.16

6:45 – 2:15 VTX

Timesheet day? Nope. Next week.
Ok, now that I think I understand Laplace Transforms and why they matter, I think I can get back to Calibrating Noise to Sensitivity in Private Data Analysis. Ok, kinda hit the wall on the math on this one. These aren’t formulas that I would be using at this point in the research. It’s nice to know that they’re here, and can probably help me determine the amount of noise that would be needed in calculating the biometric projection (which inherently removes information/adds noise).
Starting on Security-Control Methods for Statistical Databases: A Comparative Study
Article on useful AI chatbots. Sent SemanticMachines an email asking about their chatbot technology.
Got the name disambiguation working pretty well. Here’s the text:
- – RateMDs Name Signup | Login Claim Doctor Profile | Claim Doctor Profile See what’s new! Account User Dashboard [[ doctor.name ]] Claim Doctor Profile Reports Admin Sales Admin: Doctor Logout Toggle navigation Menu Find A Doctor Find A Facility Health Library Health Blog Health Forum Doctors › Columbia › Family Doctor / G.P. › Unfollow Follow Share this Doctor: twitter facebook Dr. Robert S. Goodwin Family Doctor / G.P. 29 reviews #9 of 70 Family Doctors / G.P.s in Columbia, Maryland Male Dr Goodwin & Associates Unavailable View Map & ……………plus a lot more ………………..Hospitalizes Infant In Spain Wellness How Did Google Cardboard Save This baby’s life? Health 7 Amazing Stretches To Do On a Plane Follow Us You may also like Dr. Charles L. Crist Family Doctor / G.P. 24 reviews Top Family Doctors / G.P.s in Columbia, MD Dr. Mark V. Sivieri 21 reviews #1 of 70 Dr. Susan B. Brown Schoenfeld 8 reviews #2 of 70 Dr. Nj Udochi 4 reviews #3 of 70 Dr. Sarah L. Connor 4 reviews #4 of 70 Dr. Kisa S. Crosse 7 reviews #5 of 70 Sign up for our newsletter and get the latest health news and tips. Name Email Address Subscribe About RateMDs About Press Contact FAQ Advertise Privacy & Terms Claim Doctor Profile Top Specialties Family G.P. Gynecologist/OBGYN Dentist Orthopedics/Sports Cosmetic Surgeon Dermatologist View all specialties > Top Local Doctors New York Chicago Houston Los Angeles Boston Toronto Philadelphia Follow Us Facebook Twitter Google+ ©2004-2016 RateMDs Inc. – The original and largest doctor rating site.
- Here’s the list of extracted people:
```
PERSON: Robert S. Goodwin
PERSON: Robert S. Goodwin
PERSON: L. Crist
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: G
PERSON: Robert S. Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Goodwin
PERSON: Ajay Kumar
PERSON: Charles L. Crist
PERSON: Mark V. Sivieri
PERSON: B. Brown Schoenfeld
PERSON: L. Connor
PERSON: S. Crosse
```
- And here some tests against that set (low scores are better. Information Distance):
```
Best match for Robert S. Goodwin is PERSON: Robert S. Goodwin (score = 0.0)
Best match for Goodwin Robert S. is PERSON: Robert S. Goodwin (score = 0.0)
Best match for Dr. Goodwin is PERSON: Robert S. Goodwin (score = 1.8)
Best match for Bob Goodwin is PERSON: Robert S. Goodwin (score = 2.0)
Best match for Rob Goodman is PERSON: Robert S. Goodwin (score = 2.6)
```
So I can cluster together similar (and misspelled) words, and SNLP hands me information about DATE, DURATION, PERSON, ORGANIZATION, LOCATION
Don’t know why I didn’t see this before – this is the page for the NER with associated papers. That’s kind as close to a guide as I think you’ll find in this system

Phil 1.21.16

7:00 – 4:00 VTX

Inverse Laplace examples
Dirac delta function
Useful link of the day: Firefox user agent strings
Design Overview presentation.

Working on (simple!) name disambiguation

Building word chains of sequential tokens that are entities (PERSON and ORGANIZATION) Done
Given a name, split by spaces and get best match on last name, then look ahead one or two words for best match on first name. If both sets are triples, then check the middle. Wound up iterating over all the elements looking for the best match. This does let things like reverse order work. Not sure if it’s best
Checks need to look for initials for first and middle name in source and target. Still working on this one.

Results (lower is better):

------------------------------
Robert S. Goodwin
PERSON: Robert S. Goodwin score = 0.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: L. Crist score = 6.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: G score = 2.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Ajay Kumar score = 9.0
PERSON: Charles L. Crist score = 13.0
PERSON: Mark V. Sivieri score = 10.0
PERSON: B. Brown Schoenfeld score = 13.0
PERSON: L. Connor score = 6.0
PERSON: S. Crosse score = 6.0

------------------------------
Goodwin Robert S.
PERSON: Robert S. Goodwin score = 0.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: L. Crist score = 6.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: G score = 2.0
PERSON: Robert S. Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Goodwin score = 0.0
PERSON: Ajay Kumar score = 9.0
PERSON: Charles L. Crist score = 13.0
PERSON: Mark V. Sivieri score = 10.0
PERSON: B. Brown Schoenfeld score = 13.0
PERSON: L. Connor score = 6.0
PERSON: S. Crosse score = 6.0

viztales

Dimension reduction, State, Orientation, and Speed

Phil 2.9.16

Phil 2.8.16

Phil 2.5.16

Phil 2.4.16

Phil 2.3.16

Phil 2/2/16

Phil 2.1.16

Phil 1.29.16

Phil 1.28.16

Phil 1.27.16

Phil 1.26.16

Phil 1.25.16

Phil 1.24.16

Phil 1.22.16

Phil 1.21.16