Monthly Archives: February 2016

Phil 2.29.16

7:00 – 3:00 VTX

Seminar today, sent Aaron a reminder.
- Some discussion about your publication quantity. Amy suggests 8 papers as the baseline for credibility: So here are some preliminary thoughts about what could come out of my work:
  - Page Rank document return sorting Pertinence
  - User Interfaces for trustworthy input
  - Rating the raters / harnessing the Troll
  - Trustworthiness inference using network shape
  - Adjusting relevance through GUI pertinence
  - Something about ranking of credibility cues – Video, photos, physical presence, etc.
  - Something about the patterns of posting indicating the need for news. Sweden vs. Gezi. And how this can be indicative of emerging crisis informatics need
  - Something about fragment synthesis across disciplines and being able to use it to ‘cross reference’ information?
  - Fragment synthesis vs. community fragmentation.
- 2013 SenseCam paper
- Narrative Clip
Continuing Incentivizing High-quality User-Generated Content.
- Looking at the authors
  - Arpita Ghosh. Lots of good stuff. Revisit later
    - Social Computing and User-generated Content: A Game-Theoretic Approach. Arpita Ghosh. SigEcom Exchanges, Vol 11.2, December 2012.
    - Incentives in Human Computation: HCOMP, November 2013.
    - Truthful Assignment without Money. Shaddin Dughmi, Arpita Ghosh. Proc. 11th ACM Conference on Electronic Commerce (EC), 2010.
  - R. Preston McAfee currently works as chief economist at Microsoft. Previously, he was an economist at Google. Before that he was a Vice President and Research Fellow at Yahoo! Research where he led the Microeconomics and Social Systems group. Also has the ugliest home page I’ve seen since the late ’90s.
    - The Wisdom of Smaller, Smarter Crowds, EC 2014:Proceedings of the 15th ACM Conference on Economics and Computation, 2014 (with Dan Goldstein and Sid Suri).
- The proportional mechanism therefore improves upon the baseline mechanism by disincentivizing q = 0, i.e., it eliminates the worst reviews. Ideally, we would like to be able to drive the equilibrium qualities to 1 in the limit as the number of viewers, M, diverges to inﬁnity; however, as we saw above, this cannot be achieved with the proportional mechanism.
- This reflects my intuition. The lower the quality of the rating, the worse the proportional rating system is, and the lower the bar for quality for the contributor. The three places that I can think of offhand that have high-quality UCG (Idea Channel, StackOverflow and Wikipedia) all have people rating the data (contextually!!!) rather than a simple up/downvote.Idea Channel – The main content creators read the comments and incorporate the best in the subsequent episode.Stackoverflow – Has become a place to show of knowledge, and there are community mechanisms of enforcement, and the number of answers are low enough that it’s possible to look over all of them.Others that might be worth thinking aboutQuora? Seems to be an odd mix of questions. Some just seem lazy (how do I become successful) or very open ended (What kind of guy is Barak Obama). The quality of the writing is usually good, but I don’t wind up using it much. So why is that?Reddit. So ugly that I really don’t like using it. Is there a System Quality/Attractiveness as well as System Trust?
  Slashdot. Good headline service, but low information in the comments. Occasionally something insightful, but often it seems like rehearsed talking points.
- So the better the raters, the better the quality. How can the System evaluate rater quality? Link analysis? Pertinence selection? And if we know a rater is low-quality, can we use that as a measure in its own right?
Trying to test the redundant web page filter, but the urls for most identical pages are actually slightly different:
- http://archive.org/stream/recordofhampd7421999hamp/recordofhampd7421999hamp_djvu.txt
- https://archive.org/stream/recordofhampd4321969hamp/recordofhampd4321969hamp_djvu.txt
I think tomorrow I might parse the URL or look at page content. Tomorrow.

Phil 2.26.16

7:00 – 4:30 VTX

Continuing Incentivizing High-quality User-Generated Content.
- Suppose there are M viewers. The distribution of the total available attention from these M viewers amongst the K participating contributors is determined by the mechanism M being used to display the content. But what if it’s the Idea Channel’s hybrid approach? Google does some ranking of the replies (that has to do with viewer rating?), but then (Mike Rugnetta? Staff?) go through some sample of the comments looking for those that are worth incorporating into the show? Oh, wait, are the comments are on Reddit? Or is that where we go to comment on the comments? I’m confused. There does seem to be more dialog on Rettit. Is this cultural? Design? Both?
- After poking around a bit, I discovered that youtube creators have special tools to search through their comments: https://www.youtube.com/comments
More rating tool stuff
- Working on Blacklist. Kinda done? The JPA query that uses LIKE isn’t behaving in the way I think it should. Using ‘flaggable match’ for now instead of ‘match’. Oh. Duh. You need to use ‘%’ to indicate the match %pattern%. Now I’m done.
- Create a loop that changes all the QueryObjects so that qo.getUnquotedName() is used and persist. Done.
- Moving on to eliminating redundant URLs that have the same rating per person (maybe also start skipping when the same rating for two people?)
- I think it’s done – need to test. I’m a bit worried about recursion in loadNextPage/loadNextQuery. Might have to clean that up a bit.

Phil 2.25.16

7:00 – 5:00 VTX

Thinking more about the economics of contributing trustworthy information. Recently, I’ve discovered the PBS Idea Channel, which is a show that explores pop culture with a philosophical bent (LA Times review here). For example, Deadpool is explored from a phenomenology perspective. But what’s really interesting and seems unique to me is the relationship of the show with its commenters. For each show, there is a follow-on show where the most interesting comments are discussed by the host, Mike Rugnetta. And the comments are surprisingly cogent and good. I think that this is because Rugnetta is acting like the anchor of an interactive news program where the commenters are the reporters. He sets up the topic, gets the ball rolling, and then incorporates the best comments (stories) to wrap up the story. Interestingly, in a recent comment section on aesthetic (which I can’t find now?), he brings up a comment that about science and philosophy and invites the commenter into a deeper discussion and also discusses the potential of an episode about that.
To get a flavor, here’s one of the longer comments (with 25 replies on its own) from the Deadpool show:
TasteDatRainbow1 week ago

I could actually buy that DeadPool’s ability to understand the medium he’s in if it weren’t for one thing he does very often: references to our world. If his fourth wall breaks were limited to interacting with the panels, making quips and nods about the idea of “readers”, and joking about general comic book (or video game or movie) tropes, then I’d be on board with the idea that he is hyper-aware due to his constant physical torment and knowledge of his own perceptions. however, he somehow has knowledge of things that do not seem to exist in the world he inhabits, such as memes, pop culture references, and things like “Leeroy Jenkins”. His hypersensitivity can explain his knowledge of the medium he’s in (an integral part of the reality he inhabits), but I don’t see a way that it could explain him knowing about things that, as far as I’m aware, do not exist in his reality.
Compare that to the comments for the MIT opencourseware intro to MIT 6.034, which I ‘took’ and found well presented and deeply interesting, though not as flashy. Here’s a rough equivalent (with 21 replies):
DroidSage1 year ago

wow ..it’s such an overwhelming feeling for a guy like me ..who had no chance in hell of ever getting into MIT or any other ivy’s to be able to listen and learn from this lectures online and that too free. :’)
To me, it seems like the Deadpool post is deeply involved with the subject matter of the episode, while the MIT comment is more typical of a YouTube comment in that it is more about the commenter and less about the content. This does imply that working on providing value to good commenting through inclusion in the content of the show can improve the quality and relevance of the comments.
To continue the ‘News Anchor’ thought from above, it might be possible to structure a news entity of some kind where different areas (sports, entertainment, local/regional, etc) could have their own anchors that produce interactive content with their commenters. Some additional capability to handle multimedia uploads from commenters should probably be supported and better navigation, but this sounds more to me like a 21st century news product than many other things that I’ve seen. It’s certainly the opposite of the Sweden paper.
And speaking of papers, here’s one on YouTube comments: Commenting on YouTube Videos: From Guatemalan Rock to El Big Bang
Starting on Incentivizing High-quality User-Generated Content.
- References look really good. Only 8? For a WWW paper?
- This is starting to look like what I was trying to find. Nash Equilibrium. Huh. The model predicts, as observed in practice, that if exposure is independent of quality, there will be a ﬂood of low quality contributions in equilibrium. An ideal mechanism in this context would elicit both high quality and high participation in equilibrium.
Need to add ‘change password’ option. Done. And now that I know my way around JPA, I like it a lot
Added role-based enabling of menu choices
The code base could really use a cleanup. We have the classic research->production problem…
Adding match/nomatch and blacklist queries. Note that blacklist needs to be by search engine
- Finished match
- Finished nomatch
- Working on Blacklist
- Create a loop that changes all the QueryObjects so that qo.getUnquotedName() is used and persist.

Phil 2.24.16

7:00 – 4:00 VTX

Continuing Information Rules – A Strategic Guide to the Network Economy.
- This is feeling very much a book about the pricing of commercial items with an emphasis on the profit motive. But this assumes that the profit motive is the primary motivation, which is certainly the case for business, but not necessarily for individual information produces. Consider the effect of ‘badges’ with user generated content. Honor and Glory have value, particularly if the visible rewards are scarce?
- Went back to Wayne’s email and poked around on the Cornell link from Michael Macy, which led to two new papers to take a look at:
  - The evolution of trust and cooperation between strangers: A computational model
  - Network Formation in the Presence of Contagious Risk – Larry Blume
- This also looks promising. based on the search string ‘user generated content incentives’ Incentivizing high-quality user-generated content
Need to add a document classification story. The notes should point at The Hybrid Representation Model for Web Document Classification. Basically need to implement that so that the machine learning algorithms have something to work with. Done
Adding users and roles to the rating app – Added users and got the query working. Time to work on the roles.
Disable ratings but not reports when no one is logged in
Oh, good grief. I forgot about how if you want a fancy dialog in swing that you’ve got to extend the whole thing. Flashback city, daddy-o.

Phil 2.23.16

7:00 – 3:30VTX

Much needed vacation is now history. I started Information Rules – A Strategic Guide to the Network Economy. Probably not going to read the whole book, but it does address the economics issues I’m thinking about. Though more of a focus on financial transactions. For example, if discusses how people place different values on information, which makes me think about the differences between Sweden, Egypt and Turkey, as well as Crisis Informatics in general.
- From my notes: This is why there is no blogging in Sweden. Since the reporting is good enough for most people, the only reason people blog was about things that weren’t covered in the news – personal expression or similar arcanum. Where news is not available, the value for this kind of information goes up, and people who respond to the perceived need step in to fill the gap.This is important – an individual can have an information need, but also a perception of information needs in others, and have a need/desire to provide for that need.
And based on one of Paul Krugman’s blog entries, I went and found the wikipedia entry on information economics, which looks like it will be worth looking at. This part in particular leaped out at me: The subject of “information economics” is treated under Journal of Economic Literature classification code JEL D8 – Information, Knowledge, and Uncertainty. The present article reflects topics included in that code. There are several subfields of information economics. Information as signal has been described as a kind ofnegative measure of uncertainty.^[2] It includes complete and scientific knowledge as special cases. The first insights in information economics related to the economics of information goods.
Submitting paperwork for CHIR
And, back to normal… Continue to refine the rating app?
- Make uploading a super user thing. Which means user accounts and passwords. Probably add everyone to a DB and just let them put in/change passwords.
- Add code to scan the DB for previous pages that had the same rating for the same doctor (and the same term?)
- Add an analytics app that looks for ratings that disagree, either as outliers (watch out for that reviewer) or there is disagreement (are we having problems with terms, fuzzy matching, or what?)
- Add a second app that tags the ontology onto the ‘Flaggable Match’
- Write up a guidance manual for edge conditions. Comes up when you click ‘help’
- Add a ‘total MATCH’ search. That shows how many relevant documents were returned
- Add a ‘total NO MATCH’ search. That shows how many non-relevant documents were returned – basically
```
select search_type, count(*) as matches, total_results from view_rated_items2 where rating NOT LIKE '%match%' group by search_type;
```
- Add a blacklist query that lists all root domains that only show up in non-match results
- Incorporate Flywaydb
  - Verified that I can generate just the table structure with mysqldump: mysqldump -u xxx-pyyy -d googlecse1 > gcse1Tables.sql
- Get DB deployed somewhere and validate – talk to Damien and specify what’s needed. He’ll cost out hours. Done
- Build a web repo that contains gold standard data that we can point a special test GoogleCSE and keep track of return changes.
- Machine Learning framework
  - Get back up to speed on WEKA
  - probably have to write some java data translator generator code
  - Run some tests, get some results in the interactive mode,
  - Redo programatically, so a collection of urls (text? Yeah, extracted text. Compare Stanford and Alchemy?)
  - Data flow:
    - Raw pages,
    - Cleaned content
    - Machine learning (per provider?) returns scored pages
    - Extraction of flags from highly-ranked pages
Took all of the above and rolled it into stories. For points I built an Excel spreadsheet. Turns out that Excel doesn’t have Fibonacci, so I used this version.

Phil 2.18.16

7:00 – 6:00 VTX

Continuing Connectivism: A Learning Theory for the Digital Age. Done. Very interesting. It’s George Seimens most-cited work at over 3,100 citations. He talks about how blogs, which are a feature of the new information networks are challenging traditional media, but I don’t think that’s exactly it.

I think that this is more an issue of information economics. The incentives in social publication is honor, glory and followers. Maybe some money from ad revenue sharing (Though this is changing?). Traditional news media offers a more direct model where the product (news) is sold to readers and/or advertisers so that the news-making product can be made.

Connectivism states that there is now an emphasis on leaning how to find information as opposed to knowing the information (since information obsolescence happens more rapidly, the value of the information is lower than the knowledge of how to find current knowledge).

Since traditional news media tends to aggregate information to produce stories because it makes learning entertaining and worth the price paid (cash or time watching commercials). However, if the friction to finding free alternatives of the initial information for the story is low, then the value of the story becomes lower, since now all you’re paying for is a pleasing presentation.

Blogs and other free sources make this more difficult for the consumer, since what appears credible may not be, but may be confused with an actual information source nonetheless. Or, looking at confirmation bias, a free pleasing story may have higher value for a consumer than a (non-free) well researched story that disputes the reader’s beliefs.

There is also an emotional cost for checking rumors that you agree with. Going to Snopes to find out that the politician that you hate didn’t actually do that stupid thing you just saw in your feed. So the traditional few-channel media is being subsumed by networks that we construct to support our biases?

Banged away at the white paper. Done! Off to Key West for a long weekend!

Phil 2.17.16

7:00 – 5:00 VTX

Starting to list strawman hypothesis
Reading Connectivism paper. Very good so far.
Albert-László Barabási – publications Google Scholar Profile
LexRank: graph-based lexical centrality as salience in text
Talked to Thresea about the human rating app/results and sent her this article on Schema.org
Add doctor disambiguation popup – done

Add a ‘total results’ search. That shows how many relevant documents exist.

MariaDB [googlecse1]> select distinct search_type, total_results from query_object where total_results > 0 order by total_results desc;
+------------------------------------------------+---------------+
| search_type                                    | total_results |
+------------------------------------------------+---------------+
| RESTRICTED_COM(Ram Singh: board actions)       |         12600 |
| RESTRICTED_COM(Ram Singh: criminal)            |          7490 |
| ALL_ORG(Ram Singh: board actions)              |          4200 |
| BASELINE(Ram Singh: board actions)             |          3360 |
| BASELINE(Ram Singh: criminal)                  |          1880 |
| RESTRICTED_COM(Ram Singh: sanctions)           |          1580 |
| ALL_ORG(Ram Singh: criminal)                   |          1390 |
| ALL_ORG(Ram Singh: sanctions)                  |           539 |
| ALL_GOV(Ram Singh: board actions)              |           401 |
| BASELINE(Ram Singh: sanctions)                 |           284 |
| ALL_US(Ram Singh: board actions)               |           157 |
| ALL_EDU(Ram Singh: criminal)                   |           126 |
| ALL_EDU(Ram Singh: board actions)              |           125 |
| RESTRICTED_COM(Ram Singh: malpractice)         |           108 |
| ALL_US(Ram Singh: criminal)                    |           103 |
| ALL_GOV(Ram Singh: criminal)                   |            57 |
| ALL_EDU(Ram Singh: sanctions)                  |            50 |
| BASELINE(Ram Singh: malpractice)               |            34 |
| ALL_ORG(Ram Singh: malpractice)                |            31 |
| ALL_GOV(Ram Singh: sanctions)                  |            15 |
| RESTRICTED_COM(Russell Johnson: criminal)      |             9 |
| ALL_US(Ram Singh: sanctions)                   |             8 |
| RESTRICTED_COM(Tommy Osborne: criminal)        |             8 |
| ALL_EDU(Ram Singh: malpractice)                |             7 |
| RESTRICTED_COM(Russell Johnson: board actions) |             7 |
| RESTRICTED_COM(Tommy Osborne: board actions)   |             7 |
| RESTRICTED_COM(Tommy Osborne: malpractice)     |             7 |
| ALL_ORG(Tommy Osborne: board actions)          |             5 |
| ALL_GOV(Ram Singh: malpractice)                |             4 |
| ALL_US(Ram Singh: malpractice)                 |             4 |
| ALL_ORG(Tommy Osborne: malpractice)            |             3 |
| BASELINE(Tommy Osborne: board actions)         |             3 |
| BASELINE(Tommy Osborne: malpractice)           |             3 |
| ALL_GOV(Tommy Osborne: board actions)          |             2 |
| ALL_GOV(Tommy Osborne: criminal)               |             2 |
| ALL_GOV(Tommy Osborne: malpractice)            |             2 |
| ALL_GOV(Tommy Osborne: sanctions)              |             2 |
| ALL_ORG(Tommy Osborne: criminal)               |             2 |
| RESTRICTED_COM(Tommy Osborne: sanctions)       |             2 |
| BASELINE(Tommy Osborne: criminal)              |             1 |
| BASELINE(Tommy Osborne: sanctions)             |             1 |
| RESTRICTED_COM(Russell Johnson: malpractice)   |             1 |
| RESTRICTED_COM(Russell Johnson: sanctions)     |             1 |
+------------------------------------------------+---------------+
43 rows in set (0.00 sec)

Need to run about 30 doctors through the system to get statistical significance for making recommendations
CommonCrawl vs. Google approximation. For this analysis, I listed all the domains that produced a ‘flaggable match’ and fed them into the common crawl index search for November 2015 (the most recent at the time of this writing). In the results listed below, the number indicates the number of blocks stored in the CommonCrawl. A value of zero indicates that the CommonCrawl index did not contain any reference to that domain:
```
1 - w3.health.state.ny.us
6 - www.consumerwatchdog.org
2 - law.resource.org
3 - www.ncmedboard.org
40 - caselaw.findlaw.com
0 - www.courtlistener.com
1 - www.rfhha.org
1 - www.dhp.virginia.gov
2 - www.vahealthprovider.com
0 - w3.nyhealth.gov
2 - medboard.nv.gov
2 - www.courts.state.va.us
0 - www.physicianus.org
0 - wwwapps.ncmedboard.org
240 - www.healthgrades.com
0 - www.dos.pa.gov
3 - law.justia.com
3 - ezdoctor.com
```
As can be seen, 5 out of 18 domains, or approximately 27% of the domains containing useful information are missing. Of the remaining sites, it is an open question as to whether the crawl contains the full data from the site.

Here’s the ratios of search results to hits

search type			pertenence	relevance	ratio
ALL_GOV(Tommy Osborne: board actions)	2		2	100.00%
ALL_GOV(Tommy Osborne: criminal)	2		2	100.00%
ALL_GOV(Tommy Osborne: malpractice)	2		2	100.00%
ALL_GOV(Tommy Osborne: sanctions)	2		2	100.00%
BASELINE(Tommy Osborne: criminal)	1		1	100.00%
BASELINE(Tommy Osborne: sanctions)	1		1	100.00%
RESTRICTED_COM(Russell Johnson: malpractice)	1	1	100.00%
ALL_ORG(Tommy Osborne: malpractice)	2		3	66.67%
ALL_ORG(Tommy Osborne: board actions)	3		5	60.00%
RESTRICTED_COM(Tommy Osborne: board actions)	4	7	57.14%
ALL_GOV(Ram Singh: malpractice)		2		4	50.00%
RESTRICTED_COM(Tommy Osborne: sanctions)	1	2	50.00%
BASELINE(Tommy Osborne: board actions)	1		3	33.33%
BASELINE(Tommy Osborne: malpractice)	1		3	33.33%
RESTRICTED_COM(Russell Johnson: board actions)	2	7	28.57%
RESTRICTED_COM(Tommy Osborne: malpractice)	2	7	28.57%
ALL_US(Ram Singh: malpractice)		1		4	25.00%
ALL_GOV(Ram Singh: sanctions)		2		15	13.33%
RESTRICTED_COM(Tommy Osborne: criminal)	1		8	12.50%
ALL_ORG(Ram Singh: malpractice)		3		31	9.68%
ALL_GOV(Ram Singh: criminal)		1		57	1.75%
ALL_GOV(Ram Singh: board actions)	4		401	1.00%
ALL_US(Ram Singh: criminal)		1		103	0.97%
RESTRICTED_COM(Ram Singh: malpractice)	1		108	0.93%
ALL_ORG(Ram Singh: criminal)		2		1390	0.14%
ALL_ORG(Ram Singh: board actions)	3		4200	0.07%
RESTRICTED_COM(Ram Singh: criminal)	2		7490	0.03%
RESTRICTED_COM(Ram Singh: board actions)	2	12600	0.02%

Phil 2.16.16

7:00 – 4:00 VTX

Interesting stuff from Stephen Wolfram’s blog: Data Science of the Facebook World. Makes me wonder if you can infer age and gender from writing. Is this global or just US?
Meeting today with Wayne at 4:00
Added Config load
Added provider load
Added query generation. I realized that there is no need to generate a new query that is the same as a query that has never been run, so after generating all the potential new queries, I compare them to the untested list and remove any common items before persisting.
- HOWEVER, while doing the calculation, I was adding all the QueryObjects to the ProviderObjects and then deleting them, so that when I persists, I was adding HUGE numbers of lines. Moved the testing around so that it happens before a potential QueryObject is created.

Phil 2.15.16

7:30 – 1:30 VTX

Wound up writing a lot on the thoughts about the Participatory journalism – the (r)evolution that wasn’t. Content and user behavior in Sweden 2007–2013. I also found a really interesting paper on Connectivism that might tie a lot of my thinking together. News is a form of ongoing education, and education is a form of fragment synthesis. Human-Centered Information ecologies produce fragment synthesis as their (main?) output. Connectivism might tie all that together.
It looks like my laptop had to go through a major reboot. Waiting to get the latest uploaded to SVN…
Back to reading the config file. Done. Need to fold it in to the UserFeedback app.

Phil 2.12.16

6:30 – 4:30 VTX

Continuing Participatory journalism – the (r)evolution that wasn’t. Content and user behavior in Sweden 2007–2013
Create xml configuration file
Integrate Flyway?
Meeting on rating tool. Thoughts:
- Add a ‘I goofed’ button to the GUI (or maybe a ‘back’ button that lets you change the rating?
- Add more info that pops up medical provider.
- Add an analytics app that looks for ratings that disagree, either as outliers (watch out for that reviewer) or there is disagreement (are we having problems with terms, fuzzy matching, or what?)
- Add a second app that tags the ontology onto the ‘Flaggable Match’
- Write up a guidance manual for edge conditions. Comes up when you click ‘help’
- When an url comes up that has already been reviewed more than N times and the reviews match substantially (A majority? – means odd numbers of reviews) for the same provider don’t run that result item, just add a copy of the rating object wit the name of (‘computed’)
Return from NJ

Phil 2.11.16

6:00 – 4:00 VTX

Continuing Participatory journalism – the (r)evolution that wasn’t. Content and user behavior in Sweden 2007–2013
Need to see if I can get this on Monday: Rethinking Journalism: trust and participation in a transformed news landscape. Got the kindle book.
Need to add a menubar to the Gui app that has a ‘data’ and ‘queries’ tab. Data runs the data generation code. Queries has a list of questions that clears the output and then sends the results to the text area.
Still need to move the db to a server. Just realized that it could be a MySql db on Dreamhost too. Having trouble with that. It might be the eclipse jar? Here’s the hibernate jar location in maven:
```
<groupId>org.hibernate.javax.persistence</groupId>
<artifactId>hibernate-jpa-2.0-api</artifactId>
<version>1.0.1.Final</version>
```
Gave up on connecting to Dreamhost. I think it’s a permissions thing. Asked Heath to look into creating a stable DB somewhere. He needs to talk to Damien.
Webhose.io – direct access to live & structured data from millions of sources.
Search by date: https://support.google.com/news/answer/3334?hl=en
- Google news search that produces Json for the last 24 hours:
```
?q=malpractice&safe=off&hl=en&gl=us&authuser=0&tbm=nws&source=lnt&tbs=qdr:d
```
Played around with a bunch of queries, but in the end, I figured that it was better to write the whole works out in a .csv file and do pivot tables in Excel.
Adding the ability to read a config file to set the search engines, lables, etc for generation.

Data Architecture Meeting 2.11.15

Testing what we have

Relevance score
Pertinence score
Charts for management

Vinny

Terminology
gov
Bias towards trustworthy unstructured sources.
What about getting structured data.

Aaron

Isolate V1 capability
Metrics!
We need the structured data!!

Matt

Dsds

Scott

Questions about unstructured query

Phil 2.10.16

Phil 8:00 – 6:00 VTX

Finished Anonymity Loves Company – Anonymous Web Transactions with Crowds
Figured out how to use code families. Not obvious at all fromthe documentation (too many types of families!), but obvious once you see it. Just select one or more codes in the code manager, right-click in the ‘family’ pane and select ‘New from Selected Items’
Enough with the cryptography and back to people! Participatory journalism – the (r)evolution that wasn’t. Content and user behavior in Sweden 2007–2013
Up to NJ with Aaron for the rest of the week.
Start adding capability to rate existing query results. Done

Some output!

MariaDB [googlecse1]> select search_type, display_link, rating, date_rated, user_name from view_rated_items order by rating;
+-------------------------------------+------------------------------+-----------------+---------------------+-----------+
| search_type                         | display_link                 | rating          | date_rated          | user_name |
+-------------------------------------+------------------------------+-----------------+---------------------+-----------+
| ALL_ORG(Ram Singh: malpractice)     | www.consumerwatchdog.org     | flaggable match | 2016-02-10 15:43:38 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | caselaw.findlaw.com          | flaggable match | 2016-02-10 15:37:25 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | www.consumerwatchdog.org     | flaggable match | 2016-02-10 15:26:19 | Phil      |
| ALL_US(Ram Singh: criminal)         | w3.health.state.ny.us        | flaggable match | 2016-02-10 15:17:02 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | www.consumerwatchdog.org     | flaggable match | 2016-02-10 15:33:06 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | law.resource.org             | flaggable match | 2016-02-10 15:27:10 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | www.courtlistener.com        | flaggable match | 2016-02-10 15:39:12 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | www.ncmedboard.org           | flaggable match | 2016-02-10 15:31:59 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | law.resource.org             | flaggable match | 2016-02-10 15:32:12 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | www.rfhha.org                | flaggable match | 2016-02-10 15:43:25 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | www.ncmedboard.org           | flaggable match | 2016-02-10 15:44:43 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | www.alasu.edu                | legal           | 2016-02-10 15:36:26 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | imageserver.library.yale.edu | legal           | 2016-02-10 15:36:28 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | www.academia.edu             | legal           | 2016-02-10 15:35:44 | Phil      |
| ALL_US(Ram Singh: criminal)         | www.co.jefferson.tx.us       | legal           | 2016-02-10 15:16:41 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | indiankanoon.org             | legal           | 2016-02-10 15:25:51 | Phil      |
| ALL_US(Ram Singh: criminal)         | docslide.us                  | legal           | 2016-02-10 15:15:23 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | archive.org                  | legal           | 2016-02-10 15:45:13 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | indiankanoon.org             | legal           | 2016-02-10 15:26:00 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | indiankanoon.org             | legal           | 2016-02-10 15:32:34 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.legalindia.com           | legal           | 2016-02-09 14:57:59 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | www.norcobar.org             | legal           | 2016-02-10 15:40:44 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | www.indianbarassociation.org | legal           | 2016-02-10 15:34:02 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | indiankanoon.org             | legal           | 2016-02-10 15:30:54 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | www.indiankanoon.com         | legal           | 2016-02-10 15:38:38 | Phil      |
| ALL_US(Ram Singh: board actions)    | docslide.us                  | legal           | 2016-02-09 14:59:35 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | indiankanoon.org             | legal           | 2016-02-10 15:43:52 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | ww3.lawschool.cornell.edu    | legal           | 2016-02-10 15:36:20 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | www.clarkcountymedical.org   | match           | 2016-02-10 15:41:51 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.healthgrades.com         | match           | 2016-02-09 14:57:29 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | www.intelius.com             | match           | 2016-02-10 15:38:22 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | jmidlifehealth.org           | medical         | 2016-02-10 15:44:17 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | mic.com                      | Not appropriate | 2016-02-10 15:37:09 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | indiankanoon.org             | Not appropriate | 2016-02-10 15:42:24 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | www.vacouncilofchurches.org  | Not appropriate | 2016-02-10 15:33:18 | Phil      |
| ALL_ORG(Ram Singh: malpractice)     | www.pbs.org                  | Not appropriate | 2016-02-10 15:45:57 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | wtkr.com                     | Not appropriate | 2016-02-10 15:39:23 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | www.law.fsu.edu              | Not appropriate | 2016-02-10 15:34:38 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | modelminority.com            | Not appropriate | 2016-02-10 15:38:56 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | www.alasu.edu                | Not appropriate | 2016-02-10 15:34:42 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | wiki.verkata.com             | Not appropriate | 2016-02-10 15:38:30 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | www.facebook.com             | Not appropriate | 2016-02-10 15:37:55 | Phil      |
| RESTRICTED_COM(Ram Singh: criminal) | search.ancestry.com          | Not appropriate | 2016-02-10 15:37:40 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | www.academia.edu             | Not appropriate | 2016-02-10 15:35:18 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | lists.washlaw.edu            | Not appropriate | 2016-02-10 15:36:36 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | lists.washlaw.edu            | Not appropriate | 2016-02-10 15:35:53 | Phil      |
| ALL_EDU(Ram Singh: criminal)        | www.utexas.edu               | Not appropriate | 2016-02-10 15:34:55 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | netsecu.org                  | Not appropriate | 2016-02-10 15:32:47 | Phil      |
| ALL_US(Ram Singh: board actions)    | www.gutenberg.us             | Not appropriate | 2016-02-09 14:59:57 | Phil      |
| ALL_US(Ram Singh: board actions)    | www.leg.state.mn.us          | Not appropriate | 2016-02-09 14:59:13 | Phil      |
| ALL_US(Ram Singh: board actions)    | www.nhusd.k12.ca.us          | Not appropriate | 2016-02-09 14:59:02 | Phil      |
| ALL_US(Ram Singh: board actions)    | www.acoe.k12.ca.us           | Not appropriate | 2016-02-09 14:58:59 | Phil      |
| ALL_US(Ram Singh: board actions)    | www.nhusd.k12.ca.us          | Not appropriate | 2016-02-09 14:58:30 | Phil      |
| ALL_US(Ram Singh: board actions)    | datab.us                     | Not appropriate | 2016-02-09 14:58:16 | Phil      |
| ALL_US(Ram Singh: board actions)    | newweb.altoona.k12.wi.us     | Not appropriate | 2016-02-09 14:58:11 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.linkedin.com             | Not appropriate | 2016-02-09 14:57:11 | Phil      |
| BASELINE(Ram Singh: board actions)  | en.wikipedia.org             | Not appropriate | 2016-02-09 14:57:06 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.dailymail.co.uk          | Not appropriate | 2016-02-09 14:57:02 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.ndtv.com                 | Not appropriate | 2016-02-09 14:56:56 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.india.com                | Not appropriate | 2016-02-09 14:56:52 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.firstpost.com            | Not appropriate | 2016-02-09 14:52:41 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.youtube.com              | Not appropriate | 2016-02-09 14:48:13 | Phil      |
| ALL_US(Ram Singh: board actions)    | www.curatedobject.us         | Not appropriate | 2016-02-09 15:00:04 | Phil      |
| ALL_US(Ram Singh: board actions)    | datab.us                     | Not appropriate | 2016-02-09 15:00:10 | Phil      |
| ALL_US(Ram Singh: criminal)         | www.curatedobject.us         | Not appropriate | 2016-02-10 15:14:14 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | www.acoe.org                 | Not appropriate | 2016-02-10 15:31:06 | Phil      |
| ALL_ORG(Ram Singh: board actions)   | en.wikipedia.org             | Not appropriate | 2016-02-10 15:30:21 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | fr.wikipedia.org             | Not appropriate | 2016-02-10 15:28:13 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | en.wikipedia.org             | Not appropriate | 2016-02-10 15:26:40 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | www.vacouncilofchurches.org  | Not appropriate | 2016-02-10 15:26:35 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | ca.wikipedia.org             | Not appropriate | 2016-02-10 15:25:21 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | en.wikisource.org            | Not appropriate | 2016-02-10 15:24:59 | Phil      |
| ALL_ORG(Ram Singh: criminal)        | ca.wikipedia.org             | Not appropriate | 2016-02-10 15:24:43 | Phil      |
| ALL_US(Ram Singh: criminal)         | hodges-directory.us          | Not appropriate | 2016-02-10 15:18:46 | Phil      |
| ALL_US(Ram Singh: criminal)         | docslide.us                  | Not appropriate | 2016-02-10 15:15:52 | Phil      |
| ALL_US(Ram Singh: criminal)         | www.nhusd.k12.ca.us          | Not appropriate | 2016-02-10 15:15:37 | Phil      |
| ALL_US(Ram Singh: criminal)         | www.nhusd.k12.ca.us          | Not appropriate | 2016-02-10 15:15:34 | Phil      |
| ALL_US(Ram Singh: criminal)         | www.acoe.k12.ca.us           | Not appropriate | 2016-02-10 15:15:31 | Phil      |
| ALL_US(Ram Singh: criminal)         | www.gutenberg.us             | Not appropriate | 2016-02-10 15:14:33 | Phil      |
| BASELINE(Ram Singh: board actions)  | www.firstpost.com            | Not appropriate | 2016-02-09 14:46:59 | Phil      |
+-------------------------------------+------------------------------+-----------------+---------------------+-----------+
80 rows in set (0.02 sec)

Phil 2.9.16

7:00 – 4:00 VTX

Finished Publius: A robust, tamper-evident, censorship-resistant web publishing system
Starting Anonymity Loves Company – Anonymous Web Transactions with Crowds by Mike Reiter and Aviel Ruben, who was one of the co-authors on the Publius paper.
- Crowds could probably be built with PeerJS. The ISP would still know traffic, but that’s it.
Found this nice article in Communications of the ACM: Schema.org: Evolution of Structured Data on the Web. Nice overview. Very current.
The Big List of Naughty Strings
Time to combine everything
- Optional generation of Providers and queries – default is to load them from the DB
- Run queries from the DB
  - Show the number available and allow a request – done
  - Iterating over the queries and pages. Need to create, append and persist a rating Done
  - Named queries for
    - Queries that have the lowest number of results.ratings – done-ish. Currently it looks for -1 as a flag. Should also look for queries that have unrated results.
    - Queries associated with ‘bad’ providers
    - Queries associated with ‘good’ providers
  - Connect to DB remotely
- Wrap the app (done, with Launch4j. Very nice!) and test it on the other laptop. Note, it doesn’t have enough disk to install java on. That will have to wait.
- Packing up the laptop. Debating bringing multi monitor support. I’ll have the other laptop…
- Gratuitous screenshot:

Phil 2.8.16

7:00 – 5:00 VTX

My 401k still isn’t being done right. Sheesh.
More Publius: A robust, tamper-evident, censorship-resistant web publishing system
- Very good introduction, then it dives into the weeds of how the system was implemented and and the cryptologic challenges. Good stuff, and should be addressed. It does imply that the information stored in my system could be encrypted and sharded as an additional layer of protection agains malicious editing. Since in this case, text can have annotations pointing to it but the source should be archival.
- I think I also need to set up a new doc db of news items that I can use to make the story more readable.
  - Stories of people fooled by misinformation
  - Stories of people damaged by lack of anonymity
  - Stories about citizen journalism
  - Stories about computational journalism
  - Something about CSCW, Wikipedia maybe?
- Anderson’s Eternity Service?
  - https://freenetproject.org/
Need to make the ProviderObject persistent. Done
Need a rating object – date , who, the rating, anything else? Done-ish
Need to make a quick & dirty swing app for people to use – started. Once that’s working, then build the rating object that it will create
Need to connect to a remote DB
- Will also need summary statistics and charts to see how queries do.
- Will also need to store the good (“match” and “flaggable”) pages for later training.
Should make the app stand-alone-ish Jsmooth?
Discussion with Mike G., Heath, Bob H., and Theresa on how to integrate current NLP/NER

Phil 2.5.16

6:45 – 4:15 VTX

Starting Publius: A robust, tamper-evident, censorship-resistant web publishing system
- Marc Waldman

Change the JsonLoaded class to only look at declared fields – done
Register for Periscope Charts -done. Callback on Monday?
Working on parsing the query result.
- Had to set the charset to UTF-8. Huh.
- Can we pull back items by cacheId? Then we don’t need to load the primary store with internet info.
- Had a STUPID mistake in getting JPA set up. Had all the annotations pointing at each other, but forgot when creating the result objects that I had to pass the ‘parent’ query object in to get the mapping. Sigh.
- Adding a dirt-simple rating scheme
  - Java app iterates over all the urls returned and the user can pick from:
```
1 - not appropriate at all
2 - medical and or legal
3 - Correct person
4 - Correct person with flaggable
```
    The Java app then either opens the page or downloads and opens the file with the default application.
  - The user picks the value, the result object persists with the rating and we move on to the next item. Right now the DB is on my local machine, but if we made it networkable everyone could rate a few pages. Most of the results should only take a few seconds to evaluate.
I have the Google/db code running in one sandbox and the user eval running in another. Monday I’ll integrate them.

viztales

Dimension reduction, State, Orientation, and Speed

Monthly Archives: February 2016

Phil 2.29.16

Phil 2.26.16

Phil 2.25.16

Phil 2.24.16

Phil 2.23.16

Phil 2.18.16

Phil 2.17.16

Phil 2.16.16

Phil 2.15.16

Phil 2.12.16

Phil 2.11.16

Phil 2.10.16

Phil 2.9.16

Phil 2.8.16

Phil 2.5.16