Author Archives: pgfeldman

Phil 6.7.16

6:00 – 10:00, 12:00 – 5:00 Writing

Good progress on the lit review section. Found NewsCube: Delivering Multiple Aspects of News to Mitigate Media Bias, which I’ll be adding to the corpus.

Phil 6.6.16

6:30 – 1:00 Writing

Realized that I had forgotten to go into how information seeking behavior of the IR users can potentially be used to vet the quality of the information they are looking at.
Working my way through the lit review.

Phil 6.5.16

8:00 – 2:00 – Writing

Some cleanup and organization
Working on navigation and personalization Think I’m done with the introduction and motivation. Lit Review and plan next.
Well, this looks useful: A new understanding: What makes people trust and rely on news. From the American Press Institute.
And another cool thing: How to break away from articles and invent new story forms. This is another research question – what kind of interface makes for the most accurate, informative and entertaining citizen journalism?
Is ONA 2016 a good thing to go to?

Phil 6.4.16

7:30 – 1:30 Writing

More on libraries and serendipity. Found lots, and then went on to look for metions in electronic retrieval. Found Foster’s A Nonlinear Model of Information-Seeking Behavior, which also has some spiffy citations. Going to take a break from writing and actually read this one. Because, I just realized that interdisciplinary researchers are the rough academic equivalent of the explorer pattern.
Investigating Information Seeking BehaviorUsing the Concept of Information Horizons
- Page 3 – To design and develop a new research method we used Sonnenwald’s (1999) framework for human information behavior as a theoretical foundation. This theoretical framework suggests that within a context and situation is an ‘information horizon’ in which we can act. For a particular individual, a variety of information resources may be encompassed within his/her information horizon. They may include social networks, documents, information retrieval tools, and experimentation and observation in the world. Information horizons, and the resources they encompass, are determined socially and individually. In other words, the opinions that one’s peers hold concerning the value of a particular resource will influence one’s own opinions about the value of that resource and, thus, its position within one’s information horizon.

Phil 6.2.16

7:00 – 5:00 VTX

Writing
- Loaded extracted terms into LMT.
  - removed method and datum
- Working through current IR and group polarization
- A document retrieval system for man-machine interaction. From 1964, folks. Also, the ACM is mapping concepts to Wikipedia entries now.
Write up sprint story – done
- Develop a ‘training’ corpus known bad actors (KBA) for each domain.
  - KBAs will be pulled from http://w3.nyhealth.gov/opmc/factions.nsf, which provides a large list.
  - List of KBAs will be added to the content rating DB for human curation
  - HTML and PDF data will be used to populate a list of documents that will then be scanned and analyzed to prepare TF-IDF and LSI term-document tables.
  - The resulting table will in turn be analyzed using term centrality, with the output being an ordered list of terms to be evaluated for each domain.

Building view to get person, rating and link from the db – done, or at least V1

CREATE VIEW view_ratings AS
  select io.link, qo.search_type, po.first_name, po.last_name, po.pp_state, ro.person_characterization from item_object io
    INNER JOIN query_object qo ON io.query_id = qo.id
    INNER JOIN rating_object ro on io.id = ro.result_id
    INNER JOIN poi_object po on qo.provider_id = po.id;

Took results from w3.nyhealth.gov and ran them through the whole system. The full results are in the Corpus file under w3.nyhealth.gov-PDF-centrality_06_02_16-13_12_09.xlsx and w3.nyhealth.gov-WEB-centrality_06_02_16-13_12_09.xlsx. The results seem to make incredibly specific searches. Here are the two first examples. Note that there are very few .com sites.:
- PDF-derived: respondent consent committee probation agreement pursuant
  - This is very medical for some reason, though not very region specific
- Web-derived: physician license professional return number effective
  - This is very medical, but also biased towards NY state.
- Prepending physician names seems (so far) to lead to consistent hits in the search results on standard Google

Phil 6.1.16

7:00 – 2:00VTX

Checking in the large matrix for working through ‘delete’
- Finished. Also deletes any other items that are wholly dependent on the deleted item.
If opening a pdf, look for the file in the directory where the input is, if not found, bring up the dialog. It’s going to be a bit tricky, since the string isn’t quite the file name. Done. If it’s not in the default directory, it’ll bring up a file chooser.
Writing
Need to write up the sprint plan for generating terms. It’ll be based on the information in this site.
Goal Question Metric
Serendipity of walking through the library and getting distracted. What Is Used during Cognitive Processing in Information Retrieval and Library Searching? Eleven Sources of Search Information

Phil 5.31.16

7:00 – 4:30 VTX

Writing. Working on describing how maintaining many codes in a network contains more (and more subtle) information than grouping similar codes.
Working on the UrlChecker
- In the process, I discovered that the annotation.xml file is unique only for the account and not for the CSE. All CSEs for one account are contained in one annotation file
- Created a new annotation called ALL_annotations.xml
- fixed a few things in Andy’s file
- Reading in everything. Now to produce the new sets of lists.
- I think it’s just easier to delete all the lists and start over.
- Done and verified. You run UrlChecker from the command line, with the input file being a list of domains (one per line) and the ALL_annotations.xml file.
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2
Need to add a Delete or Hide button to reduce down a large corpus to a more effective size.
Added. Tomorrow I’ll wire up the deletion of a row or cilumn and the recreation of the initialMatrix

Phil 5.30.16

7:00 – 10:00 Thesis/VTX

Built a new matrix for the coded lit review. I had coded a couple of more papers
Working on copying over the read papers into a new folder that I can run text analytics over
After carefully reading through the doc manager list and copying over each paper, I just discovered I could have exported selected.
Ooops: Exception in thread “JavaFX Application Thread” java.lang.IllegalArgumentException: Invalid column index (16384). Allowable column range for EXCEL2007 is (0..16383) or (‘A’..’XFD’)
- Going to add a limit of
```
SpreadsheetVersion.EXCEL2007.getMaxColumns()-8
```
  columns for now. Clearly that can be cut down.
- Figuring out where to cut the terms. I’m summing the columns of the LSI calculation, starting at the highest value and then dividing that by the sum of all values. The top 20% of rank weights gives 280 columns. Going to try that first
- Success! Some initial thoughts
  - The coded version is much more ‘crisp’
  - There are interesting hints in the LSI version
  - Clicking on a term or paper to see the associated items is really nice.
  - I think that document subgroups might be good/better, and it might be possible to use the tool to help build those subgroups. This goes back to the ‘hiding’ concept. (hide item / hide item and associated)

Phil 5.27.16

7:00 – 2:00 VTX

Wound up writing the introduction and saving the old intro to a new document – Themesurfing
Renamed the current document
Got the parser working. Old artifact settings.
Added some tweaks to show progress better. I’m kinda stuck with the single thread in JavaFx having to execute before text can get shown.
Need an XML parser to find out what sites have already been added. Added an IntelliJ project to the GoogleCseConfigFiles SVN file. Should be able to finish it on Tuesday.

Phil 5.26.16

7:00 – 5:00 VTX

Writing. Working through the auto-generated terms
Create config file for the three sites from Margarita

Done. And got reasonable results!

Term	Weight	Eigenvalue
narcotic	1	0.02394
cocaine	1	0.021204
defendant	1	0.015849
heroin	1	0.014715
dealer	1	0.01026
sentence	1	0.009586
bronx	1	0.008208
pound	1	0.007524
indictment	1	0.00297
jersey	1	0.001524

Had one problem that I’ll need to look for. The page ‘https://www.justice.gov/usao-edla/pr/superseding-indictment-charges-fourteen-individuals-conspiracy-commit-30-million‘ did not read in correctly the first time. Need to add a check on that. At least a warning popup.
Also, here’s a way to redirect console output to the app while its running: http://stackoverflow.com/questions/8708342/redirect-console-output-to-string-in-java

Start report? Done: Extracting Better Search Terms.

Phil 5.25.16

7:00 – 4:30 VTX

Took the weekend off for the ESCN. Bailed on Saturday because of rain, then dodged rain for two days, then got a nice ride in on Tuesday.
Chatting with Aaron last night, I discovered that the REST API won’t work for Demo. I’ll need to get a new SQL dump from Heath. No, actually it works just fine in that it is accessable, but anything other than empty sets is a timeout.
Need to try to build a new jar file for the CorpusManager so it can have its own executable. Put the Manifest in the CM directory? not sure how to do that.
- Yay stack overflow!
Writing
Looks like my old laptop finally bit the dust. Chromebook time?
Working on the Corpus manager to pull in links in the config file. Done!
So I’m having all kinds of problems getting the flag info from Jeremy’s rest service. I did realize that I can use the dashboard though and harvest the urls by following the links and build my list that way. Except that the flags are crap. Back to moby dick for the moment.
Actually, those are pretty bad too. Margarita put three urls up on confluence.
Got url scanning done through config file.
Ingested the first four chapters of Moby Dick. Pretty interesting. Ill try those three files tomorrow and we’ll see what we’ve got, at least for a sense of .gov sites…

Phil 5.20.16

7:00 – 3:30

Writing
Going to try LSI. I think the term clustering is simply the sum if the TF-IDF across docs by term. That should give a topic list. Then use that for centrality calculations? Take the top n words?
- Actually, then the user could group words into concepts and that could make a smaller matrix where the concept count is the union of the counts of its component terms.
Have a LSI-lite version going that sums the TF-IDF scores and then sorts based on the sum of all scores * (number of docs with score / number of docs). Then sort and take the top n terms.
Need to multiply the matrix by something so that the count gets populated with something reasonable. Maybe 100? Tried that – it looks good.

Got the PDF parsing working. Need to get it to work with webpages next and try it on Moby Dick. Then output from the flag data

https://dockerapps5.eip.nj.vistronix.com:9443/authenticationendpoint/login.do?client_id=w674kmsNj7flgKkTp_t_8ArPES0a&commonAuthCallerPath=%2Foauth2%2Fauthorize&forceAuth=false&passiveAuth=false&redirect_uri=http%3A%2F%2Fdockerapps.vistronix.com%2Flogin&response_type=code&scope=openid&state=RrKxRY&tenantDomain=carbon.super&sessionDataKey=fbcaf4a0-679a-4eed-93df-5464bca702ff&relyingParty=w674kmsNj7flgKkTp_t_8ArPES0a&type=oidc&sp=EIP-CI&isSaaSApp=false&authenticators=BasicAuthenticator:LOCAL

http://dockerapps.vistronix.com/gtc-server/physicianservice/flags

Need to make sure that I use the above pointing at the demo system. From Andy’s email:

Yes …looks you are looking at dev….in Confluence, search on environment details…that Will give you the urls for the dashboards on dev, ci and demo…we are working on demo now.

Phil 5.19.16

7:00 – 5:00 VTX

Looks like I saved the wrong version of the code to dropbox, so I can’t update the app image.
More writing
- Looking for a clear, concise description of bootstrapping led here: Resampling: The New Statistics
System and Social trust, revisited: Algorithms, clickworkers, and the befuddled fury around Facebook Trends
GDELT “uses some of the world’s most sophisticated natural language and data mining algorithms to extract more than 300 categories of “events” and the networks of people, organizations, locations, themes, and emotions that tie them together.“
Working on the Corpus Processing Tool
Need to break apart calculateAndSave
Need to build matrix
Need to save spreadsheets
Need name to save too.
Start and stopwords
Add Latent Semantic Indexing? I have most of the pieces.

Phil 5.18.16

7:00 – 4:30 VTX

Writing
- Wanted to show how a network could be used for intercoder agreement so I had to refresh my understanding of Cohen’s kappa
- It occurs to me that if one coder’s rank can be mapped to another coder’s rank we have a kind of information distance measure. Although the math to do that eludes me. Rank comparison could make a lot of sense to compare centrality. Another possibility is to compare the network measures?
Adding ‘Filter’ field and button to LMN
This appears to be how you do it.
And it worked like a charm 🙂
Worked through scoring math with Aaron

Phil 5.17.16

7:00 -7:00

Great discussion with Greg yesterday. Very encouraging.
Some thoughts that came up during Fahad’s (Successful!) defense
- It should be possible to determine the ‘deletable’ codes at the bottom of the ranking by setting the allowable difference between the initial ranking and the trimmed rank.
- The ‘filter’ box should also be set by clicking on one of the items in the list of associations for the selected items. This way, selection is a two-step process in this context.
- Suggesting grouping of terms based on connectivity? Maybe second degree? Allows for domain independence?
- Using a 3D display to show the shared second, third and nth degree as different layer
- NLP tagged words for TF-IDF to produce a more characterized matrix?
- 50 samples per iteration, 2,000 iterations? Check! And add info to spreadsheet! Done, and it’s 1,000 iterations
Writing
Parsing Jeremy’s JSON file
- Moving the OptionalContent and JsonLoadable over to JavaJtils2
- Adding javax.persistence-2.1.0
- Adding json-simple-1.1.1
- It worked, but it’s junk. It looks like these are un-curated pages
Long discussion with Aaron about calculating flag rollups.

viztales

Dimension reduction, State, Orientation, and Speed

Author Archives: pgfeldman

Phil 6.7.16

Phil 6.6.16

Phil 6.5.16

Phil 6.4.16

Phil 6.2.16

Phil 6.1.16

Phil 5.31.16

Phil 5.30.16

Phil 5.27.16

Phil 5.26.16

Phil 5.25.16

Phil 5.20.16

Phil 5.19.16

Phil 5.18.16

Phil 5.17.16