5:30 – 3:30 VTX
- Continuing The Hybrid Representation Model for Web Document Classification. Good stuff, well written. This paper (An Efficient Algorithm for Discovering Frequent Subgraphs) may be good for recognizing patterns between stories. Possibly also images.
- Useful page for set symbols that I can never remember: http://www.rapidtables.com/math/symbols/Set_Symbols.htm
- Finally discovered why the RdfStatementNodes aren’t assembling properly. There is no root statement… Fixed! We can now go from:
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:vCard='http://www.w3.org/2001/vcard-rdf/3.0#' > <rdf:Description rdf:about="http://somewhere/JohnSmith/"> <vCard:FN>John Smith</vCard:FN> <vCard:N rdf:parseType="Resource"> <vCard:Family>Smith</vCard:Family> <vCard:Given>John</vCard:Given> </vCard:N> </rdf:Description> <rdf:Description rdf:about="http://somewhere/RebeccaSmith/"> <vCard:FN>Becky Smith</vCard:FN> <vCard:N rdf:parseType="Resource"> <vCard:Family>Smith</vCard:Family> <vCard:Given>Rebecca</vCard:Given> </vCard:N> </rdf:Description> <rdf:Description rdf:about="http://somewhere/SarahJones/"> <vCard:FN>Sarah Jones</vCard:FN> <vCard:N rdf:parseType="Resource"> <vCard:Family>Jones</vCard:Family> <vCard:Given>Sarah</vCard:Given> </vCard:N> </rdf:Description> <rdf:Description rdf:about="http://somewhere/MattJones/"> <vCard:FN>Matt Jones</vCard:FN> <vCard:N vCard:Family="Jones" vCard:Given="Matthew"/> </rdf:Description> </rdf:RDF>to this:
[1]: http://somewhere/SarahJones/ --[5] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal: "Sarah Jones" --[4] Subject: http://somewhere/SarahJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffd) ----[6] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal: "Sarah" ----[7] Subject: b81a776:1528928f544:-7ffd, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal: "Jones" [3]: http://somewhere/MattJones/ --[15] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal: "Matt Jones" --[14] Subject: http://somewhere/MattJones/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffc) ----[11] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal: "Jones" ----[10] Subject: b81a776:1528928f544:-7ffc, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal: "Matthew" [0]: http://somewhere/RebeccaSmith/ --[3] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal: "Becky Smith" --[2] Subject: http://somewhere/RebeccaSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7ffe) ----[9] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal: "Smith" ----[8] Subject: b81a776:1528928f544:-7ffe, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal: "Rebecca" [2]: http://somewhere/JohnSmith/ --[12] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#N, Object(b81a776:1528928f544:-7fff) ----[1] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Family, Object Literal: "Smith" ----[0] Subject: b81a776:1528928f544:-7fff, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#Given, Object Literal: "John" --[13] Subject: http://somewhere/JohnSmith/, Predicate: http://www.w3.org/2001/vcard-rdf/3.0#FN, Object Literal: "John Smith"
- Some thoughts about information retrieval using graphs
- Characterise our current data
- URL
- Domain
- HTML text- allows for tagged information <tite>, etc. Outbound links (how many hops?)
- Cleaned text
- Word Bag?
- Date retrieved
- Query (items returned)
- Currently 4 queries per entity – few 100k to mbytes.
- Felony – top 10
- 3 others
- Currently 4 queries per entity – few 100k to mbytes.
- Flags produced
- From that, we can analyze
- Most productive pages for flags (power law?)
- Temporal patterns
- Network characteristics for good/bad medical providers, pages producing flags, etc.
- Diameter
- Flag page adjacency and connecting paths
- Statistical differences for above, including the same query over time
- Once we know the size of the network we use, we can look to how we might do our own crawl.
- Impact of Similarity Measures on Web-page Clustering
- Look up “Standard Graph Representation”
- An Efficient Algorithm for Discovering Frequent Subgraphs
- Efficient Graph-Based Representation of Web Documents
- Impact of Similarity Measures on Web-page Clustering
- Patient satisfaction is systemic, criminal is isolated? Implies a signal to noise problem.
- Can we produce a training set of documents?
- Can we test against other search engines? Bing? Mechanical Turk?
- Data Sets
- Using Google better
- Only use Google to find new sites and exclude the ones that we know about: https://support.google.com/customsearch/answer/2631038?hl=en
- Can we infer a bad/good provider? Google Similarity Distance If so, we can only run extensive queries about bad ones. (A quick example)
- How often do we really need to do a deep crawl on a provider. Are there inferential triggers that we could use?
- Characterise our current data
- Sent a note to Theresa asking for people to do manual flag extraction
