Phil 3.2.16

5:00-ish 4:00 – VTX

  • Call Charlestown
  • Meeting with Dr. Pan
    • The new ground truth framework looks good. Saving outbound and inbound links is also worth doing.
    • Beware of low percentage patterns. finding the 1% answer is very hard for machine learning, while finding the 49% answer is much better.
    • SVMs are probably a good way to start since they are resistant to overfitting
    • Multiple passes may be required to filter the data to get a meaningful result. Patterns like the .edu/.gov ratio may be very helpful
    • The subReddit Change My View is an interesting UGC site that should provide good examples of information networks on both sides of a controversial point, and a measure of success. It would certainly be interesting to do a link analysis.
  • Starting on A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web. If I’m right, I should have a Game Theory/Information Economics model to frame this. Here’s hoping.
    • As an aside, parsing my saved documents to get authors, general terms, and ACM Reference Format terms should be done to compare the produced networks. Looks like PDFBox should do the trick.
    • Elaheh Momeni – Lots of stuff on UGC
      • Data Mining
      • Collective Intelligence
      • Machine Learning
      • User Generated Content Mining
      • Social Computing
    • Claire Cardie
      • argument mining and argument generation including the identification of supported vs. unsupported claims and opinions,
      • social-computational methods for improving communication and interactions in on-line settings,
      • NLP for e-rulemaking,
      • sentiment analysis: extraction and summarization of fine-grained opinions in text,
      • discourse-aware methods for opinion and argument extraction,
      • deception detection in on-line reviews,
      • noun phrase coreference resolution.
    • Nick Diakopoulos
      • Research in computational and data journalism with an emphasis on algorithmic accountability, narrative data visualization, and social computing in the news.
  • New Weapon in Day Laborers’ Fight Against Wage Theft: A Smartphone App – NYTimes. Short documentary on YouTube. Sol Aramendi is the author?
  • Spent time when I should be sleeping thinking about rating webpages. Rather than the current single list, I think at least four categories are needed:
    • Accessible yes/no (404, etc)
    • Match – did the person show up yes/no/possible-can’t tell
    • Target Characterization
      • Positive – gave to charity, published a paper
      • Neutral – phone book listing
      • Negative – conviction, confession
    • Source type
      • Official Document
      • Home Page
      • Microblog
      • Blog
      • News organization
      • Federal Government
      • State Government
      • Commercial Entity – Rating site, etc
      • Non-commercial Entity – Nonprofit, clubs, interest group
      • Educational – yearbook, program, course listing
      • Machine-generated for unclear purpose
      • Spam
    • Content Characterization (can be multiple)
      • Medical
      • Legal
      • Commercial
      • Official
      • Marketing
      • Other
      • Spam
    • Quality Characterization
      • Low – confusing, conflicting unrelated information
      • Minimal – some useful information (Machine harvested from better sources)
      • High – clear, providing high quality information
    • Source Characterization
      • Very trustworthy – I’d give them my SSN
      • Trustworthy – I’d use a credit card here
      • Credible – I’d use this site to support an argument
      • Neutral – Not sure, but wouldn’t avoid
      • Not Credible – Not rooted in things that I believe/trust
      • Distrustworthy – I’m pretty sure this site is misinformation
      • Very Distrustworthy – Conspiracy theories, Lizardmen, etc
    • Relevant Text – In addition, I think we need a text area that the user can paste text from the webpage that contains the match in context, or something that exemplifies the source characterisation
    • Notes – To cover anything that’s not covered above
  • So now Gregg is handling Crawl Service file generation?
  • Discussion with Katy and Jeremy about the list above?
  • Pondering how to adjust the ratingObject everything is a string, except for content characterization, which can have multiples. I could do a bitfield or a separate table. Leaning towards the bitfieled.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.