5:00-ish 4:00 – VTX
- Call Charlestown
- Meeting with Dr. Pan
- The new ground truth framework looks good. Saving outbound and inbound links is also worth doing.
- Beware of low percentage patterns. finding the 1% answer is very hard for machine learning, while finding the 49% answer is much better.
- SVMs are probably a good way to start since they are resistant to overfitting
- Multiple passes may be required to filter the data to get a meaningful result. Patterns like the .edu/.gov ratio may be very helpful
- The subReddit Change My View is an interesting UGC site that should provide good examples of information networks on both sides of a controversial point, and a measure of success. It would certainly be interesting to do a link analysis.
- Starting on A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web. If I’m right, I should have a Game Theory/Information Economics model to frame this. Here’s hoping.
- As an aside, parsing my saved documents to get authors, general terms, and ACM Reference Format terms should be done to compare the produced networks. Looks like PDFBox should do the trick.
- Elaheh Momeni – Lots of stuff on UGC
- Data Mining
- Collective Intelligence
- Machine Learning
- User Generated Content Mining
- Social Computing
- Claire Cardie
- argument mining and argument generation including the identification of supported vs. unsupported claims and opinions,
- social-computational methods for improving communication and interactions in on-line settings,
- NLP for e-rulemaking,
- sentiment analysis: extraction and summarization of fine-grained opinions in text,
- discourse-aware methods for opinion and argument extraction,
- deception detection in on-line reviews,
- noun phrase coreference resolution.
- Nick Diakopoulos
- Research in computational and data journalism with an emphasis on algorithmic accountability, narrative data visualization, and social computing in the news.
- New Weapon in Day Laborers’ Fight Against Wage Theft: A Smartphone App – NYTimes. Short documentary on YouTube. Sol Aramendi is the author?
- Spent time when I should be sleeping thinking about rating webpages. Rather than the current single list, I think at least four categories are needed:
- Accessible yes/no (404, etc)
- Match – did the person show up yes/no/possible-can’t tell
- Target Characterization
- Positive – gave to charity, published a paper
- Neutral – phone book listing
- Negative – conviction, confession
- Source type
- Official Document
- Home Page
- Microblog
- Blog
- News organization
- Federal Government
- State Government
- Commercial Entity – Rating site, etc
- Non-commercial Entity – Nonprofit, clubs, interest group
- Educational – yearbook, program, course listing
- Machine-generated for unclear purpose
- Spam
- Content Characterization (can be multiple)
- Medical
- Legal
- Commercial
- Official
- Marketing
- Other
- Spam
- Quality Characterization
- Low – confusing, conflicting unrelated information
- Minimal – some useful information (Machine harvested from better sources)
- High – clear, providing high quality information
- Source Characterization
- Very trustworthy – I’d give them my SSN
- Trustworthy – I’d use a credit card here
- Credible – I’d use this site to support an argument
- Neutral – Not sure, but wouldn’t avoid
- Not Credible – Not rooted in things that I believe/trust
- Distrustworthy – I’m pretty sure this site is misinformation
- Very Distrustworthy – Conspiracy theories, Lizardmen, etc
- Relevant Text – In addition, I think we need a text area that the user can paste text from the webpage that contains the match in context, or something that exemplifies the source characterisation
- Notes – To cover anything that’s not covered above
- So now Gregg is handling Crawl Service file generation?
- Discussion with Katy and Jeremy about the list above?
- Pondering how to adjust the ratingObject everything is a string, except for content characterization, which can have multiples. I could do a bitfield or a separate table. Leaning towards the bitfieled.
