Phil 6.22.16

6:45 – 4:45 VTX

  • Running analytics on the CSCW corpora
  • My codebase at home was out of data, and I was having my missing bouncycastle jar file issue, so I updated the development folders. I also started updating my IntelliJ, which has taken 10 minutes so far…
  • First pass of running the CSCW data through the tool.
    There are three categories:
    • CSCW17 – these are the submitted papers
    • MostCited – These are (generally) the most cited paper by the author where they are first or last author. It took me a while to start doing this, so the set isn’t consistent.
    • MostRecent – These are the most recent papers that I could get copies of. Same constraints and caveats as above.
    I also deleted the term ‘participants’, as it overwhelmed the rest of the relationships and is a pretty standard methods element that I don’t think contributes to the story the data tell.
    Here’s the top ten items, ranked by influence of terms inside of the top 52 items in the LSI ranking. It’s kind of interesting…
    CSCW2017 Most Cited Most Recent Most Cited Most Recent
    older social media Sean P. Goggins.pdf Donald McMillan.pdf
    ageism student privacy Chinmay Kulkarni.pdf Mark Rouncefield.pdf
    adult photo twitter Airi Lampinen.pdf Sarah Vieweg.pdf
    blogger awareness behavior Cliff Lampe.pdf Jeffrey T. Hancock.pdf
    ageist object device Anne Marie Piper.pdf David Randall.pdf
    platform class interview Frank Bentley.pdf Cliff Lampe.pdf
    workplace facebook notification Mor Naaman.pdf Sean P. Goggins.pdf
    woman friend deception Morgan G. Ames.pdf Airi Lampinen.pdf
    gender flickr phone Gabriela Avram.pdf Wayne Lutters.pdf
    snapchat software facebook Lior Zalmanson.pdf Vivek K. Singh.pdf
  • Finished rating! 530 pages. Now I need to get the outputs to Excel. I think the view_ratings should be enough…?
  • I don’t have just names alone, but I’m going to assume that the initial set of queries (‘board actions’, ‘criminal’, ‘malpractice’ and ‘sanctions’) may modestly improve the search. So with a proxy for the current system, with my small data set, I have the following results:
    • Hits or near misses – 46 pages or 16.7% of the total pages evaluated
    • Misses – 230 or 83.3% of the total pages evaluated

    With the new CSE configuration (exactTerms=<name permutation>, query=<full state name>, orTerms=<TF-IDF string>, we get much better results:

    • Hits or near misses – 252 pages or 78% of the total pages evaluated
    • Misses – 71 or 22% of the total pages evaluated

    So it looks like we can expect something on the order of a 450% improvement in results.

  • Good presentation on document similarity

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.