Category Archives: Work

Phil 8.16.18

7:00 – 4:30 ASRC MKT

  • R2D3 is an experiment in expressing statistical thinking with interactive design. Find us at @r2d3usR2D3
  • Foundations of Temporal Text Networks
    • Davide Vega (Scholar)
    • Matteo Magnani (Scholar)
    • Three fundamental elements to understand human information networks are the individuals (actors) in the network, the information they exchange, that is often observable online as text content (emails, social media posts, etc.), and the time when these exchanges happen. An extremely large amount of research has addressed some of these aspects either in isolation or as combinations of two of them. There are also more and more works studying systems where all three elements are present, but typically using ad hoc models and algorithms that cannot be easily transferred to other contexts. To address this heterogeneity, in this article we present a simple, expressive and extensible model for temporal text networks, that we claim can be used as a common ground across different types of networks and analysis tasks, and we show how simple procedures to produce views of the model allow the direct application of analysis methods already developed in other domains, from traditional data mining to multilayer network mining.
      • Ok, I’ve been reading the paper and if I understand it correctly, it’s pretty straightforward and also clever. It relates a lot to the way that I do term document matrices, and then extends the concept to include time, agents, and implicitly anything you want to. To illustrate, here’s a picture of a tensor-as-matrix: tensorIn2DThe important thing to notice is that there are multiple dimensions represented in a square matrix. We have:
        • agents
        • documents
        • terms
        • steps
      • This picture in particular is of an undirected adjacency matrix, but I think there are ways to handle in-degree and out-degree, though I think that’s probably better handled by having one matrix for indegree and one for out.
      • Because it’s a square matrix, we can calculate the steps between any node that’s on the matrix, and the centrality, simply by squaring the matrix and keeping track of the steps until the eigenvector settles. We can also weight nodes by multiplying that node’s row and column by the scalar. That changes the centrality, but ot the connectivity. We can also drop out components (steps for example) to see how that changes the underlying network properties.
      • If we want to see how time affects the development of the network, we can start with all the step nodes set to a zero weight, then add them in sequentially. This means, for example, that clustering could be performed on the nonzero nodes.
      • Some or all of the elements could be factorized using NMF, resulting in smaller, faster matrices.
      • Network embedding could be useful too. We get distances between nodes. And this looks really important: Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec
      • I think I can use any and all of the above methods on the network tensor I’m describing. This is very close to a mapping solution.
  • The Shifting Discourse of the European Central Bank: Exploring Structural Space in Semantic Networks (cited by the above paper)
    • Convenient access to vast and untapped collections of documents generated by organizations is a valuable resource for research. These documents (e.g., Press releases, reports, speech transcriptions, etc.) are a window into organizational strategies, communication patterns, and organizational behavior. However, the analysis of such large document corpora does not come without challenges. Two of these challenges are 1) the need for appropriate automated methods for text mining and analysis and 2) the redundant and predictable nature of the formalized discourse contained in these collections of texts. Our article proposes an approach that performs well in overcoming these particular challenges for the analysis of documents related to the recent financial crisis. Using semantic network analysis and a combination of structural measures, we provide an approach that proves valuable for a more comprehensive analysis of large and complex semantic networks of formal discourse, such as the one of the European Central Bank (ECB). We find that identifying structural roles in the semantic network using centrality measures jointly reveals important discursive shifts in the goals of the ECB which would not be discovered under traditional text analysis approaches.
  • Comparative Document Analysis for Large Text Corpora
    • This paper presents a novel research problem, Comparative Document Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of documents) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the background corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains—scientific papers and news—demonstrate the effectiveness and robustness of the proposed framework on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.
  • Social and semantic coevolution in knowledge networks
    • Socio-semantic networks involve agents creating and processing information: communities of scientists, software developers, wiki contributors and webloggers are, among others, examples of such knowledge networks. We aim at demonstrating that the dynamics of these communities can be adequately described as the coevolution of a social and a socio-semantic network. More precisely, we will first introduce a theoretical framework based on a social network and a socio-semantic network, i.e. an epistemic network featuring agents, concepts and links between agents and between agents and concepts. Adopting a relevant empirical protocol, we will then describe the joint dynamics of social and socio-semantic structures, at both macroscopic and microscopic scales, emphasizing the remarkable stability of these macroscopic properties in spite of a vivid local, agent-based network dynamics.
  • Tensorflow 2.0 feedback request
    • Shortly, we will hold a series of public design reviews covering the planned changes. This process will clarify the features that will be part of TensorFlow 2.0, and allow the community to propose changes and voice concerns. Please join developers@tensorflow.org if you would like to see announcements of reviews and updates on process. We hope to gather user feedback on the planned changes once we release a preview version later this year.

Phil 8.12.18

7:00 – 4:00 ASRC MKT

  • Having an interesting chat on recommenders with Robin Berjon on Twitter
  • Long, but looks really good Neural Processes as distributions over functions
    • Neural Processes (NPs) caught my attention as they essentially are a neural network (NN) based probabilistic model which can represent a distribution over stochastic processes. So NPs combine elements from two worlds:
      • Deep Learning – neural networks are flexible non-linear functions which are straightforward to train
      • Gaussian Processes – GPs offer a probabilistic framework for learning a distribution over a wide class of non-linear functions

      Both have their advantages and drawbacks. In the limited data regime, GPs are preferable due to their probabilistic nature and ability to capture uncertainty. This differs from (non-Bayesian) neural networks which represent a single function rather than a distribution over functions. However the latter might be preferable in the presence of large amounts of data as training NNs is computationally much more scalable than inference for GPs. Neural Processes aim to combine the best of these two worlds.

  • How The Internet Talks (Well, the mostly young and mostly male users of Reddit, anyway)
    • To get a sense of the language used on Reddit, we parsed every comment since late 2007 and built the tool above, which enables you to search for a word or phrase to see how its popularity has changed over time. We’ve updated the tool to include all comments through the end of July 2017.
  • Add breadcrumbs to slides
  • Download videos – done! Put these in the ppt backup
  • Fix the DTW emergent population chart on the poster and in the slides. Print!
  • Set up the LaTex Army BAA framework
  • Olsson
  • Slide walkthough. Good timing. Working on the poster some more AdversarialHerding2

Phil 6.25.18

7:00 – 9:00 ASRC MKT

  • Update laptop – Intellij, Java, GroupPolarazation codebase
  • Add XML output for influence – done!
  • Refactored the GUI to work with smaller (laptop) screens)

9:00 – 2:30 ASRC A2P

  • Debug what’s going on with the excel reading. Try a new config file first?
  • Ground slowly through options
    • Replaced the config file
    • Stepped through the debugger, and noticed that the worksheet was null. Tried a different worksheet/config, and that was *not* null
    • Created a new workbook and copied everything over without formatting. That worked on the converter, but didn’t work with A2P
    • Reformatted the new workbook and wound up using the Funding Summary Details data with the formatting, which is *crazy*….
    • Had some issues getting connected to the server. Pageant forgot my key.

3:00 – 4:00 ASRC MKT

  • Fika. No, not really. Wound up chatting with Will

Phil 2.21.18

7:00 – 6:00 ASRC MKT

  • Wow – I’m going to the Tensorflow Summit! Need to get a hotel.
  • Dimension reduction + velocity in this thread
  • Global Pose Estimation with an Attention-based Recurrent Network
    • The ability for an agent to localize itself within an environment is crucial for many real-world applications. For unknown environments, Simultaneous Localization and Mapping (SLAM) enables incremental and concurrent building of and localizing within a map. We present a new, differentiable architecture, Neural Graph Optimizer, progressing towards a complete neural network solution for SLAM by designing a system composed of a local pose estimation model, a novel pose selection module, and a novel graph optimization process. The entire architecture is trained in an end-to-end fashion, enabling the network to automatically learn domain-specific features relevant to the visual odometry and avoid the involved process of feature engineering. We demonstrate the effectiveness of our system on a simulated 2D maze and the 3D ViZ-Doom environment.
  •  Slides
    • Location
    • Orientation
    • Velocity
    • IR context -> Sociocultural context
  • Writing Fika. Make a few printouts of the abstract
    • It kinda happened. W
  • Write up LMN4A2P thoughts. Took the following and put them in a LMN4A2P roadmap document in Google Docs
    • Storing a corpora (raw text, BoW, TF-IDF, Matrix)
      • Uploading from file
      • Uploading from link/crawl
      • Corpora labeling and exploring
    • Index with ElasticSearch
    • Production of word vectors or ‘effigy documents’
    • Effigy search using Google CSE for public documents that are similar
      • General
      • Site-specific
      • Semantic (Academic, etc)
    • Search page
      • Lists (reweightable) or terms and documents
      • Cluster-based map (pan/zoom/search)
  • I’m as enthusiastic about the future of AI as (almost) anyone, but I would estimate I’ve created 1000X more value from careful manual analysis of a few high quality data sets than I have from all the fancy ML models I’ve trained combined. (Thread by Sean Taylor on Twitter, 8:33 Feb 19, 2018)
  • Prophet is a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.
  • Done with Angular fundamentals. reDirectTo isn’t working though…
    • zone.js:405 Unhandled Promise rejection: Invalid configuration of route '': redirectTo and component cannot be used together ; Zone: <root> ; Task: Promise.then ; Value: Error: Invalid configuration of route '': redirectTo and component cannot be used together

Phil 11.30.17

7:00 – 4:30 ASRC MKT

  • Need to get this book: Simulating Social Complexity
  • Continuing Alignment in social interactions here. This looks much better. Found this book on game theory for groups. It took me a while to determine if ‘joint action’ is the coordinated ‘joint’ action of two people, or it it was the coordinated action of two people’s joints. It’s neuroscience after all.
  • Continuing on the Research Browser white paper. Based on the HHS requests, I added default performance logging
  • In a lot of meetings as an Aaron Proxey
  • Good progress on the white paper. Just about finished the background section. Need to add pix of the current prototype

Phil 3.24.16

7:00 – 10:00, 11:00 – 3:00 VTX

  • Was going to continue The Law of Group Polarization, but got sucked into the following. On a related note, I peeked at the group sensemaking paper from CSCW and realized that they are dealing with group polarization issues.
  • Soooooooooo, I went back to check the links that the google search “link:http://dotearth.blogs.nytimes.com” brings up. In looking at the pages (mostly other blog-like sites), the link to dotearth is almost always in the blogroll list that’s off to the side on many of these sites. For example look at the lower right on climatecentral.org, and you’ll see the link.
  • I think this makes sense. These are the generic pages that point to other generic pages. So I went back to Google and searched for ‘Paul Krugman blog‘ and then looked for the oldest post that I could find in the result, which was this one from January 16. Top ratings means that it has to be linked to a lot, so I tried “link:krugman.blogs.nytimes.com/2016/01/23/how-to-make-donald-trump-president/“. Alas, that doesn’t return anything, though “link:krugman.blogs.nytimes.com” does.
  • So I went to the the Wikipedia most referenced pages page. Top ranked was Geographic coordinate system, which has over 600k inbound links. But –
  • Apparently, this is Google being coy. Searching for backlinks can be expensive. Moz has plans that start at $500/month. Bing also seems to have something with an API. Starting to check that out.
    • Added philfeldman.com to my bing webmaster profile. Had to add BingSiteAuth.xml to the site.
    • Nope, looks like it’s just the verified pages
  • Looking at SEMrush. Pretty straightforward and $15 buys you 7,500 lines of results.
    • Here’s the REST-ish API
    • Here’s the first format I’ve tried:
      http://api.semrush.com/analytics/v1/?key=xxxxxxxxxxxxxxxxxxxxxx&target=boardsanctions.com/&type=backlinks&target_type=root_domain&display_sort=page_score_desc&display_limit=10
    • The first thing I tried out was on my angular blog entry, and this is what comes back:
      page_score;source_title;source_url;target_url;anchor;external_num;internal_num;first_seen;last_seen
      1;Philip Feldman;http://philfeldman.com/resume.html;https://phifel.wordpress.com/;blog;7;2;1435698192;1452178691
      1;Phil Feldman Resume (WebGL);http://philfeldman.com/;https://phifel.wordpress.com/;My Primary Blog;15;4;1424207638;1452178080
      1;Phil Feldman Resume (WebGL);http://www.philfeldman.com/;https://phifel.wordpress.com/;My Primary Blog;15;4;1435689880;1452178091
    • Pretty good! Very clean. Then I tried boardsanctions.com:
      page_score;source_title;source_url;target_url;anchor;external_num;internal_num;first_seen;last_seen
      0;Plastic Surgery - Avoiding The Nightmare Case - Social Gaming Wiki FR;http://fr.socialgamingwiki.com/index.php/Plastic_Surgery_-_Avoiding_The_Nightmare_Case;http://boardsanctions.com/;Georgia Medical Board Actions;4;32;1454582397;1454582397
      0;Plastic Surgeon - Advice To Allow You Choose – TFC;http://www.tvfc.de/index.php?printable=yes&title=Plastic_Surgeon_-_Advice_To_Allow_You_Choose;http://boardsanctions.com/;Doctors to avoid;2;28;1452634501;1452634501
      0;Finding A Plastic Surgeon In Your Area – TheorieWiki;http://theoriewiki.org/index.php?oldid=8721&title=Finding_A_Plastic_Surgeon_In_Your_Area;http://boardsanctions.com/;Ohio Medical Board Actions;4;40;1451297137;1451297137
      0;How To Prepare For Your Breast Augmentation – TheorieWiki;http://theoriewiki.org/index.php?title=How_To_Prepare_For_Your_Breast_Augmentation;http://boardsanctions.com/;Doctor Complaints;4;33;1444916428;1453210146
      0;Finding A Plastic Surgeon In Your Area: Unterschied zwischen den Versionen – TheorieWiki;http://theoriewiki.org/index.php?diff=8723&oldid=8721&title=Finding_A_Plastic_Surgeon_In_Your_Area;http://boardsanctions.com/;Florida Medical Board Sanctions;4;39;1457400844;1457400844
      0;Benutzer:FelicaAngelo06 – TheorieWiki;http://theoriewiki.org/index.php?title=Benutzer%3AFelicaAngelo06;http://boardsanctions.com/;NC Medical Board Actions;5;35;1448297485;1458043290
      0;Benutzer:FelicaAngelo06 – TheorieWiki;http://theoriewiki.org/index.php?title=Benutzer%3AFelicaAngelo06;http://boardsanctions.com/;http://boardsanctions.com/;5;35;1448297485;1458043290
      0;Benutzer:FelicaAngelo06 – TheorieWiki;http://theoriewiki.org/index.php?printable=yes&title=Benutzer%3AFelicaAngelo06;http://boardsanctions.com/;NC Medical Board Actions;5;30;1456257160;1457931212
      0;Benutzer:FelicaAngelo06 – TheorieWiki;http://theoriewiki.org/index.php?printable=yes&title=Benutzer%3AFelicaAngelo06;http://boardsanctions.com/;http://boardsanctions.com/;5;30;1456257160;1457931212
      0;Finding A Plastic Surgeon In Your Area – TheorieWiki;http://theoriewiki.org/index.php?title=Finding_A_Plastic_Surgeon_In_Your_Area;http://boardsanctions.com/;Florida Medical Board Sanctions;4;33;1443858328;1457622408
    • Note that it’s a good thing I’m limiting the results to 10! The second thing to notice is every one of these links is SEO garbage. This one is my favorite. Now, this is ordered according to rank (however that’s calculated) and maybe there are better ways to order the results, but this does make me nervous about using backlinks without some checking. Maybe cosine similarity?
    • So the last thing, if we want to spend some money is to use the common crawl for backlinks. Not sure if it would make any difference, but there would be more insight. As an example, there’s wikireverse which did exactly that.

Phil 3.11.16

8:00 – VTX

  • Created new versions of the Friday crawl scheduler, one for GOV, one for ORG.
  • The gap between inaccurate viral news stories and the truth is 13 hours, based on this paper: Hoaxy – A Platform for Tracking Online Misinformation
  • Here’s a rough list on why UGC stored in a graph might be the best way to handle the BestPracticesService.
    • Self generating, self correcting information using incentivized contributions (every time a page you contributed to is used, you get money/medals/other…)
    • Graph database, maybe document elements rather than documents
      BPS has its own network, but it connects to doctors and possibly patients (anonymized?) and their symptoms.
    • Would support Results-driven medicine from a variety of interesting dimensions. For example we could calculate the best ‘route’ from symptoms to treatment using A*. Conversely, we could see how far from the optimal some providers are.
    • Because it’s UGC, there can be a robust mechanism for keeping information current (think Wikipedia) as well as handling disputes
    • Could be opened up as its own diagnostic/RDM tool.
    • A graph model allows for easy determination of provenience.
    • A good paper to look at: http://www.mdpi.com/1660-4601/6/2/492/htm. One of the social sites it looked at was Medscape, which seems to be UGC
  • Got the new Rating App mostly done. Still need to look into inbound links
  • Updated the blacklists on everything

Phil 3.10.16

7:00 – 3:30 VTX

  • Today’s thought. Trustworthiness is a state that allows for betrayal.
  • Since it’s pledge week on WAMU, I was listening to KQED this morning, starting around 4:45 am. Somewhere around 5:30(?) they ran an environment section that talked about computer-generated hypotheses. Trying to run that down with no luck.
  • Continuing A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web.
    • End-user–based framework approaches use different methods to allow for the differences between individual end-users for adaptive, interactive, or personalized assessment and ranking of UGC. They utilize computational methods to personalize the ranking and assessment process or give an individual end-user the opportunity to interact with the system, explore content, personally define the expected value, and rank content in accordance with individual user requirements. These approaches can also be categorized in two main groups: human centered approaches, also referred to as interactive and adaptive approaches, and machine-centered approaches, also referred to as personalized approaches. The main difference between interactive and adaptive systems compared to personalized systems is that they do not explicitly or implicitly use users’ previous common actions and activities to assess and rank the content. However, they give users opportunities to interact with the system and explore the content space to find content suited to their requirements.
    • Looks like section 3.1 is the prior research part for the Pertinence Slider Concept.
    • Evaluating the algorithm reveals that enrichment of text (by calling out to
      search engines) outperforms other approaches by using simple syntactic conversion

      • This seems to work, although the dependency on a Google black box is kind of scary. It really makes me wonder what would happen if we analyzed the links created by a search of each sentence (where the subject is contained in the sentence?) would look like ant what we could learn…I took the On The Media retweet of a Google Trends tweet [“Basta” just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate][https://twitter.com/GoogleTrends/status/707756376072843268] and fed that into Google which returned:
        4 results (0.51 seconds)
        Search Results
        Hillary Clinton said 'basta' and America went nuts | Sun ...
        national.suntimes.com/.../7/.../hillary-clinton-basta-cnn-univision-debate/
        9 hours ago - America couldn't get enough of a line Hillary Clinton dropped during Wednesday night's CNN/Univision debate after she ... "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate.
        Hillary is Asked If Trump is 'Racist' at Debate, But It Gets ...
        https://www.ijreview.com/.../556789-hillary-was-asked-if-trump-was-raci...
        "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate. — GoogleTrends (@GoogleTrends) March 10, 2016.
        Election 2016 | Reuters.com
        live.reuters.com/Event/Election_2016?Page=93
        Reuters
        Happening during tonight's #DemDebate, below are the first three tracks: ... "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during # ...
        Maysoon Zayid (@maysoonzayid) | Twitter
        https://twitter.com/maysoonzayid?lang=en
        Maysoon Zayid added,. GoogleTrends @GoogleTrends. "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate.
    • Found Facilitating Diverse Political Engagement with the Living Voters Guide, which I think is another study of the Seattle system presented at CSCW in Baltimore. The survey indicates that it has a good focus on bubbles.
    • Encouraging Reading of Diverse Political Viewpoints with a Browser Widget. Possibly more interesting are the papers that cite this…
    • Can you hear me now?: mitigating the echo chamber effect by source position indicatorsDoes offline political segregation affect the filter bubble? An empirical analysis of information diversity for Dutch and Turkish Twitter usersEvents and controversies: Influences of a shocking news event on information seeking
  • Finished and committed the CrawlService changes. Jenkens wasn’t working for some reason, so we spun on that for a while. Tested and validated on the Integration sysytem.
  • Worked some more on the Rating App. It compiles all the new persisted types in the new DB. Realized that the full website text should be in the result, not the rating.
  • Modified Margarita’s test file to use Theresa’s list of doctors.
  • Wrote up some notes on why a graph DB and UGC might be a really nice way to handle the best practices part of the task

Phil 3.9.16

7:00 – 2:30 VTX

  • Good discussion with Wayne yesterday about getting lost in a car with a passenger.
    • The equivalent of a trapper situated in an environment who may not know where he is but is not lost is analogous to people exchanging information where the context is well understood, but new information is being created in that context. Think of sports enthusiasts or researchers. More discussion will happen about the actions in the game than the stadium it was played in. Similarly, the focus of a research paper is the results as opposed to where the authors appear in the document. Events can transpire to change that discussion (The power failure at the 2013 Superbowl, for example) but even then most of the discussion involves how the blackout affected gameplay.
    • Trustworthy does not mean infallible. GPS gets things wrong, but we still depend on it. It has very high system trust. Interestingly, a Google Search of ‘GPS Conspiracy’ returns no hits about how GPS is being manipulated, while ‘Google Search Conspiracy’ returns quite a few appropriate hits.
    • GPS can also be considered a potential analogy to how our information gathering behaviors will evolve. Where current search engines index and rank existing content, a GPS synthesises a dynamic route based on an ever-increasing set of constraints (road type, tolls, traffic, weather, etc). Similarly, computational content generation (of which computational journalism is just one of the early trailblazers) will also generate content that is appropriate for the current situation (in 500 feet turn right). Imagine a system that can take a goal “I want to go to the moon” and creates an assistant that constantly evaluates the information landscape to create a near optimal path to that goal with turn-by-turn directions.
    • Studying how to create Trustworthy Anonymous Citizen Journalism is important then for:
      • Recognising individuals for who they are rather than who they say they are
      • Synthesizing trustworthy (quality?) content from the patterns of information as much as the content (Sweden = boring commute, Egypt = one lost, 2016 Republican Primaries = lost and frustrated direction asking, etc). The dog that doesn’t bark is important.
      • Determining the kind of user interfaces that create useful trustworthy information on the part of the citizen reporters and the interfaces and processes that organize, synthesise, curate and rank the content to the news consumer.
      • Providing a framework and perspective to provide insight into how computational content generation potentially reshapes Information Retrieval as it transitions to Information Goal Setting and Navigation.
  • Continuing A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web.
  • Finish tests – Done. Found a bug!
  • Submit paperwork for Wall trip in Feb. Done
  • Get back to JPA
    • Set up new DB.
    • Did the initial populate. Now I need to add in all the new data bits.
  • Margarita sent over a test json file. Verified that it worked and gave her kudos.

Phil 3.8.16

7:00 – 3:00 VTX

  • Continuing A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web. Dense paper, slow going.
    • Ok, Figure 3 is terrible. Blue and slightly darker blue in an area chart? Sheesh.
    • Here’s a nice nugget though regarding detecting fake reviews using machine learning: For assessing spam product reviews, three types of features are used [Jindal and Liu 2008]: (1) review-centric features, which include rating- and text-based features; (2) reviewer-centric features, which include author based features; and (3) product-centric features. The highest accuracy is achieved by using all features. However, it performs as efficiently without using rating-based features. Rating-based features are not effective factors for distinguishing spam and nonspam because ratings (feedback) can also be spammed [Jindal and Liu 2008]. With regard to deceptive product reviews, deceptive and truthful reviews vary concerning the complexity of vocabulary, personal and impersonal use  of language, trademarks, and personal feelings. Nevertheless, linguistic features of a text are simply not enough to distinguish between false and truthful reviews. (Comparison of deceptive and truthful travel reviews). Here’s a later paper that cites the previous. Looks like some progress has been made: Using Supervised Learning to Classify Authentic and Fake Online Reviews 
    • And here’s a good nugget on calculating credibility. Correlating with expert sources has been very important: Examining approaches for assessing credibility or reliability more closely indicates that most of the available approaches use supervised learning and are mainly based on external sources of ground truth [Castillo et al. 2011; Canini et al. 2011]—features such as author activities and history (e.g., a bio ofan author), author network and structure, propagation (e.g., a resharing tree of a post and who shares), and topical-based affect source credibility [Castillo et al. 2011; Morris et al. 2012]. Castillo et al. [2011] and Morris et al. [2012] show that text- and content-based features are themselves not enough for this task. In addition, Castillo et al. [2011] indicate that authors’ features are by themselves inadequate. Moreover, conducting a study on explicit and implicit credibility judgments, Canini et al. [2011] find that the expertise factor has a strong impact on judging credibility, whereas social status has less impact. Based on these findings, it is suggested that to better convey credibility, improving the way in which social search results are displayed is required [Canini et al. 2011]. Morris et al. [2012] also suggest that information regarding credentials related to the author should be readily accessible (“accessible at a glance”) due to the fact that it is time consuming for a user to search for them. Such information includes factors related to consistency (e.g., the number of posts on a topic), ratings by other users (or resharing or number of mentions), and information related to an author’s personal characteristics (bio, location, number of connections).
    • On centrality in finding representative posts, from Beyond trending topics: Real-world event identification on twitterThe problem is approached in two concrete steps: first by identifying each event and its associated tweets using a clustering technique that clusters together topically similar posts, and second, for each cluster of event, posts are selected that best represent the event. Centrality-based techniques are used to identify relevant posts with high textual quality and are useful for people looking for information about the event. Quality refers to the textual quality of the messages—how well the text can be understood by any person. From three centrality-based approaches (Centroid, LexRank [Radev 2004], and Degree), Centroid is found to be the preferred way to select tweets given a cluster of messages related to an event [Becker et al. 2012]. Furthermore, Becker et al. [2011a] investigate approaches for analyzing the stream of tweets to distinguish between relevant posts about real-world events and nonevent messages. First, they identify each event and its related tweets by using a clustering technique that clusters together topically similar tweets. Then, they compute a set of features for each cluster to help determine which clusters correspond to events and use these features to train a classifier to recognizing between event and nonevent clusters.
  • Meeting with Wayne at 4:15
  • Crawl Service
    • had the ‘&q=’ part at the wrong place
    • Was setting the key = to the CSE in the payload, which caused much errors. And it’s working now! Here’s the full payload:
      {
       "query": "phil+feldman+typescript+angular+oop",
       "engineId": "cx=017379340413921634422:swl1wknfxia",
       "keyId": "key=AIzaSyBCNVJb3v-FvfRbLDNcPX9hkF0TyMfhGNU",
       "searchUrl": "https://www.googleapis.com/customsearch/v1?",
       "requestId": "0101016604"
      }
    • Only the “query” field is required. There are hard-coded defaults for engineId, keyId and searchUrlPrefix
    • Ok, time for tests, but before I try them in the Crawl Service, I’m going to try out Mockito in a sandbox
    • Added mockito-core to the GoogleCSE2 sandbox. Starting on the documentation. Ok – that makes sense
    • Added SearchRequestTest to CrawlService

Phil 3.2.16

5:00-ish 4:00 – VTX

  • Call Charlestown
  • Meeting with Dr. Pan
    • The new ground truth framework looks good. Saving outbound and inbound links is also worth doing.
    • Beware of low percentage patterns. finding the 1% answer is very hard for machine learning, while finding the 49% answer is much better.
    • SVMs are probably a good way to start since they are resistant to overfitting
    • Multiple passes may be required to filter the data to get a meaningful result. Patterns like the .edu/.gov ratio may be very helpful
    • The subReddit Change My View is an interesting UGC site that should provide good examples of information networks on both sides of a controversial point, and a measure of success. It would certainly be interesting to do a link analysis.
  • Starting on A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web. If I’m right, I should have a Game Theory/Information Economics model to frame this. Here’s hoping.
    • As an aside, parsing my saved documents to get authors, general terms, and ACM Reference Format terms should be done to compare the produced networks. Looks like PDFBox should do the trick.
    • Elaheh Momeni – Lots of stuff on UGC
      • Data Mining
      • Collective Intelligence
      • Machine Learning
      • User Generated Content Mining
      • Social Computing
    • Claire Cardie
      • argument mining and argument generation including the identification of supported vs. unsupported claims and opinions,
      • social-computational methods for improving communication and interactions in on-line settings,
      • NLP for e-rulemaking,
      • sentiment analysis: extraction and summarization of fine-grained opinions in text,
      • discourse-aware methods for opinion and argument extraction,
      • deception detection in on-line reviews,
      • noun phrase coreference resolution.
    • Nick Diakopoulos
      • Research in computational and data journalism with an emphasis on algorithmic accountability, narrative data visualization, and social computing in the news.
  • New Weapon in Day Laborers’ Fight Against Wage Theft: A Smartphone App – NYTimes. Short documentary on YouTube. Sol Aramendi is the author?
  • Spent time when I should be sleeping thinking about rating webpages. Rather than the current single list, I think at least four categories are needed:
    • Accessible yes/no (404, etc)
    • Match – did the person show up yes/no/possible-can’t tell
    • Target Characterization
      • Positive – gave to charity, published a paper
      • Neutral – phone book listing
      • Negative – conviction, confession
    • Source type
      • Official Document
      • Home Page
      • Microblog
      • Blog
      • News organization
      • Federal Government
      • State Government
      • Commercial Entity – Rating site, etc
      • Non-commercial Entity – Nonprofit, clubs, interest group
      • Educational – yearbook, program, course listing
      • Machine-generated for unclear purpose
      • Spam
    • Content Characterization (can be multiple)
      • Medical
      • Legal
      • Commercial
      • Official
      • Marketing
      • Other
      • Spam
    • Quality Characterization
      • Low – confusing, conflicting unrelated information
      • Minimal – some useful information (Machine harvested from better sources)
      • High – clear, providing high quality information
    • Source Characterization
      • Very trustworthy – I’d give them my SSN
      • Trustworthy – I’d use a credit card here
      • Credible – I’d use this site to support an argument
      • Neutral – Not sure, but wouldn’t avoid
      • Not Credible – Not rooted in things that I believe/trust
      • Distrustworthy – I’m pretty sure this site is misinformation
      • Very Distrustworthy – Conspiracy theories, Lizardmen, etc
    • Relevant Text – In addition, I think we need a text area that the user can paste text from the webpage that contains the match in context, or something that exemplifies the source characterisation
    • Notes – To cover anything that’s not covered above
  • So now Gregg is handling Crawl Service file generation?
  • Discussion with Katy and Jeremy about the list above?
  • Pondering how to adjust the ratingObject everything is a string, except for content characterization, which can have multiples. I could do a bitfield or a separate table. Leaning towards the bitfieled.

Phil 2.29.16

7:00 – 3:00 VTX

  • Seminar today, sent Aaron a reminder.
    • Some discussion about your publication quantity. Amy suggests 8 papers as the baseline for credibility: So here are some preliminary thoughts about what could come out of my work:
      • Page Rank document return sorting Pertinence
      • User Interfaces for trustworthy input
      • Rating the raters / harnessing the Troll
      • Trustworthiness inference using network shape
      • Adjusting relevance through GUI pertinence
      • Something about ranking of credibility cues – Video, photos, physical presence, etc.
      • Something about the patterns of posting indicating the need for news. Sweden vs. Gezi. And how this can be indicative of emerging crisis informatics need
      • Something about fragment synthesis across disciplines and being able to use it to ‘cross reference’ information?
      • Fragment synthesis vs. community fragmentation.
    • 2013 SenseCam paper
    • Narrative Clip
  • Continuing Incentivizing High-quality User-Generated Content.
    • Looking at the authors
    • The proportional mechanism therefore improves upon the baseline mechanism by disincentivizing q = 0, i.e., it eliminates the worst reviews. Ideally, we would like to be able to drive the equilibrium qualities to 1 in the limit as the number of viewers, M, diverges to infinity; however, as we saw above, this cannot be achieved with the proportional mechanism.
    • This reflects my intuition. The lower the quality of the rating, the worse the proportional rating system is, and the lower the bar for quality for the contributor. The three places that I can think of offhand that have high-quality UCG (Idea Channel, StackOverflow and Wikipedia) all have people rating the data (contextually!!!) rather than a simple up/downvote.Idea Channel – The main content creators read the comments and incorporate the best in the subsequent episode.Stackoverflow – Has become a place to show of knowledge, and there are community mechanisms of enforcement, and the number of answers are low enough that it’s possible to look over all of them.Others that might be worth thinking aboutQuora? Seems to be an odd mix of questions. Some just seem lazy (how do I become successful) or very open ended (What kind of guy is Barak Obama). The quality of the writing is usually good, but I don’t wind up using it much. So why is that?Reddit. So ugly that I really don’t like using it. Is there a System Quality/Attractiveness as well as System Trust?

      Slashdot. Good headline service, but low information in the comments. Occasionally something insightful, but often it seems like rehearsed talking points.

    • So the better the raters, the better the quality. How can the System evaluate rater quality? Link analysis? Pertinence selection? And if we know a rater is low-quality, can we use that as a measure in its own right?
  • Trying to test the redundant web page filter, but the urls for most identical pages are actually slightly different:
  • I think tomorrow I might parse the URL or look at page content. Tomorrow.

Phil 2.25.16

7:00 – 5:00 VTX

  • Thinking more about the economics of contributing trustworthy information. Recently, I’ve discovered the PBS Idea Channel, which is a show that explores pop culture with a philosophical bent (LA Times review here). For example, Deadpool is explored from a phenomenology perspective. But what’s really interesting and seems unique to me is the relationship of the show with its commenters. For each show, there is a follow-on show where the most interesting comments are discussed by the host, Mike Rugnetta. And the comments are surprisingly cogent and good. I think that this is because Rugnetta is acting like the anchor of an interactive news program where the commenters are the reporters. He sets up the topic, gets the ball rolling, and then incorporates the best comments (stories) to wrap up the story. Interestingly, in a recent comment section on aesthetic (which I can’t find now?), he brings up a comment that about science and philosophy and invites the commenter into a deeper discussion and also discusses the potential of an episode about that.
  • To get a flavor, here’s one of the longer comments (with 25 replies on its own) from the Deadpool show:
    I could actually buy that DeadPool’s ability to understand the medium he’s in if it weren’t for one thing he does very often: references to our world. If his fourth wall breaks were limited to interacting with the panels, making quips and nods about the idea of “readers”, and joking about general comic book (or video game or movie) tropes, then I’d be on board with the idea that he is hyper-aware due to his constant physical torment and knowledge of his own perceptions. however, he somehow has knowledge of things that do not seem to exist in the world he inhabits, such as memes, pop culture references, and things like “Leeroy Jenkins”. His hypersensitivity can explain his knowledge of the medium he’s in (an integral part of the reality he inhabits), but I don’t see a way that it could explain him knowing about things that, as far as I’m aware, do not exist in his reality.
  • Compare that to the comments for the MIT opencourseware intro to MIT 6.034, which I ‘took’ and found well presented and deeply interesting, though not as flashy. Here’s a rough equivalent (with 21 replies):
    wow ..it’s such an overwhelming feeling for a guy like me ..who had no chance in hell of ever getting into MIT or any other ivy’s to be able to listen and learn from this lectures online and that too free. :’)
  • To me, it seems like the Deadpool post is deeply involved with the subject matter of the episode, while the MIT comment is more typical of a YouTube comment in that it is more about the commenter and less about the content. This does imply that working on providing value to good commenting through inclusion in the content of the show can improve the quality and relevance of the comments.
  • To continue the ‘News Anchor’ thought from above, it might be possible to structure a news entity of some kind where different areas (sports, entertainment, local/regional, etc) could have their own anchors that produce interactive content with their commenters. Some additional capability to handle multimedia uploads from commenters should probably be supported and better navigation, but this sounds more to me like a 21st century news product than many other things that I’ve seen. It’s certainly the opposite of the Sweden paper.
  • And speaking of papers, here’s one on YouTube comments: Commenting on YouTube Videos: From Guatemalan Rock to El Big Bang
  • Starting on Incentivizing High-quality User-Generated Content.
    • References look really good. Only 8? For a WWW paper?
    • This is starting to look like what I was trying to find. Nash Equilibrium. Huh. The model predicts, as observed in practice, that if exposure is independent of quality, there will be a flood of low quality contributions in equilibrium. An ideal mechanism in this context would elicit both high quality and high participation in equilibrium.
  • Need to add ‘change password’ option. Done. And now that I know my way around JPA, I like it a lot
  • Added role-based enabling of menu choices
  • The code base could really use a cleanup. We have the classic research->production problem…
  • Adding match/nomatch and blacklist queries. Note that blacklist needs to be by search engine
    • Finished match
    • Finished nomatch
    • Working on Blacklist
    • Create a loop that changes all the QueryObjects so that qo.getUnquotedName() is used and persist.

Phil 2.18.16

7:00 – 6:00 VTX

I think that this is more an issue of information economics. The incentives in social publication is honor, glory and followers. Maybe some money from ad revenue sharing (Though this is changing?). Traditional news media offers a more direct model where the product (news) is sold to readers and/or advertisers so that the news-making product can be made.

Connectivism states that there is now an emphasis on leaning how to find information as opposed to knowing the information (since information obsolescence happens more rapidly, the value of the information is lower than the knowledge of how to find current knowledge).

Since traditional news media tends to aggregate information to produce stories because it makes learning entertaining and worth the price paid (cash or time watching commercials). However, if the friction to finding free alternatives of the initial information for the story is low, then the value of the story becomes lower, since now all you’re paying for is a pleasing presentation.

Blogs and other free sources make this more difficult for the consumer, since what appears credible may not be, but may be confused with an actual information source nonetheless. Or, looking at confirmation bias, a free pleasing story may have higher value for a consumer than a (non-free) well researched story that disputes the reader’s beliefs.

There is also an emotional cost for checking rumors that you agree with. Going to Snopes to find out that the politician that you hate didn’t actually do that stupid thing you just saw in your feed.  So the traditional few-channel media is being subsumed by networks that we construct to support our biases?

  • Banged away at the white paper. Done! Off to Key West for a long weekend!

Phil 2.15.16

7:30 – 1:30 VTX