Category Archives: Derived DB

Phil 12.8.15

7:00 – 4:30 VTX

Phil 12.4.15

8:00 – VTX

  • Scrum
  • Found an interesting tidbit on the WaPo this morning. It implies that if there is a pattern of statement followed by a search for confirming information followed by a public citation of confirming information could be the basic unit of an information bubble. For this to be a bubble, I think the pertinent information extracted from the relevant search results would have to be somehow identifiable as a minority view. This could be done by comparing the Jaccard index of the adjusted results with the raw returns of a search? In other words, if the world (relevant search)  has an overall vector in one direction and the individual preferences produce a pertinent result that is pointing in the opposite direction (large dot product), then the likelihood of those results being the result of echo-chamber processes are higher?
  • If the Derived DB depends on analyst examination of the data, this could be a way of flagging analyst bias.
  • Researching WebScaleSQL, I stumbled on another db from Facebook. This one,  RocksDB, is more focused on speed. From the splash page:
    • RocksDB can be used by applications that need low latency database accesses. A user-facing application that stores the viewing history and state of users of a website can potentially store this content on RocksDB. A spam detection application that needs fast access to big data sets can use RocksDB. A graph-search query that needs to scan a data set in realtime can use RocksDB. RocksDB can be used to cache data from Hadoop, thereby allowing applications to query Hadoop data in realtime. A message-queue that supports a high number of inserts and deletes can use RocksDB.
  • Interestingly, RocksDB appears to have integration with MongoDB and is working on MySQL integration. Cassandra appears to be implementing similar optimizations.
  • Just discovered reported.ly, which is a social medial sourced, reporter curated news stream. Could be a good source of data to compare against things like news feeds from Google or major news venues.
  • Control System Meeting
    • Send RCS and Search Competition to Bob
    • Seems like this whole system is a lot like what Databricks is doing?

Phil 12.3.15

7:00 – 5:00 VTX

  • Learning: Genetic Algorithms
    • Rank space (probability is based on unsorted values??)
    • Simulated annealing – reducing step size.
    • Diversity rank (from the previous generation) plus fitness rank
  • Some more timing results. The view test (select count(*) from tn_view_network_items where network_id = 1) for the small network_1 is about the same as the pull for the large network_8, about .75 sec. The pull from the association table without the view is very fast – 0.01 for network_1 and 0.02 for network_8. So this should mean that a 1,000,000 item pull would take 1-2 seconds.
  • mysql> select count(*) from tn_associations where network_id = 1;
     11 
    1 row in set (0.01 sec)
    
    mysql> select count(*) from tn_associations where network_id = 8;
     10000 
    1 row in set (0.01 sec)
    
    mysql> select count(*) from tn_view_network_items where network_id = 8;
     10000 
    1 row in set (0.88 sec)
    
    mysql> select count(*) from tn_view_network_items where network_id = 1;
     11 
    1 row in set (0.71 sec)
  • Field trip to Wall NJ
    • Learned more about the project, started to put faces to names
    • Continued to look at DB engines for the derived DB. Discovered WebScaleSQL, which is a collaboration between Alibaba, Facebook, Google, LinkedIn, and Twitter to produce a big(!!) version of MySql.
    • More discussions with Aaron D. about control systems, which means I’m going to be leaning on my NIST work again.