Phil 8.25.16

7:00 – 3:30 ASRC

  • Paper
  • Code
    • Build class(s) that uses some of the CorpusBuilder (or just add to output?) codebase to
    • Access webpages based on xml config file
    • Read in, lemmatize , and build bag-of-words per page (configurable max). Done. Took out DF-ITF code and replaced it with BagOfWords in DocumentStatistics.
    • Write out .arff file that includes the following elements
      • @method (TF-IDF, LSI, BOW)
      • @source (loomings, the carpet bag, the spouter inn, the counterpane)
      • @title (Moby-dick, Tarzan)
      • @author (Herman Melville, Edgar Rice Burroughs)
      • @words (nantucket,harpooneer,queequeg,landlord,euroclydon,bedford,lazarus,passenger,circumstance,civilized,water,thousand,about,awful,slowly,supernatural,reality,sensation,sixteen,awake,explain,savage,strand,curbstone,spouter,summer,northern,blackness,embark,tempestuous,expensive,sailor,purse,ocean,tomahawk,black,night,dream,order,follow,education,broad,stand,after,finish,world,money,where,possible,morning,light)
    • So a line should look something like
      • LSI, chapter-1-loomings, Moby-dick, Herman Melville, 0,0,0,0,0,0,0,5,0,0,7,4,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,3,3,0,0,0,0,2,0,0,0,5,0,0,3,4,2,0,0,0
      • Updated LabledMatrix2D to generate arff files.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.