Monthly Archives: July 2020

Phil 7.8.20

A brief history of high-speed trading (via the Museum of American Finance)

  • In the late 1830s, Philadelphia broker William C. Bridges operated a private signal station between New York and Philadelphia which disseminated stock market news to him and his backers (and to no one else). The signals were transmitted through an “optical telegraph,” which consisted of a series of boards on a pole, mounted on hills that could be seen by a telescope.

DtZ

  • The IHME site has improved to the point that we should pull down our site

GPT-2 Agents

  • Need to think about how to show that interrogating a language model is sufficiently similar to interrogating actual data.
    • At this point, I know that the language model comes up with legal moves
    • I need to compare the statistics of actual moves to synthetic moves to see if the populations are sufficiently similar. This means that I need to get the training and evaluation data into the database. Once that’s done, I can compare the frequency of move types (e.g. “At move 10, White moves pawn from a2 to a4”), and the moves from a particular location (e.g. “e2” can have moves to “e3” and “e4” with the pawn, or diagonals with the “f1” bishop or the white queen).
    • The level of similarity should indicate if the biases of the players are represented in the language model.
      • There should be a way of determining a lower bound of data?
      • Once this is shown, then the idea of generalizing to other human interactions can be justified.
  • Started PGNtoDB, which will populate table_actual
    • Ignoring castling for now
    • Chunking into the database! And by chunking, I refer to the sound of the drive 🙂
    • And now I have a catalog of 188,324 human chess moves

chess_moves_db

GOES

  • 10:00 Meeting with Vadim
  • 2:00 Status
  • Last training for a while!

Phil 7.7.20

The opportunity cost of this is going to be so steep. I wonder what country will set up an effective, open, online university?

f1

GPT-2 Agents

  • Working through the texthero examples. Spent a lot of time figuring out how to print elements from a row in a Dataframe, which was ridiculously hard. Instead, I just turned it into a dict and worked with that
    # print the first n rows of a dataframe using the specified columns. Use a -1 for printing all rows 
    def print_df(df:pd.DataFrame, headers:List, num_rows:int = 4, max_chars:int = 80):
        s:pd.Series
        rows = 0
    
        d:Dict = df.to_dict('index')
        rd:Dict
        for index, rd in d.items():
            st = ""
            keys = rd.keys()
            for key in headers:
                if key in keys:
                    val = rd[key]
                    st += "{}: {}, ".format(key, val[:max_chars])
            print(st)
            rows += 1
            if num_rows != -1 and rows > num_rows:
                break
  • The scatterplot appears to use plotly, since it’s presented in the browser. That’s kind of cool, since it implies that the plotting functions of plotly are free somehow? After going to the plotly.com website, I see that “Plotly.py is free and open source and you can view the source, report issues or contribute on GitHub.” That would be worth digging into some more then. Here’s the PCA plot:

pca

  • You can make word clouds easily, too

WordCloud

GOES

  • Finish training? Ooops, forgot
  • Some discussion with Vadim about the structure of the control

ML Seminar

  • Good discussion on topic extraction over time. Basically, create k topics from the entire corpora. Each topic is a ranking of all the words in the corpus. Behavior over time is the amount of the top words from each topic k in each time sample t.

Phil 7.6.20

GPT-2 Agents

  • Search the db for the appropriate “from to” text snippet (e.g. “Black moves pawn from e2 to e3”), with a count of the number of times this move was done using that piece. Done!
  • Add a “fewest hops” (A* – traditional network approach), closest (each step finds the closest node to the target) in addition to the line following algorithm. There will have to be some user testing to see what makes the most sense, if any
  • Played around a bit with Summarization, but it didn’t work that well
  • TextHero came across my Twitter feed. It might be good for topic extraction? Trying it out, but the documentation is… sparse. Checking out
    • Installing many things:
      • unidecode
      • spacy (which installs many other things)
      • plotly (I thought you had to pay to use this?)
      • wordcloud
    • Working through the example, which is broken. Trying to fix based on the Getting Started

GOES

  • 11:00 Meeting with Vadim
  • Got the DataDictionary streaming to InfluxDBL

influx_ddict

  • More Satern – one more course down

Phil 7.4.20

Starting to think about topic modeling.Here are some resources:

I also want to search the db for the appropriate “from to” text snippet (e.g. “Black moves pawn from e2 to e3”), maybe with a count of the number of times this move was done using that piece

Also, I think it makes sense to have a “fewest hops” (A* – traditional network approach), closest (each step finds the closest node to the target) in addition to the line following algorithm. There will have to be some user testing to see what makes the most sense, if any

The map is based on the single jumps, and shows the big jumps as arcs

Phil 7.3.20

Today is a federal holiday, so no rocket science

Huggingface has a pipeline interface now that is pretty abstract. This works:

from transformers import pipeline

translator = pipeline("translation_en_to_fr")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
  • [{‘translation_text’: ‘Hugging Face est une entreprise technologique basée à New York et à Paris.’}]

Wow: GPT-3 writes code!

DtZ is back up! Too many countries have the disease and the histories had to be cropped to stay under the data cap for the free service

GPT-2 Agents

  • Work on more granular path finding
    • Going to try the hypotenuse of distance to source and line first – nope
    • Trying looking for the distances of each and doing a nested sort
    • I had a problem where I was checking to see whether a point was between the current node and the target node using the original line between the source and target nodes. Except that I was checking on a lone from the current node to the target, and failing the test. Oops! Fixed
    • I went back to the hypotenuse version now that the in_between test isn’t broken and look at that!

granular

    • Added the option for coarse or granular paths
  • Start thinking about topic extraction for a given corpus

#COVID

  • Evaluate Arabic to English translation. Got it working!
    from transformers import MarianTokenizer, MarianMTModel
    from typing import List
    src = 'ar'  # source language
    trg = 'en'  # target language
    sample_text = "لم يسافر أبي إلى الخارج من قبل"
    sample_text2 = "الصحة_السعودية تعلن إصابة أربعيني بفيروس كورونا بالمدينة المنورة حيث صنفت عدواه بحالة أولية مخالطة الإبل مشيرة إلى أن حماية الفرد من(كورونا)تكون باتباع الإرشادات الوقائية والمحافظة على النظافة والتعامل مع #الإبل والمواشي بحرص شديد من خلال ارتداء الكمامة "
    mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
    
    model = MarianMTModel.from_pretrained(mname)
    tok = MarianTokenizer.from_pretrained(mname)
    batch = tok.prepare_translation_batch(src_texts=[sample_text2])  # don't need tgt_text for inference
    gen = model.generate(**batch)  # for forward pass: model(**batch)
    words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) 
    print(words)
  • It took a few tries to find the right model. The naming here is very haphazard.
  • Asked for a sanity check from the group
    • This:
      الصحة_السعودية تعلن إصابة أربعيني بفيروس كورونا بالمدينة المنورة حيث صنفت عدواه بحالة أولية مخالطة الإبل مشيرة إلى أن حماية الفرد من(كورونا)تكون باتباع الإرشادات الوقائية والمحافظة على النظافة والتعامل مع #الإبل والمواشي بحرص شديد من خلال ارتداء الكمامة
    • Translates to this:
      Saudi health announces a 40-year-old corona virus in the city of Manora, where his enemy was classified as a primary camel conglomerate, indicating that the protection of the individual from Corona would be through preventive guidance, hygiene, and careful handling of the Apple and the cattle by wearing the gag.

       

  • Write script that takes a batch of rows and adds translations until all the rows in the table are complete

Book chat

Phil 7.2.20

Emergence of polarized ideological opinions in multidimensional topic spaces

  • Opinion polarization is on the rise, causing concerns for the openness of public debates. Additionally, extreme opinions on different topics often show significant correlations. The dynamics leading to these polarized ideological opinions pose a challenge: How can such correlations emerge, without assuming them a priori in the individual preferences or in a preexisting social structure? Here we propose a simple model that reproduces ideological opinion states found in survey data, even between rather unrelated, but sufficiently controversial, topics. Inspired by skew coordinate systems recently proposed in natural language processing models, we solidify these intuitions in a formalism where opinions evolve in a multidimensional space where topics form a non-orthogonal basis. The model features a phase transition between consensus, opinion polarization, and ideological states, which we analytically characterize as a function of the controversialness and overlap of the topics. Our findings shed light upon the mechanisms driving the emergence of ideology in the formation of opinions.

DtZ has broken

dtzfail

GPT2-Agents

  • Continue working on the trajectory. I think that a plot that works entirely on distance to target can result in spirals, so there needs to be some kind of system that looks at the distance to the center line first, and if there is a fail, move the last node from the trajectory list to a dirty list. Then the search restores the cur node to the previous, and continue the search with the trajectory and dirty list nodes ignored?
  • Found an example to fix: A6 – H7
    • get_closest_node() line = [337.0, 44.0, 581.0, 499.0], cur_node = h1, node_list = [‘a6’, ‘b6’, ‘c7’, ‘d7’, ‘e6’, ‘c5’, ‘b7’, ‘g7’, ‘h6’, ‘g6’, ‘c6’, ‘e7’, ‘f7’, ‘g8’, ‘f6’, ‘d8’, ‘a8’, ‘e8’, ‘d6’, ‘b4’, ‘b8’, ‘c8’, ‘c4’, ‘e5’, ‘d5’, ‘d4’, ‘b5’, ‘c3’, ‘e4’, ‘f5’, ‘f8’, ‘f4’, ‘g5’, ‘g4’, ‘h5’, ‘h4’, ‘f3’, ‘d3’, ‘c2’, ‘e3’, ‘d2’, ‘e2’, ‘b2’, ‘b1’, ‘c1’, ‘e1’, ‘d1’, ‘a1’, ‘f1’, ‘g3’, ‘h3’, ‘g2’, ‘f2’, ‘g1’, ‘h2’, ‘h1’]
    • It does fine until it gets to E6, where it chooses c5
    • Adding a target distance-based search if the distance to line search fails seems to have fixed it:
      nlist = list(nx.all_neighbors(self.gml_model, cur_node))
      print("\tneighbors = {}".format(nlist))
      dist_dict = {}
      sx, sy = self.get_center(cur_node)
      
      for n in nlist:
          if n not in node_list:
              newx, newy = self.get_center(n)
              newa = [newx, newy]
              print("\tline dist checking {} at {}".format(n, newa))
              x, y = self.point_to_line([l[0], l[1]], [l[2], l[3]], newa)
              ca = [x, y]
              ib = self.is_between([sx, sy], [l[2], l[3]], [x, y])
              if ib:
                  # option 1: Find the closest to the line
                  dist = np.linalg.norm(np.array(newa)-np.array(ca))
                  dist_dict[n] = dist
                  print("\tis BETWEEN = {}, dist = {}".format(ib, dist))
      if len(dist_dict) == 0:
          ta = [self.get_center(self.target_node)]
          for n in nlist:
              if n not in node_list:
                  newx, newy = self.get_center(n)
                  newa = [newx, newy]
                  print("\ttarget dist checking {} at {}".format(n, newa))
                  # option 2: Find the closest to the target node
                  dist = np.linalg.norm(np.array(newa)-np.array(ta))
                  dist_dict[n] = dist
                  print("\tis CLOSEST: dist = {}".format(dist))
  • Got legal trajectories working. Below is a set of jumps that are legal (rook to c1, bishop to e3 and then h6, then rook the rest of the way) I think I want to also sort based on closest distance to the current node.

legal_moves

GOES

  • Add InfluxDB streaming to DD
  • 10:00 Sim meeting
  • 2:00 Status meeting

Phil 7.1.20

I should be riding across southern Spain right now

#COVID19

  • Huggingface has fixed the Marian Model/Toekenizer! Need to try Arabic, and if it works, translate the tweets in the db

GPT-2 Agents

  • Shimei had a good point last night that belief maps may be directed. For example, it may be easier for a person to go from smoking to harder drugs than to go back to just smoking. The path from hard drugs might involve 12-step programs, which would be unlikely to be reached from smoking. Other beliefs can swing back and forth, as we see with, for example, the desirability of deficit spending when in power.
  • Finishing up direct selection of source and target nodes as a way of procrastinating on calc_direct, which is going to be harder. I’m always nervous with recursion!
  • Added a test to see that one of the neighbors is the target node first
  • Broke out the angle calculations
  • Need to sort by distance to line and distance to target. It may be necessary to step away from the target occasionally. For now, it’s an option as I try to figure out what’s best:
    # option 1: Find the closest to the line
    # dist = np.linalg.norm(np.array(na)-np.array(ca))
    # option 2: Find the closest to the target node
    dist = np.linalg.norm(np.array(newa)-np.array(ta))
    dist_dict[n] = dist

GOES

  • Status report for June
  • 1:30 Sim progress meeting – things are working! Need to hook up the data dictionary to influx for monitoring and debugging. Add Erik to the invites
  • 2:00 Status meeting
  • I did a NASA/GSFC training module early!