Phil 11.20.20

From the Washington Post this morning

Book

Read and annotate Michelle’s outline, and add something about attention. That’s also the core of my response to Antonio
More cults
2:00 Meeting
Thinking about how design must address American Gnosticism, and the danger and opportunities of online “research”, and also how things like maps and diversity injection can potentially make profound impacts

GOES

Update test code to use least squares/quaternion technique
look into celluloid package for animating pyplot
3:00 Meeting

GPT-2 Agents

I’m ingesting data from phase 12!
Generate an Arabic dataset with less meta info for sim to run as a baseline
Try to get embeddings from GPT? This might work:
- github.com/huggingface/transformers/issues/1458
- I think this is working?

import tensorflow as tf
# pip install git+https://github.com/huggingface/transformers.git
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

# options are (from https://huggingface.co/transformers/pretrained_models.html)
# 'gpt2' : 12-layer, 768-hidden, 12-heads, 117M parameters. # OpenAI GPT-2 English model
# 'gpt2-medium' : 24-layer, 1024-hidden, 16-heads, 345M parameters. # OpenAI’s Medium-sized GPT-2 English model
# 'gpt2-large' : 36-layer, 1280-hidden, 20-heads, 774M parameters. # OpenAI’s Large-sized GPT-2 English model
# 'gpt2-xl' : 48-layer, 1600-hidden, 25-heads, 1558M parameters.. # OpenAI’s XL-sized GPT-2 English model
tokenizer = GPT2Tokenizer.from_pretrained("../models/chess_model")

# add the EOS token as PAD token to avoid warnings
# model = TFGPT2LMHeadModel.from_pretrained("../models/gpt2-medium", pad_token_id=tokenizer.eos_token_id)
model = TFGPT2LMHeadModel.from_pretrained("../models/chess_model", pad_token_id=tokenizer.eos_token_id, from_pt=True)

wte = model.transformer.wte
wpe = model.transformer.wpe
word_embeddings:tf.Variable = wte.weight  # Word Token Embeddings
print("\nword_embeddings.shape = {}".format(word_embeddings.shape))
terms = ['black', 'white', 'king', 'queen', 'rook', 'bishop', 'knight', 'pawn']
for term in terms:
    text_index_list = tokenizer.encode(term)
    print("\nindex for {} = {}".format(term, text_index_list))
    for ti in text_index_list:
        vec = word_embeddings[ti]
        print("{}[{}] = {}...{}".format(term, ti, vec[:3], vec[-3:]))

It gives the following results:

word_embeddings.shape = (50257, 768)
 index for black = [13424]
 black[13424] = [ 0.1466832  -0.03205131  0.13073246]…[ 0.03556942  0.2691384  -0.15679955]
 index for white = [11186]
 white[11186] = [ 0.01213744 -0.08717686  0.09657521]…[-0.01646501  0.05803612 -0.14158668]
 index for king = [3364]
 king[3364] = [ 0.07679952 -0.36437798  0.04769149]…[-0.2532825   0.11794613 -0.22853516]
 index for queen = [4188, 268]
 queen[4188] = [ 0.01280739 -0.12996083  0.10692213]…[0.03401601 0.01343785 0.30656403]
 queen[268] = [-0.17423214 -0.14485645  0.04941033]…[-0.16350408 -0.10608979 -0.03318951]
 index for rook = [305, 482]
 rook[305] = [ 0.08708595 -0.13490516  0.17987011]…[-0.17060643  0.07456072  0.04632077]
 rook[482] = [-0.07434712 -0.01915449  0.04398194]…[ 0.02418434 -0.06441653  0.26534158]
 index for bishop = [27832]
 bishop[27832] = [-0.05137009 -0.11024677  0.0080909 ]…[-0.02372078  0.00158158 -0.08555448]
 index for knight = [74, 3847]
 knight[74] = [ 0.10828184 -0.20851855  0.2618368 ]…[0.10234124 0.1747297  0.15052234]
 knight[3847] = [-0.15940899 -0.14975397  0.13490209]…[ 0.01935775  0.056772   -0.08009521]
 index for pawn = [79, 3832]
 pawn[79] = [-0.02358418 -0.18336709  0.08343078]…[ 0.23536623  0.06735501 -0.13106383]
 pawn[3832] = [0.12719391 0.05303555 0.12345099]…[-0.15112995  0.14558738 -0.05049708]

Which are stable from run to run. Note that there can be multiple embeddings for the same term
Monday, I’ll try to project into 2D using scikit-learn’s manifold learning

viztales

Dimension reduction, State, Orientation, and Speed

Phil 11.20.20

Share this:

Related