(Re)Discovering Protein Structure and Function Through Language Modeling (ArXiv)(Code)

- In our study, we show how a language model, trained simply to predict a masked (hidden) amino acid in a protein sequence, recovers high-level structural and functional properties of proteins. In particular, we show how the
**Transformer**language model uses**attention****folding structure**of proteins, connecting regions that are apart in the underlying sequence but spatially close in the protein structure, and (2) targets**binding sites**, a key functional component of proteins. We also introduce a three-dimensional**visualization**of the interaction between attention and protein structure. Our findings align with biological processes and provide a tool to aid scientific discovery. The code for the visualization tool and experiments is available at https://github.com/salesforce/provis. **TL;DR:**Trained solely on language modeling, the Transformer’s attention mechanism recovers high-level structural and functional properties of proteins.*We explored the degree to which attention captures these contact relationships by analyzing the attention patterns of 5,000 protein sequences and comparing them to ground-truth contact maps. Our analysis revealed that one particular head — the***12th layer’s 4**In contrast, the background frequency of contacts among all amino acid pairs in the dataset is just 1.3%.^{th}head, denoted as head 12-4 — aligned remarkably well with the contact map. For “high confidence” attention (> .9 ), 76% of this head’s total attention connected amino acids that were in contact.

GPT-2 Agents

- Add a menu that writes node spatial information to the DB
- Add a “Graph from DB” menu that assembles the edge information from the move table and the node information from the new table, above.
- Continue on path finding
- Distance between a point and a line using numpy (stackoverflow). Not exactly what I need, which is the point of intersection
*and*the distance. There is a stackoverflow post that is close, but here’s a version that tests the results and plots it:import numpy as np import math import matplotlib.pyplot as plt p1 = np.array([1.0, 1.0]) l1 = np.array([0.0, 1.0]) l2 = np.array([1.0, 0.0]) lvec = l2 - l1 lvec /= np.linalg.norm(lvec, 2) p2 = l1 + lvec * np.dot(p1 - l1, lvec) print("intesection = {}".format(p2)) #0.2 1. pvec = p2 - p1 dist = np.linalg.norm(pvec, 2) pvec /= dist det = np.linalg.det([lvec, pvec]) dot = np.dot(lvec, pvec) rads = math.atan2(det, dot) print("distance = {}, angle = {}".format(dist, math.degrees(rads))) plt.plot([l1[0], l2[0]],[l1[1], l2[1]]) plt.plot([p1[0], p2[0]],[p1[1], p2[1]]) plt.show()

- Distance between a point and a line using numpy (stackoverflow). Not exactly what I need, which is the point of intersection
- Here’s the test for seeing if a point is on a line. Again, loosely based on a stackoverflow post:
def is_between(self, l1:[int, int], l2:[int, int], p1:[int, int], epsilon:float = .1) -> bool: p1 = np.array(p1).astype(np.float) l1 = np.array(l1).astype(np.float) l2 = np.array(l2).astype(np.float) s1 = np.linalg.norm(l1-p1) s2 = np.linalg.norm(l2-p1) d = np.linalg.norm(l2-l1) # print("d = {}, s1 + s2 = {}".format(d, s1+s2)) if abs(d - (s1+s2)) < epsilon: return True return False

- Got graphical node selection working. Need to tie that back into the menus for start and stop

Proposal

- Looks like no writing today. Done, maybe?

GOES

- 10:00 CASSIE demo – really good
- 12:00 All Hands – need to catch up on my training. Something for the afternoons?

ML Seminar

- Status report
- Participated in some some triage on Arpita’s and Fatima’s paper