The last several years have seen a huge expansion in our capabilities both to collect large amounts of data and to analyze them in meaningful ways. One of the exciting results of this shift in the natural language processing realm is that large datasets of human language (ie 'corpora') can be used to develop statistical models for language acquisition and construction.
Some of the most obvious things to assess are things like information content of language. A simple example: if I am describing my work in biology to a layperson and I am about to use a bit of jargon (or even an unusual word), it is typical for my speech to slow just prior to delivery of that word. This helps to accommodate the understanding of a low-probability, high-information piece of content. Interestingly, there are easy analogies to draw with biology, specifically with regards to how biological data in the form of nucleic acid or protein is stored and transferred. One exampleI have been curious about for ages is the use of rare, synonymous codons. Many amino acids, the monomeric units that are linked together to form proteins, can be encoded by more than one codon in nucleic acid. Each amino acid is delivered as part of a transfer RNA (tRNA), which includes both the amino acid and the relevant codon of RNA that is complementary to that being read from an mRNA. The abundance of each of these 'synonymous' tRNAs is not necessarily fixed. What this means is that if some synonymous tRNAs are very abundant and others are not, the use of one versus the other could actually slow the rate of mRNA translation. This could be used strategically to slow the translation and folding rate of nascent polypeptides. For example, if a protein has a very difficult to fold region downstream of some amino acid, it could slow translation and folding not through changing its amino acid sequence but by incorporating 'rare' codons. In the hopes of exploring whether this type of analysis- drawing analogies from the natural language processing world as a means of discovery of new features of biological encoding, I have paired up with Dr. Melody Dye at the Berkeley Institute for Data Science. Melody is an expert in the application of these methodologies to language, and by simply updating the types of information being plugged into her models as 'language', we can ask whether statistical modeling of DNA or protein sequences will reveal hidden features. Below is a preliminary example, but we hope to use this type of analysis both to look for information content as a predictor of protein folding rate across gene families such as the olfactory receptors, as well as to assess higher-level features of DNA, such as how its three-dimensional organization leads to information 'branching'. Bokeh visualizations of amino/codon frequency can be found here, here, and here. Comments are closed.
|