Epistatic models and Latent Generative Landscapes of sequence space asroadmaps for protein functional diversity, evolution, and design
Faruck Morcos
Characterizing the enormous space of protein sequences is an arduous task given the complex nature of the rules determining functional effects of sequence variability. The current availability of sequencing data, high throughput experiments and novel algorithms to model the joint probability of sequence composition are shifting the panorama from an intractable problem to the ability to not only characterize but also generate novel functional sequences. One example of these learning models is Variational Autoencoders (VAEs), when applied to sequence data can be useful to classify members of a protein family and generate diverse members of a given family by still satisfying higher order statistics of the training data.
In our current work, we evaluate the underlying latent manifold of VAEs in which sequence information is embedded. We utilize an amino-acid epistatic Potts model like direct coupling analysis (DCA) and its sequence Hamiltonian to investigate the properties of the latent manifold. Together they constitute what we call a latent generative landscape (LGL). In addition to understand the sequence space of proteins we can also use learned epistasis to enrich a model of sequence evolution called Sequence Evolution with Epistatic Contributions (SEEC) that faithfully unifies different models of evolution and is consistent with observed statistics of sequence data. We test SEEC in an in vivo system by evolving beta-lactamases and evaluate if this model of neutral evolution can provide viable sequences that confer antibiotic resistance even after a large number of evolutionary steps. We propose that utilizing epistatic parameters for both LGLs and SEEC is key to infer fitness in extant and designed protein sequences.