Data-driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations.
Manasi A PetheAliza B RubensteinSagar D KharePublished in: Proceedings of the National Academy of Sciences of the United States of America (2018)
Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein-peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence-energetics-function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function-site-specific cleavages of the viral polyprotein-is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide variety of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.
Keyphrases
- hepatitis c virus
- single cell
- machine learning
- drug resistant
- sars cov
- amino acid
- rna seq
- structural basis
- human immunodeficiency virus
- copy number
- multidrug resistant
- deep learning
- artificial intelligence
- endothelial cells
- body composition
- acinetobacter baumannii
- randomized controlled trial
- dna methylation
- minimally invasive
- genome wide
- high resolution
- zika virus
- mesenchymal stem cells
- protein protein
- physical activity
- small molecule
- cell wall
- cystic fibrosis
- circulating tumor
- phase iii
- antiretroviral therapy
- phase ii
- bone marrow