TopoFormer: Multiscale Topology-enabled Structure-to-Sequence Transformer for Protein-Ligand Interaction Predictions.
Guo-Wei WeiDong ChenJian LiuPublished in: Research square (2024)
Pre-trained deep Transformers have had tremendous success in a wide variety of disciplines. However, in computational biology, essentially all Transformers are built upon the biological sequences, which ignores vital stereochemical information and may result in crucial errors in downstream predictions. On the other hand, three-dimensional (3D) molecular structures are incompatible with the sequential architecture of Transformer and natural language processing (NLP) models in general. This work addresses this foundational challenge by a topological Transformer (TopoFormer). TopoFormer is built by integrating NLP and a multiscale topology techniques, the persistent topological hyperdigraph Laplacian (PTHL), which systematically converts intricate 3D protein-ligand complexes at various spatial scales into a NLP-admissible sequence of topological invariants and homotopic shapes. Element-specific PTHLs are further developed to embed crucial physical, chemical, and biological interactions into topological sequences. TopoFormer surges ahead of conventional algorithms and recent deep learning variants and gives rise to exemplary scoring accuracy and superior performance in ranking, docking, and screening tasks in a number of benchmark datasets. The proposed topological sequences can be extracted from all kinds of structural data in data science to facilitate various NLP models, heralding a new era in AI-driven discovery.
Keyphrases
- deep learning
- protein protein
- amino acid
- artificial intelligence
- machine learning
- big data
- electronic health record
- small molecule
- public health
- mental health
- physical activity
- high throughput
- molecular dynamics
- autism spectrum disorder
- working memory
- patient safety
- social media
- data analysis
- dna methylation
- body composition
- health information
- mass spectrometry