RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.
Marcell SzikszaiMarcin MagnusSiddhant SanghiSachin KadyanNazim BouattaElena RivasPublished in: bioRxiv : the preprint server for biology (2024)
While there is a recent surge in applying deep learning to RNA structure prediction, domain experts have raised concerns about generalization and current trends in benchmarking.Many of the concerns primarily relate to how novel RNA families-i.e. families unseen in the training set-are benchmarked, and whether the models are effective at handling such cases. Performance on bench-marks reflective of real-world applications, such as CASP15 and RNA-Puzzles, is poor for RNA deep learning models.We present a dataset-RNA3DB-that is designed for training and bench-marking deep learning models for RNA structure prediction. RNA3DB provides coverage of all RNA chains found in the Protein Data Bank (PDB).RNA3DB is clustered into groups that are both sequentially and structurally non-redundant, providing a robust way of creating training, validation, and testing sets for deep learning models. Along with the dataset, we also provide a transparent methodology as well as the source-code, making our tool both reproducible and customizable.