Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline.

Joshua M MitchellYuanye ChiMaheshwor ThapaZhiqiang PangJianguo XiaShuzhao Li

Published in: bioRxiv : the preprint server for biology (2024)

All life processes involve the consumption, creation, and interconversion of metabolites. Metabolomics is the comprehensive study of these small molecules, often using mass spectrometry, to provide critical information of health and disease. Automated processing of such metabolomics data is desired, especially for the bioinformatics community with familiar tools and infrastructures. Despite of Python's popularity in bioinformatics and machine learning, the Python ecosystem in computational metabolomics still misses a complete data pipeline. We have developed an end-to-end computational metabolomics data processing pipeline, based on the raw data preprocessor Asari [1]. Our pipeline takes experimental data in .mzML or .raw format and outputs annotated feature tables for subsequent biological interpretation. We demonstrate the application of this pipeline to multiple metabolomics and lipidomics datasets. Accompanying the pipeline, we have designed a set of reusable data structures, released as the MetDataModel package, which shall promote more consistent terminology and software interoperability in this area.

Keyphrases