Improving the classification of neuropsychiatric conditions using gene ontology terms as features.
Thomas P QuinnSamuel C LeeSvetha VenkateshThin NguyenPublished in: American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics (2019)
Although neuropsychiatric disorders have an established genetic background, their molecular foundations remain elusive. This has prompted many investigators to search for explanatory biomarkers that can predict clinical outcomes. One approach uses machine learning to classify patients based on blood mRNA expression. However, these endeavors typically fail to achieve the high level of performance, stability, and generalizability required for clinical translation. Moreover, these classifiers can lack interpretability because not all genes have relevance to researchers. For this study, we hypothesized that annotation-based classifiers can improve classification performance, stability, generalizability, and interpretability. To this end, we evaluated the models of four classification algorithms on six neuropsychiatric data sets using four annotation databases. Our results suggest that the Gene Ontology Biological Process database can transform gene expression into an annotation-based feature space that is accurate and stable. We also show how annotation features can improve the interpretability of classifiers: as annotations are used to assign biological importance to genes, the biological importance of annotation-based features are the features themselves. In evaluating the annotation features, we find that top ranked annotations tend contain top ranked genes, suggesting that the most predictive annotations are a superset of the most predictive genes. Based on this, and the fact that annotations are used routinely to assign biological importance to genetic data, we recommend transforming gene-level expression into annotation-level expression prior to the classification of neuropsychiatric conditions.
Keyphrases
- machine learning
- genome wide
- genome wide identification
- deep learning
- big data
- dna methylation
- rna seq
- copy number
- gene expression
- artificial intelligence
- genome wide analysis
- poor prognosis
- transcription factor
- end stage renal disease
- ejection fraction
- single cell
- chronic kidney disease
- emergency department
- high resolution
- binding protein
- patient reported
- data analysis