Machine learning models identify gene predictors of waggle dance behaviour in honeybees.
Marcell VeinerJuliano MorimotoEllouise LeadbeaterFabio ManfrediniPublished in: Molecular ecology resources (2022)
The molecular characterization of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes-or 'neurogenomics'-to select the best candidates that associate with patterns of interest. However, traditional neurogenomic analyses have some well-known limitations: above all, the usually limited number of biological replicates compared to the number of genes tested-known as the "curse of dimensionality." In this study we implemented a machine learning (ML) approach that can be used as a complement to more established methods of transcriptomic analyses. We tested three supervised learning algorithms (Random Forests, Lasso and Elastic net Regularized Generalized Linear Model, and Support Vector Machine) for their performance in the characterization of transcriptomic patterns and identification of genes associated with honeybee waggle dance. We then matched the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: boss and hnRNP A1. Overall, our study demonstrates the application of ML to analyse transcriptomics data and identify candidate genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits, and can present some clear advantages compared to the established tools of gene expression analysis, making it a valuable complement for future studies.
Keyphrases
- machine learning
- genome wide
- genome wide identification
- gene expression
- dna methylation
- single cell
- big data
- deep learning
- artificial intelligence
- copy number
- climate change
- poor prognosis
- bioinformatics analysis
- rna seq
- genome wide analysis
- brain injury
- long non coding rna
- binding protein
- electronic health record
- subarachnoid hemorrhage
- cerebral ischemia