Predicting genetically regulated gene expression on the X chromosome.
Xueyi ZhangPiper BelowAdam C NajBrian KunkleEden MartinWilliam S BushPublished in: bioRxiv : the preprint server for biology (2023)
Despite the potential importance of genetic variation on the X chromosome, it is often omitted in disease association studies. The exclusion of the X chromosome has also propagated into the post-GWAS era, as transcriptome-wide association studies (TWAS) also ignore the X due to the lack of adequate models of X chromosome gene expression. In this work, we trained elastic net penalized models in the brain cortex and whole blood using whole genome sequencing (WGS) and RNA-seq data. To make generalizable recommendations, we evaluated multiple modeling strategies in a homogeneous study population of 175 whole blood samples for 600 genes, and 126 brain cortex samples for 766 genes. SNPs (MAF>0.05) within the gene's two megabase flanking window were used to train the tissue-specific model of each gene. We tuned the shrinkage parameter and evaluated the model performance with nested cross-validation. Across different mixing parameters, sample sex, and tissue types, we trained 511 significant gene models in total, predicting the expression of 229 genes (98 genes in whole blood and 144 genes in brain cortex). The average model coefficient of determination ( R 2 ) was 0.11 (range from 0.03 to 0.34). We tested a range of mixing parameters (0.05, 0.25, 0.5, 0.75, 0.95) for the elastic net regularization, and compared the sex-stratified and sex-combined modeling on the X chromosome. We further investigated genes escaping X chromosome inactivation to determine if their genetic regulation patterns were distinct. Based on our findings, sex-stratified elastic net models with a balanced penalty (50% LASSO and 50% ridge) are the most optimal approach to predict the expression levels of X chromosome genes, regardless of X chromosome inactivation status. The predictive capacity of the optimal models in whole blood and brain cortex was confirmed through validation using DGN and MayoRNAseq temporal cortex cohort data. The R 2 the tissue-specific prediction models ranges from 9.94 × 10 -5 to 0.091. These models can be used in Transcriptome-wide Association Studies (TWAS) to identify putative causal X chromosome genes by integrating genotype, imputed gene expression, and phenotype information.
Keyphrases
- genome wide
- copy number
- gene expression
- dna methylation
- genome wide identification
- functional connectivity
- rna seq
- resting state
- bioinformatics analysis
- genome wide analysis
- single cell
- transcription factor
- white matter
- magnetic resonance
- risk assessment
- body composition
- artificial intelligence
- binding protein
- case control