Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia.
Georgia DoingAlexandra J LeeSamuel L NeffTaylor ReiterJacob D HoltBruce A StantonCasey S GreeneDeborah Ann HoganPublished in: mSystems (2022)
Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. IMPORTANCE Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.
Keyphrases
- rna seq
- single cell
- gene expression
- pseudomonas aeruginosa
- electronic health record
- dna methylation
- escherichia coli
- genome wide
- cystic fibrosis
- big data
- quality control
- quality improvement
- biofilm formation
- machine learning
- microbial community
- acinetobacter baumannii
- health information
- genome wide identification
- single molecule
- high resolution
- type diabetes
- healthcare
- magnetic resonance
- staphylococcus aureus
- public health
- drug resistant
- multidrug resistant
- magnetic resonance imaging
- metabolic syndrome
- mass spectrometry
- bioinformatics analysis