Login / Signup

Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES.

Anshu GuptaSiavash MirarabYatish Turakhia
Published in: bioRxiv : the preprint server for biology (2024)
Inference of species trees plays a crucial role in advancing our understanding of evolutionary relationships and has immense significance for diverse biological and medical applications. Extensive genome sequencing efforts are currently in progress across a broad spectrum of life forms, holding the potential to unravel the intricate branching patterns within the tree of life. However, estimating species trees starting from raw genome sequences is quite challenging, and the current cutting-edge methodologies require a series of error-prone steps that are neither entirely automated nor standardized. In this paper, we present ROADIES, a novel pipeline for species tree inference from raw genome assemblies that is fully automated, easy to use, scalable, free from reference bias, and provides flexibility to adjust the tradeoff between accuracy and runtime. The ROADIES pipeline eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. Moreover, it leverages recent advances in phylogenetic inference to allow multi-copy genes, eliminating the need to detect orthology. Using the genomic datasets released from large-scale sequencing consortia across three diverse life forms (placental mammals, pomace flies, and birds), we show that ROADIES infers species trees that are comparable in quality with the state-of-the-art approaches but in a fraction of the time. By incorporating optimal approaches and automating all steps from assembled genomes to species and gene trees, ROADIES is poised to improve the accuracy, scalability, and reproducibility of phylogenomic analyses.
Keyphrases
  • genome wide
  • single cell
  • machine learning
  • genetic diversity
  • deep learning
  • copy number
  • dna methylation
  • rna seq
  • gene expression
  • risk assessment
  • transcription factor
  • quality improvement
  • data analysis