Login / Signup

PhyloAln: A Convenient Reference-Based Tool to Align Sequences and High-Throughput Reads for Phylogeny and Evolution in the Omic Era.

Yu-Hao HuangYi-Fei SunHao LiHao-Sen LiHong Pang
Published in: Molecular biology and evolution (2024)
The current trend in phylogenetic and evolutionary analyses predominantly relies on omic data. However, prior to core analyses, traditional methods typically involve intricate and time-consuming procedures, including assembly from high-throughput reads, decontamination, gene prediction, homology search, orthology assignment, multiple sequence alignment, and matrix trimming. Such processes significantly impede the efficiency of research when dealing with extensive data sets. In this study, we develop PhyloAln, a convenient reference-based tool capable of directly aligning high-throughput reads or complete sequences with existing alignments as a reference for phylogenetic and evolutionary analyses. Through testing with simulated data sets of species spanning the tree of life, PhyloAln demonstrates consistently robust performance compared with other reference-based tools across different data types, sequencing technologies, coverages, and species, with percent completeness and identity at least 50 percentage points higher in the alignments. Additionally, we validate the efficacy of PhyloAln in removing a minimum of 90% foreign and 70% cross-contamination issues, which are prevalent in sequencing data but often overlooked by other tools. Moreover, we showcase the broad applicability of PhyloAln by generating alignments (completeness mostly larger than 80%, identity larger than 90%) and reconstructing robust phylogenies using real data sets of transcriptomes of ladybird beetles, plastid genes of peppers, or ultraconserved elements of turtles. With these advantages, PhyloAln is expected to facilitate phylogenetic and evolutionary analyses in the omic era. The tool is accessible at https://github.com/huangyh45/PhyloAln.
Keyphrases
  • high throughput
  • electronic health record
  • big data
  • single cell
  • genome wide
  • risk assessment
  • gene expression
  • transcription factor
  • copy number
  • data analysis
  • artificial intelligence
  • genome wide identification