Reconstructing phylogeny from reduced-representation genome sequencing data without assembly or alignment.
Huan FanAnthony R IvesYann Surget-GrobaPublished in: Molecular ecology resources (2018)
Reduced-representation genome sequencing such as RADseq aids the analysis of genomes by reducing the quantity of data, thereby lowering both sequencing costs and computational burdens. RADseq was initially designed for studying genetic variation across genomes at the population level, but has also proved to be suitable for interspecific phylogeny reconstruction. RADseq data pose challenges for standard phylogenomic methods, however, due to incomplete coverage of the genome and large amounts of missing data. Alignment-free methods are both efficient and accurate for phylogenetic reconstructions with whole genomes and are especially practical for nonmodel organisms; nonetheless, alignment-free methods have not been applied with reduced genome sequencing data. Here, we test a full-genome assembly- and alignment-free method, AAF, in application to RADseq data and propose two procedures for reads selection to remove reads from restriction sites that were not found in taxa being compared. We validate these methods using both simulations and real data sets. Reads selection improved the accuracy of phylogenetic construction in every simulated scenario and the two real data sets, making AAF as good or better than a comparable alignment-based method, even though AAF had much lower computational burdens. We also investigated the sources of missing data in RADseq and their effects on phylogeny reconstruction using AAF. The AAF pipeline modified for RADseq or other reduced-representation sequencing data, phyloRAD, is available on github (https://github.com/fanhuan/phyloRAD).