Identifiability of species network topologies from genomic sequences using the logDet distance.
Elizabeth S AllmanHector BañosJohn A RhodesPublished in: Journal of mathematical biology (2022)
Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees. It applies to both unlinked site data, such as for SNPs, and to sequence data in which many contiguous sites may have evolved on a common tree, such as concatenated gene sequences. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.
Keyphrases
- genome wide
- copy number
- dna methylation
- genome wide identification
- electronic health record
- single cell
- big data
- genetic diversity
- healthcare
- rna seq
- computed tomography
- transcription factor
- magnetic resonance imaging
- data analysis
- gene expression
- network analysis
- magnetic resonance
- social media
- artificial intelligence