Login / Signup

Inferring Optimal Species Trees in the Presence of Gene Duplication and Loss: Beyond Rooted Gene Trees.

Md Shamsuzzoha Bayzid
Published in: Journal of computational biology : a journal of computational molecular cell biology (2022)
Estimating species trees from multiple genes is complicated and challenging due to gene tree-species tree discordance . One of the basic approaches to understanding differences between gene trees and species trees is gene duplication and loss events. Minimize Gene Duplication and Loss (MGDL) is a popular technique for inferring species trees from gene trees when the gene trees are discordant due to gene duplications and losses. Previously, exact algorithms for estimating species trees from rooted, binary trees under MGDL were proposed. However, gene trees are usually estimated using time-reversible mutation models, which result in unrooted trees. In this article, we propose a dynamic programming (DP) algorithm that can be used for an exact but exponential time solution for the case when gene trees are not rooted. We also show that a constrained version of this problem can be solved by this DP algorithm in time that is polynomial in the number of gene trees and taxa. We have proved important structural properties that allow us to extend the algorithms for rooted gene trees to unrooted gene trees. We propose a linear time algorithm for finding the optimal rooted version of an unrooted gene tree given a rooted species tree so that the duplication and loss cost is minimized. Moreover, we prove that the optimal rooting under MGDL is also optimal under the MDC (minimize deep coalescence) criterion. The proposed methods can be applied to both orthologous genes and gene families that by definition include both paralogs and orthologs. Therefore, we hope that these techniques will be useful for estimating species trees from genes sampled throughout the whole genome.
Keyphrases
  • genome wide
  • genome wide identification
  • copy number
  • machine learning
  • dna methylation
  • genome wide analysis
  • gene expression
  • transcription factor