Navigating the seven challenges of taxonomic reference databases in metabarcoding analyses.

François Keck Marjorie CoutonFlorian Altermatt

Published in: Molecular ecology resources (2022)

Assessment of biodiversity using metabarcoding data, such as from bulk- or eDNA sampling, is becoming increasingly relevant in ecology, biodiversity sciences, and monitoring. Thereby, the taxonomic identification of species from their DNA sequences relies strongly on reference databases that link genetic sequences to taxonomic names. These databases vary in completeness and availability, depending on the taxonomic group studied and the genetic region targeted. The incompleteness of reference databases is an important argument to explain the non-detection by metabarcoding of species supposedly present. However, there exist further and generally overlooked problems with reference databases that can lead to false or inaccurate taxonomic assignment inferences. Here, we synthesize all possible problems inherent to reference databases. In particular, we identify a complete, mutually non-exclusive list of seven classes of challenges when it comes to selecting, developing, and using a reference database for taxonomic assignment. These are: 1) mislabeling, 2) sequencing errors, 3) sequence conflict, 4) taxonomic conflict, 5) low taxonomic resolution, 6) missing taxon, and 7) missing intraspecific variant. For each problem identified, we provide a description of possible consequences on the taxonomic assignment process. We illustrate the respective problem with examples taken from the literature or obtained by quantitative analyses of public databases, such as Genbank or BOLD. Finally, we discuss possible solutions to the identified problems and how to navigate them. Only by raising users' awareness of the limitations of metabarcoding data and DNA-reference databases, adequate interpretations of these data will be achieved.

Keyphrases