Login / Signup

Quantifying the degree of bias from using county-scale data in species distribution modeling: Can increasing sample size or using county-averaged environmental data reduce distributional overprediction?

Steven D CollinsJohn C AbbottNancy E McIntyre
Published in: Ecology and evolution (2017)
Citizen-science databases have been used to develop species distribution models (SDMs), although many taxa may be only georeferenced to county. It is tacitly assumed that SDMs built from county-scale data should be less precise than those built with more accurate localities, but the extent of the bias is currently unknown. Our aims in this study were to illustrate the effects of using county-scale data on the spatial extent and accuracy of SDMs relative to true locality data and to compare potential compensatory methods (including increased sample size and using overall county environmental averages rather than point locality environmental data). To do so, we developed SDMs in maxent with PRISM-derived BIOCLIM parameters for 283 and 230 species of odonates (dragonflies and damselflies) and butterflies, respectively, for five subsets from the OdonataCentral and Butterflies and Moths of North America citizen-science databases: (1) a true locality dataset, (2) a corresponding sister dataset of county-centroid coordinates, (3) a dataset where the average environmental conditions within each county were assigned to each record, (4) a 50/50% mix of true localities and county-centroid coordinates, and (5) a 50/50% mix of true localities and records assigned the average environmental conditions within each county. These mixtures allowed us to quantify the degree of bias from county-scale data. Models developed with county centroids overpredicted the extent of suitable habitat by 15% on average compared to true locality models, although larger sample sizes (>100 locality records) reduced this disparity. Assigning county-averaged environmental conditions did not offer consistent improvement, however. Because county-level data are of limited value for developing SDMs except for species that are widespread and well collected or that inhabit regions where small, climatically uniform counties predominate, three means of encouraging more accurate georeferencing in citizen-science databases are provided.
Keyphrases
  • big data
  • electronic health record
  • public health
  • human health
  • data analysis
  • machine learning
  • ionic liquid
  • mass spectrometry