Login / Signup

Collision between biological process and statistical analysis revealed by mean centring.

David F WestneatYimen G Araya-AjoyHassen AllegueBarbara ClassNiels Jeroen DingemanseNed A DochtermannLászló Zsolt GaramszegiJulien G A MartinShinichi NakagawaDenis RéaleHolger Schielzeth
Published in: The Journal of animal ecology (2020)
Animal ecologists often collect hierarchically structured data and analyse these with linear mixed-effects models. Specific complications arise when the effect sizes of covariates vary on multiple levels (e.g. within vs. among subjects). Mean centring of covariates within subjects offers a useful approach in such situations, but is not without problems. A statistical model represents a hypothesis about the underlying biological process. Mean centring within clusters assumes that the lower level responses (e.g. within subjects) depend on the deviation from the subject mean (relative) rather than on the absolute scale of the covariate. This may or may not be biologically realistic. We show that mismatch between the nature of the generating (i.e. biological) process and the form of the statistical analysis produce major conceptual and operational challenges for empiricists. We explored the consequences of mismatches by simulating data with three response-generating processes differing in the source of correlation between a covariate and the response. These data were then analysed by three different analysis equations. We asked how robustly different analysis equations estimate key parameters of interest and under which circumstances biases arise. Mismatches between generating and analytical equations created several intractable problems for estimating key parameters. The most widely misestimated parameter was the among-subject variance in response. We found that no single analysis equation was robust in estimating all parameters generated by all equations. Importantly, even when response-generating and analysis equations matched mathematically, bias in some parameters arose when sampling across the range of the covariate was limited. Our results have general implications for how we collect and analyse data. They also remind us more generally that conclusions from statistical analysis of data are conditional on a hypothesis, sometimes implicit, for the process(es) that generated the attributes we measure. We discuss strategies for real data analysis in face of uncertainty about the underlying biological process.
Keyphrases
  • data analysis
  • electronic health record
  • big data
  • machine learning
  • risk factors
  • neural network