On the Misleading Use of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msubsup><mml:mi>Q</mml:mi> <mml:mrow><mml:mi>F</mml:mi> <mml:mn>3</mml:mn></mml:mrow> <mml:mn>2</mml:mn></mml:msubsup> </mml:math> for QSAR Model Comparison.
Viviana ConsonniRoberto TodeschiniDavide BallabioFrancesca GrisoniPublished in: Molecular informatics (2018)
Quantitative Structure - Activity Relationship (QSAR) models play a central role in medicinal chemistry, toxicology and computer-assisted molecular design, as well as a support for regulatory decisions and animal testing reduction. Thus, assessing their predictive ability becomes an essential step for any prospective application. Many metrics have been proposed to estimate the model predictive ability of QSARs, which have created confusion on how models should be evaluated and properly compared. Recently, we showed that the metric <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msubsup><mml:mi>Q</mml:mi> <mml:mrow><mml:mi>F</mml:mi> <mml:mn>3</mml:mn></mml:mrow> <mml:mn>2</mml:mn></mml:msubsup> </mml:math> is particularly well-suited for comparing the external predictivity of different models developed on the same training dataset. However, when comparing models developed on different training data, this function becomes inadequate and only dispersion measures like the root-mean-square error (RMSE) should be used. The intent of this work is to provide clarity on the correct and incorrect uses of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msubsup><mml:mi>Q</mml:mi> <mml:mrow><mml:mi>F</mml:mi> <mml:mn>3</mml:mn></mml:mrow> <mml:mn>2</mml:mn></mml:msubsup> </mml:math> , discussing its behavior towards the training data distribution and illustrating some cases in which <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msubsup><mml:mi>Q</mml:mi> <mml:mrow><mml:mi>F</mml:mi> <mml:mn>3</mml:mn></mml:mrow> <mml:mn>2</mml:mn></mml:msubsup> </mml:math> estimates may be misleading. Hereby, we encourage the usage of measures of dispersions when models trained on different datasets have to be compared and evaluated.