Solubility Challenge Revisited after Ten Years, with Multilab Shake-Flask Data, Using Tight (SD ∼ 0.17 log) and Loose (SD ∼ 0.62 log) Test Sets.
Antonio LlinasAlex AvdeefPublished in: Journal of chemical information and modeling (2019)
Ten years ago we issued, in conjunction with the Journal of Chemical Information and Modeling, an open prediction challenge to the cheminformatics community. Would they be able to predict the intrinsic solubilities of 32 druglike compounds using only a high-precision set of 100 compounds as a training set? The "Solubility Challenge" was a widely recognized success and spurred many discussions about the prediction methods and quality of data. Regardless of the obvious limitations of the challenge, the conclusions were somewhat unexpected. Despite contestants employing the entire spectrum of approaches available then to predict aqueous solubility and disposing of an extremely tight data set, it was not possible to identify the best methods at predicting aqueous solubility, a variety of methods and combinations all performed equally well (or badly). Several authors have suggested since then that it is not the poor quality of the solubility data which limits the accuracy of the predictions, but the deficient methods used. Now, ten years after the original Solubility Challenge, we revisit it and challenge the community to a new test with a much larger database with estimates of interlaboratory reproducibility.