Contrastive representation learning of inorganic materials to overcome lack of training datasets.

Published in: Chemical communications (Cambridge, England) (2022)

Data representation forms a feature space where forms data distribution that is one of the key factors determining the prediction accuracy of machine learning (ML). In particular, the data representation is crucial to handle small and biased training datasets, which is the main challenge of ML in chemical applications. In this paper, we propose a data-agnostic representation method that automatically and universally generates a vector-shaped and target-specified representation of crystal structures. By employing the new materials representation of the proposed method, the prediction capabilities of ML algorithms were highly improved on small training datasets and transfer learning tasks. Moreover, the prediction accuracies of ML algorithms were improved by 28.89-30.87% in extrapolation problems to predict the physical properties of the materials in unknown material groups. The source code of EMRL is publicly available at https://github.com/ngs00/emrl/tree/master/EMRL.

Keyphrases