Label-Free Data Mining of Scientific Literature by Unsupervised Syntactic Distance Analysis.
Baicheng ZhangHengyu XiaoGuilin YeZhaokun SongTiantian HanEdward SharmanMan LuoAoyuan ChengQing ZhuHaitao ZhaoGuoqing ZhangSong WangJun JiangPublished in: The journal of physical chemistry letters (2023)
Label-free data mining can efficiently feed large amounts of data from the vast scientific literature into artificial intelligence (AI) processing systems. Here, we demonstrate an unsupervised syntactic distance analysis (SDA) approach that is capable of mining chemical substances, functions, properties, and operations without annotation. This SDA approach was evaluated in several areas of research from the physical sciences and achieved performance in information mining comparable to that of supervised learning, as shown by its satisfactory scores of 0.62-0.72, 0.60-0.82, and 0.86-0.95 in precision, recall, and accuracy, respectively. We also showcase how our approach can assist robotic chemists programmed to perform research focused on double-perovskite colloidal nanocrystals, gold colloidal nanocrystals, oxygen evolution reaction catalysts, and enzyme-like catalysts by designing materials, formulations, and synthesis parameters based on data mined from 1.1 million literature references.