ReactionDataExtractor: A Tool for Automated Extraction of Information from Chemical Reaction Schemes.
Damian M WilaryJacqueline M ColePublished in: Journal of chemical information and modeling (2021)
Chemical reaction schemes are commonly used for visual encapsulation of chemical information. Figures of reaction schemes contain chemical transformations, the chemical species involved, as well as reaction conditions. From a data-mining point of view, they constitute rich sources, densely packed with knowledge. Yet, the challenge of automatically extracting data from them has remained largely untackled. This work presents ReactionDataExtractor, a software tool that can be used for the automatic extraction of information from multistep reaction schemes. Its capabilities include segmentation of reaction steps, regions containing reaction conditions, chemical diagrams, as well as optical character and structure recognition. A combination of rules and unsupervised machine-learning approaches is used, with bespoke detection algorithms that identify arrows, structures, labels, and conditions detection algorithms. It can be used as a low-maintenance tool for database generation capable of extracting data from large quantities of images supplied by the user. On assessment using a self-generated evaluation set, the tool achieved precision and recall metrics of between 67% and 91% in the six core areas of data extraction. The ReactionDataExtractor tool is released under the MIT license and is available to download from http://www.reactiondataextractor.org.