Machine-Guided Polymer Knowledge Extraction Using Natural Language Processing: The Example of Named Entity Normalization.

Published in: Journal of chemical information and modeling (2021)

A rich body of literature has emerged in recent years that discusses the extraction of structured information from materials science text through named entity recognition models. Relatively little work has been done to address the "normalization" of extracted entities, that is, recognizing that two or more seemingly different entities actually refer to the same entity in reality. In this work, we address the normalization of polymer named entities, polymers being a class of materials that often have a variety of common names for the same material in addition to the IUPAC name. We have trained supervised clustering models using Word2Vec and fastText word embeddings reported in previous work so that named entities referring to the same polymer are categorized within the same cluster in the word embedding space. We report the use of parameterized cosine distance functions to cluster and normalize textually derived entities, achieving an F1 score of 0.85. Furthermore, a labeled data set of polymer names was utilized to train our model and to infer the true total number of unique polymers that are actively reported in the literature. For ∼15,500 polymer named entities extracted from our corpus of 0.5 million papers, we detected 6734 unique clusters (i.e., unique polymers), 632 of which were manually curated to train the normalization model. This work will serve as a critical ingredient in a natural language processing-based pipeline for the automatic and efficient extraction of knowledge from the polymer literature.

Keyphrases