Hospital-wide natural language processing summarising the health data of 1 million patients.
Daniel M BeanZeljko KraljevicAnthony ShekJames TeoRichard J B DobsonPublished in: PLOS digital health (2023)
Electronic health records (EHRs) represent a major repository of real world clinical trajectories, interventions and outcomes. While modern enterprise EHR's try to capture data in structured standardised formats, a significant bulk of the available information captured in the EHR is still recorded only in unstructured text format and can only be transformed into structured codes by manual processes. Recently, Natural Language Processing (NLP) algorithms have reached a level of performance suitable for large scale and accurate information extraction from clinical text. Here we describe the application of open-source named-entity-recognition and linkage (NER+L) methods (CogStack, MedCAT) to the entire text content of a large UK hospital trust (King's College Hospital, London). The resulting dataset contains 157M SNOMED concepts generated from 9.5M documents for 1.07M patients over a period of 9 years. We present a summary of prevalence and disease onset as well as a patient embedding that captures major comorbidity patterns at scale. NLP has the potential to transform the health data lifecycle, through large-scale automation of a traditionally manual task.
Keyphrases
- electronic health record
- healthcare
- adverse drug
- end stage renal disease
- ejection fraction
- clinical decision support
- health information
- newly diagnosed
- public health
- machine learning
- autism spectrum disorder
- emergency department
- mental health
- chronic kidney disease
- peritoneal dialysis
- depressive symptoms
- smoking cessation
- mass spectrometry
- deep learning
- high resolution
- adipose tissue
- metabolic syndrome
- hepatitis c virus
- risk assessment
- weight loss
- climate change
- genome wide
- acute care
- hiv infected
- drug induced
- patient reported