Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.
Xavier TannierPerceval WajsbürtAlice CalligerBasile DuraAlexandre MouchetMartin HilkaRomain BeyPublished in: Methods of information in medicine (2024)
Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.