Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach.
José Deney AraujoJuan Carlo SilvaAndré Guilherme Costa-MartinsVanderson SampaioDaniel Barros de CastroRobson Francisco de SouzaJeevan GiddaluruPablo Ivan Pereira RamosRobespierre PitaMaurício Lima BarretoManoel Barral NettoHelder Takashi Imoto NakayaPublished in: PeerJ (2022)
Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.