flowSim: Near Duplicate Detection for flow cytometry data.
Sebastiano MontanteYixuan ChenRyan R BrinkmanPublished in: Cytometry. Part A : the journal of the International Society for Analytical Cytology (2023)
The analysis of large amounts of data is important for the development of machine learning (ML) models. flowSim is the first algorithm designed to visualize, detect and remove highly redundant information in flow cytometry (FCM) training sets to decrease the computational time for training and increase the performance of ML algorithms by reducing overfitting. flowSim performs near duplicate image detection (NDD) by combining community detection algorithms with the density analysis of the marker expression values. flowSim clustering compared to consensus manual clustering on a dataset composed of 160 images of bivariate FCM data had a mean Adjusted Rand Index (ARI) of 0.90, demonstrating its efficiency in identifying similar patterns. flowSim selectively discarded near duplicate files in datasets constructed with known redundancy, and removed 92.6% of FCM images in a dataset of over 500,000 drawn from public repositories. This article is protected by copyright. All rights reserved.
Keyphrases
- flow cytometry
- machine learning
- deep learning
- big data
- artificial intelligence
- electronic health record
- loop mediated isothermal amplification
- convolutional neural network
- healthcare
- real time pcr
- label free
- mental health
- rna seq
- single cell
- poor prognosis
- emergency department
- optical coherence tomography
- long non coding rna