The impacts of active and self-supervised learning on efficient annotation of single-cell expression data.
Michael J GeuenichDae-Won GongKieran R CampbellPublished in: Nature communications (2024)
A crucial step in the analysis of single-cell data is annotating cells to cell types and states. While a myriad of approaches has been proposed, manual labeling of cells to create training datasets remains tedious and time-consuming. In the field of machine learning, active and self-supervised learning methods have been proposed to improve the performance of a classifier while reducing both annotation time and label budget. However, the benefits of such strategies for single-cell annotation have yet to be evaluated in realistic settings. Here, we perform a comprehensive benchmarking of active and self-supervised labeling strategies across a range of single-cell technologies and cell type annotation algorithms. We quantify the benefits of active learning and self-supervised strategies in the presence of cell type imbalance and variable similarity. We introduce adaptive reweighting, a heuristic procedure tailored to single-cell data-including a marker-aware version-that shows competitive performance with existing approaches. In addition, we demonstrate that having prior knowledge of cell type markers improves annotation accuracy. Finally, we summarize our findings into a set of recommendations for those implementing cell type annotation procedures or platforms. An R package implementing the heuristic approaches introduced in this work may be found at https://github.com/camlab-bioml/leader .
Keyphrases
- single cell
- rna seq
- machine learning
- big data
- induced apoptosis
- high throughput
- artificial intelligence
- electronic health record
- cell cycle arrest
- healthcare
- deep learning
- poor prognosis
- stem cells
- oxidative stress
- signaling pathway
- minimally invasive
- mesenchymal stem cells
- smoking cessation
- cell proliferation
- long non coding rna
- cell therapy
- neural network