Login / Signup

Integrating single-cell RNA-seq datasets with substantial batch effects.

Karin HrovatinAmir Ali MoinfarAlejandro Tejada LapuertaLuke ZappiaBenjamin J LengerichManolis KellisFabian Joachim Theis
Published in: bioRxiv : the preprint server for biology (2023)
Computational methods for integrating scRNA-seq datasets often struggle to harmonize datasets with substantial differences driven by technical or biological variation, such as between different species, organoids and primary tissue, or different scRNA-seq protocols, including single-cell and single-nuclei. Given that many widely adopted and scalable methods are based on conditional variational autoencoders (cVAE), we hypothesize that machine learning interventions to standard cVAEs can help improve batch effect removal while potentially preserving biological variation more effectively. To address this, we assess four strategies applied to commonly used cVAE models: the previously proposed Kullback-Leibler divergence (KL) regularization tuning and adversarial learning, as well as cycle-consistency loss (previously applied to multi-omic integration) and the multimodal variational mixture of posteriors prior (VampPrior) that has not yet been applied to integration. We evaluated performance in three data settings, namely cross-species, organoid-tissue, and cell-nuclei integration. Cycle-consistency and VampPrior improved batch correction while retaining high biological preservation, with their combination further increasing performance. While adversarial learning led to the strongest batch correction, its preservation of within-cell type variation did not match that of VampPrior or cycle-consistency models, and it was also prone to mixing unrelated cell types with different proportions across batches. KL regularization strength tuning had the least favorable performance, as it jointly removed biological and batch variation by reducing the number of effectively used embedding dimensions. Based on our findings, we recommend the adoption of the VampPrior in combination with the cycle-consistency loss for integrating datasets with substantial batch effects.
Keyphrases