FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data.
Vikram SundarBoqiang TuLindsey GuanKevin M EsveltPublished in: bioRxiv : the preprint server for biology (2024)
Machine learning (ML) for protein design requires large protein fitness datasets generated by high-throughput experiments for training, fine-tuning, and benchmarking models. However, most models do not account for experimental noise inherent in these datasets, harming model performance and changing model rankings in benchmarking studies. Here, we develop FLIGHTED, a Bayesian method for generating fitness landscapes with calibrated errors from noisy high-throughput experimental data. We apply FLIGHTED to single-step selection assays such as phage display and to a novel high-throughput assay DHARMA that ties fitness to base editing activity. Our results show that FLIGHTED robustly generates fitness landscapes with accurate errors. We demonstrate that FLIGHTED improves model performance and enables the generation of protein fitness datasets of up to 10 6 variants with DHARMA. FLIGHTED can be used on any high-throughput assay and makes it easy for ML scientists to account for experimental noise when modeling protein fitness.