A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction.
Anup Kumar HalderAbhishek AgarwalKarolina JodkowskaDariusz PlewczyńskiPublished in: Briefings in functional genomics (2024)
Genomic data analysis has witnessed a surge in complexity and volume, primarily driven by the advent of high-throughput technologies. In particular, studying chromatin loops and structures has become pivotal in understanding gene regulation and genome organization. This systematic investigation explores the realm of specialized bioinformatics pipelines designed specifically for the analysis of chromatin loops and structures. Our investigation incorporates two protein (CTCF and Cohesin) factor-specific loop interaction datasets from six distinct pipelines, amassing a comprehensive collection of 36 diverse datasets. Through a meticulous review of existing literature, we offer a holistic perspective on the methodologies, tools and algorithms underpinning the analysis of this multifaceted genomic feature. We illuminate the vast array of approaches deployed, encompassing pivotal aspects such as data preparation pipeline, preprocessing, statistical features and modelling techniques. Beyond this, we rigorously assess the strengths and limitations inherent in these bioinformatics pipelines, shedding light on the interplay between data quality and the performance of deep learning models, ultimately advancing our comprehension of genomic intricacies.
Keyphrases
- deep learning
- data analysis
- transcription factor
- high throughput
- copy number
- genome wide
- machine learning
- dna damage
- electronic health record
- gene expression
- big data
- artificial intelligence
- high resolution
- convolutional neural network
- palliative care
- dna methylation
- rna seq
- quality improvement
- high density
- tandem mass spectrometry