Identification of Methylation Signatures and Rules for Sarcoma Subtypes by Machine Learning Methods.
Jingxin RenXianChao ZhouWei GuoKaiYan FengTao HuangYu-Dong CaiPublished in: BioMed research international (2022)
Sarcoma, the second common type of solid tumor in children and adolescents, has a wide variety of subtypes that are often not properly diagnosed at an early stage, leading to late metastases and causing serious loss of life and property to patients and families. It exhibits a high degree of heterogeneity at the cellular, molecular, and epigenetic levels, where DNA methylation has been proposed to play a role in the diagnosis of sarcoma subtypes. Thus, this study is aimed at finding potential biomarkers at the DNA methylation level to distinguish different sarcoma subtypes. A machine learning process was designed to analyse sarcoma samples, each of which was represented by lots of methylation sites. Irrelevant sites were removed using the Boruta method, and remaining sites related to the target variables were kept for further analyses. Afterward, three feature ranking methods (LASSO, LightGBM, and MCFS) were adopted to rank these features, and six classification models were constructed by combining incremental feature selection and two classification algorithms (decision tree and random forest). Among these models, the performance of RF model was higher than that of DT model under all three ranking conditions. The specific expression of genes obtained from the annotation of highly correlated methylation site features, such as PRKAR1B, INPP5A, and GLI3, was proven to be associated with sarcoma by publications. Moreover, the quantitative rules obtained by decision tree algorithm helped us to understand the essential differences between various sarcoma types and classify sarcoma subtypes, providing a new means of clinical identification and determining new therapeutic targets.
Keyphrases
- machine learning
- dna methylation
- genome wide
- deep learning
- early stage
- artificial intelligence
- gene expression
- big data
- end stage renal disease
- poor prognosis
- climate change
- chronic kidney disease
- radiation therapy
- high resolution
- long non coding rna
- peritoneal dialysis
- mass spectrometry
- rectal cancer
- single cell
- decision making
- transcription factor
- binding protein
- sentinel lymph node
- rna seq