Login / Signup

An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes.

Andrea S Martinez-VernonJames A CovingtonRamesh P ArasaradnamSiavash EsfahaniNicola O'ConnellIoannis KyrouRichard S Savage
Published in: PloS one (2018)
In this study, we present a new data analysis pipeline for FAIMS data, and demonstrate a number of improvements over previously used methods. We evaluate the effect of a series of candidate operational steps during data processing, such as the use of wavelet transforms, principal component analysis (PCA), and classifier ensembles. We also demonstrate the use of FAIMS data in our pipeline to diagnose diabetes on the basis of a simple urine sample using machine learning classifiers. We present results for data generated from a case-control study of 115 urine samples, collected from 72 type II diabetic patients, with 43 healthy volunteers as negative controls. The resulting pipeline combines the steps that resulted in the best classification model performance. These include the use of a two-dimensional discrete wavelet transform, and the Wilcoxon rank-sum test for feature selection. We are able to achieve a best ROC curve AUC of 0.825 (0.747-0.9, 95% CI) for classification of diabetes vs control. We also note that this result is robust to changes in the data pipeline and different analysis runs, with AUC > 0.80 achieved in a range of cases. This is a substantial improvement in performance over previously used data processing methods in this area. Our ability to make strong statements about FAIMS ability to diagnose diabetes is sadly limited, as we found confounding effects from the demographics when including these data in the pipeline. The demographics alone produced a best AUC of 0.87 (0.795-0.94, 95% CI). While the combination of the demographics and FAIMS data resulted in an improvement on the AUC (0.907; 0.848-0.97, 95% CI), it did not prove to be a significant difference. Nevertheless, the pipeline itself shows a significant improvement in performance over more basic methods which have been used with FAIMS data in the past.
Keyphrases
  • machine learning
  • electronic health record
  • data analysis
  • big data
  • type diabetes
  • cardiovascular disease
  • artificial intelligence
  • mass spectrometry
  • adipose tissue
  • insulin resistance
  • neural network