Login / Signup

A Data Set Comparison Method Using Noise Statistics Applied to VUV Spectrum Match Determinations.

David C BellRandall C BoehmJohn FeldhausenJoshua S Heyne
Published in: Analytical chemistry (2022)
It has been demonstrated that a pair of spectra exhibiting a coefficient of determination ( R 2 ) as low as 0.976 can originate from the same chemical species in one example, while a different pair of spectra exhibiting an R 2 up to 0.9997 can originate from different chemical species. The R 2 between spectra overlays depends on the signal-to-noise ratio, while the residual between any two spectra should look like noise only when the two spectra originate from the same chemical species. Numerical characteristics of the residual between two high-resolution spectra are invaluable toward the definitive elimination of many plausible matches of reference spectra to the sample spectra of analytes eluted from two-dimensional gas chromatography. Additionally, numerical characteristics beyond R 2 facilitate a logical ranking of all plausible matches, making positive identification of a single-component analyte possible provided a reference spectrum exists for all plausible matches. Specifically, the experimental background noise is shown to follow a Gaussian distribution at all wavelengths, and a method is described to normalize the data such that the numerically adjusted noise distributions are independent of wavelength. The differences between matching spectra are further shown to exhibit numerical characteristics consistent with the background noise's Gaussian distribution, common to all wavelengths. Seven criteria are described for judging the similarity between spectra: R 2 between the two spectra, R 2 of a Q - Q plot with one axis being ideal Gaussian quantiles and the other axis being the distribution of the numerically adjusted residual quantiles, the maximum count of consecutive (by wavelength) signs in the residual, and the first four moments of the residuals. One exemplar application of the methodology is a definitive match of n -undecane, n -dodecane, and n -tridecane sample spectra to their corresponding reference spectrum, which is among the most challenging set of species within the volatility range of jet fuel to differentiate by spectral methods. While this example is a significant stress test of the approach, the utility of the methodology generally is in the subtle math and transparent criteria that unambiguously identify mismatches because the distributions of residuals between mismatching spectra are very clearly not Gaussian and have a high consecutive sign count, even in cases where the R 2 between the compared spectra is ambiguous.
Keyphrases
  • density functional theory
  • air pollution
  • high resolution
  • mass spectrometry
  • molecular dynamics
  • machine learning
  • electronic health record
  • radiation therapy
  • solid phase extraction
  • high resolution mass spectrometry