Login / Signup

Unsupervised learning methods for efficient geographic clustering and identification of disease disparities with applications to county-level colorectal cancer incidence in California.

Mallory E McMahonLyubov DoroshenkoJavad RoostaeiHyunsoon ChoMansoor A Haider
Published in: Health care management science (2022)
Many public health policymaking questions involve data subsets representing application-specific attributes and geographic location. We develop and evaluate standard and tailored techniques for clustering via unsupervised learning (UL) algorithms on such amalgamated (dual-domain) data sets. The aim of the associated algorithms is to identify geographically efficient clusters that also maximize the number of statistically significant differences in disease incidence and demographic variables across top clusters. Two standard UL approaches, k means with k++ initialization (k++) and the standard self-organizing map (SSOM), are considered along with a new, tailored version of the SOM (TSOM). The TSOM algorithm involves optimization of a customized objective function with terms promoting individual geographic cluster cohesion while also maximizing the number of differences across clusters, and two hyper-parameters controlling the relative weighting of geographic and attribute subspaces in a non-Euclidean distance measure within the clustering problem. The performance of these three techniques (k++, SSOM, TSOM) is compared and evaluated in the context of a data set for colorectal cancer incidence in the state of California, at the level of individual counties. Clusters are visualized via chloropleth maps and ordered graphs are also used to illustrate disparities in disease incidence among four identity groups. While all three approaches performed well, the TSOM identified the largest number of disease and demographic disparities while also yielding more geographically efficient top clusters. Techniques presented in this study are relevant to applications including the delivery of health care resources and identifying disparities among identity groups, and to questions involving coordination between county- and state-level policymakers.
Keyphrases
  • machine learning
  • public health
  • risk factors
  • healthcare
  • big data
  • electronic health record
  • deep learning
  • affordable care act
  • peripheral blood
  • smoking cessation
  • health insurance