Taxonomy-based data representation for data mining: an example of the magnitude of risk associated with H. pylori infection.
Inese PoļakaDanute Razuka-EbelaJin Young ParkMarcis LejaPublished in: BioData mining (2021)
While there are always features and measurements that must be used in data analysis as they are, the use of taxonomies for the description of study subjects in parallel allows using membership to specific naturally occurring groups and their impact on an outcome. This can decrease the risk of overfitting (picking attributes and values specific to the training set without explaining the underlying conditions), improve the accuracy of the models, and improve privacy protection of study participants by decreasing the amount of specific information used to identify the individual.