EvoSplit: An Evolutionary Approach to Split a Multi-Label Data Set into Disjoint Subsets

Francisco Florez-Revuelta

Resumen

This paper presents a new evolutionary approach, EvoSplit, for the distribution of multi-label data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using iterative stratification, a method that aims to maintain the label (or label pair) distribution of the original data set into the different subsets. Following the same aim, this paper first introduces a single-objective evolutionary approach that tries to obtain a split that maximizes the similarity between those distributions independently. Second, a new multi-objective evolutionary algorithm is presented to maximize the similarity considering simultaneously both distributions (labels and label pairs). Both approaches are validated using well-known multi-label data sets as well as large image data sets currently used in computer vision and machine learning applications. EvoSplit improves the splitting of a data set in comparison to the iterative stratification following different measures: Label Distribution, Label Pair Distribution, Examples Distribution, folds and fold-label pairs with zero positive examples.

Palabras claves

multi-label data sets - supervised learning - machine learning - evolutionary computation - big data applications

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 11 Parte: 6 (2021)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

DOI

https://doi.org/10.3390/app11062823

EvoSplit: An Evolutionary Approach to Split a Multi-Label Data Set into Disjoint Subsets

Revistas destacadas