k-Means+++: Outliers-Resistant Clustering

Adiel Statman

Liat Rozenberg and Dan Feldman

Resumen

The k-means problem is to compute a set of k centers (points) that minimizes the sum of squared distances to a given set of n points in a metric space. Arguably, the most common algorithm to solve it is k-means++ which is easy to implement and provides a provably small approximation error in time that is linear in n. We generalize k-means++ to support outliers in two sense (simultaneously): (i) nonmetric spaces, e.g., M-estimators, where the distance dist(??,??) dist ( p , x ) between a point p and a center x is replaced by min{dist(??,??),??} min dist ( p , x ) , c for an appropriate constant c that may depend on the scale of the input. (ii) k-means clustering with ??=1 m = 1 outliers, i.e., where the m farthest points from any given k centers are excluded from the total sum of distances. This is by using a simple reduction to the (??+??) ( k + m ) -means clustering (with no outliers).

Palabras claves

clustering - approximation - outliers

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 13 Parte: 12 (2020)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Algorithms
Applied Sciences
International Journal of Open Information Technologies

DOI

https://doi.org/10.3390/a13120311

Artículos similares

Risk identification approach using artificial intelligence and big data analysis

Acceso

N.N. Goglev,S.A. Migalin,E.V. Kasatkina Pág. 111 - 119

The use of artificial intelligence technologies and big data analysis in risk management makes it possible to reduce the burden on experts and reduce the influence of the human factor in risk assessment. These technologies are well studied and actively u... ver más

Revista: International Journal of Open Information Technologies

Hexadecimal Aggregate Approximation Representation and Classification of Time Series Data

Acceso

Zhenwen He, Chunfeng Zhang, Xiaogang Ma and Gang Liu

Time series data are widely found in finance, health, environmental, social, mobile and other fields. A large amount of time series data has been produced due to the general use of smartphones, various sensors, RFID and other internet devices. How a time... ver más

Revista: Algorithms

Pattern Classification and Clustering - A New Clustering Technique for Function Approximation

Acceso

González, J; Rojas, I; Pomares, H; Ortega, J; Prieto, A Pág. 132 - 142

Revista: IEEE TRANSACTIONS ON NEURAL NETWORK

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas