Resumen
The purpose of this work is to study methods for detecting anomalies in the processing of data streams in distributed streams in real time. To do this, the authors carried out a modification of the K-Means algorithm, called K-Means in real time, and carried out a comparative analysis of the effectiveness of the developed algorithm with K-Means from the MLlib library of the Apache Spark framework. The comparison confirmed the effectiveness of the proposed modification. To conduct experiments with the algorithm, a special data array (dataset) was built, which included about 1000 measurements of the Apache Kafka server log metrics with one topic, two providers and a consumer. Anomalous fragments have been added to this set of dates, with a large number of messages in the blink of an eye and/or size. The dataset values have been pre-processed to align the index of metrics and exclude correlations. Results developed by the authors of the K-Means algorithm for solving anomaly search problems, taking into account the detection time of its effectiveness.