Resumen
The purpose of this work is to study methods for predicting the values of time series when processing streaming data in distributed systems in real time. To do this, the authors propose a modification of the autoregressive model with a given AR order by adding to it the inheritance function of the previous values of the time series. The results of comparative experiments of the proposed modification, called Real-Time AR with classical AR and ARIMA, confirmed the effectiveness of the modification. This is especially evident in the presence of anomalies in the behavior of the real time series. The proposed modification of the algorithm allows not only to parallelize calculations, but also to configure the model on the fly in the Apache Spark ecosystem. To conduct experiments with the algorithms, a special data array was built - a data slice from 1000 measurements of the Apache Kafka server metrics log with one topic, two producers and one consumer. Anomalous fragments were artificially added to the array, differing in a large number of messages per second and/or message size. The values of the proposed data array were normalized and shifted by the average value over the training sample of the model pre-training. The results of applying the proposed algorithm in solving problems of predicting the values of time series showed that the presence of anomalies in the behavior of objects does not introduce significant distortions in the results of predicting values.