Resumen
One of the most important tasks of a modern big data processing platform is the task of choosing data storage formats. The choice of formats is based on various performance criteria, which depend on the class of objects and the requirements. One of the most important criteria is the time spent in various big data processing operations. The paper studies the five most popular formats for storing big data (avro, CSV, JSON, ORC, parquet), proposes an experimental bench for assessing time efficiency, and conducts a comparative analysis of experimental estimates of the characteristics of the formats under consideration. For the experiment, the basic data processing operations were considered using the Apache Spark framework. The format selection algorithm is developed based on the hierarchy analysis method. As a result, a methodology was formed for choosing a format from alternatives based on experimental estimates of parameters and a methodology for analyzing hierarchies for the task of choosing time-efficient basic operations of storage formats for big data in the Apache Hadoop system using Apache Spark.