Analyzing Data Properties using Statistical Sampling ? Illustrated on Scientific File Formats

Julian Martin Kunkel

Resumen

Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a subset of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified.This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly.

Acceso

PÁGINAS

pp. 34 - 39

NÚMERO

Volumen: 3 Número: 3 Parte: 0 (2016)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

REVISTAS SIMILARES

Water
Journal of Science and Applicative Technology
Inteligencia Artificial

DOI

http://dx.doi.org/10.14529/jsfi160304

Artículos similares

Analyzing Multi-Mode Fatigue Information from Speech and Gaze Data from Air Traffic Controllers

Acceso

Lin Xu, Shanxiu Ma, Zhiyuan Shen, Shiyu Huang and Ying Nan

In order to determine the fatigue state of air traffic controllers from air talk, an algorithm is proposed for discriminating the fatigue state of controllers based on applying multi-speech feature fusion to voice data using a Fuzzy Support Vector Machin... ver más

Revista: Aerospace

AdaBoost Ensemble Approach with Weak Classifiers for Gear Fault Diagnosis and Prognosis in DC Motors

Acceso

Syed Safdar Hussain and Syed Sajjad Haider Zaidi

This study introduces a novel predictive methodology for diagnosing and predicting gear problems in DC motors. Leveraging AdaBoost with weak classifiers and regressors, the diagnostic aspect categorizes the machine?s current operational state by analyzin... ver más

Revista: Applied Sciences

Algorithms Utilized for Creep Analysis in Torque Transducers for Wind Turbines

Acceso

Jacek G. Puchalski, Janusz D. Fidelus and Pawel Fotowicz

One of the fundamental challenges in analyzing wind turbine performance is the occurrence of torque creep under load and without load. This phenomenon significantly impacts the proper functioning of torque transducers, thus necessitating the utilization ... ver más

Revista: Algorithms

The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines

Acceso

Torrey Wagner, Dennis Guhl and Brent Langhals

Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the cate... ver más

Revista: Algorithms

Comparison of Flight Parameters in SIL Simulation Using Commercial Autopilots and X-Plane Simulator for Multi-Rotor Models

Acceso

Michal Welcer, Nezar Sahbon and Albert Zajdel

Modern aviation technology development heavily relies on computer simulations. SIL (Software-In-The-Loop) simulations are essential for evaluating autopilots and control algorithms for multi-rotors, including drones and other UAVs (Unmanned Aerial Vehicl... ver más

Revista: Aerospace

Revistas destacadas

Acceso directo a los números publicados en la revista Infrastructures

Infrastructures

Acceso directo a los números publicados en la revista Informed Infraestructure

Informed Infraestructure

Acceso directo a los números publicados en la revista BiT

Acceso directo a los números publicados en la revista Revista de la Construcción

Revista de la Construcción

Ver todas las revistas