Inicio  /  Applied Sciences  /  Vol: 10 Par: 7 (2020)  /  Artículo
ARTÍCULO
TITULO

Log Analysis-Based Resource and Execution Time Improvement in HPC: A Case Study

JunWeon Yoon    
TaeYoung Hong    
ChanYeol Park    
Seo-Young Noh and HeonChang Yu    

Resumen

High-performance computing (HPC) uses many distributed computing resources to solve large computational science problems through parallel computation. Such an approach can reduce overall job execution time and increase the capacity of solving large-scale and complex problems. In the supercomputer, the job scheduler, the HPC?s flagship tool, is responsible for distributing and managing the resources of large systems. In this paper, we analyze the execution log of the job scheduler for a certain period of time and propose an optimization approach to reduce the idle time of jobs. In our experiment, it has been found that the main root cause of delayed job is highly related to resource waiting. The execution time of the entire job is affected and significantly delayed due to the increase in idle resources that must be ready when submitting the large-scale job. The backfilling algorithm can optimize the inefficiency of these idle resources and help to reduce the execution time of the job. Therefore, we propose the backfilling algorithm, which can be applied to the supercomputer. This experimental result shows that the overall execution time is reduced.

 Artículos similares

       
 
Dmitry A. Nikitenko,Vadim V. Voevodin,Sergey A. Zhumatiy     Pág. 4 - 10
It is a common knowledge that the increasingly growing capabilities of HPC systems are always limited by a number of efficiency related issues. The reasons can be very different: hardware failures, incorrect job scheduling, peculiarities of algorithm, ch... ver más

 
Matthijs van Waveren,Ahmed Seif El Nawasany,Nasr Hassanein,David Moon,Niall O'Byrnes,Alain Clo,Karthikeyan Murugan,Antonio Arena     Pág. 11 - 18
The computing environment at the King Abdullah University of Science and Technology (KAUST) is growing in size and complexity. KAUST hosts the tenth fastest supercomputer in the world (Shaheen II) and several HPC clusters. Researchers can be inhibited by... ver más

 
Dmitry A. Nikitenko,Sergey A. Zhumatiy,Pavel A. Shvets     Pág. 72 - 79
The effective mastering of extremely parallel HPC system is impossible without deep understanding of all internal processes and behavior of the whole diversity of the components: computing processors and nodes, memory usage, interconnect, storage, whole ... ver más

 
Suo Guang     Pág. 4 - 21
Fault resilience has became a major issue for HPC systems, particularly, in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. MPI-level fault tolerant constructs, such as ULFM, are being proposed... ver más

 
Antoni Artigues,Fernando Martin Cucchietti,Carlos Tripiana Montes,David Vicente,Hadrien Calmet,Guillermo Marin,Guillaume Houzeaux,Mariano Vazquez     Pág. 4 - 18
We designed and implemented a parallel visualisation system for the analysis of large scale time-dependent particle type data. The particular challenge we address is how to analyse a high perfor- mance computation style dataset when a visual representati... ver más