ARTÍCULO
TITULO

NR-MPI: A Non-stop and Fault Resilient MPI Supporting Programmer Defined Data Backup and Restore for E-scale Super Computing Systems

Suo Guang    

Resumen

Fault resilience has became a major issue for HPC systems, particularly, in the perspective of future E-scale systems, which will consist of millions of CPU cores and other components. MPI-level fault tolerant constructs, such as ULFM, are being proposed to support software level fault tolerance. However, there are few systematic evaluations by application programmers using benchmarks or pseudo applications. This paper proposes NR-MPI, a \emph{N}on-stop and Fault \emph{R}esilient \emph{MPI}, supporting programmer defined data backup and restore. To help programmers write fault tolerant programs, NR-MPI provides a set of friendly programming interfaces and a state transition diagram for data backup and restore. This paper focuses on design, implementation and evaluation of NR-MPI. Specifically,this paper puts emphases on failure detection in MPI library, friendly programming interface extending for NR-MPI and examples of fault tolerant programs based NR-MPI. Furthermore, to support failure recovery of applications, NR-MPI implements data backup interfaces based on double in-memory checkpoint/restart. We conduct experiments with both NPB benchmarks and Sweep3D on TH supercomputer in NSCC-TJ. Experimental results show that NR-MPI based fault tolerant programs can recover from failures online without restarting, and the overhead is small even for applications with tens of thousands of cores.

 Artículos similares

       
 
Seyed Mohammad Hashemi, Seyed Ali Hashemi, Ruxandra Mihaela Botez and Georges Ghazi    
This paper presents a methodology for designing a highly reliable Air Traffic Management and Control (ATMC) methodology using Neural Networks and Peer-to-Peer (P2P) blockchain. A novel data-driven algorithm was designed for Aircraft Trajectory Prediction... ver más
Revista: Aerospace

 
Jiping Cong, Jianbo Hu, Yingyang Wang, Zihou He, Linxiao Han and Maoyu Su    
This paper presents a fault-tolerant attitude control scheme, incorporating reconfiguration control allocation for supersonic tailless aircraft subject to nonlinear characteristics, actuator constraint, uncertainty, and actuator faults. The main idea is ... ver más
Revista: Aerospace

 
Jihe Wang, Qingxian Jia and Dan Yu    
The issue of active attitude fault-tolerant stabilization control for spacecrafts subject to actuator faults, inertia uncertainty, and external disturbances is investigated in this paper. To robustly and accurately reconstruct actuator faults, a novel mi... ver más
Revista: Applied Sciences

 
Wanlu Zhu, Tianwen Gu, Jie Wu and Zhengzhuo Liang    
In instances where vessels encounter impacts or other factors leading to communication impairments, the status of electrical equipment becomes inaccessible through standard communication lines for the controllers. Consequently, the shipboard power system... ver más

 
Jiawen Li, Yujia Wang, Haiyan Li, Xing Liu and Zhengyu Chen    
Ocean currents, mechanical collisions and electronic damage can cause faults in an autonomous underwater vehicle (AUV), including sensors and thrusters. For such problems, this paper designs a fault-tolerant controller that is independent of the results ... ver más