ARTÍCULO
TITULO

Toward Exascale Resilience: 2014 update

Franck Cappello    
Al Geist    
William Gropp    
Sanjay Kale    
Bill Kramer    
Marc Snir    

Resumen

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions.The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that exascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

 Artículos similares

       
 
Vadim V. Elisseev,Milos Puzovic,Eun Kyung Lee     Pág. 24 - 41
On the path to Exascale, the goal of High Performance Computing (HPC) to achieve maximum performance becomes the goal of achieving maximum performance under strict power constraint. Novel approaches to hardware and software co-design of modern HPC system... ver más

 
Earle Jennings     Pág. 38 - 53
The STAR protocol is introduced, which solves three problems with MPI, a well known secur- ity problem, and three exascale communication problems. Optical implementations are developed compatible with 100 Gbit/sec Ethernet. Automatic fault resilience mec... ver más

 
William Tang,Bei Wang,Stephane Ethier,Zhihong Lin     Pág. 83 - 97
As HPC R&D moves forward on a variety of ?path to exascale? architectures today, an associated objective is to demonstrate performance portability of discovery-science-capable software.  Important application domains, such as Magnetic Fusion Energy ... ver más

 
Gabriel Noaje,Alan Davis,Jonathan Low,Seng Lim,Geok Lian Tan,Lukasz Orlowski,Dominic Chien,Sing-Wu Liou,Tin Wee Tan,Yves Poppe,Kenneth Ban Hon Kim,Andrew Howard,David Southwell,Jason Gunthorpe,Marek Michalewicz     Pág. 87 - 102
The global effort to build ever more powerful supercomputers is faced with the challenge of ramping up High Performance Computing systems to ExaScale capabilities and, at the same time, keeping the electrical power consumption for a system of that scale ... ver más

 
Maciej Brodowicz,Thomas Sterling     Pág. 27 - 37
With nano-scale technology and Moore's Law end, architecture advance serves as the principal means of achieving enhanced efficiency and scalability into the exascale era. Ironically, the field that has demonstrated the greatest leaps of technology in the... ver más