
Principal Engineer in the Data Center and AI (DCAI) Group at Intel Corporation, US.
“Fine-grained Automated Failure Management on Extreme-Scale GPU Accelerated Systems”
20 February 2025
Abstract
Failures in leadership-class accelerated HPC and AI systems have become increasingly common, and as these systems continue to scale, the frequency of failures is expected to rise. With hundreds of thousands of field-replaceable parts in such systems, automated failure management is essential. This talk introduces StabilityDB, a failure management automation framework that leverages real-time data analytics to drive failure servicing and maintenance on a per-failure mode basis. This approach ensures minimal compute node downtimes and high overall system availability. We will provide an architectural overview of StabilityDB and present statistical information on the failure characteristics that guide our automation policies. StabilityDB has been deployed on the Aurora supercomputer at Argonne National Laboratory, a system with over 63,744 GPUs, and is contributing to its efficient operation.

Bio
Balazs Gerofi is a Principal Engineer in the Data Center and AI (DCAI) Group at Intel Corporation and a Visiting Researcher in the High Performance Artificial Intelligence Systems (HPAIS) Research Team at RIKEN Center for Computational Science in Japan. He is primarily involved in system software research and development for high performance computing, most recently working on the Aurora supercomputer. Before joining Intel he also participated in the design and development of Supercomputer Fugaku, Japan’s latest flagship supercomputer. Balazs earned his M.Sc. and Ph.D. degree in computer science from the Vrije Universiteit Amsterdam and The University of Tokyo, respectively. He is a member of the IEEE Computer Society and the Association for Computing Machinery (ACM).
