Dr. Balazs Gerofi is an expert in system software and parallel / distributed computing. In particular Balasz is interested in operating systems (kernel architectures for many-core CPUs, memory management, file systems), HPC (parallel and distributed I/O, resiliency), virtualisation, and fault tolerant computing (replication, checkpoint-restart, message-logging).
Balazs is a Research Scientist at the System Software Research Team, part of the RIKEN Advanced Institute for Computational Science (AICS), Tokyo, Japan
He is the Co-Editor of “Operating Systems for Supercomputers and High Performance Computing“ 1st edition, October 2019
Towards Dynamic Resource Management in Next Generation HPC Environments
Balazs Gerofi – Research Scientist
System Software Research Team
RIKEN Center for Computational Science (RIKEN-CCS) – Tokyo, Japan
Thursday 20 February 2020 – 9:30 am
Workload diversity in high-performance computing (HPC) environments has experienced an explosion in recent years. The increasing prevalence of Big Data processing, in-situ analytics, artificial intelligence (AI) and machine learning (ML) workloads, as well as multi-component workflows is pushing the limits of supercomputing systems that have been primarily designed to serve parallel simulations.
In addition, with the growing complexity of the hardware there is also a growing interest for multi-tenancy and for a more dynamic, cloud-like execution environment. All these trends bring together a large variety of runtime components that do not cooperate well with each other, which in turn can lead to suboptimal performance.
This talk will enumerate a number of representative workloads that stress the limitations of the traditional HPC center. We then highlight some of the underlying forces which shape requirements of next generation systems and propose a cross-stack coordination layer that aims to resolve these conflicts. Finally, through some of our previous efforts in this space we demonstrate the benefits of the overall approach.