Updated 14 February 2025. (scroll down for updates).
Monday 17 February 2025
AI and HPC, or AI ends HPC?
Dan Stanzione. Associate VP for Research. Executive Director, Texas Advanced Computing Center (TACC). US.
Abstract – In this talk, I’ll first cover the plans for the new National Science Foundation (NSF) Leadership Facility in the United States, benchmarks from the early hardware, then discuss the balance (or lack thereof) between AI and HPC in future system designs.
HPC to Enable NASA Missions
Rupak Biswas. PhD – Director, Exploration Technology Directorate. NASA Ames Research Center. US.
Abstract – To efficiently and effectively accomplish its many missions in aeronautics, Earth and space sciences, and exploration, NASA relies heavily on advanced supercomputing and data analytics. This includes the ability to perform high fidelity modeling and simulation, process large volume of observational data sets, and apply novel AI / ML technologies to enable scientific discovery and engineering achievements.
In this talk, I’ll share examples how HPC writ large has enabled various NASA missions and its role in some exciting upcoming ones.
Pathways to Actionable ModSim for Computing Infrastructure
Kevin A. Brown – Inaugural Walter Massey Fellow at Argonne National Laboratory, Chicago. US.
Abstract – Large computing facilities have become critical resources across many research areas such as high energy physics, climate science, and artificial intelligence. Modeling and simulation (ModSim) of this infrastructure has become essential to ensuring they are deployed and managed to meet productivity and efficiency goals. However, ModSim methodologies currently lack the maturity to accurately predict the behavior of these infrastructure at scale with the required fidelity. There is a need for more actionable models of these important facilities to support design and operational decision-making.
This talk explores challenges in ModSim for large-scale computing infrastructure. It will discuss limitations of current methodologies – such as discrete event and agent-based modelling – in heterogeneous environments and present novel hybrid modelling strategies for more scalable and accurate simulations. Finally, it will highlight opportunities in the wider research ecosystem for advancing the field of ModSim and co-design of large-scale computing facilities and infrastructure.
The tyranny of integers, binary, and grayscale when building large supercomputing infrastructures
Andrew Jones – Future Capabilities for Supercomputing & AI, Azure Specialized Engineering. Microsoft. UK.
Abstract – Supercomputers are traditionally through of as the world of floating-point computations – especially “FP64”. However, when planning, designing, building and operating large scale supercomputing infrastructures, there are many cases where integer or binary or grayscale aspects becomes critical. Some of this is due to quantization, for example the number of GPUs in a rack has to be an integer, potentially leaving unused datacenter space. How many supercomputers: 1, 2 or 20? How many variants can be supported? Attempting to define uptime SLAs means delving into a “grayscale” world of complexity, ambiguity, and lost hopes of binary “it is up or not?”.
This talk will explore the non-FP64 aspects of supercomputing infrastructures, and share some lessons applicable to broader scientific computing.
Infrastructure for Edge to Cloud AI Research
Kate Keahey. Senior Computer Scientist. Argonne National Laboratory (ANL). US.
Abstract – For better or worse, scientific instruments shape a field: to validate hypotheses we need an instrument where we can deploy, capture, measure, and record relevant phenomena. For systems experimentation in computer science – in other words, projects that build and evaluate new operating systems, improve energy efficiency, or investigate performance – this means first and foremost an instrument that supports deep reconfiguration allowing investigators to change everything from firmware, through operating systems kernel, to higher levels of the software stack. It also means access to a great diversity of state-of-the-art hardware, including the newest types of architectures, memory and storage options, as well as the newest GPUs, FPGAs, and other accelerators that serve as a platform for the developing AI revolution. And most recently, it means supporting scalable computing at edge and programmable networking such that it allows investigators to create interesting experiments in the edge to cloud continuum. Most importantly though, a scientific instrument needs to support the capability to evolve as new scientific problems arise and require new features in an experimentation platform and as methodology approaches emerge and drive change.
In this talk, I will describe the status and evolution of two scientific instruments: Chameleon (www.chameleoncloud.org) and FLOTO (floto.cs.uchicago.edu). Chameleon is a primarily datacenter-based virtual facility for computer science research. It consists of hundreds of diverse high-end nodes, distributed between core sites at University of Chicago, the Texas Advanced Computing Center (TACC) – and now also NCAR – as well as a few smaller volunteer sites. The hardware of this exploration platform is bare metal reconfigurable, allowing investigators to customize firmware, boot from custom kernel, as well as orchestrate complex distributed deployments connected using programmable networks. More recently, the system also introduced features that allow investigators to create and program custom distributed deployments using edge devices (CHI@Edge). Over almost ten years of its operation, Chameleon has served 10,000+ users, working on 1,000+ research and education projects. Collectively, this community has produced 700+ research papers (that we were able to find).
FLOTO is an observational instrument that deploys 1,000 Raspberry Pis nationwide to measure broadband. It grew out of the exploration capability provided by Chameleon’s CHI@Edge to implement a large-scale, distributed observational instrument. By providing a simple deployment mechanism, FLOTO allows investigators to deploy edge devices (Raspberry Pis) easily, and manages diverse and evolving suites of broadband tests on those deployments. Over the last year or so, FLOTO has been deployed in various communities of interest, over 17 national and local network providers, and across different access technologies, such as fiber, cable, satellite, and fixed wireless. These deployments generated millions of datapoints that are publicly available and have been to support both computer science and policy research, supporting both research publication and policy decision making. Its model has been adapted to support investigations other than broadband measurement and expanded to serve the needs of National Discovery Cloud for Climate (NDCC) applications.
Enabling AI-Driven Digital Agriculture:
Solutions from the NSF-AI ICICLE Institute
Dhabaleswar K. (DK) Panda – The Ohio State University, US.
DK Panda is a Professor and University Distinguished Scholar of C sand Engineering at the Ohio State University. He is
serving as the Director of the ICICLE NSF-AI Institute. He has published over 500 papers. The MVAPICH MPI libraries, designed and developed by his research group, are currently being used by more than 3,400 organizations worldwide (in 92 countries). Prof. Panda is a Fellow of ACM and IEEE, a recipient of the 2022 IEEE Charles Babbage Award and the 2024 IEEE TCPP Outstanding Service and Contributions Award.
Abstract – With the world population steadily increasing, and the amount of arable land steadily decreasing, a lot of discussions are taking place about growing food with higher yield and positive environmental impact. This leads to an open question on whether AI can help to solve some of these challenges.
In this talk, I will provide an overview of the approach and solutions being worked out in the NSF-AI Institute ICICLE (https://icicle.osu.edu/) along AI-driven digital agriculture. These solutions focus on designing a high-performance edge-to-HPC/cloud middleware stack with conversational AI interface for digital agriculture which diverse stakeholders (such as agronomists, farmers, and researchers) can use and/or customize in a plug-and-play manner for the targeted use cases. An overview of a flexible AI pipeline containing multiple components (data collection from drones and IoT devices, high-performance data transfers between edge-to-HPC/Cloud, high-performance training with semi-supervised learning, model commons to hold diverse trained models from different crops and diseases, and real-time inferencing by drones) will be presented. Many of these components are currently available from the Institute’s web site for experimentation and deployment.
Tuesday 18 February 2025
Staying Close to Home with NUMA
Ruud van der Pas – Senior Principal Software Engineer, Oracle Linux Engineering. The Netherlands.
Abstract – In the not so distant past, a Non-Uniform Memory Access (or NUMA for short) memory architecture was only used in very large servers. Tuning applications for NUMA was therefore only for a relatively small group to worry about.
Today’s situation is drastically different. Already a 2 socket server has a NUMA architecture. It has even come down to the socket level, because certain processors have a NUMA memory architecture internally.
The performance cost of neglecting the NUMA architecture ranges from noticeable to dramatic.
This talk starts with a description of NUMA. This includes the benefits, because there are good reasons to use a NUMA architecture. Next, some common techniques how to tune for NUMA, and the choices to be made, are introduced and explained. We end with quite a compelling performance example.
Collaborative continuous benchmarking for HPC
Olga Pearce, Lawrence Livermore National Laboratory, US.
Abstract- Benchmarking is integral to procurement of HPC systems, communicating HPC center workloads to HPC vendors, and verifying performance of the delivered HPC systems. Currently, HPC benchmarking is manual and challenging at every step, posing a high barrier to entry, and hampering reproducibility of the benchmarks across different HPC systems.
In this talk, we describe collaborative continuous benchmarking which enables functional reproducibility, automation, and community collaboration in HPC benchmarking. We develop a common language to streamline the interactions between HPC centers, vendors, and researchers, further enabling the previously unimaginable large-scale improvements to the HPC ecosystem.
We introduce an open source continuous benchmarking repository, Benchpark, for community collaboration. We believe collaborative continuous benchmarking will help overcome the human bottleneck in HPC benchmarking, enabling better evaluation of our systems and enabling a more productive collaboration within the HPC community.
Keywords: HPC, benchmarking, collaborative continuous benchmarking.
Sadram – A new memory addressing protocol
Robert Trout – President and founder of Sadram, Inc. New Zealand
Abstract – Sadram (symbolically addressable DRAM) is an addressing protocol embedded in DRAM which allows access by symbol as well as traditional linear addresses.
Such facility opens up a cornucopia of applications embedded in DRAM as well as promising power savings because of the intimacy of the storage media to the computation – an architecture generically called PIM (process-in-memory).
Sparsification
David Brebner – CEO, Umajin. New Zealand.
Abstract – This approach to 3D time series data reduction allows for a smaller node counts by 10,000 to 1,000,000 times over voxels. This also allows for multiple pre computed connectivity graphs to be stored to make simulation and analysis of time series data sets much faster. To further reduce workloads spatial partitioning can be used to support variable density. By assessing the change on a region by region basis it is also re-computed only when certain spatial thresholds are reached.
Digital Twin: Graph Formulations for Managing Complexity and Uncertainty.
Karen Willcox, MNZM – Director, Oden Institute for Computational Engineering and SciencesThe University of Texas at Austin | Associate Vice President for Research | Professor of Aerospace Engineering and Engineering Mechanics | W. A. “Tex” Moncrief, Jr. Chair in Simulation-Based Engineering and Sciences | Peter O’Donnell, Jr. Centennial Chair in Computing Systems. US.
Abstract – Our work has proposed graphical models and graph-based methods as fundamental enablers of digital twins. Graph-based representations are well known to be suited for describing complex systems where the connections between entities are as important as the entities themselves. The interconnections within and across data, models, and decisions are central to a digital twin’s value. Not only does a graph emphasize the scalable representation of such interrelationships, it also provides a natural mathematical setting for addressing uncertainty and complexity — arguably the two biggest barriers to scalable deployment and adoption of digital twins. We discuss how recent advances in theory and algorithms for large-scale knowledge graphs and graphical models can be combined in a multi-layered formulation to provide a powerful foundation for achieving scalable digital twins for complex systems. We illustrate our approaches by creating an educational digital twin to support decision-making around transfer student pathways for the state of Texas. The educational digital twin represents the dynamic academic pathways of millions of students across hundreds of demographic and academic dimensions.
Exaba: Secure, Fast and Lean Storage built in Rust for the Exascale Era
Stuart Inglis – CTO/Co-founder of Exaba.io. New Zealand.
Abstract – Exaba delivers a transformative storage platform that is secure, fast, lean, easy to use and run-anywhere. Instead of re-engineering existing stacks, we’ve built every layer from the ground-up — ensuring that each component is optimized to meet the rigorous demands of modern enterprise storage.
Developed in Rust: We chose to develop Exaba’s storage platform in Rust as a security-first systems language with C-like performance.
Active-Active Scale-Out Topology: At the heart of our system is a network of active-active, scale-out S3/compute nodes built atop a disaggregated, shared-everything topology. In this architecture, every drive is visible to all nodes, enabling dynamic file placement within immutable zones. Files are encoded using multiple-area Reed Solomon (M.A.R.S.), ensuring data durability and resilience across the entire cluster.
Innovative Eight-Layer S3 Architecture: We’ve developed a dedicated eight-layer model for S3 operations—including mirroring, replication, and caching—that pushes beyond traditional storage paradigms. Our integrated web server and S3-compatible server are built in Rust, offering efficient, low-latency communication for seamless cloud integration and real-time management of billions of files.
Post-Quantum Security: We’ve implemented future-proof encryption that aligns with post-quantum standards for data in transit and at-rest, delivering a robust backbone for our zero-trust model.
Flexible, Run-Anywhere Deployments: Standalone Server: Our file system–based solution runs on Windows, macOS, Raspberry Pi, RISC‑V, and Linux—ideal for edge environments or pilot deployments.
Linux Scale-Out Clusters: For enterprise and data center applications, our clustered solution supports high transaction volumes, fault tolerance, and seamless scalability.
Containerized Deployments: Docker containers provide rapid testing and streamlined rollouts, perfect for continuous integration.
Convergence of High-Performance & Machine Learning Workflows
Amal Gueroudji – Postdoctoral Appointment at Argonne National Laboratory. US/Algeria.
Abstract – The convergence of High-Performance Computing (HPC) and Machine Learning (ML) workflows offers transformative potential across scientific research, industrial applications, and emerging technologies. However, realizing seamless integration requires overcoming several critical challenges.
Coupling distinct programming models—such as MPI for distributed HPC systems and Python-based frameworks prevalent in ML—introduces significant complexity, necessitating innovative interoperability mechanisms to harmonize these paradigms effectively. Performance characterization in this hybrid ecosystem demands advanced metrics and methodologies capable of capturing the nuances of diverse computational patterns, from dense numerical simulations typical of HPC to the sparse tensor operations of ML, where even established tools fall short. Additionally, efficient data streaming in composable environments presents a persistent bottleneck, as these workflows often rely on real-time data ingestion, transformation, and transfer across heterogeneous architectures.
This presentation explores these challenges and highlights strategies for addressing them, emphasizing the importance of scalable, holistic solutions to achieve composability and efficiency in HPC-ML workflows.
Computer architectures in the era of artificial intelligence – New!
Simon McIntosh-Smith – Professor of High Performance Computing. Director of the Bristol Centre for Supercomputing (BriCS), including Isambard-AI and Isambard 3. University of Bristol, UK.
Abstract – Computer architecture as a field has made great strides since the beginning of the field in the 1940s and 50s, with a golden era during the 80s and 90s which saw the dawn of RISC instruction sets, deep pipelines, out of order execution, branch prediction, and deep memory hierarchies. As Dennard Scaling came to an end during the nineties, advances in microarchitecture were largely supplanted by a shift to multi-core, with individual cores advancing at a much slower rate than seen historically. However, in the last five years we have seen a resurgence of innovation in computer architectures, spearheaded by GPUs and GPU like features in CPUs, in response to the enormous economic opportunity created by the AI market. TensorCores and memory techniques designed for sparsity are just two of the many innovations inspired by this phenomenon, but even in high-end CPUs the pipeline widths and depths have also advanced well beyond what was deemed optimal for more general purpose markets.
In this talk we’ll review some of the more important innovations, and discuss the implications of AI driving the computer architecture space so forcefully.
Enabling Efficient and Effective Complex HPC and AI Workflows
Larry Kaplan – Senior Distinguished Technologist, Hewlett Packard Enterprise. US.
Abstract – High-end HPC and AI systems are facing several challenges today, two important ones being the increasing use of AI on such systems and concerns about sustainability. Addressing these new challenges is an important part of system design, for both hardware and software. Focusing on software, AI shows up in several ways, as a significant workflow enhancement to traditional simulation workloads and also as a new field of innovation for computation. Sustainability also manifests in several ways, including the desire for reduced energy consumption and the ability to work with the environment. Features such as “free cooling” and “heat reuse” are becoming more desirable.
This talk provides an overview of some of these current challenges along with discussion of potential solutions.
Wednesday 19 February 2025
Small Data – Big Problems: Some Learnings from AI’s Potential Role in Diagnosis and Clinical Decisions
Prof. Alok N. Choudhary – Harold Washington Professor ECE and CS Departments, Northwestern University. Founder, 4Cinsights Inc. US.
Abstract – AI and Machine Learning can potentially help making important discoveries in health-care, particularly in clinical decision support. However, datasets are small, not very clean and very heterogenous, making AI/ML approaches challenging to develop and apply. But the problems facing Health-care providers (HCPs) are big, as the decisions in many cases can result in adverse outcomes. For example, a decision as simple as whether to intubate or extubate a patient from mechanical ventilation can be (potentially) be improved by AI approaches.
This talk will present preliminary lessons from ongoing research in applying AI and underlying challenges in various lung diseases, potential therapies and treatments.
The Artificial Scientist – Leveraging In-transit Machine Learning for Plasma Simulations
Sunita Chandrasekaran – Associate Professor | Department of Computer & Information Sciences. Co-Director, AI Center of Excellence. University of Delaware. US.
Abstract – With the rapid advancements in the computer architecture space, the migration of legacy applications to new architectures remains a continuous challenge. To effectively navigate this ever-evolving hardware landscape, software and toolchains must evolve in tandem, staying ahead of the curve in terms of architectural innovation. While this synchronization between hardware and software is inherently complex, it is essential for fully harnessing the potential of advanced hardware platforms. In this context, a marriage between HPC and AI is gaining increasing prominence. By effectively orchestrating the workflow of HPC and AI, we can not only accelerate scientific progress but also achieve significant gains in computational efficiency. One promising strategy to further optimize large-scale workflows is to stream simulation data directly into machine learning (ML) frameworks. This approach bypasses traditional file system bottlenecks, allowing for the transformation of data in transit— asynchronously with both the simulation process and model training.
This talk will explore these strategies in detail, demonstrating the synergy between hardware innovation and software adaptation. Using real-world scientific applications as case study, Plasma-in-Cell on GPU, i.e. PIConGPU we will showcase how these techniques can be applied at scale to drive both scientific and computational advancements.
Building Next-Generation Scientific Workflows for Autonomous Research Facilities
Rafael Ferreira da Silva – Lead, Workflow and Ecosystem Services group. Oak Ridge National Laboratory. US.
Abstract – The landscape of scientific workflows is undergoing a transformative shift as research facilities evolve towards greater autonomy and intelligence. This presentation examines emerging trends and challenges in scientific workflows, with particular emphasis on the convergence of AI/ML with HPC systems, time-sensitive operations, and multi-facility integrations.
Drawing from recent community discussions and Oak Ridge Leadership Computing Facility’s experiences, we explore how modern workflows are adapting to handle unprecedented data volumes, near real-time processing requirements, and complex cross-facility collaborations. Special attention is given to OLCF’s efforts in supporting the Department of Energy’s Integrated Research Infrastructure initiative and the Labs of the Future vision, including the development of standardized protocols for authentication, data movement, and workflow orchestration across distributed research environments.
The presentation highlights key technical challenges and proposed solutions for creating sustainable, AI-enhanced workflow ecosystems that can effectively bridge traditional scientific methods with autonomous laboratory operations, while maintaining essential human oversight and scientific rigor.
A Real Time Bayesian Digital Twin for Tsunami Data Assimilation and Prediction
Omar Ghattas – Oden Institute for Computational Engineering & Sciences. Walker Department of Mechanical Engineering. The University of Texas at Austin. US.
Abstract – Tsunamis generated from megathrust earthquakes are capable of massive destruction. Efforts are underway to instrument subduction zones with seafloor acoustic pressure sensors to provide tsunami early warning.
Our goal is to create a digital twin framework that employs this pressure data, along with a high fidelity forward model given by the 3D coupled acoustic–gravity wave equations, to infer the earthquake-induced spatiotemporal seafloor motion in real time. The Bayesian solution of this inverse problem then provides the boundary forcing to forward propagate the tsunamis toward populated areas along coastlines and issue wave height forecasts with quantified uncertainties for early warning.
However, solution of a single forward problem alone entails severe computational costs stemming from the need to resolve ocean acoustic waves with wavelengths of order 1.5 km in a subduction zone of length ~1000 km and width ~200 km. This can require ~1 hour on a large supercomputer. The Bayesian inverse problem, with billions of uncertain parameters, formally requires hundreds of thousands of such forward and adjoint wave propagations; thus our goal of real time inference appears to be intractable. We propose a novel approach to enable accurate solution of the inverse and prediction problems in a few seconds on a GPU cluster. The key is to exploit the structure of the parameter-to-observable map, namely that it is a shift-invariant operator and its discretization can be recast as a block Toeplitz matrix, permitting FFT diagonalization and fast roofline-optimal multi-GPU implementation.
We discuss the Bayesian formulation of the inverse problem and real time GPU solution, and demonstrate that tsunami inverse problems with 10^8 parameters can be solved exactly (up to discretization error) in a fraction of a second, thus enabling early warning with high fidelity models.
This work is joint with Sreeram Venkat, Stefan Hennekinig, and Milinda Fernando.
Jeffrey Vetter. Section Head, Advanced Computing Systems Research. Oak Ridge National Lab (ORNL). US.
Navigating the Post-Exascale Computing Era: GPUs, Analog Computing, and AI
DOE has deployed its first three Exascale systems, so now is an appropriate time to think about post-Exascale challenges and opportunities. GPUs were just the beginning of architectural disruption. Focusing on both performance and energy efficiency, we are seeing a wide array of new technologies emerge during this ‘golden age of architectures,’ making the choices of architectures, software, and applications existential. In this talk, I will survey post-Exascale technologies and discuss their implications for both system design and software. As an example, I will delve into analog computing as one alternative to dramatically improve energy efficiency. Meanwhile, the extraordinary disruption offered by AI may provide ways to mitigate these software and algorithmic challenges. Our team is exploring how AI can be used to port software to new architectures and transform legacy programming languages to contemporary programming systems.
Status Update of LLM Training in Japan
Rio Yokota – Professor at Institute of Science Tokyo
Abstract – The release of DeepSeek-V3 and R1 have shown that state-of-the-art LLMs are reproducible outside OpenAI, Anthropic, and Google. These are not mechanical tools that benefit everyone equally, but are rather intellectual tools that will influence our culture and society. Therefore, it is important for individual countries to be able to control as much of the training pipeline as possible. In this regard, the fact that open models are catching up to closed models, give us hope that training sovereign LLMs is a worthwhile endeavor. In this talk, I will give a status update of the efforts in Japan to train LLMs.
Quantum Computing: Progress Towards Real-World Impact
Russell Stutz. Senior Director of Product Technologies. Quantinuum. US.
Abstract – Consistent with other emergent technologies, the hype of quantum computing has not always been aligned to current realities. However, quantum computing is no longer a distant dream—it’s quickly progressing, and it’s poised to revolutionize industries and economies worldwide. In this talk, we’ll explore the quest for near-term quantum utility, breakthroughs in quantum error correction, and Quantinuum’s path to fully fault-tolerant quantum computers. As a truly new and unique tool, quantum computing will reshape the way we solve complex problems in areas such as chemistry, finance, and others that are not yet identified.
This session will also dive into Quantinuum’s leading approach to building high performance quantum hardware, how we benchmark them, adding much needed transparency to quantum computing technology development. The high-fidelity operations and rich feature set of Quantinuum’s trapped ion quantum computers are driving innovation and paving the way for real-world applications that deliver tangible impact.
At the heart of this transformation is collaboration. The future of quantum computing depends on partnerships that push boundaries, foster creativity, and accelerate progress. Join us as we discuss how these advancements can empower nations, including New Zealand, to become leaders in the emerging quantum era.
Building an Agriculture Data Infrastructure in Arizona
Barney Maccabe. Executive Director, Institute for the Future of Data and Computing. University of Arizona. US.
While Arizona is not likely to be one of the first states that comes to mind when thinking about agricultural production in the US, agriculture in Arizona supports over 130,000 jobs, has a $23.3B economic impact, with exports to over 70 countries. The Yuma Valley in southwestern Arizona produces 90% of the leafy greens consumed in North America between November and March. Given the limited water supply, this kind of productivity could only be achieved by embracing technology. Today, this means embracing AI which means embracing the collection of data. Arizona has several resources, including Biosphere 2 and an outdoor phenotying facility , that will be essential for collecting the data needed to build new models needed for agriculture.
Thursday 20 February 2025
Laura Monroe. Mathematician & Computer Scientist. Senior Project Leader. Los Alamos National Laboratory. US.
“Math vs. CS (Good buds that should really hang out more.)”
Computer Science has in the past benefited greatly from mathematical theory and was in fact built on it. These days, the disciplines have grown a little apart. In this talk, I will discuss a series of computer science results in network design based on various mathematics papers, some nearly a century old, and will illustrate the interactions between them. Finally, there is a little meta-discussion on how practitioners of the two disciplines interact and how such interaction may be encouraged.
Fine-grained Automated Failure Management on Extreme-Scale GPU Accelerated Systems
Balazs Gerofi – Principal Engineer in the Supercompute Platforms Group – Intel Corporation. US.
Abstract – Failures in leadership-class accelerated HPC and AI systems have become increasingly common, and as these systems continue to scale, the frequency of failures is expected to rise. With hundreds of thousands of field-replaceable parts in such systems, automated failure management is essential. This talk introduces StabilityDB, a failure management automation framework that leverages real-time data analytics to drive failure servicing and maintenance on a per-failure mode basis. This approach ensures minimal compute node downtimes and high overall system availability. We will provide an architectural overview of StabilityDB and present statistical information on the failure characteristics that guide our automation policies. StabilityDB has been deployed on the Aurora supercomputer at Argonne National Laboratory, a system with over 63,744 GPUs, and is contributing to its efficient operation.
Making computers fundamentally more secure – the CHERI approach
Simon Moore – Professor of Computer Engineering at the University of Cambridge Department of Computer Science and Technology (previously the Computer Laboratory) in England.
Abstract – Year-on-year memory safety vulnerabilities account for around 70% of all computer security vulnerabilities. The CHERI architecture enhances hardware and software to deterministically mitigate these and other vulnerabilities. In a 14+ year collaboration between University of Cambridge, SRI International, ARM Ltd and others, a open source full-stack security solution has been produced including operating systems, compiler and processors: CHERI for RISC-V (open-source) and the ARM Morello multicore 7nm SoC demonstrator (proprietary).
Microsoft Security Response Center undertook a substantial study to see how many of their 2019 vulnerabilities (CVEs) could have been mitigated using CHERI; concluding that two thirds would have been completely mitigated to the point where a patch was unnecessary. Microsoft has subsequently produced CHERIoT, an open-source CHERI enhanced RISC-V microcontroller. In 2023 the Five Eyes government agencies (USA, Canada, UK, Australia and New Zealand) issued the report “Shifting the Balance of Cybersecurity Risk: Principles and Approaches for Security-by Design and -Default” that recommends CHERI as the secure hardware foundation. The 2024 White House report “Back to the building blocks: a path toward secure and measurable software” identifies the need for memory safety and commends the CHERI approach.
This talk will present an overview of the technical approach and a summary of some of the many results to-date including an overview of major industry applications and security benefits.
Performance Optimization & Portability: Pathways for the Era of Heterogeneous HPC Systems
Florina M. Ciorba – Associate Professor for HPC | Performance Optimization. University of Basel. Switzerland.
Abstract – Contemporary High Performance Computing (HPC) is increasingly defined by heterogeneity, characterized by a diverse array of devices and a multitude of cores per device within each node. Looking ahead, the HPC landscape is expected to grow even more complex and heterogeneous. Optimizing performance and achieving portability represent critical pathways to achieving highest computational efficiency, reducing energy consumption, and ensuring that applications remain adaptable and seamlessly portable across rapidly evolving heterogeneous architectures.
This presentation will discuss the challenges and pathways for optimizing application performance and achieving portability in modern heterogeneous HPC systems. Drawing from recent experiences in optimizing legacy applications, developing new simulation frameworks, and integrating data analysis pipelines, we will highlight strategies for leveraging multiple levels of parallelism—both within and across nodes—while maintaining a delicate balance between performance and portability. Key topics include scheduling libraries and autotuning, scalable domain decomposition, and runtime scheduling of workflows that integrate AI, data management, and simulations. The discussion will conclude with pathways for the effective and productive exploitation of parallelism and heterogeneity in next-generation HPC systems.
Embracing AI for Energy-Efficient Data Movement across the Computing Continuum
Tevfik Kosar. Director, Data Intensive Distributed Computing Lab, University at Buffalo (SUNY). US.
The digital revolution is transforming scientific disciplines across the board, with our growing reliance on the timely access, analysis, and interpretation of large-scale, heterogeneous, and often uncertain datasets propelling scientific discovery to new realms. This surge in global data access and movement requirements has also made the energy consumption and carbon footprint of data transfers a critical concern, particularly for HPC and cloud data centers, where the environmental impact is becoming increasingly unsustainable. The information and communication technologies are responsible for about 3% of global carbon emissions, which is very close to that associated with the aviation industry, and it could deteriorate if left unaddressed. The share of communication networks in the total IT power consumption is around 43%.
In this talk, I will present our efforts on creating the first-ever energy-efficient and carbon-aware data access and sharing cyberinfrastructure (CI) for the wider community. This novel CI, empowered by AI, will allow researchers to be able to access the widely-distributed, heterogeneous, complex, and dynamic data sources across the computing continuum from the micro level (e.g., edge devices, sensors, IoT) to the macro level (e.g., data centers, clouds, supercomputers) in an easy, efficient and timely manner while minimizing the energy consumption and carbon emission of the data transfers.
The Democratization of Co-design
James Ang and Antonino Tumeo – Pacific Northwest National Laboratory. US.
Abstract – The U.S. CHIPS R&D programs are creating an infrastructure for microelectronics prototyping and advanced packaging. This CHIPS infrastructure provides an unprecedented opportunity to create a low cost, agile, prototype hardware design and generation capability. This talk will provide an overview of a new DOE project that integrates capabilities to democratize the hardware-software co-design process—DeCoDe. We build on our history of compiler framework optimization to map domain specialized applications to high level synthesis tools; and integrate capabilities from many collaborators to both leverage from, and create our contribution back to, an open hardware technology commons.
Some highlights from this overview include, target converged application co-design drivers, leverage the open chiplet ecosystem to demonstrate heterogeneous computing where customized prototype designs are concentrated in only a few chiplets, support hardware composition with corresponding system software, and use prototype heterogeneous processor hardware to quantify improvements in energy efficiency. Our focus on low cost and agile hardware R&D defines boundaries on what DeCoDe can and cannot do. We work with open source hardware design tools to minimize the cost of commercial EDA tools and to benefit from open data sets, we leverage the open chiplet ecosystem to integrate our custom hardware designs with a much larger number of existing computing chiplets, and finally, we stay focused on hardware design for prototypes to support lab-to-fab R&D and validate ideas that industry might later bring to product.
Quantum Computing – What, How, When?
Christopher Monroe – Gilhuly Family Presidential Distinguished Professor of Electrical and Computer Engineering and Physics at Duke University. US. He is also the Co-Founder and former CEO and Chief Scientist of IonQ, Inc., the first pure-play public quantum computing company. Monroe has pioneered nearly all aspects of trapped ion quantum computers and simulators, from demonstrations of the first quantum gate, monolithic semiconductor-chip ion trap, and photonic interconnects between physically separated qubits; to the design, fabrication, and use of full-stack ion trap quantum computer systems in both university and industrial settings. He is a key architect of the US National Quantum Initiative, a Fellow of the American Physical Society, Optical Society of America, the UK Institute of Physics, the American Association for the Advancement of Science, and is a member of the National Academy of Sciences.
Abstract – Quantum computers exploit the bizarre features of quantum physics — uncertainty, entanglement, and measurement — to perform tasks that are impossible using conventional means. These may include the computing and optimizing over ungodly amounts of data; breaking encryption standards; simulating models of chemistry and materials; and communicating via quantum teleportation.
The two challenges of quantum computing are (1) we don’t really have many clear examples of useful applications, and (2) they are notoriously hard to build and scale. Despite these herculean challenges, many important problems known and unknown will never be solved until we have quantum computers.
I will discuss the state-of-the-art in quantum computers, led by an uneasy coalition of scientists and engineers from academia, industry and government.
Friday 21 February 2025
Jeff Zais. HPC Platform Architect, Senior Science Advisor, NIWA. New Zealand.
“NIWA’s recent procurement of the largest supercomputer in New Zealand”
The NIWA Generation 3 supercomputer environment is approaching end of life. This talk will review at a high level the process to finalise recommendations for the Generation 4 environment. A summary of key decisions will be presented, in areas such as the archive approach, cloud/hosted, AMD/ARM/Intel, high-performance storage, interconnect, energy consumption. Detailed discussion on any or all of these topics will be possible during the Friday unconference day.
Updated 14 February 2025
