search instagram arrow-down

Emily Casleton

Scientist at Los Alamos National Laboratory. US.

“Testing and Evaluating Large AI models: Current Trends and Future Work”

Wednesday 18 February 2026

Abstract

Large-scale AI models are now incorporated into many of our workflows, including search, coding, and national-security, yet research on defensible methods to probe and test what they can and should do have lagged behind model development. However, Stanford AI Experts predict that “The era of AI evangelism is giving way to evaluation”.

In this talk I will discuss three interconnected trends with examples that are reshaping evaluation practice. 1) Bespoke metrics for model assessments and how they can inform the loss function, or what you optimize during training. Mean squared error and cross entropy are the most common loss functions and accuracy/precision/recall are default metrics, but in AI models that are built for science, bespoke metrics can be more informative. 2) Capability-based benchmark suites. By organizing benchmarks around what a model should do rather what it knows, the evaluation can better quantify the model’s usefulness to potential users. 3) Design-of-experiments for benchmark construction. This approach yields smaller but more informative benchmarks and lead to accuracy tied to meta-data.

Bio

Dr. Emily Casleton is a statistician in the statistical sciences group at Los Alamos National Laboratory (LANL) in New Mexico, USA. She joined the lab as a post doc in 2014 after earning her PhD in Statistics from Iowa State University. Since converting to staff in 2015, Emily has routinely collaborated with seismologists, nuclear engineers, physicists, geologists, chemists, and computer scientists on a wide variety of cool data-driven projects. Most recently, her research focus has been on bridging the gap between statistics and AI through better evaluation and uncertainty quantification, often with a focus on nuclear nonproliferation. She holds a BS in Mathematics and Political Science from Washington & Jefferson College, 2003; an MS in Statistics from West Virginia University, 2006; and a PhD in Statistics from Iowa State University.

MW26 Slides

MW26 Videos

MW26 Q & A