Early risers

Several scientists supported by the Department of Energy’s Office of Advanced Scientific Computing Research (ASCR) were among those who recently received awards in DOE’s new Early Career Research Program. ASCR Discovery talked with them to learn what they will investigate over the course of their five-year awards.

Picture a six-faced cube with eight corner points, blow it up and slap it onto a globe. The six faces are rectangular patches that can be subdivided infinitely. This technique allows Christiane Jablonowski, University of Michigan assistant professor of atmospheric and space sciences and scientific computing, to place a “refinement region” of tight grid-spacing atop a uniform-resolution, cubed-sphere mesh.

“The refined patch lets us, for example, capture and track tropical hurricanes with high accuracy,” she says. The technique is called adaptive mesh refinement (AMR).

AMR is the foundation for a new “dynamical core,” which makes up the heart of every climate and weather model and describes the fluid flow behind wind, temperature, pressure and density, Jablonowski says.

Jablonowski, who heads Michigan’s atmospheric dynamic modeling group, plans to use her award to develop more precise and reliable climate models using AMR techniques. It’s the only model that can be directed to zoom in on features of interest (in this case an isolated idealized storm system) at the minuscule resolution (for atmospheric science,anyway) of 1 kilometer while maintaining the flexibility to portray other features at a resolution of 100 kilometers.

Many atmospheric science models use latitude-longitude grids for their mesh, but converging meridians at the poles present mathematical problems.

“This will be a major test case of the efficacy of our modeling approach,” she says.

The model’s dynamical core will use the power of hundreds of thousands of processors working in parallel. Besides two graduate students, Jablonowski’s collaborators include researchers at DOE’s Lawrence Berkeley National Laboratory and the National Center for Atmospheric Research.

Super-efficient supercomputing

Patrick Chiang

Patrick Chiang, assistant professor of computer science at Oregon State University, is working on an energy-efficient interconnect for microchips in future massively parallel computing systems.

“It turns out that the energy consumed to make a computation is much less than the energy used to move that computed result somewhere else within the system,” Chiang says. “This is the case at every level — within a single integrated microprocessor, connecting multiple chips inside a single server, and connecting multiple servers in a data center. The goal of my research is to tackle innovative ways in silicon to reduce this energy at all of these levels.”

At the on-chip interconnect, where data must be moved to and from hundreds of computational units in a multicore processor, he and his team will take advantage of bandwidth advances that enable better system operations and lower power consumption. Off the chip, for short distances on a motherboard, he and his group hope to reduce the “clock distribution energy” that comes with data transfer to or from a computer and to or from a peripheral component, consuming significant power.

Finally, between separate racks that may be several meters apart in a data center, his group is working on energy-efficient, gigahertz analog-digital converters. Encoding data into multiple signals at a given time can significantly improve the off-chip bandwidth, he says.

Finding fault

Greg Bronevetsky

Today’s largest supercomputers incorporate hundreds of thousands of cores. Upcoming systems like Sequoia, under development at Lawrence Livermore National Laboratory (LLNL), will have more than 1 million. And exascale systems are expected to have hundreds of millions of cores, millions of memory chips and hundreds of thousands of disk drives.

“At these scales,” says Greg Bronevetsky, an LLNL computer scientist, “supercomputers become unreliable simply because of the large numbers of components involved, with exascale machines expected to encounter continuous hardware failures.”

Bronevetsky’s work looks at a key aspect of these failures, or hardware faults: their effect on applications.

“The current state of the art is to execute each application of interest thousands of times, each time injecting it with a random fault,” Bronevetsky says. “The result is a profile of the application errors most likely to result from hardware faults and the types of hardware faults most likely to cause each type of application error.”

This procedure is so expensive, he says, that it can be done for only a few applications of high importance.

To address this conundrum, Bronevetsky is devising a modular fault analysis approach that breaks the application into its constituent software components, such as linear solvers and physics models. He will then perform fault injections on each component, producing a statistical profile of how faults affect and travel through the component. He plans to connect these component profiles to produce a model of how hardware faults affect the entire application.

Given a distribution of faults, the model can predict the resulting distribution of application errors. Likewise, given a detected application error, the model can provide reverse analysis to come up with a probability distribution of hardware faults that most likely caused the error.

“The modularity of this mechanism will make it possible to assemble models of arbitrary applications out of pre-generated profiles of popular (software) libraries and services,” Bronevetsky says. “This will make large-scale complex systems easier to manage for system administrators and easier to use for computational scientists” — and help LLNL and others build ever larger and more powerful computers.

Speeding up scientific computing

Michelle Mills Strout

Michelle Mills Strout, assistant professor of computer science at Colorado State University, will focus on models and tools that enable scientists to develop faster, more precise computational models of the physical world.

“Computing has become the third pillar of science along with theory and experimentation,” Strout says. “Some examples include molecular dynamics simulations that track atom movement in proteins over simulated femtoseconds and climate simulations for the whole Earth at tens-of-kilometers resolution.”

Such simulations require the constant evolution of algorithms that model the physical phenomena under study. The simulations also must keep up with rapidly changing implementation details that boost performance on highly parallel computer architectures.

“Currently, the algorithm and implementation specifications are strongly coupled,” Strout says. To correct for the resulting “code obfuscation that makes algorithm and implementation details difficult,” she will program new libraries that allow critical algorithms to “operate on sparse and dense matrices as well as computations that can be expressed as task graphs.”

Diffusion: a theoretical framework

Anil Vullikanti

Diffusion processes can model the spread of disease in social-contact networks, malware in wireless networks and random, or cascading, failures in infrastructure networks such as the electrical power grid.

Anil Vullikanti, assistant professor of computer science and senior research associate at Virginia Polytechnic and State University, wants to develop a theoretical framework to characterize the dynamics of simple diffusion processes on graphs to assess, for example, whether a disease outbreak will spread to a large fraction of nodes.

“Our approach will be to use and develop theories to both formulate and solve problems in various applications,” Vullikanti says. “We will use both theoretical and large-scale simulation-based methods to address the challenges of complex networks.”

Network flow dynamics, random failures or disease outbreaks typically have been studied in isolation. In practice, Vullikanti says, dynamic phenomena are all composed of the same basic underlying interactions, involving network packet-flow issues and faults cascading through systems.

“Despite a significant and growing amount of research on complex networks in different applications, the methods that have been developed are fairly ad hoc and application-specific, and the theoretical foundations of this area remain open,” Vullikanti says.

Shifting time into reverse

Kalyan Perumalla

Kalyan Perumalla of Oak Ridge National Laboratory (ORNL) is developing scalable, reversible software that can “turn back the hands of time” to fix processor flaws on parallel computing systems that are essentially forward only.

When glitches happen, systems must start over from scratch or interrupt operations, causing serious and costly delays. Such problems will increase as parallel computing systems grow larger. Reversible software is aimed at relieving these serious issues, says Perumalla, a senior researcher in the modeling and simulation research group in ORNL’s computational sciences and engineering division.

Perumalla’s approach: To enable supercomputers with as many as a million core processors to self-correct when one or more processors send bad or outdated data to good processors and corrupt the application.

To that end, he is testing his software, called ReveR-SES (for “reversible software execution systems”) on some of the world’s fastest computers, such as ORNL’s Jaguar, a Cray XT5.

Understanding through inference

Youssef Marzouk

Youssef Marzouk will study uncertainty quantification and a statistical method called inference — tools relevant to researching new energy conversion technologies and environmentally important practices, such as groundwater remediation and carbon sequestration.

Inference starts with data and attempts to determine the underlying parameters that produced them. Many traditional inferential approaches require millions of repeated simulations over enormous regions of parameter space, says Marzouk, Boeing Assistant Professor of Aeronautics and Astronautics at the Massachusetts Institute of Technology.

He hopes to get around this problem partly by creating surrogate models for data-driven inference and prediction.

“The resulting algorithms will provide a foundation for learning from noisy and incomplete data, a natural mechanism for fusing heterogeneous information sources, and a complete assessment of uncertainty in subsequent predictions,” he says.

He also will develop inferential methods that fit the uncertainty inherent in a computational model’s form. Such uncertainty arises when the choice between competing models is unclear — for instance, whether the model includes or adequately resolves the right physical processes or chemical pathways.

ASCR Discovery

Understanding Science through Computing

Most Recent Stories

February 2025

Disease watch

February 2025

Hard target

February 2025

Untangling the cosmos

January 2025

Putting hydrogen to work