Memory boost

A composite visual of a supercomputing testbed. Image courtesy of Jeff London/Pacific Northwest National Laboratory.

For as long as researchers have applied artificial intelligence techniques, they’ve had an insatiable demand to manage and store big data. Use of AI – an iterative processes in which algorithms train computers to recognize patterns in data – has skyrocketed over the past decade and will continue to contribute even more to science and society while accelerating demand for, and innovation in, storage and memory.

Now, researchers at Pacific Northwest National Laboratory (PNNL) in Richland, Washington, and Micron, a Boise, Idaho, memory and storage semiconductor company, are developing an advanced memory system to support AI for scientific computing. The work, sponsored by the Advanced Scientific Computing Research (ASCR) program in the Department of Energy (DOE), will help assess emerging memory technologies for DOE Office of Science projects that employ artificial intelligence.

“A renaissance has occurred in the last five years,” says James Ang, PNNL’s chief scientist for computing and the lab’s project leader.

GPUs – graphics processing units, a mainstay of gaming – have become a staple in boosting computing performance and have let AI improve, scale and support many commercial applications. “Most of the performance improvements have been on the processor side,” Ang notes. “But recently, we’ve been falling short on performance improvements, and it’s because we’re actually more memory-bound. That bottleneck increases the urgency and priority in memory resource research.”

Micron has been inventing computer memory devices since its founding. For this collaboration, the partners will apply an industry-standard data interface called Compute Express Link (CXL) to join memory from various processing units deployed for scientific simulations.

Tony Brewer, Micron’s chief architect of near-data computing, says the collaboration aims to blend old and new memory technologies to boost high-performance computing (HPC) workloads. “We have efforts that look at how we could improve the memory devices themselves and efforts that look at how we can take traditional high-performance memory devices and run applications more efficiently.”

A centralized memory pool would help mitigate the issue of over-provisioning the memory.

In HPC systems that deploy AI, high performance but low-capacity memory (typically gigabytes in capacity) is typically coupled to the GPUs, whereas a conventional system with low-performance but high capacity memory (terabytes) is loosely coupled via the traditional HPC workhorses, central processing units (CPUs). With PNNL, Micron will create proof-of-concept shared GPU and CPU systems and combine them with additional external storage devices in the hundreds of terabytes range. Future systems will need rapid access to petabytes of memory – a thousand times more capacity than on a single GPU or CPU.

This will create a third level of memory hierarchy, Brewer explains. “The host would have some local memory, the GPU would have some local memory, but the main capacity memory is accessible to all compute resources across a switch, which would allow scaling of much larger systems.” This unified memory would let researchers using deep-learning algorithms to run a simulation while its results simultaneously feed back to the algorithm.

A centralized memory system also benefits operations because an algorithm or scientific simulation can share data with, say, another program that’s tasked with analyzing those data. These converged application workflows are typical in DOE’s scientific discovery challenges. Sharing memory and moving it around involves other technical resources, says Andrés Márquez, a PNNL senior computer scientist. This centralized memory pool, on the other hand, would help mitigate the issue of over-provisioning the memory.

Márquez explains that because AI-aided data-driven science will rapidly drive up demand for memory, an application can’t afford to partition and strand the memory so that it keeps “piling up underutilized at various processing units. Having the capability of reducing that over-provisioning and getting more bang out of your buck by sharing that data across all those devices and different stages of workflow cannot be overemphasized.”

Some of PNNL’s AI algorithms can underperform when memory is challenging or slow to access, Márquez says. In PNNL’s computational chemistry group, for instance, researchers use AI to study water’s molecular dynamics to see how it aggregates and interacts with other compounds. Water is a common solvent for commercial processes, so running simulations to understand how it acts with a molecule of interest is important. A separate research team at Richland is using AI and neural networks to modernize the power grid’s transmission lines.

Micron’s Brewer is excited not only to develop tools with PNNL but also for future commercial use – by any company working on large-scale data analysis. “We are looking at algorithms,” he says, “and understanding how we can advance these memory technologies to better meet the needs of those applications.” PNNL’s computational science problems provide Micron a way to observe applications that will most stress the memory. Those findings will help Brewer and colleagues develop products that help industry meet its memory requirements.

Ang, too, expects the project to help artificial intelligence at large, pointing out that the Micron partnership isn’t “just a specialized one-off for DOE or scientific computing. The hope is that we’re going to break new ground and understand how we can support applications with pooled memory in a way that can be communicated to the community through enhancements to the CXL standard.”