Processor cores in a giant supercomputer want data delivery that’s fast and efficient. It’s a difficult goal to achieve in large calculations, like those in chemistry and bioinformatics problems. The size and complexity of such computations can snarl communication between hundreds of processors as each solves its own piece of the puzzle. If communication traffic jams develop, programs can bog down. Making big computations efficient means making data transmission fast and seamless.
To overcome communications challenges, researchers have developed Casper, a layer of code that inserts itself into the workflow to overcome delays. Casper is portable, so it should be useful for other high-performance computing (HPC) software that uses a similar communications strategy. It could even facilitate the move toward exascale computing, when machines are capable of a quintillion calculations per second.
Casper was first developed for NWChem, a Pacific Northwest National Laboratory-designed computational chemistry program that lets researchers model complex chemical processes such as how radioactive waste decays and how ultraviolet radiation causes cancer in skin cells. From its infancy in the mid-1990s, NWChem was designed to run on networked processors, as in an HPC system, using one-sided communication, says Jeff Hammond of Intel Corp.’s Parallel Computing Laboratory.
In one-sided communication, a processor is programmed internally to fetch data from and write data to another processor without that processor’s involvement. Eliminating the need to match communication on two processors greatly simplifies programming in applications where data are accessed irregularly. This strategy also reduces communication overhead, which can burden large-scale simulations. When communication proceeds without a programmer’s instruction, it makes asynchronous progress.
To streamline asynchronous communication, Casper lets researchers designate one or more processor cores as ‘ghost processes.’
As HPC systems have evolved, Message Passing Interface (MPI) – computer code that facilitates overall communication between parallel processors – has become the lingua franca of supercomputers, allowing researchers to move data via various strategies. “There’s a huge amount of investment in making sure that MPI is fast and correct. MPI just works on these supercomputers,” Hammond says.
MPI is widely used, but some computer scientists have resisted adopting it. The code can be sluggish with programs that use one-sided communication, Hammond says. When a program doesn’t make asynchronous progress, calculations can crawl to a near standstill.
Jim Dinan, Pavan Balaji, and Hammond, all at Argonne National Laboratory at the time, had worked out some of these communication problems between NWChem and earlier MPI versions. But NWChem didn’t work optimally with MPI-3, the newest version.
To solve this problem, Hammond and Balaji envisioned adding another layer of code that would speed up NWChem, much like introducing nitrous oxide in a dragster engine to boost its power. They worked with Min Si, a University of Tokyo doctoral student, to develop the code now called Casper during her 2013 Argonne internship.
To streamline asynchronous communication, Casper lets researchers designate one or more processor cores as “ghost processes,” Si says. They serve as landing points for communication and allow asynchronous progress to take place while other cores perform the computation.
There’s no need to change MPI, Hammond says. Casper is “just this invisible code that sneaks in and takes care of things. Casper is, you know, the friendly ghost that comes in and helps you.” The friendly ghost solution is particularly useful because it works with any MPI version, no matter who built the supercomputer.
Casper also can be turned on as needed in the NWChem workflow, Hammond says. It’s easy to install and use, but it’s not there if you don’t want it.
To demonstrate Casper, the team showed that the code accelerated NWChem calculations of water clusters by 50 percent. That level of improvement on this type of test system suggests that NWChem uses one-sided communication to make asynchronous progress.
Improving NWChem is probably just Casper’s first step. “When we were doing this project, we wanted to focus on something useful,” Hammond says – ensuring NWChem made asynchronous progress. Since then, he has collaborated with Georgia Institute of Technology researchers who have used Casper with different software to make asynchronous progress on biochemistry calculations.
One-sided communication is a relatively new strategy, Si says, and one that a growing number of developers are using in software. Hammond says Casper is applicable to other one-sided communication models, such as SHMEM and UPC, found on parallel processors.
The ability to make asynchronous progress will be important in many of the programming models DOE supports as part of its exascale initiative, Hammond adds. Researchers will have to reach that goal either by altering MPI, designing new communications libraries, or using an existing model with less support. With Casper, he says, researchers could just use MPI-3 rather than tricking it into making asynchronous progress.