August 2009

Combustion research helped spark cluster revolution

Just four years ago, before the Department of Energy launched the Innovative and Novel Computational Impact on Theory and Experiment program (INCITE), Joe Oefelein and Jackie Chen had to improvise computing power for their data-intensive simulations.

Because the supercomputers they use aren’t located where they are – Sandia National Laboratory’s Combustion Research Facility (CRF) in Livermore, Calif. – they had to rely on offsite facilities such as the National Center for Supercomputing Applications at the University of Illinois. Jaguar, the world’s fastest open-science supercomputer, is at a different national lab entirely: Oak Ridge, in Tennessee.

What’s more, these supercomputers were devised to be shared worldwide, making access a matter of advanced scheduling and a rare resource.

How Chen and Oefelein corralled a group of less-expensive in-house computers to perform the work of a supercomputer made them heroic code warriors among software, hardware and programming aficionados, who saw in their ingenuity a triumph of human thinking in a world of hard drives and silicon chips.

“Combustion Lab Discovers Cost Effective Way To Turn Clusters into More Research,” read a headline above a case study on the effort. “Sandia Blasts Off Blade Cluster,” reported Byte and Switch magazine.

For Linux Magazine, Oefelein wrote, “Sandia’s CRF team and Penguin Computing put their heads together and harnessed the power of several Penguin Altus Opteron servers.”

Using Scyld Software’s Beowulf program to control the Linux-based servers, Chen and Oefelein added 5 million additional computer processor hours. In one year, they saved CRF and taxpayers $150,000 in overhead costs alone. In subsequent years, they have used additional savings to invest in memory, software and other local resources.

They also added a valuable case study to the concept of “cluster computing.” In programming parlance, a cluster is a group of individual computers linked by a network to perform tasks as a single high-performance computer. Clusters typically cost less to build and operate than supercomputers, and cluster-based computations are an excellent way to start a simulation for later transfer to a Jaguar-style system.

In the article “Exceptional Installations” for Beowulf.org, a Web site and online user community for devotees of Linux-based cluster computers, Chen and Oefelein provided a three-step program for people considering clusters:

    • Do your homework on how your specific applications will perform on the clusters being evaluated.
    • Reliability is key. Cluster technology isn’t perfect and depending on which products you choose, you may need in-house cluster management expertise.
    • Buy in logical increments. Anticipate what you’ll need, especially if you work with uncertain budgets.

By providing “fast and routine access to pertinent research results, cluster computing is playing an increasingly important role in science,” Oefelein says. “Clusters help identify critical issues prior to accessing large-scale computing resources such as those deployed for INCITE, all of which facilitates more efficient use of these systems.”

Citing cost and ease of use, Oefelein calls Sandia’s server clusters part of a “computational hierarchy.” Whereas supercomputers handle the big stuff, “our in-house clusters are essential tools for day-to-day ‘routine’ calculations.”

Oefelein attributes much of his success with INCITE to this intermediate in-house capability. He’s happy to report that DOE is on board, too.

“DOE has recognized the need for a hierarchy of in-house clusters and supercomputers,” Oefelein says. “And unlike the situation four years ago, our DOE sponsors have been able to largely increase the amount of supercomputer time available to the scientific community over this period.”

Today, hierarchical computing has plenty of adherents. To support computationally intensive research in all scientific disciplines, balance – between in-house and national computing resources – has become a leading mantra among computational theorists and computer scientists.

“And at CRF” he says, “we have been extremely successful in establishing such a balance.”