Categories: Computer ScienceNew Faces

The incredible shrinking data

Tom Peterka submitted his Early Career Research Program proposal to the Department of Energy (DOE) last year with a sensible title: “A Continuous Model of Discrete Scientific Data.” But a Hollywood producer might have preferred, “User, I Want to Shrink the Data.”

Downsizing massive scientific data streams seems a less fantastic voyage than science fiction’s occasional obsession with shrinking human beings, but it’s still quite a challenge. The $2.5 million, five-year early-career award will help Peterka accomplish that goal.

Researchers find more to do with each generation of massive and improved supercomputers. “We find bigger problems to run, and every time we do that, the data become larger,” says Peterka, a computer scientist at the DOE’s Argonne National Laboratory.

His project is addressing these problems by transforming data into a different form that is both smaller and more user-friendly for scientists who need to analyze that information.

“I see a large gap between the data that are computed and the knowledge that we get from them,” Peterka says. “We tend to be data-rich but information-poor. If science is going to advance, then this information that we extract from data must somehow keep up with the data being collected or produced. That, to me, is the fundamental challenge.”

Tom Peterka. Image by Wes Agresta courtesy of Argonne National Laboratory.

Computers have interested Peterka since he was a teenager in the 1980s, at the dawn of the personal computer era. “I’ve never really left the field. A background in math and science, an interest in technology – these are crosscutting areas that carry through all of my work.”

The problems when Peterka got into the field dealt with gigabytes of data, one gigabyte exceeding the size of a single compact disc. The hurdle now is measured in petabytes – about 1.5 million CDs of data.

Since completing his doctorate in computer science at the University of Illinois at Chicago in 2007, Peterka has focused on scientific data and their processing and analysis. His works with some of DOE’s leading-edge supercomputers, including Mira and Theta at Argonne and Cori and Titan at, respectively, Lawrence Berkeley and Oak Ridge national laboratories.

The early-career award is helping Peterka develop a multivariate functional approximation tool that reduces a mass of data at the expense of just a bit of accuracy. He’s designing his new method with the flexibility to operate on a variety of supercomputer architectures, including the next-generation exascale machines whose development DOE is leading.

“We want this method to be available on all of them,” Peterka says, “because computational scientists often will run their projects on more than one machine.”

With Peterka’s method, the data need not be decompressed before reuse.

His new, ultra-efficient way of representing data eliminates the need to revert to the original data points. He compares the process to the compression algorithms used to stream video or open a jpeg, but with an important difference. Those compress data to store the information or transport it to another computer. But the data must be decompressed to their original form and size for viewing. With Peterka’s method, the data need not be decompressed before reuse.

“We have to decide how much error we can tolerate,” he says. “Can we throw away a percent of accuracy? Maybe, maybe not. It all depends on the problem.”

Peterka’s Argonne Mathematics and Computer Science Division collaborators are Youssef Nashed, assistant computer scientist; Iulian Grindeanu, software engineer; and Vijay Mahadevan, assistant computational scientist. They have already produced some promising early results and submitted them for publication.

The problems – from computational fluid dynamics and astrophysics to climate modeling and weather prediction – are “of global magnitude, or they’re some of the largest problems that we face in our world, and they require the largest resources,” Peterka says. “I’m sure that we can find similarly difficult problems in other domains. We just haven’t worked with them yet.”

The Large Hadron Collider, the Dark Energy Survey and other major experiments and expansive observations generate and accumulate enormous amounts of data. Processing the data has become vital to the discovery process, Peterka says – becoming the fourth pillar of scientific inquiry, alongside theory, experiment and computation. “This is what we face today. In many ways, it’s no different from what industry and enterprise face in the big-data world today as well.”

Peterka and his team work on half a dozen or more projects at a given time. Some sport memorable monikers, such as CANGA (Coupling Approaches for Next-Generation Architectures), MAUI (Modeling, Analysis and Ultrafast Imaging) and RAPIDS (Resource and Application Productivity through computation, Information and Data Science). Another project, called Decaf (for decoupled data flows), allows “users to allocate resources and execute custom code – creating a much better product,” Peterka says.

The projects cover a range of topics, but they all fit into three categories: software or middleware solutions; algorithms built on top of that middleware; or applications developed with domain scientists – all approaches necessary for solving the big-data science problem.

Says Peterka, “The takeaway message is that when you build some software component – and the multivariate functional analysis is no different – you want to build something that can work with other tools in the DOE software stack.”