Computer Science
July 2021

Data alchemy

A Harvard DOE early-career awardee says scientific big-data management needs a major structural reorganization.

Online shopping and massive scientific computations like this cosmology simulation have something in common: Data must be optimally structured for smooth information transactions. Image courtesy of  Dark Sky Simulations Collaboration.

Checking our email, video-meeting on Zoom, using digital wallets to buy groceries, and analyzing vast amounts of scientific data all have a common feature: information transaction. We connect with others via data systems, exchanging, storing and analyzing information as our inboxes refresh or our bills are processed.

Using this information requires deft data management. Zoom-chatting seems effortless because data are stored, compressed, decompressed, extrapolated and analyzed seamlessly in nanoseconds. But as online activities and consumers multiply, it can be difficult and costly to maintain the vast amounts of data generated while improving user experience.

The sciences, where data analysis is at the center of all operations, face similar issues. Astronomers, biologists, computational scientists and others collect massive information troves, relayed automatically from sensors or satellites. Scientific data also must be organized for later analysis. When the systems that store all this information aren’t optimized, processing can slow to a crawl.

Harvard University’s Stratos Idreos.

Researchers across the sciences have recognized that they must use sophisticated tools to manage their data, says Stratos Idreos, an associate professor at Harvard University. Idreos has made it his mission to assist with this challenge and is using an Early Career Research Program Award from the Department of Energy’s Office of Science to achieve it.

Over the past 50 years, Idreos says, data systems have been context-fixed and are therefore inflexible. “Context means the overall environment where data structures run. That is the shape of the data, the kinds of requests over the data and the hardware. All these define the complexity of the problem at hand.”

Today, business and science users constantly generate new contexts for data management. Idreos and his colleagues are developing a new kind of system that relies on mathematics, engineering and machine learning to automatically discover and design the core building blocks of data systems, known as data structures, in a way that’s as near-perfect a fit as possible for the task at hand.

Idreos first started considering the difficulty of moving information while studying distributed data systems and algorithms as an undergraduate at the Technical University of Crete in Greece. Early on, he realized that how data are linked and organized influence how they’re sent, stored and then retrieved. Scientists’ data needs are often unique, but a customized approach might be infeasible for some situations. Building data structures from scratch can be technically daunting, even for researchers with significant data-management expertise. It typically takes several months to design and implement one.

After a few years of studying data management, Idreos felt unsatisfied generating ad hoc solutions. “I wanted to do something more substantial in that area, that would give a general solution to the problem.” That’s what prompted him to think bigger and see if he could automate the entire data-system design process.

Idreos’ research shows there are more possible data structure designs than stars in the sky.

First, Idreos strips data systems down to their first principles – common building blocks, akin to an alphabet from which words, sentences and paragraphs can be assembled. “These are the ingredients out of which you synthesize your solutions,” he says. Every problem “can be broken down to a set of fundamental decisions about how data are physically stored in memory and on computer disks, and about how we form algorithms that access these data. For example, the first 16 bits of this data should be in location X and then connected with an in-memory pointer to these other 16 bits of data. It is a low-level set of decisions that make up first principles of how to store data and how to build algorithms for key-value data” – for instance, images stored in a common data structure, such as a JPEG, optimized for the human eye.

Simple principles comprise even the most complex data systems. “If we know those principles, and if we know the rules that govern the way they can be synthesized into complex systems, then we can compare and rank the possible solutions,” Idreos says. This is the design space, which “gives us the whole possible space of data structures that we may invent, even if it hasn’t been invented yet.” Computer scientists have published nearly 5,000 data structure designs. Idreos’ research shows there are more possible data structure designs than stars in the sky.

Even after Idreos and his colleagues distill fundamental principles, they still must figure out which organizational scheme is most useful for a particular context. Building and testing all these data structures one by one would take millions of years. So the team has sought ways to evaluate each possible design’s utility without implementing and trying it. They built a new kind of system, named the Data Structure Alchemist, that assesses each design principle based on the context or the data and hardware where the desired data structure is expected to work. Then the system can synthesize complex design behavior based on individual principles and deploy a machine learning-based search algorithm that continuously learns and becomes better at designing data structures. Generally used to help recognize patterns, machine learning can sort through the vast pool of solutions, seeking key features found in existing data management systems.

Idreos says the system “minimizes or even eliminates the human effort needed to design new data structures” and takes into account the data workload, hardware compatibility, a client’s budget restrictions and how quickly data should be searched and retrieved. And the more it’s used to design solutions for different contexts, the more it learns and can recognize patterns for which designs work best under which conditions, improving the method’s speed over time. Once a close-to-optimal solution is found, the system automatically codes the target design to deliver the desired data structure readymade for users or other systems.

Recently, Idreos has focused on three commonplace data-storage models: graphs, images and key-value structures like JPEGs. “The overwhelming majority of modern images are only read by machine-leaning algorithms and not seen by humans. The Data Structure Alchemist can generate new ways to store and process images based on the specific application – for instance, a self-driving car, which needs to recognize people, other cars, et cetera, automatically.”

Idreos says the approach presents challenges, including ensuring “near-optimal performance for a given context, how to utilize hardware properties across the memory hierarchy to better design data structures so we can make maximum use of hardware,” and how to minimize data movement as workloads shift. Data Structure Alchemist results so far suggest the process “can achieve up to two orders of magnitude better performance than state-of-the-art data structures, and near-optimal hardware utilization.”

With trillions of possible data structures – and with computer scientists studying only several dozen each year – the Data Structure Alchemist could dramatically automate and speed up computer science research, Idreos notes. “It has the potential to accelerate data-driven fields in both industry and sciences, and help students and educators understand the complex space of data structures.”

Eventually, he says, he’d like to devise “holistic reasoning of data structures” that besides searching for one optimal data structure at a time would “additionally reason about the interactions of multiple data structures coexisting in the same complex system.”