May 2013

Big data’s breadth

It’s no wonder there’s a lot of talk about data-driven research, says David Brown, director of Lawrence Berkeley National Laboratory’s Computational Research Division. There’s an abundance of data in all areas of science and technology.

For example, there are the large experimental and observational facilities run by the DOE Office of Science, including the Advanced Photon Source at Argonne National Laboratory, the Spallation Neutron Source at Oak Ridge National Laboratory, the Joint Genome Institute, and Berkeley Lab’s Advanced Light Source (ALS), Molecular Foundry and National Center for Electron Microscopy.

Technology for collecting data has grown at a pace exceeding Moore’s law. At the ALS, for example, researchers once were able to take their data home on a thumb drive. Now it takes multiple hard drives. Soon it will require the kind of data storage capacity typically associated with supercomputing centers.

That means research could stall at some facilities, Brown says, “because the old paradigm of taking the information home and turning it into science on your laptop is breaking down.” Part of the challenge is designing new scientific workflows involving computation at every stage. In the past, scientists thought of analysis as happening only in a project’s final stages. “Now, all sorts of computer hardware can be inserted” to begin analysis earlier, Brown says. An image detector, for instance, might perform computations right in its hardware, reducing the amount of data that must be moved.

Many challenges involve mathematics. Brown points to the Linac Coherent Light Source (LCLS) at the SLAC National Accelerator Laboratory, the first light source that can examine small crystallized proteins, or nanocrystals. A rapid-fire X-ray laser probes a jet of nanocrystals, producing scattering images with unknown orientations for each. New mathematical approaches are needed to simultaneously determine the orientation of each nanocrystal.