University of Illinois computer scientist William Gropp is an expert at making code run fast.
When it comes to tools that help programmers optimize code performance in today’s increasingly complex software, there is a “really huge need,” says Gropp, a Gordon Bell Prize-winner for high-performance computer applications.
“Whether for single- or multi-core processors, codes typically provide less – sometimes lots less – performance than they should when just using tools like optimizing compilers,” which convert programs into executable code, says Gropp, deputy director for research at Illinois’ Institute for Advanced Computing Applications and Technologies and the Paul and Cynthia Saylor Professor of Computer Science at the Urbana-Champaign campus. “Many papers demonstrate that you can get significantly better performance if you put in special effort.”
Some of those studies suggest that most scientific applications attain only a small fraction of achievable performance. One hope for enlarging that fraction is Orio (Greek for “speed limit”), a program being developed by Boyana Norris, a computer scientist in the Mathematics and Computer Science Division at the Department of Energy’s Argonne National Laboratory (ANL). Norris also is a senior fellow at the University of Chicago’s Computation Institute.
Norris and Gropp started talking about ways to optimize code in 2005. Gropp knew a programmer could improve code performance by hand – going in and making optimizations manually – but that requires specific codes for specific machines.
“The machines change every few years,” Gropp says, “and the two biggest machines within DOE have very different architectures.”
Manual optimization also is time-consuming and can make code more difficult to read. For instance, a few lines of easy-to-read code could become hundreds of lines of optimized code. By contrast, libraries that optimize specific program tasks can be effective and easy to use, but they only work in limited cases. What programmers really need are general-purpose optimization tools, but few exist.
Even a general-purpose tool, however, cannot magically jump into a program and make it flawlessly fast. So instead of completely automating the process, Gropp sought an approach allowing a programmer to interact with an optimization tool. He worked on a prototype technique that lets a programmer direct the optimization tool, essentially telling it where to improve the code’s efficiency.
As Norris explains: “Bill had been looking at performance tuning of important numerical kernels and was looking at automating this process without requiring one to write a full-blown compiler. To do this, he introduced comments – annotations – in the code that specify in a simplified way what is being computed as well as what transformations should be performed. He had a little prototype implemented that could generate C code” – a programming language originally developed for Unix systems – “for simple computations.”
Norris continued to explore these ideas with Gropp and enlisted Albert Hartono, a graduate student in computer science and engineering at Ohio State University (OSU). With this team, Norris says, “We designed a more extensible system that can process different kinds of annotations and generate many versions of the code after applying different types of optimizations.”
That’s the basis of Orio.
How do you know that the optimized code produces the same answer as the original code?
Written in Python (a program language capable of running on many kinds of systems), Orio lets a programmer insert annotations in C or Fortran code. These annotations – simply written as comments that a computer running a C program, for example, typically ignores – show Orio where to go to work. A programmer also can use the annotations to tell Orio a little bit about what the section of code does. Hartono led the programming and Norris led the design work to turn the idea into a tool.
Hartono and Norris developed Orio to provide two key capabilities: source-to-source transformations and automatic performance tuning. For source-to-source transformations, Orio starts with the C code plus the annotations, or performance-tuning tips, the programmer inserts. Orio pulls out the annotated sections and sends them to transformation modules, which develop ways to speed up the sections.
In particular, Orio includes transformation modules for compiler optimizations, including simple loop unrolling, memory-alignment optimization, loop unroll/jamming, loop tiling, loop permutation, scalar replacement, register tiling, loop-bound replacement, array-copy optimization, multicore parallelization and optimizations specific to a particular computer architecture. Then a code generator adds the optimized sections to the original code.
“These are actually typical compiler optimizations, but compilers do not necessarily do them because they cannot determine that it is safe to do so,” Norris explains.
Since Orio generates various ways to optimize each annotated section, it also can provide automatic performance tuning. Orio tries the different versions of optimized code to see which runs fastest. The best-performing code gets embedded in the original program.
When it comes to finding the fastest code for any annotated section, “you can’t try everything,” Norris says. “The optimization space is gigantic,” so “we have different strategies to narrow the options you need to search.” Specifically, Orio limits how it searches for the best code and how hard it tries to find it.
To further expand Orio’s capabilities, Norris and Hartono made it capable of incorporating other tools. With parallel hardware, for example, Orio can use Pluto – a tool developed at OSU – to automatically make sections of code run in parallel and to keep data as nearby as possible, all making the program run faster. Orio also can make loops run faster by using PrimeTile, which speeds up loops by, in part, keeping needed data accessible.
In one test, Norris and her colleagues used Orio on two systems: an Intel Xeon workstation and a Blue Gene/P supercomputer, both located at ANL. They let several optimizers, including ones that use optimization libraries, and Orio work on speeding up code for solving simple algebra problems.
The Orio-optimized code was consistently the fastest, in both sequential and parallel computing. In some cases, the Orio-enhanced code outperformed the others by factors of 4 to 6.
“Orio is able to generate many versions of the implementation,” Norris says, “and our experience shows that the performance achieved is better than manually written implementations, including those that call highly optimized math libraries.”
By testing Orio on two computers, Norris and Hartono showed the software works on different architectures. Beyond that, an optimizer also must find good solutions that work with a specific architecture. “With Orio, the Intel machine gives different optimized code than you get on the Blue Gene/P,” Norris notes. “When you run Orio on different architectures, you are almost guaranteed to get different code because the memory hierarchies are different.”
In addition, Norris and her colleagues have used Orio to work on codes that simulate fusion reactors and particle accelerators. “The core of this is a set of loops that iterate and update the discretized electric and magnetic fields,” Norris says. “We create annotations for those sections and we have achieved some pretty good results with Orio.”
In creating new versions of code for annotated sections, though, one must ask: How do you know that the optimized code produces the same answer as the original code?
“That is a very critical question,” Gropp says. “Typically, when a developer optimizes code, it gets changed. The developer might compare the results of the original and ‘optimized’ code at that time, but then the code could get permanently changed, leaving nothing for future comparisons.”
That is, new code could replace previous code and the programmer would move ahead – assuming that the old and new code would produce the same answer under all circumstances.
“That new code, though, is probably more complicated,” Gropp says, “and the compiler could make a mistake dealing with it or the new code might be ‘buggy’ in a way that is not obvious until some specific feature is used.”
With Orio’s annotations approach, the program keeps the original, straightforward code for comparison. “Orio gives you the possibility to check this simpler, easier-to-understand code that you saved with the Orio-generated code and the compiler code generated by the Orio optimization,” Gropp says.
Norris plans to build on Orio’s capabilities.
For some computing challenges, programmers could use higher-level tools. For example, if a computation includes a string of matrix-vector operations – such as multiplying data in a matrix, or rows and columns, by a vector, or just a single column of data – “it is easier to express that in a higher language, like MATLAB,” says Norris. So she and her colleagues are working on a language that lets a user write in MATLAB,which gets converted to C or Fortran and then optimized with Orio.
Orio also makes it possible for other developers to add capabilities. “We designed it to be extensible,” she says. “It shouldn’t be too difficult for someone to write simple parsers to extend support to a different language. The annotations do not have to be in a fixed syntax, so you should be able to play around with different languages.”
Norris also has her own plans for adding more to Orio. It’s “a very new tool, with a long list of planned features and improvements,” she says “Some of the ongoing and future work includes making the empirical search more efficient.”
She goes on: “That is, when we test the performance of the generated code, we need to check many different inputs and other parameters, resulting in having to do a lot of runs. This process could be improved by using optimization methods that lead us to the best-performing version without requiring a huge number of test runs.”
No matter how Orio advances, Norris always wants to maintain its original purpose: Making computing more efficient. “While we want Orio to interoperate with other tools when they are available, we will always maintain a standalone, portable version that does not require a user to install a lot of complicated extra packages before they can use it.”
In that way, Orio will bring more-optimized code to more programmers.