We the AI trainers

Computer scientists are democratizing artificial intelligence, devising a way to enable virtually anyone to train their own AI models, no matter how big and complex the models may be.

Their open-source AI training framework, AxoNN, is faster than several commercially available approaches and is a finalist for the Gordon Bell Prize of the Association for Computing Machinery. ACM will announce the winner in late November at the international supercomputing meeting, SC24, in Atlanta.

The group, led by the University of Maryland’s Abhinav Bhatele, focuses on the technologies behind OpenAI’s ChatGPT, Meta’s Llama and other so-called large language models, or LLMs. The models enable computers to write text and software and mimic other human abilities. But before an LLM can do its job, it must be trained on large amounts of data representing examples of those tasks.

An emerging challenge is that new LLMs are larger than ever before, and some are too big to fit on a single graphics processing unit (GPU), the workhorse of modern computing. That means the training needs to be done in parallel — broken into chunks that can be processed separately and concurrently on individual GPUs, and then combined.

“That’s where AxoNN comes in,” says Bhatele, associate professor of computer science and director of UMD’s Parallel Software and Systems Group. “AxoNN helps you parallelize your LLM training on lots of GPUs.”

The team has received a Department of Energy Innovative and Novel Computational Impact on Theory and Experiment (INCITE) award of 750,000 node hours across three DOE high-performance machines: Frontier at the Oak Ridge Leadership Computing Facility, and Aurora and Polaris at the Argonne Leadership Computing Facility. Frontier and Aurora are the world’s most powerful computers and the first and second machines to achieve exascale, the ability to perform a million trillion operations per second. Polaris is a leading-edge machine in AI, machine learning and parallelization.

Other frameworks parallelize AI training, such as NVIDIA’s Megatron-LM and Microsoft’s DeepSpeed. However, Bhatele notes, the most advanced AI training methods are likely being developed behind closed doors. Corporations can deploy large numbers of people to develop and train proprietary software, feeding huge volumes of sensitive data to dedicated GPU clusters.

Bhatele’s team is democratizing the process, enabling researchers, university students, small business owners and others to train AI programs on their own clusters, desktops or laptops. “We believe AxoNN is one of the few open-source frameworks that scales to large GPU counts and can enable people to do AI training and fine-tuning at a variety of scales,” he says.

‘People have seen that if you use all the dimensions of parallelism, you can scale your training to a much larger number of GPUs.’

Parallelization of AI training poses new challenges. LLMs are specific kinds of neural networks, which are built of multiple nodes. Each node is programmed to mimic a neuron, and the nodes are arranged in layers. The user feeds batches of training data to an input layer, and the LLM processes and advances each batch through any number of hidden layers and produces the results in an output layer.

One approach, or dimension, for training an LLM across GPUs is data parallelism. The user creates a copy of the full neural network on each of several GPUs, and each GPU processes a separate batch of training data. In each iteration, the trained models are synchronized. “An obvious problem with data parallelism is, if your model does not fit on a single GPU, you are out of luck,” Bhatele says.

Intra-layer parallelism can train an LLM that’s too big for one GPU. Also known as model parallelism or tensor parallelism, this approach distributes the work of each layer across two or more GPUs. This dimension can meet the LLM’s hefty memory demands but may exact a high communication cost when it pieces together outputs between layers.

A third dimension is inter-layer or pipeline parallelism, with one or multiple neural network layers installed on a single GPU. This method reduces idle time by feeding the layer — and the GPU — bite-sized micro-batches of data, which it processes and passes on to the next layer. This approach creates a pipeline in which the second and later GPUs are processing the first micro-batch while the first GPU grinds away on other pieces of the total batch. However, communication needs and some persistent idle time can rack up inefficiencies when the LLM needs many GPUs.

More recently, hybrid frameworks such as Megatron-LM and DeepSpeed have become popular by combining two or even all three approaches, Bhatele says. “People have seen that if you use all the dimensions of parallelism, you can scale your training to a much larger number of GPUs.”

About four years ago, Bhatele and his graduate students started test-driving available hybrid frameworks and soon spotted ways to improve performance.

Most existing frameworks use approaches such as blocking communication and fixed scheduling, which are reliable and relatively easy to implement but their communication delays can add up to big inefficiencies. An intra-layer and data parallel hybrid version of AxoNN that avoided these approaches out-performed both Megatron-LM and DeepSpeed. “We have found that our framework is currently one of the fastest ones and does much better in terms of performance compared to other existing frameworks,” Bhatele says.

In another potential route to improved performance, the group is working to use alternatives to the widely used transformer network (the “T” in “GPT”). INCITE award co-principal investigator Tom Goldstein, UMD’s Volpi-Cupal Endowed Professor of Computer Science, is exploring architectures that use less memory and are more adaptive, such as recurrent neural networks. Bhatele frames the problem: “Could you use such alternatives to transformers to achieve the same tasks, like generating text or captioning images and so on?”

Since the researchers began their allocation in January, they have used Frontier to explore how efficiently AxoNN scales with problem size and GPUs. They are also pushing the framework’s limits. The size of an LLM is measured in its number of parameters, or “the number of numbers in the model,” Bhatele explains. For example, the LLM named GPT-4 has more than 1 trillion parameters. “So can we train a language model with more than a trillion parameters on Frontier? How slow is that or how fast is that, and can we get that to be as efficient as possible?”

Big challenges lie beyond the scope of the current allocation. Bhatele cites a recent mixture-of-experts approach, a mash-up of small LLM experts into one big, coordinated model. A mixture-of-experts poses novel challenges for parallelized training.

The team also is looking at how massive LLMs will function after they have been trained. When a user poses questions of such a model, the inference will also have to run on more than one GPU, Bhatele says. “We are starting to investigate how you can use AxoNN for inference, and can you parallelize the inference across multiple GPUs and even multiple nodes?”