How to scale LLM workloads to 20B+ with Amazon SageMaker using Hugging Face and PyTorch FSDP
A couple of weeks ago, Together released an instruction-tuned large language model, fine-tuned for a chat from EleutherAI GPT-NeoX-20B with over 43 million instructions under an Apache-2.0 license! It is the first big open-source alternative to OpenAIs ChatGPT. The model might not perform as well as ChatGPT on your domain-specific prompts and input out of the box, but that's something we can improve through fine-tuning. We are going learn how to use PyTorch FSDP on Amazon SageMaker with Hugging Face to fine-tune LLMs on a multi-node multi-GPU setup
In this tutorial, you will learn how to:
- Setup Environment
- Load and prepare chat dataset
- Fine-tune the GPT model using FSDP on Amazon SageMaker
What is PyTorch Fully Sharded Data Parallel (FSDP)?
PyTorch FSDP (Fully Sharded Data Parallel) is an extension of data parallelism that enables efficient large-scale training of LLMs. With FSDP, each GPU stores only a subset of the model and associated optimizer states and gradients and can optionally offload the sharded model parameters to CPUs. This helps maximize the overlap between network communication and model computation, reducing the memory footprint on GPUs.
[Introducing PyTorch Fully Sharded Data Parallel (FSDP) API]
FSDP optimizations include:
- Transformer Wrapping Policy
- Mixed Precision (bf16)
- Activation Checkpointing (Gradient Checkpointing)
- Full Sharding Strategy
PyTorch FSDP is natively integrated into the Hugging Face Trainer, making it easy to adapt and use. You can learn more about PyTorch FSDP in Efficient Large-Scale Training with Pytorch FSDP and AWS or Introducing PyTorch Fully Sharded Data Parallel (FSDP) API blog post.
Now, Let's get started! 🚀
Tutorial
In this tutorial, we will fine-tune the new GPT-NeoXT-Chat-Base-20B on the ELI5 dataset to improve the explanation and question-answering skills of the agent. The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers.
GPT-NeoXT-Chat-Base is a 20B open-source LLM, which makes it hard to fine-tune on a single GPU or even a single Node with multiple GPUs. We are going to use Amazon SageMaker managed training platform as our infrastructure backbone to help us create a multi-node cluster to easily run our distributed training. As instances, we will use 2x p4d.24xlarge instances, which come with 8x NIVIDA A100 40GB GPUs.
Note: You might have to increase and request a quota for those instances.
As distributed training framework, we will use Pytorch FSDP + Hugging Face Transformers Trainer, which will make it super easy to distribute our model and data in a fully sharded way across all our nodes and GPUs.
1. Setup Environment
The first step is installing the Hugging Face Libraries, including transformers, datasets, and sagemaker. Running the following cell will install all the required packages.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for SageMaker. You can find here more about it.
2. Load and prepare the dataset
As the base dataset, we will use the ELI5 dataset, but before fine-tuning the model, we need to preprocess the data. We will create a "chat" version of the dataset by adding <user>
and <bot>
tokens and add an end-of-sequence <|endoftext|>
token to help the model learn to distinguish consecutive examples. Additionally, we create chunks of 2048
tokens (model max length) to avoid unnecessary padding and computing.
The first step is to load our dataset from Hugging Face. The dataset contains 272634
samples for eli5
. We will downsample the dataset to 25 000
to make it more realistic for real-world use cases.
An ELI5 sample can include multiple answers to a “question”. We will select the answer with the highest user score for our explanation.
Note: This dataset is a good example of using reinforcement learning for training transformers learning to generate answers with higher scores. Let me know if you are interested in an example of that.
The next step is to convert our dataset into a chat version. Here we will follow the instructions on the Model card and add the EOS token.
The last step of the data preparation is to tokenize and chunk our dataset. We convert our inputs (text) to token IDs by tokenizing, which the model can understand. Additionally, we concatenate our dataset samples into chunks of 2048
to avoid unnecessary padding.
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket()
, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
Our dataset is now prepared, and we can start fine-tuning our model. 🥳
3. Fine-tune the GPT model using FSDP on Amazon SageMaker
As mentioned in the beginning, we will use Amazon SageMaker and PyTorch FSDP to train our model. Amazon SageMaker makes it easy to create a multi-node cluster to train our model in a distributed manner. Lately, the sagemaker
python SDK got support to run training jobs using torchrun
, to distribute the script across multiple nodes and GPUs.
To use torchrun
to execute our scripts, we only have to define the distribution
parameter in our Estimator and set it to {"torch_distributed": {"enabled": True}}
. This tells SageMaker to launch our training job with.
To use FSDP with the Hugging Face Trainer, we need to provide our fsdp
strategy as well as the transformer layer policy
.
In our example, we will use full shard auto_wrap
and GPTNeoXLayer
as transformer layer policy. If you run this example and change the model id make sure to also adjust the transformer layer policy.
We prepared a run_clm.py, which implements causal language modeling and accepts our fsdp and other hyperparameters.
To create a sagemaker training job, we create an HuggingFace
Estimator and provide all our information. SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data
. Then, it starts the training job by running.
We can now start our training job, with the .fit()
method passing our S3 path to the training script.
The training took 9407
seconds, which is about 2.6
hours. The ml.p4d.24xlarge
instance we used costs $37.688
per hour. So the total cost for training GPT-NeoXT-Chat-Base-20B
is (2.6h * $37.688) * 2 instances
which results in $197
. We could reduce the cost by using a spot instance or using Parameter Efficient Fine Tuning.
Conclusion
In conclusion, Amazon SageMaker and PyTorch FSDP with Hugging Face Transformers provide an efficient way to train LLMs for various applications. With FSDP and the integration into Hugging Face Transformers, we can train LLMs on a multi-node multi-GPU setup without any engineering investments.
If you're interested in improving the performance of LLMs or even training your own LLM from scratch, you should consider Amazon SageMaker + Hugging Face Transformers as a good option.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.