Combine Amazon SageMaker and DeepSpeed to fine-tune FLAN-T5 XXL

Published on
13 min read
View Code

FLAN-T5, released with theScaling Instruction-Finetuned Language Modelspaper, is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters. Google has open sourced 5 checkpoints available on Hugging Face ranging from 80M parameter up to 11B parameter.

In a previous blog post, we already learned how to “Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers”. In this blog post, we look into how we can integrate DeepSpeed into Amazon SageMaker to allow any practitioners to train those billion parameter size models with a simple API call. Amazon SageMaker managed training allows you to train large language models without having to manage the underlying infrastructure. You can find more information about Amazon SageMaker in the documentation.

This means we will learn how to fine-tune FLAN-T5 XL & XXL using model parallelism, multiple GPUs, and DeepSpeed ZeRO on Amazon SageMaker.

The blog post is structured as follows:

  1. Process dataset and upload to S3
  2. Prepare training script and deepspeed launcher
  3. Fine-tune FLAN-T5 XXL on Amazon SageMaker

before we start, let’s install the required libraries and make sure we have the correct permissions to access S3.

!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker --upgrade

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

1. Process dataset and upload to S3

Similar to the “Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers” we need to prepare a dataset to fine-tune our model. As mentioned in the beginning, we will fine-tune FLAN-T5-XXL on the CNN Dailymail Dataset. The blog post is not going into detail about the dataset generation. If you want to learn the detailed steps check out the previous post.

We define some parameters, which we use throughout the whole example, feel free to adjust it to your needs.

# experiment config
model_id = "google/flan-t5-xxl" # Hugging Face Model Id
dataset_id = "cnn_dailymail" # Hugging Face Dataset Id
dataset_config = "3.0.0" # config/verison of the dataset
save_dataset_path = "data" # local path to save processed dataset
text_column = "article" # column of input text is
summary_column = "highlights" # column of the output text
# custom instruct prompt start
prompt_template = f"Summarize the following news article:\n{{input}}\nSummary:\n"

Compared to the previous example, we are splitting the processing and training into two separate paths. This allows you to run the preprocessing outside of the managed SageMaker Training job. We process (tokenize) the dataset and upload to s3 and pass it into our managed Training job.

from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np

# Load dataset from the hub
dataset = load_dataset(dataset_id,name=dataset_config)
# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

# Train dataset size: 287113
# Test dataset size: 11490

We defined a prompt_template in our config, which we will use to construct an instruct prompt for better performance of our model. Our prompt_template has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation

prompt_lenght = len(tokenizer(prompt_template.format(input=""))["input_ids"])
max_sample_length = tokenizer.model_max_length - prompt_lenght
print(f"Prompt length: {prompt_lenght}")
print(f"Max input length: {max_sample_length}")

# Prompt length: 12
# Max input length: 500

We know now that our documents can be “500” tokens long to fit our template_prompt still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)

from datasets import concatenate_datasets
import numpy as np

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
max_source_length = min(max_source_length, max_sample_length)
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# use 95th percentile as max target length
max_target_length = int(np.percentile(target_lenghts, 95))
print(f"Max target length: {max_target_length}")

We now have everything needed to process our dataset.

def preprocess_function(sample, padding="max_length"):
    # created prompted input
    inputs = [prompt_template.format(input=item) for item in sample[text_column]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# process dataset
tokenized_dataset =, batched=True, remove_columns=list(dataset["train"].features))

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/{dataset_id}/train'

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/processed/{dataset_id}/test'

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
print(f"test dataset to: {test_input_path}")

2. Prepare training script and deepspeed launcher

Done! The last step before we start training our is to prepare our training script and deepspeed. We learned in the introduction that we would leverage the DeepSpeed integration with the Hugging Face Trainer. In the previous post we used the deepspeed launcher to start our training on multiple GPUs. As of today Amazon SageMaker does not support the deepspeed launcher. 😒

To overcome this limitation, we need to create a custom launcher The launcher is a simple python script, which we will pass to our training script. The launcher will start the real training script with the correct environment variables and parameters. In addition, we need to create a deepspeed_config.json to configure our training setup. In the “Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers” post we created 4 deepspeed configurations for the experiments we ran, including CPU offloading and mixed precision:

Depending on your setup, you can use those, e.g. if you are running on NVIDIA V100s, you have to use the config without bf16 since V100 are not support bfloat16 types.

When fine-tuning T5 models we cannot use fp16 since it leads to overflow issues, see: #4586, #10830, #10956

We are going to use a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage bf16, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently.

We are going to use the ds_flan_t5_z3_config_bf16.json. If you are irritated by the auto values, check the documentation.

deepspeed_parameters = {
  "deepspeed": "./configs/ds_flan_t5_z3_config_bf16.json", # deepspeed config file
  "training_script": "./scripts/" # real training script, not entrypoint

3. Fine-tune FLAN-T5 XXL on Amazon SageMaker

In addition to our deepspeed_parameters we need to define the training_hyperparameters for our training script. The training_hyperparameters are passed to our training_script as CLI arguments with --key value.

If you want to better understand which batch_size and deepspeed_config can work which hardware setup you can check out the Results & Experiments we ran.

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
  'model_id': model_id,                                # pre-trained model
  'train_dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
  'test_dataset_path': '/opt/ml/input/data/test',      # path where sagemaker will save test dataset
  'epochs': 3,                                         # number of training epochs
  'per_device_train_batch_size': 8,                    # batch size for training
  'per_device_eval_batch_size': 8,                     # batch size for evaluation
  'learning_rate': 1e-4,                               # learning rate used during training
  'generation_max_length': max_target_length,          # max length of generated summary

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator then creates our Amazon SageMaker training. Amazon SagMaker takes care of starting and managing our ec2 instances, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data.

import time
# define Training Job Name
job_name = f'huggingface-deepspeed-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = '',  # deepspeed launcher script
    source_dir           = '.',               # directory which includes all the files needed for training
    instance_type        = 'ml.p4d.24xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.17',            # the transformers version used in the training job
    pytorch_version      = '1.10',            # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    hyperparameters      = {
    },   # the hyperparameter used for running the training job

We created our HuggingFace estimator including the as entry_point and defined our deepspeed config and training_script in the deepspeed_parameters, which we merged with our training_hyperparameters. We can now start our training job, with the .fit() method passing our S3 path to the training script.

# define a data input dictonary with our uploaded s3 uris
data = {
    'training': training_input_path,
    'test': test_input_path

# starting the train job with our uploaded datasets as input, wait=True)

If you want to deploy your model to a SageMaker Endpoint, you can check out the Deploy FLAN-T5 XXL on Amazon SageMaker blog.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.