Save up to 90% training cost with AWS Spot Instances and Hugging Face Transformers
Photo by Marten Bjork on Unsplash
notebook: sagemaker/05_spot_instances
Amazon EC2 Spot Instances are a way to take advantage of unused EC2 capacity in the AWS cloud. A Spot Instance is an instance that uses spare EC2 capacity that is available for less than the On-Demand price. The hourly price for a Spot Instance is called a Spot price. If you want to learn more about Spot Instances, you should check out the concepts of it in the documentation. One concept we should nevertheless briefly address here is Spot Instance interruption
.
Amazon EC2 terminates, stops, or hibernates your Spot Instance when Amazon EC2 needs the capacity back or the Spot price exceeds the maximum price for your request. Amazon EC2 provides a Spot Instance interruption notice, which gives the instance a two-minute warning before it is interrupted.
Amazon SageMaker and the Hugging Face DLCs make it easy to train transformer models using managed Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances.
As we learned spot instances can be interrupted, causing jobs to potentially stop before they are finished. To prevent any loss of model weights or information Amazon SageMaker offers support for remote S3 Checkpointing where data from a local path to Amazon S3 is saved. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path.
In this example, we will learn how to use managed Spot Training and S3 checkpointing with Hugging Face Transformers to save up to 90% of the training costs.
We are going to:
- preprocess a dataset in the notebook and upload it to Amazon S3
- configure checkpointing and spot training in the
HuggingFace
estimator - run training on a spot instance
NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances
Development Environment and Permissions
Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed
1 !pip install "sagemaker>=2.77.0" "transformers==4.12.3" "datasets[s3]==1.18.3" s3fs --upgrade
Permissions
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
1 import sagemaker2 import boto33 sess = sagemaker.Session()4 # sagemaker session bucket -> used for uploading data, models and logs5 # sagemaker will automatically create this bucket if it not exists6 sagemaker_session_bucket=None7 if sagemaker_session_bucket is None and sess is not None:8 # set to default bucket if a bucket name is not given9 sagemaker_session_bucket = sess.default_bucket()1011 try:12 role = sagemaker.get_execution_role()13 except ValueError:14 iam = boto3.client('iam')15 role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']1617 sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)1819 print(f"sagemaker role arn: {role}")20 print(f"sagemaker bucket: {sess.default_bucket()}")21 print(f"sagemaker session region: {sess.boto_region_name}")
Preprocessing
We are using the datasets
library to download and preprocess the emotion
dataset. After preprocessing, the dataset will be uploaded to our sagemaker_session_bucket
to be used within our training job. The emotion dataset consists of 16000 training examples, 2000 validation examples, and 2000 testing examples.
1 from datasets import load_dataset2 from transformers import AutoTokenizer34 # model_id used for training and preprocessing5 model_id = 'distilbert-base-uncased'67 # dataset used8 dataset_name = 'emotion'910 # s3 key prefix for the data11 s3_prefix = 'samples/datasets/emotion'1213 # download tokenizer14 tokenizer = AutoTokenizer.from_pretrained(model_id)1516 # tokenizer helper function17 def tokenize(batch):18 return tokenizer(batch['text'], padding='max_length', truncation=True)1920 # load dataset21 train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])2223 # tokenize dataset24 train_dataset = train_dataset.map(tokenize, batched=True)25 test_dataset = test_dataset.map(tokenize, batched=True)2627 # set format for pytorch28 train_dataset = train_dataset.rename_column("label", "labels")29 train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])30 test_dataset = test_dataset.rename_column("label", "labels")31 test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
After we processed the datasets
we are going to use the new FileSystem
integration to upload our dataset to S3.
1 import botocore2 from datasets.filesystems import S3FileSystem34 s3 = S3FileSystem()56 # save train_dataset to s37 training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'8 train_dataset.save_to_disk(training_input_path, fs=s3)910 # save test_dataset to s311 test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'12 test_dataset.save_to_disk(test_input_path, fs=s3)
Configure checkpointing and spot training in the HuggingFace
estimator
After we have uploaded we can configure our spot training and make sure we have checkpointing enabled to not lose any progress if interruptions happen.
To configure spot training we need to define the max_wait
and max_run
in the HuggingFace
estimator and set use_spot_instances
to True
.
max_wait
: Duration in seconds until Amazon SageMaker will stop the managed spot training if not completed yetmax_run
: Max duration in seconds for training the training job
max_wait
also needs to be greater than max_run
, because max_wait
is the duration for waiting/accessing spot instances (can take time when no spot capacity is free) + the expected duration of the training job.
Example
If you expect your training to take 3600 seconds (1 hour) you can set max_run
to 4000
seconds (buffer) and max_wait
to 7200
to include a 3200
seconds waiting time for your spot capacity.
1 # enables spot training2 use_spot_instances=True3 # max time including spot start + training time4 max_wait=72005 # expected training time6 max_run=4000
To enable checkpointing we need to define checkpoint_s3_uri
in the HuggingFace
estimator. checkpoint_s3_uri
is a S3 URI in which to save the checkpoints. By default Amazon SageMaker will save now any file, which is written to /opt/ml/checkpoints
in the training job to checkpoint_s3_uri
.
It is possible to adjust /opt/ml/checkpoints
by overwriting checkpoint_local_path
in the HuggingFace
estimator
1 # s3 uri where our checkpoints will be uploaded during training2 base_job_name = "emotion-checkpointing"34 checkpoint_s3_uri = f's3://{sess.default_bucket()}/{base_job_name}/checkpoints'
Next step is to create our HuggingFace
estimator, provide our hyperparameters
and add our spot and checkpointing configurations.
1 from sagemaker.huggingface import HuggingFace23 # hyperparameters, which are passed into the training job4 hyperparameters={5 'epochs': 1, # number of training epochs6 'train_batch_size': 32, # batch size for training7 'eval_batch_size': 64, # batch size for evaluation8 'learning_rate': 3e-5, # learning rate used during training9 'model_id':model_id, # pre-trained model id10 'fp16': True, # Whether to use 16-bit (mixed) precision training11 'output_dir':'/opt/ml/checkpoints' # make sure files are saved to the checkpoint directory12 }1314 # create the Estimator15 huggingface_estimator = HuggingFace(16 entry_point = 'train.py', # fine-tuning script used in training jon17 source_dir = './scripts', # directory where fine-tuning script is stored18 instance_type = 'ml.p3.2xlarge', # instances type used for the training job19 instance_count = 1, # the number of instances used for training20 base_job_name = base_job_name, # the name of the training job21 role = role, # Iam role used in training job to access AWS ressources, e.g. S322 transformers_version = '4.12.3', # the transformers version used in the training job23 pytorch_version = '1.9.1', # the pytorch_version version used in the training job24 py_version = 'py38', # the python version used in the training job25 hyperparameters = hyperparameters, # the hyperparameter used for running the training job26 use_spot_instances = use_spot_instances,# wether to use spot instances or not27 max_wait = max_wait, # max time including spot start + training time28 max_run = max_run, # max expected training time29 checkpoint_s3_uri = checkpoint_s3_uri, # s3 uri where our checkpoints will be uploaded during training3031 )
When using remote S3 checkpointing you have to make sure that your train.py
also supports checkpointing. Transformers
and the Trainer
offers utilities on how to do this. You only need to add the following snippet to your Trainer
training script
1 from transformers.trainer_utils import get_last_checkpoint23 # check if checkpoint existing if so continue training4 if get_last_checkpoint(args.output_dir) is not None:5 logger.info("***** continue training *****")6 last_checkpoint = get_last_checkpoint(args.output_dir)7 trainer.train(resume_from_checkpoint=last_checkpoint)8 else:9 trainer.train()
Run training on a spot instance
The last step of this example is to start our managed Spot Training. Therefore we simple call the .fit
method of our estimator and provide our dataset.
1 # define train data object2 data = {3 'train': training_input_path,4 'test': test_input_path5 }67 # starting the train job with our uploaded datasets as input8 huggingface_estimator.fit(data)910 # Training seconds: 87411 # Billable seconds: 26212 # Managed Spot Training savings: 70.0%
After the training is successful run you should see your spot savings in the logs.
Conclusion
We successfully managed to run a Managed Spot Training on Amazon SageMaker and save 70% off the training cost, which is a big margin. Especially we only needed to define 3 parameters to set it up.
I can highly recommend using Managed Spot Training if you have grace period in between model training and delivery.
If you want to learn more about Hugging Face Transformers on Amazon SageMaker you can checkout our documentation or other examples.
You can find the code here.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.