Quick intro: PEFT or Parameter Efficient Fine-tuning
PEFT, or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Load and prepare the dataset
we will use the samsum dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.
To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.
To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out chapter 6 of the Hugging Face Course.
Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.
We defined a prompt_template which we will use to construct an instruct prompt for better performance of our model. Our prompt_template has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model.
We preprocess our dataset before training and save it to disk to then upload it to S3. You could run this step on your local machine or a CPU and upload it to the Hugging Face Hub.
After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
3. Fine-Tune BLOOM with LoRA and bnb int-8 on Amazon SageMaker
In addition to the LoRA technique, we will use bitsanbytes LLM.int8() to quantize out frozen LLM to int8. This allows us to reduce the needed memory for BLOOMZ ~4x.
In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use.
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.
We can now start our training job, with the .fit() method passing our S3 path to the training script.
The trainign took 20632 seconds, which is about 5.7 hours. The ml.g5.2xlarge instance we used costs $1.515 per hour. So the total cost for training BLOOMZ 7B was is $8.63. We could reduce the cost by using a spot instance, but the training time could increase, by waiting or restarts.
4. Deploy the model to Amazon SageMaker Endpoint
When using peft for training, you normally end up with adapter weights. We added the merge_and_unload() method to merge the base model with the adatper to make it easier to deploy the model. Since we can now use the pipelines feature of the transformers library.
We can now deploy our model using the deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.
SageMaker starts the deployment process by creating a SageMaker Endpoint Configuration and a SageMaker Endpoint. The Endpoint Configuration defines the model and the instance type.
Lets test by using a example from the test split.
Lets compare it to the test result
Finally, we delete the endpoint again.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.