Deep Learning setup made easy with EC2 Remote Runner and Habana Gaudi
Going from experimenting and preparation in a local environment to managed cloud infrastructure is often times too complex and prevents data scientists from iterating quickly and efficiently on their Deep Learning projects and work.
A common workflow I see is that a DS/MLE starts virtual machines in the cloud, ssh into the machine, and does all of the experiments there. This has at least two downsides.
- Requires a lot of work and experience to start those cloud-based instances (selecting the right environment, CUDA version…), which can lead to bad developer experiences and some unpreventable situations like forgetting to stop an instance.
- The compute resources might not be efficiently used. In Deep Learning you use most of the time GPU-backed instances, which can cost up to 40$/h. But since not all operations require a GPU, such as dataset preparation or tokenization (in NLP) a lot of money can be wasted.
Two overcome that downside we create a small package called “Remote Runner”.
Remote Runner is an easy pythonic way to migrate your python training scripts from a local environment to a powerful cloud-backed instance to efficiently scale your training, save cost & time, and iterate quickly on experiments in a parallel containerized way.
How does Remote Runner work?
Remote Runner takes care of all of the heavy liftings for you:
- Creating all required cloud resources
- Migrating your script to the remote machine
- Executing your script
- Making sure the instance is terminated again.
Let's give it a try. 🚀
Our goal for the example is to fine-tune a Hugging Face Transformer model using the Habana Gaudi-based DL1 instance on AWS to take advantage of the cost performance benefits of Gaudi.
Managed Deep Learning with Habana Gaudi
In this example, you learn how to migrate your training jobs to a Habana Gaudi-based DL1 instance on AWS. Habana Gaudi is a new deep learning training processor for cost-effective, high-performance training promising up to 40% better price-performance than comparable GPUs, which is available on AWS.
We created the past already a blog post on how to Setup Deep Learning environment for Hugging Face Transformers with Habana Gaudi on AWS, which I recommend you take a look to understand how much “manual” work was needed to use the DL1 instances on AWS for your workflows.
In the following example we will cover the:
and then
1. Requirements & Setup
Before we can start make sure you have met the following requirements
- AWS Account with quota for DL1 instance type
- AWS IAM user configured in CLI with permission to create and manage EC2 instances. You can find all permissions needed here.
After all, requirements are fulfilled we can begin by installing Remote-Runner and the Hugging Face Hub library. We will use the Hugging Face Hub as the model versioning backend. This allows us to checkpointing, log and track our metrics during training, but simply provide one API token.
2. Run a text-classification training job on Habana Gaudi DL1
We will use an example from the Remote Runner repository. The example will fine-tune a DistilBERT model on the IMDB dataset.
Note: Check out the Remote Runner repository it includes several different examples
As first we need to clone the repository and change the directory into examples
The habana_text_classification.py
script uses the new GaudiTrainer
from optimum-habana to leverage Habana Gaudi and provides an interface identical to the transformers Trainer
.
As next, we can adjust the hyperparameters in the habana_example.py
.
You can for example change the model_id
to your preferred checkpoint from Hugging Face. The next step is to create our EC2RemoteRunner
instance. The EC2RemoteRunner
defines our cloud environment including the instance_type
, region
, and which credentials we want to use. You can also define a custom container, which should be used to execute your training, e.g. if you want to include additional dependencies.
the default container for Habana is the huggingface/optimum-habana:latest
one.
We are going to use the dl1.24xlarge
instance in us-east-1
. For profile enter you configured profile.
As next, we need to define the “launch” arguments which are the executecommand
and source_dir
(will be uploaded via SCP).
Now we can launch our remote training by executing our python file.
You should see a similar output to the one below.
Remote Runner will log all steps it takes to launch your training as well as all of the outputs during the training to the terminal. At the end of the training, you'll see a summary of your training job with the duration each step took and an estimated cost.
We can see that our training finished successfully and took in a total of 594s and cost $1.71 due to the use of Habana Gaudi.
We can now check the results and the model on the Hugging Face Hub. For me it is philschmid/distilbert-imdb-habana-remote-runner
Conclusion
Remote Runner helps you to easily migrate your training to the cloud and to experiment fast on different instance types.
With the support for custom deep learning chips, it makes it easy to migrate from more expensive and slower instances to faster more optimized ones, e.g. Habana Gaudi DL1. If you are interested in why you should migrate GPU workload to Habana Gaudi checkout: Hugging Face Transformers and Habana Gaudi AWS DL1 Instances
If you run into an issue with Remote Runner or have feature requests don't hesitate to open an issue on Github.
Also, make sure to check out the optimum-habana repository.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.