Creating document embeddings with Hugging Face's Transformers & Amazon SageMaker
Photo by Alex Boyd on Unsplash
Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker Python SDK to create a real-time inference endpoint running a Sentence Transformers for document embeddings. Currently, the SageMaker Hugging Face Inference Toolkit supports the pipeline feature from Transformers for zero-code deployment. This means you can run compatible Hugging Face Transformer models without providing pre- & post-processing code. Therefore we only need to provide an environment variable HF_TASK
and HF_MODEL_ID
when creating our endpoint and the Inference Toolkit will take care of it. This is a great feature if you are working with existing pipelines.
If you want to run other tasks, such as creating document embeddings, you can the pre- and post-processing code yourself, via an inference.py
script. The Hugging Face Inference Toolkit allows the user to override the default methods of the HuggingFaceHandlerService
.
The custom module can override the following methods:
model_fn(model_dir)
overrides the default method for loading a model. The return valuemodel
will be used in thepredict_fn
for predictions.model_dir
is the the path to your unzippedmodel.tar.gz
.
input_fn(input_data, content_type)
overrides the default method for pre-processing. The return valuedata
will be used inpredict_fn
for predictions. The inputs are:input_data
is the raw body of your request.content_type
is the content type from the request header.
predict_fn(processed_data, model)
overrides the default method for predictions. The return valuepredictions
will be used inoutput_fn
.model
returned value frommodel_fn
methondprocessed_data
returned value frominput_fn
method
output_fn(prediction, accept)
overrides the default method for post-processing. The return valueresult
will be the response to your request (e.g.JSON
). The inputs are:predictions
is the result frompredict_fn
.accept
is the return accept type from the HTTP Request, e.g.application/json
.
In this example are we going to use Sentence Transformers to create sentence embeddings using a mean pooling layer on the raw representation.
NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances
Development Environment and Permissions
Installation
1 %pip install sagemaker --upgrade2 import sagemaker34 assert sagemaker.__version__ >= "2.75.0"
Install git
and git-lfs
1 # For notebook instances (Amazon Linux)2 !sudo yum update -y3 !curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash4 !sudo yum install git-lfs git -y5 # For other environments (Ubuntu)6 !sudo apt-get update -y7 !curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash8 !sudo apt-get install git-lfs git -y
Permissions
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
1 import sagemaker2 import boto33 sess = sagemaker.Session()4 # sagemaker session bucket -> used for uploading data, models and logs5 # sagemaker will automatically create this bucket if it not exists6 sagemaker_session_bucket=None7 if sagemaker_session_bucket is None and sess is not None:8 # set to default bucket if a bucket name is not given9 sagemaker_session_bucket = sess.default_bucket()1011 try:12 role = sagemaker.get_execution_role()13 except ValueError:14 iam = boto3.client('iam')15 role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']1617 sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)1819 print(f"sagemaker role arn: {role}")20 print(f"sagemaker bucket: {sess.default_bucket()}")21 print(f"sagemaker session region: {sess.boto_region_name}")
Create custom an inference.py
script
To use the custom inference script, you need to create an inference.py
script. In our example, we are going to overwrite the model_fn
to load our sentence transformer correctly and the predict_fn
to apply mean pooling.
We are going to use the sentence-transformers/all-MiniLM-L6-v2 model. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
1 !mkdir code
1 %%writefile code/inference.py23 from transformers import AutoTokenizer, AutoModel4 import torch5 import torch.nn.functional as F67 # Helper: Mean Pooling - Take attention mask into account for correct averaging8 def mean_pooling(model_output, attention_mask):9 token_embeddings = model_output[0] #First element of model_output contains all token embeddings10 input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()11 return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)121314 def model_fn(model_dir):15 # Load model from HuggingFace Hub16 tokenizer = AutoTokenizer.from_pretrained(model_dir)17 model = AutoModel.from_pretrained(model_dir)18 return model, tokenizer1920 def predict_fn(data, model_and_tokenizer):21 # destruct model and tokenizer22 model, tokenizer = model_and_tokenizer2324 # Tokenize sentences25 sentences = data.pop("inputs", data)26 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')2728 # Compute token embeddings29 with torch.no_grad():30 model_output = model(**encoded_input)3132 # Perform pooling33 sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])3435 # Normalize embeddings36 sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)3738 # return dictonary, which will be json serializable39 return {"vectors": sentence_embeddings[0].tolist()}
Create model.tar.gz
with inference script and model
To use our inference.py
we need to bundle it into a model.tar.gz
archive with all our model-artifcats, e.g. pytorch_model.bin
. The inference.py
script will be placed into a code/
folder. We will use git
and git-lfs
to easily download our model from hf.co/models and upload it to Amazon S3 so we can use it when creating our SageMaker endpoint.
1 repository = "sentence-transformers/all-MiniLM-L6-v2"2 model_id=repository.split("/")[-1]3 s3_location=f"s3://{sess.default_bucket()}/custom_inference/{model_id}/model.tar.gz"
- Download the model from hf.co/models with
git clone
.
1 !git lfs install2 !git clone https://huggingface.co/$repository
- copy
inference.py
into thecode/
directory of the model directory.
1 !cp -r code/ $model_id/code/
- Create a
model.tar.gz
archive with all the model artifacts and theinference.py
script.
1 %cd $model_id2 !tar zcvf model.tar.gz *
- Upload the
model.tar.gz
to Amazon S3:
1 !aws s3 cp model.tar.gz $s3_location2 # upload: ./model.tar.gz to s3://sagemaker-us-east-1-558105141721/custom_inference/all-MiniLM-L6-v2/model.tar.gz
Create custom HuggingfaceModel
After we have created and uploaded our model.tar.gz
archive to Amazon S3. Can we create a custom HuggingfaceModel
class. This class will be used to create and deploy our SageMaker endpoint.
1 from sagemaker.huggingface.model import HuggingFaceModel234 # create Hugging Face Model Class5 huggingface_model = HuggingFaceModel(6 model_data=s3_location, # path to your model and script7 role=role, # iam role with permissions to create an Endpoint8 transformers_version="4.12", # transformers version used9 pytorch_version="1.9", # pytorch version used10 py_version='py38', # python version used11 )1213 # deploy the endpoint endpoint14 predictor = huggingface_model.deploy(15 initial_instance_count=1,16 instance_type="ml.g4dn.xlarge"17 )
Request Inference Endpoint using the HuggingfacePredictor
The .deploy()
returns an HuggingFacePredictor
object which can be used to request inference.
1 data = {2 "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",3 }45 res = predictor.predict(data=data)6 print(res)7 # {'vectors': [0.005078191868960857, -0.0036594511475414038, .....]}
Delete model and endpoint
To clean up, we can delete the model and endpoint.
1 predictor.delete_model()2 predictor.delete_endpoint()
Conclusion
We managed to inference.py
provide a custom inference script to overwrite default methods for model loading and running inference. This allowed us to use Sentence Transformers models for creating sentence embeddings with minimal code changes.
Custom Inference scripts are an easy and nice way to customize the inference pipeline of the Hugging Face Inference Toolkit when your pipeline is not represented in the pipelines API of Transformers or when you want to add custom logic.
Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.