Benchmark evaluating varying sizes of Llama 2 on a range of Amazon EC2 instance types with different load levels on latency (ms per token), and throughput (tokens per second).

Llama 2 on Amazon SageMaker a Benchmark

In this example we will show how to fine-tune Falcon 180B using DeepSpeed, Hugging Face Transformers, LoRA with Flash Attention on a multi-GPU machine.

Fine-tune Falcon 180B with DeepSpeed ZeRO, LoRA & Flash Attention

Learn how to fine-tune Falcon 180B with QLoRA and Flash Attention on Amazon SageMaker.

Fine-tune Falcon 180B with QLoRA and Flash Attention on Amazon SageMaker

Learn how to deploy Falcon 180B to Amazon SageMaker and how to create a chatbot with streaming inference.

Deploy Falcon 180B on Amazon SageMaker

Learn how to quantize Llama 2 7B with GPTQ to use 4x less memory.

Optimize open LLMs using GPTQ and Hugging Face Optimum

Learn how to use Infrastructure as Code with (AWS CDK)] to deploy and manage Llama 2

LLMOps: Deploy Open LLMs using Infrastructure as Code with AWS CDK

Learn how to deploy Llama 2 models (7B - 70B) to Amazon SageMaker using the Hugging Face LLM Inference DLC.

Deploy Llama 2 7B/13B/70B on Amazon SageMaker

EasyLLM is an open-source project that provides helpful tools and methods for working with large language models (LLMs).

Introducing EasyLLM - streamline open LLMs

This blog post is an extended guide on instruction-tuning Llama 2 from Meta AI

Extended Guide: Instruction-tune Llama 2

All Resources for LLaMA 2, How to test, train, and deploy it.

LLaMA 2 - Every Resource you need

Learn how to fine-tune Llama 3 70b with PyTorch FSDP and Q-Lora using Hugging Face TRL, Transformers, PEFT and Datasets.

Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora

In this blog post you will learn how to deploy Llama 3 70B to Amazon SageMaker.

Deploy Llama 3 on Amazon SageMaker

In this blog post you will learn how to accelerate Mixtral using Speculative Decoding (Medusa) and Quantization (AWQ).

Accelerate Mixtral 8x7B with Speculative Decoding and Quantization on Amazon SageMaker

In this blog post you will learn how to deploy Meta Llama 2 70B on AWS Inferentia2 with Hugging Face Optimum on Amazon SageMaker.

Deploy Llama 2 70B on AWS Inferentia2 with Hugging Face Optimum

In this blog post you will learn how to fine-tune open LLMs from Hugging Face using Amazon SageMaker.

Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker

In this blog post you will learn how to evaluate LLMs using Hugging Face lighteval on Amazon SageMaker.

Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker

In this blog post you will learn how to fine tune Google Gemma using Hugging Face Transformers, Datasets and TRL.

How to fine-tune Google Gemma with ChatML and Hugging Face TRL

In this blog post you will learn how to align LLMs using Hugging Face TRL and RLHF through Direct Preference Optimization (DPO).

RLHF in 2024 with DPO & Hugging Face

In this blog post you will learn how to fine-tune LLMs using Hugging Face TRL, Transformers and Datasets in 2024. We will fine-tune a LLM on a text to SQL dataset.

How to Fine-Tune LLMs in 2024 with Hugging Face

In this blog post you will learn how to increase the throughput of Llama 13B on Amazon SageMaker using single instance multi-replica endpoints.

Scale LLM Inference on Amazon SageMaker with Multi-Replica Endpoints

In this blog post you will learn how to fine-tune Llama 7B on AWS Trainium using the Hugging Face Optimum Neuron library.

Fine-tune Llama 7B on AWS Trainium 

In this blog post you will learn how to use the huggingface_hub library to create, send requests to, pause, and delete Hugging Face Inference Endpoints.

Programmatically manage 🤗 Inference Endpoints

In this blog post you will learn how to deploy Mixtral 8x7B to Amazon SageMaker.

Deploy Mixtral 8x7B on Amazon SageMaker

In this blog post, you will learn how to compile and deploy Embedding Models on AWS Inferentia2.

Deploy Embedding Models on AWS inferentia2 with Amazon SageMaker

In this blog post, you will learn how to compile and deploy Llama 2 7B on AWS Inferentia2 with Amazon SageMaker.

Deploy Llama 2 7B on AWS inferentia2 with Amazon SageMaker

In this blog post, you will learn how to compile and deploy Stable Diffusion XL on AWS Inferentia2 with Amazon SageMaker.

Deploy Stable Diffusion XL on AWS inferentia2 with Amazon SageMaker

In this blog post I took a closer look at Amazon Bedrock Titan embeddings model and how good (bad) the perform.

Amazon Bedrock: How good (bad) is Titan Embeddings? 

Learn how to evaluate LLMs and RAG pipelines using Langchain and Hugging Face

Evaluate LLMs and RAG a practical example using Langchain and Hugging Face

Learn how to deploy Hugging Face Idefics 9B & 80B to Amazon SageMaker and send requests with images and text to the model.

Deploy Idefics 9B & 80B on Amazon SageMaker

Learn how to fine-tuned and deploy Mistral 7B with Hugging Face on Amazon SageMaker and leverage technique like Qlora, Flash Attention and response streaming

Train and Deploy Mistral 7B with Hugging Face on Amazon SageMaker

Learn how to train LLaMa 2 using QLoRA Hugging Face Transformers on Amazon SageMaker

Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker

Learn how to train LLMs using QLoRA on Amazon SageMaker

Train LLMs using QLoRA on Amazon SageMaker

Learn how to deploy LLMs using Hugging Face Inference Endpoints

Deploy LLMs with Hugging Face Inference Endpoints

Learn how to optimize and deploy BERT on AWS Inferentia2

Optimize & Deploy BERT on AWS inferentia2

Learn how to deploy LLMs into VPCs from S3 with Amazon SageMaker using the new Hugging Face LLM Inference DLC.

Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker

Learn how to deploy Falcon 40B to Amazon SageMaker using the new Hugging Face LLM Inference DLC.

Deploy Falcon 7B & 40B on Amazon SageMaker

Learn how to fine-tune Hugging Face Transformers using AWS Trainium.

Fine-tune BERT for Text Classification on AWS Trainium

Learn how to deploy the open-source LLMs, like BLOOM to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container.

Introducing the Hugging Face LLM Inference Container for Amazon SageMaker

Learn how to fine-tune Donut-base a Generative AI model for document-understand/document-parsing using Hugging Face Transformers and Amazon SageMaker.

Generative AI for Document Understanding with Hugging Face and Amazon SageMaker

Learn how to fine-tune LLMs on multi-node setups using Amazon SageMaker and Hugging Face Transformers with PyTorch FSDP

How to scale LLM workloads to 20B+ with Amazon SageMaker using Hugging Face and PyTorch FSDP

Learn how to quickly set up an AWS Trainium using the Hugging Face Neuron Deep Learning AMI and fine-tune BERT

Setting up AWS Trainium for Hugging Face Transformers

Learn how to fine-tune BLOOMZ 7B with Amazon SageMaker on a Single GPU using LoRA Hugging Face Transformers.

Train and Deploy BLOOM with Amazon SageMaker and PEFT

IGEL (Instruction-based German Language Model) is an LLM designed for German language understanding tasks, including sentiment analysis, language translation, and question answering.

Introducing IGEL an instruction-tuned German large Language Model

Learn how to fine-tune Google's FLAN-T5 XXL on a Single GPU using LoRA And Hugging Face Transformers.

Efficient Large Language Model training with LoRA and Hugging Face

Learn how to deploy Google's FLAN-UL 20B on Amazon SageMaker for inference.

Deploy FLAN-UL2 20B on Amazon SageMaker

Learn how to get started with Pytorch 2.0 and Hugging Face Transformers and reduce your training time up to 2x.

Getting started with Pytorch 2.0 and Hugging Face Transformers

Learn how to deploy ControlNet Stable Diffusion Pipeline on Hugging Face Inference Endpoints to generate controlled images.

Controlled text-to-image generation with ControlNet on Inference Endpoints

Learn how to fine-tune Google's FLAN-T5 XXL on Amazon SageMaker using DeepSpeed and Hugging Face Transformers.

Combine Amazon SageMaker and DeepSpeed to fine-tune FLAN-T5 XXL

Learn how to fine-tune Google's FLAN-T5 XXL using DeepSpeed & Hugging Face Transformers.

Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers

Learn how to deploy Google's FLAN-T5 XXL on Amazon SageMaker for inference.

Deploy FLAN-T5 XXL on Amazon SageMaker

Learn how to leverage Hugging Face Transformers to easily fine-tune your models.

Hugging Face Transformers Examples

Learn how to get started with Hugging Face Transformers and TPUs using PyTorch, fine-tune a BERT model for Text Classification using the newest Google Cloud TPUs.

Getting started with Transformers and TPU using PyTorch

Learn how to fine-tune Google's FLAN-T5 for chat & dialogue summarization using Hugging Face Transformers.

Fine-tune FLAN-T5 for chat & dialogue summarization

Learn how to deploy OpenAI Whisper for speech recognition and transcription using Hugging Face Inference Endpoints.

Managed Transcription with OpenAI Whisper and Hugging Face Inference Endpoints

Learn how to deploy Stable Diffusion 2.0 Inpainting on Hugging Face Inference Endpoints to manipulate images.

Stable Diffusion Inpainting example with Hugging Face inference Endpoints

Learn how to deploy Stable Diffusion 2.0 on Hugging Face Inference Endpoints to generate images based from text.

Stable Diffusion with Hugging Face Inference Endpoints

Learn how to fine-tune LiLt (Language independent Layout Transformer) for document-understand/document-parsing using Hugging Face Transformers.

Document AI: LiLT a better language agnostic LayoutLM model

Learn how to deploy a multiple models on to a GPU with Hugging Face multi-model inference endpoints.

Multi-Model GPU Inference with Hugging Face Inference Endpoints

Learn how to deploy a Hugging Face Gradio Application using Hugging Face Transformers to AWS Lambda for serverless workloads.

Serverless Machine Learning Applications with Hugging Face Gradio and AWS Lambda

Learn how to optimize Stable Diffusion for GPU inference with a 1-line of code using Hugging Face Diffusers and DeepSpeed.

Accelerate Stable Diffusion inference with DeepSpeed-Inference on GPUs

Learn how to deploy Stable Diffusion to Amazon SageMaker to generate images.

Stable Diffusion on Amazon SageMaker

Learn how to deploy T5 11B on a single GPU using Hugging Face Inference Endpoints.

Deploy T5 11B for inference for less than $500

Learn how to use SetFit to create a text-classification model with only a `8` labeled samples per class, or `32` samples in total. You will also learn how to improve your model by using hyperparamter tuning.

Outperform OpenAI GPT-3 with SetFit for text-classification

Learn how to fine-tune LayoutLM for document-understand using Keras & Hugging Face Transformers.

Fine-tuning LayoutLM for document-understanding using Keras & Hugging Face Transformers 

Learn how to deploy LayoutLM for document-understand using Hugging Face Inference Endpoints.

Deploy LayoutLM with Hugging Face Inference Endpoints

Learn how to fine-tune LayoutLM for document-understand using Hugging Face Transformers. LayoutLM is a document image understanding and information extraction transformers.

Document AI: Fine-tuning LayoutLM for document-understanding using Hugging Face Transformers 

Welcome to this tutorial on how to create a custom inference handler for Hugging Face Inference Endpoints.

Custom Inference with Hugging Face Inference Endpoints

Learn how to optimize GPT-J for GPU inference with a 1-line of code using Hugging Face Transformers and DeepSpeed.

Accelerate GPT-J inference with DeepSpeed-Inference on GPUs

Learn how to fine-tune Donut-base for document-understand/document-parsing using Hugging Face Transformers. Donut is a new document-understanding model achieving state-of-art performance and can be used for commercial applications.

Document AI: Fine-tuning Donut for document-parsing using Hugging Face Transformers

Learn how to Sentence Transformers model with TensorFlow and Keras for creating document embeddings

Use Sentence Transformers with TensorFlow

Learn how to pre-traing BERT from scratch using Hugging Face Transformers and Habana Gaudi.

Pre-Training BERT with Hugging Face Transformers and Habana Gaudi

Learn how to optimize BERT for GPU inference with a 1-line of code using Hugging Face Transformers and DeepSpeed.

Accelerate BERT inference with DeepSpeed-Inference on GPUs

Learn how to optimize Sentence Transformers using Hugging Face Optimum. You will learn how dynamically quantize and optimize a Sentence Transformer for ONNX Runtime.

Accelerate Sentence Transformers with Hugging Face Optimum

Learn how to migrate your training jobs to a Habana Gaudi-based DL1 instance on AWS using EC2 Remote Runner.

Deep Learning setup made easy with EC2 Remote Runner and Habana Gaudi

Learn how to optimize Vision Transformer (ViT) using Hugging Face Optimum. You will learn how dynamically quantize a ViT model for ONNX Runtime.

Accelerate Vision Transformer (ViT) with Quantization using Optimum

Learn how to optimize Hugging Face Transformers models for NVIDIA GPUs using Optimum. You will learn how to optimize a DistilBERT for ONNX Runtime

Optimizing Transformers for GPUs with Optimum

Learn how to learn how to fine-tune XLM-RoBERTa for multi-lingual multi-class text-classification using a Habana Gaudi-based DL1 instance.

Hugging Face Transformers and Habana Gaudi AWS DL1 Instances

Learn how to optimize Hugging Face Transformers models using Optimum. The session will show you how to dynamically quantize and optimize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware.

Optimizing Transformers with Hugging Face Optimum

Introduction guide about ONNX and Transformers. Learn how to convert transformers like BERT to ONNX and what you can do with it.

Convert Transformers to ONNX with Hugging Face Optimum

Learn how to setup a Deep Learning Environment for Hugging Face Transformers with Habana Gaudi on AWS using the DL1 instance type.

Setup Deep Learning environment for Hugging Face Transformers with Habana Gaudi on AWS

Learn how to do post-training static quantization on Hugging Face Transformers model with `optimum` to achieve up to 3x latency improvements.

Static Quantization with Hugging Face `optimum` for ~3x latency improvements

Learn how to do advanced PII detection and anonymization with Hugging Face Transformers and Amazon SageMaker.

Advanced PII detection and anonymization with Hugging Face Transformers and Amazon SageMaker

Learn about the different existing Amazon SageMaker Inference options and and how to use them.

An Amazon SageMaker Inference comparison with Hugging Face Transformers

Learn how to do image segmentation with Hugging Face Transformers, SegFormer and Amazon SageMaker.

Semantic Segmantion with Hugging Face's Transformers & Amazon SageMaker

Learn how to do automatic speech recognition/speech-to-text with Hugging Face Transformers, Wav2vec2 and Amazon SageMaker.

Automatic Speech Recogntion with Hugging Face's Transformers & Amazon SageMaker

Learn how to deploy a Transformer model like BERT to Amazon SageMaker Serverless using the Python SageMaker SDK.

Serverless Inference with Hugging Face's Transformers, DistilBERT and Amazon SageMaker

Learn how to accelerate Sentence Transformers inference inference using Hugging Face Transformers and AWS Inferentia.

Accelerated document embeddings with Hugging Face Transformers and AWS Inferentia

Learn how to leverage AWS Spot Instances when training Hugging Face Transformers with Amazon SageMaker to save up to 90% training cost.

Save up to 90% training cost with AWS Spot Instances and Hugging Face Transformers

Learn how to accelerate BERT and Transformers inference using Hugging Face Transformers and AWS Inferentia.

Speed up BERT inference with Hugging Face Transformers and AWS Inferentia

Learn how to use a custom Inference script for creating document embeddings with Hugging Face's Transformers, Amazon SageMaker, and Sentence Transformers.

Creating document embeddings with Hugging Face's Transformers & Amazon SageMaker

Learn how to apply autoscaling to Hugging Face Transformers and Amazon SageMaker using Terraform.

Autoscaling BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module

Learn how to deploy multiple Hugging Face Transformers for inference with Amazon SageMaker and Multi-Container Endpoints.

Multi-Container Endpoints with Hugging Face Transformers and Amazon SageMaker

Learn how to deploy an Asynchronous Inference model with Hugging Face Transformers and Amazon SageMaker, with autoscaling to zero.

Asynchronous Inference with Hugging Face Transformers and Amazon SageMaker

Learn how to deploy BERT/DistilBERT with Hugging Face Transformers using Amazon SageMaker and Terraform module.

Deploy BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module

Learn how to run apply task-specific knowledge distillation for BERT and text-classification using Hugging Face Transformers & Amazon SageMaker including Hyperparameter search.

Task-specific knowledge distillation for BERT using Transformers & Amazon SageMaker

Learn how to run large-scale distributed training using multilingual BERT on over 1 million data points with Hugging Face Transformers & Amazon SageMaker

Distributed training on multilingual BERT with Hugging Face Transformers & Amazon SageMaker

Learn how to fine-tune a a Hugging Face Transformer for Financial Text Summarization using vanilla `Keras`, `Tensorflow` , `Transformers`, `Datasets` & Amazon SageMaker.

Financial Text Summarization with Hugging Face Transformers, Keras & Amazon SageMaker

Learn how to deploy EleutherAIs GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker.

Deploy GPT-J 6B for inference using  Hugging Face Transformers and Amazon SageMaker

Learn how to fine-tune a Vision Transformer for Image Classification Example using vanilla `Keras`, `Transformers`, `Datasets`.

Image Classification with Hugging Face Transformers and `Keras` 

In October and November, we held a workshop series on “Enterprise-Scale NLP with Hugging Face & Amazon SageMaker”. This workshop series consisted out of 3 parts and covers: Getting Started, Going Production & MLOps.