Optimize open LLMs using GPTQ and Hugging Face Optimum

August 31, 20239 minute readView Code

The Hugging Face Optimum team collaborated with AutoGPTQ library to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization open LLMs to 8, 4, 3 or even 2 bits to run them on smaller Hardware without a big drop of performance.

In the blog, you will learn how to:

Setup our development environment
Prepare quantization dataset
Load and Quantize Model
Test performance and inference speed
Bonus: Run Inference with Text Generation Inference

But we before we get started lets take quick look on what GPTQ does.

Note: This tutorial was created and run on a g5.2xlarge AWS EC2 Instance, including an NVIDIA A10G GPU.

What is GPTQ?

GPTQ is a post-training quantziation method to compress LLMs, like GPT. GPTQ compresses GPT models by reducing the number of bits needed to store each weight in the model, from 32 bits down to just 3-4 bits. This means the model takes up much less memory, so it can run on less Hardware, e.g. Single GPU for 13B Llama2 models. GPTQ analyzes each layer of the model separately and approximating the weights in a way that preserves the overall accuracy.

The main benefits are:

Quantizes the weights of the model layer-by-layer to 4 bits instead of 16 bits, this reduces the needed memory by 4x.
Quantization is done gradually to minimize the accuracy loss from quantization.
Achieves same latency as fp16 model, but 4x less memory usage, sometimes faster due to custom kernels, e.g. Exllama
Quantized weights can be saved to disk for a head of time quantization.

Note: GPTQ quantization only works for text model for now. Futhermore, the quantization process can take a lot of time. You check on the Hugging Face Hub if there is not already a GPTQ quantized version of the model you want to use.

1. Setup our development environment

Let's start coding, but first, install our dependencies.

!pip install "transformers[sentencepiece]==4.32.1" "optimum==1.12.0" "auto-gptq==0.4.2" "accelerate==0.22.0" "safetensors>=0.3.1" --upgrade

2. Prepare quantization dataset

GPTQ is a post-training quantization method, so we need to prepare a dataset to quantize our model. We can either use a dataset from the Hugging Face Hub or use our own dataset. In this blog, we are going to use the WikiText dataset from the Hugging Face Hub. The dataset is used to quantize the weights to minimize the performance loss. It is recommended to use a quantization dataset with atleast 128 samples.

Note: TheBloke a very active community member is contributing hundreds of gptq weights to the Hugging Face Hub. He mostly uses wikitext as quantization dataset for general domain models.

If you want to use, e.g. your fine-tuning dataset for quantization you can provide it as a list instead of the "id", check out this example.

# Dataset id from Hugging Face
dataset_id = "wikitext2"

3. Load and Quantize Model

Optimum integrates GPTQ quantization in the optimum.qptq namespace with a GPTQQuantizer. The quantizer takes our dataset (id or list), bits, and model_seqlen as input. For more customization check here.

from optimum.gptq import GPTQQuantizer
 
# GPTQ quantizer
quantizer = GPTQQuantizer(bits=4, dataset=dataset_id, model_seqlen=4096)
quantizer.quant_method = "gptq"

After we have created our Quantizer we can load our model using Transformers. In our example we will quantize a Llama 2 7B, which we trained in my other blog post "Extended Guide: Instruction-tune Llama 2". We are going to load our model in fp16 since GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
# Hugging Face model id
model_id = "philschmid/llama-2-7b-instruction-generator"
 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) # bug with fast tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.float16) # we load the model in fp16 on purpose

After we loaded our model we are ready to quantize it. Note: Quantization can take process can take a lot of time depending on one's hardware. For this example the quantization on a single A10G GPU for a 7B model took ~45minutes.

import os
import json
 
# quantize the model
quantized_model = quantizer.quantize_model(model, tokenizer)
 
# save the quantize model to disk
save_folder = "quantized_llama"
quantized_model.save_pretrained(save_folder, safe_serialization=True)
 
# load fresh, fast tokenizer and save it to disk
tokenizer = AutoTokenizer.from_pretrained(model_id).save_pretrained(save_folder)
 
# save quantize_config.json for TGI
with open(os.path.join(save_folder, "quantize_config.json"), "w", encoding="utf-8") as f:
  quantizer.disable_exllama = False
  json.dump(quantizer.to_dict(), f, indent=2)

since the model was partially offloaded it set disable_exllama to True to avoid an error. For inference and production load we want to leverage the exllama kernels. Therefore we need to change the config.json

with open(os.path.join(save_folder, "config.json"), "r", encoding="utf-8") as f:
  config = json.load(f)
  config["quantization_config"]["disable_exllama"] = False
  with open(os.path.join(save_folder, "config.json"), "w", encoding="utf-8") as f:
    json.dump(config, f, indent=2)

4. Test performance and inference speed

Since the latest release of transformers we can load any GPTQ quantized model directly using the AutoModelForCausalLM class this. You can either load already quantized models from Hugging Face, e.g. TheBloke/Llama-2-13B-chat-GPTQ or models you quantized yourself. Since we want to test here the results of our quantization we are going to load our quantized model from disk and compare it to our non quantize model.

First lets our our non quantized model and test it on a simple prompt.

import time
 
# The prompt is based on the fine-tuning from the model: https://www.philschmid.de/instruction-tune-llama-2#4-test-model-and-run-inference
prompt = """### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.
 
### Input:
Dear [boss name],
 
I'm writing to request next week, August 1st through August 4th,
off as paid time off.
 
I have some personal matters to attend to that week that require
me to be out of the office. I wanted to give you as much advance
notice as possible so you can plan accordingly while I am away.
 
Thank you, [Your name]
 
### Response:
"""
 
# helper function to generate text and measure latency
def generate_helper(pipeline,prompt=prompt):
    # warm up
    for i in range(5):
      _ = pipeline("Warm up")
 
    # measure latency in a simple way
    start = time.time()
    out = pipeline(prompt, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)
    end = time.time()
 
    generated_text = out[0]["generated_text"][len(prompt):]
 
    latency_per_token_in_ms = ((end-start)/len(pipeline.tokenizer(generated_text)["input_ids"]))*1000
 
    # return the generated text and the latency
    return {"text": out[0]["generated_text"][len(prompt):], "latency": f"{round(latency_per_token_in_ms,2)}ms/token"}

We can load the vanilla transformers model and run inference using the pipeline class.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
# Hugging Face model id
model_id = "philschmid/llama-2-7b-instruction-generator"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16) # we load the model in fp16 on purpose
 
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

lets create our vanilla base line

import torch
 
vanilla_res = generate_helper(pipe)
 
print(f"Latency: {vanilla_res['latency']}")
print(f"GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Generated Instruction: {vanilla_res['text']}")
 
# Latency: 37.49ms/token
# GPU memory: 12.62 GB
# Generated Instruction: Write a request for PTO letter to my boss

# clean up
del pipe
del model
del tokenizer
torch.cuda.empty_cache()

Since we have now our baseline we can test and validate our GPTQ quantize weights. Therefore we will use the new gptq integration into the AutoModelForCausalLM class where we can directly load the gptq weights.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
# path to gptq weights
model_id = "quantized_llama"
 
q_tokenizer = AutoTokenizer.from_pretrained(model_id)
q_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
 
qtq_pipe = pipeline("text-generation", model=q_model, tokenizer=q_tokenizer)

Now, we can test our quantized model on the same prompt as our baseline.

gpq_res = generate_helper(qtq_pipe)
 
print(f"Latency: {gpq_res['latency']}")
print(f"GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Generated Instruction: {gpq_res['text']}")
 
# Latency: 36.0ms/token
# GPU memory: 3.83 GB
# Generated Instruction: Write a letter requesting time off

For comparison the vanilla model needed ~12.6GB Memory and the GPTQ model needed ~3.8GB Memory, with equal performance. GPTQ allowed us to save ~4x memory (don't forget pytorch has default kernels).

5. Bonus: Run Inference with Text Generation Inference

Text Generation Inference supports GPTQ model for more efficient deployments. We simply need to provide gptq as QUANTIZE environment variable when starting our container.

model="/home/ubuntu/test-gptq"
num_shard=1
quantize="gptq"
max_input_length=1562
max_total_tokens=4096 # 4096
 
!docker run --gpus all -ti -p 8080:80 \
  -e MODEL_ID=$model \
  -e QUANTIZE=$quantize \
  -e NUM_SHARD=$num_shard \
  -e MAX_INPUT_LENGTH=$max_input_length \
  -e MAX_TOTAL_TOKENS=$max_total_tokens \
  -v $model:$model \
  ghcr.io/huggingface/text-generation-inference:1.0.3

We can invoke our container using curl. _Note: The first request will be slow. _

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"### Instruction:\nUse the Input below to create an instruction, which could have been used to generate the input using an LLM.\n\n### Input:\nDear [boss name],\n\nI am writing to request next week, August 1st through August 4th,\noff as paid time off.\n\nI have some personal matters to attend to that week that require\nme to be out of the office. I wanted to give you as much advance\nnotice as possible so you can plan accordingly while I am away.\n\nThank you, [Your name]\n\n### Response:","parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}' \
    -H 'Content-Type: application/json'

With Text Generation inference we are achieving ~22.942983ms latency per token, which is 2x faster than transformers. If you plan to deploy your model in production, I would recommend to use Text Generation Inference.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.