Optimize open LLMs using GPTQ and Hugging Face Optimum
The Hugging Face Optimum team collaborated with AutoGPTQ library to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization open LLMs to 8, 4, 3 or even 2 bits to run them on smaller Hardware without a big drop of performance.
In the blog, you will learn how to:
- Setup our development environment
- Prepare quantization dataset
- Load and Quantize Model
- Test performance and inference speed
- Bonus: Run Inference with Text Generation Inference
But we before we get started lets take quick look on what GPTQ does.
Note: This tutorial was created and run on a g5.2xlarge AWS EC2 Instance, including an NVIDIA A10G GPU.
What is GPTQ?
GPTQ is a post-training quantziation method to compress LLMs, like GPT. GPTQ compresses GPT models by reducing the number of bits needed to store each weight in the model, from 32 bits down to just 3-4 bits. This means the model takes up much less memory, so it can run on less Hardware, e.g. Single GPU for 13B Llama2 models. GPTQ analyzes each layer of the model separately and approximating the weights in a way that preserves the overall accuracy.
The main benefits are:
- Quantizes the weights of the model layer-by-layer to 4 bits instead of 16 bits, this reduces the needed memory by 4x.
- Quantization is done gradually to minimize the accuracy loss from quantization.
- Achieves same latency as fp16 model, but 4x less memory usage, sometimes faster due to custom kernels, e.g. Exllama
- Quantized weights can be saved to disk for a head of time quantization.
Note: GPTQ quantization only works for text model for now. Futhermore, the quantization process can take a lot of time. You check on the Hugging Face Hub if there is not already a GPTQ quantized version of the model you want to use.
1. Setup our development environment
Let's start coding, but first, install our dependencies.
2. Prepare quantization dataset
GPTQ is a post-training quantization method, so we need to prepare a dataset to quantize our model. We can either use a dataset from the Hugging Face Hub or use our own dataset. In this blog, we are going to use the WikiText dataset from the Hugging Face Hub. The dataset is used to quantize the weights to minimize the performance loss. It is recommended to use a quantization dataset with atleast 128
samples.
Note: TheBloke a very active community member is contributing hundreds of gptq weights to the Hugging Face Hub. He mostly uses wikitext as quantization dataset for general domain models.
If you want to use, e.g. your fine-tuning dataset for quantization you can provide it as a list instead of the "id", check out this example.
3. Load and Quantize Model
Optimum integrates GPTQ quantization in the optimum.qptq
namespace with a GPTQQuantizer
. The quantizer takes our dataset (id or list), bits, and model_seqlen as input. For more customization check here.
After we have created our Quantizer we can load our model using Transformers. In our example we will quantize a Llama 2 7B, which we trained in my other blog post "Extended Guide: Instruction-tune Llama 2". We are going to load our model in fp16
since GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16.
After we loaded our model we are ready to quantize it. Note: Quantization can take process can take a lot of time depending on one's hardware. For this example the quantization on a single A10G GPU for a 7B model took ~45minutes.
since the model was partially offloaded it set disable_exllama
to True
to avoid an error. For inference and production load we want to leverage the exllama kernels. Therefore we need to change the config.json
4. Test performance and inference speed
Since the latest release of transformers we can load any GPTQ quantized model directly using the AutoModelForCausalLM
class this. You can either load already quantized models from Hugging Face, e.g. TheBloke/Llama-2-13B-chat-GPTQ or models you quantized yourself. Since we want to test here the results of our quantization we are going to load our quantized model from disk and compare it to our non quantize model.
First lets our our non quantized model and test it on a simple prompt.
We can load the vanilla transformers model and run inference using the pipeline
class.
lets create our vanilla base line
Since we have now our baseline we can test and validate our GPTQ quantize weights. Therefore we will use the new gptq
integration into the AutoModelForCausalLM
class where we can directly load the gptq
weights.
Now, we can test our quantized model on the same prompt as our baseline.
For comparison the vanilla model needed ~12.6GB Memory and the GPTQ model needed ~3.8GB Memory, with equal performance. GPTQ allowed us to save ~4x memory (don't forget pytorch has default kernels).
5. Bonus: Run Inference with Text Generation Inference
Text Generation Inference supports GPTQ model for more efficient deployments. We simply need to provide gptq
as QUANTIZE
environment variable when starting our container.
We can invoke our container using curl. _Note: The first request will be slow. _
With Text Generation inference we are achieving ~22.942983ms
latency per token, which is 2x faster than transformers. If you plan to deploy your model in production, I would recommend to use Text Generation Inference.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.