Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM
As Large Language Models (LLMs) like OpenAI o1, Meta Llama, and Anthropic Claude continue to become more performance, it's crucial to validate their general performance on core capabilities such as instruction following, reasoning, and mathematical skills using benchmarks like IFEval and GSM8K. While these may not perfectly for downstream use case, they provide a valuable general picture of a model's strengths and weaknesses.
However, running these comprehensive evaluations can be time-consuming and computationally intensive, especially with larger models. This is where optimized LLM serving tools like Hugging Face's Text Generation Inference (TGI) and vLLM come into play. Additionally, this allows us to validate the accuracy and implementation of the models in a production-like environment.
In this blog post, we will learn how to evaluate LLMs hosten using TGI or vLLM behind OpenAI compatible API endpoints, those can be locally or remotely. We will use on the Evaluation Harness to evaluate the Llama 3.1 8B Instruct model on the IFEval and GSM8K benchmarks with Chain of Thought reasoning.
Evaluation Harness
Evaluation Harness is a open-source framework to evaluate language models on a wide range of tasks and benchmarks. It supports various models and provides tools to streamline the evaluation process. It is used for the Hugging Face Open LLM Leaderboard
Hugging Face Text Generation Inference (TGI)
Text Generation Inference is a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). TGI supports popular open-source models like Llama, Mistral, and Gemma.
Evaluate Llama 3.1 8B Instruct on IFEval & GSM8K
Now, let's get started evaluating the Llama 3.1 8B Instruct model on IFEval and GSM8K benchmarks using Chain of Thought reasoning.
Note: This tutorial was run on an AWS g6e.2xlarge instance with 1x NVIDIA L40S GPU.
1. Running the Model with TGI
First, we'll use TGI to serve the Llama 3.1 8B Instruct model. Ensure you have Docker installed and a valid Hugging Face token.
Run Command:
- MODEL_ID: Specifies the model to use.
- NUM_SHARD: Number of shards (GPUs) to use.
- MAX_INPUT_TOKENS: Maximum number of tokens in the input, you might need to adjust based on the GPU you use
- MAX_TOTAL_TOKENS: Maximum number of tokens in the input and output, you might need to adjust based on the GPU you use
- HF_TOKEN: Your Hugging Face access token, you need to first run
huggingface-cli login
Note: Alternatively, you can use vLLM's OpenAI-compatible API server to serve the model. vLLM is another efficient serving library that supports high-throughput inference.
lm_eval
through OpenAI API Endpoints
2. Evaluate LLM with Evaluation Harness provides a CLI tool to evaluate models on various tasks and benchmarks. We can run 1 or more tasks in parallel and evaluate the model's performance using a ,
separated list of tasks. lm-eval
and supports many different configurations and options to evaluate models. A full list can be found in the documentation.
Most important parameters are model
, tasks
and model_args
those used to tell the harness which model (strategy) with what arguments to use and on which tasks to evaluate. We are using LLMs hosted through a OpenAI compatbile API, so we will use local-chat-completions
model interface, that allows us to evaluate models using the OpenAI messages
API format. This comes with benefits as that we can easily switch between models or run the evaluation on a different host, but this also comes with requirements as the local-chat-completions
is not supporting loglikelihood
which is needed for some tasks. Important CLI flags for this are:
--model
: Specifies the model interface. We'll uselocal-chat-completions
.--tasks
: Comma-separated list of tasks to evaluate (e.g.,gsm8k_cot_llama,ifeval
).--model_args
: Additional model arguments likemodel
,base_url
, andnum_concurrent
.- model: The model identifier, for tokenizer and other model-specific configurations.
- base_url: The API endpoint where the model is served,
http://localhost:8000/v1/chat/completions
- num_concurrent: Number of concurrent requests, e.g.
32
. - max_retries: Number of retries for failed requests, e.g.
3
. - tokenized_requests: Set to
False
for chat models.
--apply_chat_template
: Applies the chat template for formatting prompts.--fewshot_as_multiturn
: Treats few-shot examples as multiple turns (useful for instruct models).
As mentioned in the beginning, we are evaluating the Llama 3.1 8B Instruct model on IFEval and GSM8K benchmarks with Chain of Thought reasoning.
After we installed the CLI we can evaluate the model with the following command:
Running the evaluation on IFEval and GSM8K with Chain of Thought reasoning took ~10 min on a AWS g6e.2xlarge instance with 1x NVIDIA L40S GPU.
3. Comparing Results
After running the evaluation, we can compare the results with what Meta officially reported.
Task | Meta Reported | Our Result |
---|---|---|
IFEval | 0.804 | 0.803 |
GSM8K | 0.845 | 0.856 |
The results are consistent with Meta's official report, indicating that the model and serving solution perform as expected.
Conclusion
We learned how to efficiently evaluate LLMs on benchmarks, like IFEval or GMS8k using OpenAI-compatible endpoints provided by TGI and vLLM, that can run locally or on a remote cloud environment. We confirmed the offical reported results for the Llama 3.1 8B Instruct model on IFEval and GSM8K benchmarks with Chain of Thought reasoning. Allowing us to validate the model's performance and implementation in a production runtime.
Leveraging Evaluation Harness and a optimized serving solution like TGI or vLLM, we can streamline the evaluation process and get accurate results quickly. This helps us iterate faster and validate production performance of LLMs.
Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.