Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker
In this sagemaker example, we are going to learn how to evaluate LLMs using Hugging Face lighteval. LightEval supports the evaluation suite used in Hugging Face Open LLM Leaderboard.
Evaluating LLMs is crucial for understanding their capabilities and limitations, yet it poses significant challenges due to their complex and opaque nature. LightEval facilitates this evaluation process by enabling LLMs to be assessed on acamedic benchmarks like MMLU or IFEval, providing a structured approach to gauge their performance across diverse tasks.
In Detail you will learn how to:
- Setup Development Environment
- Prepare the evaluation configuraiton
- Evaluate Zephyr 7B on TruthfulQA on Amazon SageMaker
Lets get started.
1. Setup Development Environment
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Prepare the evaluation configuraiton
LightEval includes script to evaluate LLMs on common benchmarks like MMLU, Truthfulqa, IFEval, and more. Lighteval was inspired by the Eleuther AI Harness which is used to evaluate models on the Hugging Face Open LLM Leaderboard.
You can find all available benchmarks here.
We are going to use Amazon SageMaker Managed Training to evaluate the model. Therefore we will leverage the script available in lighteval. The Hugging Face DLC is not having lighteval installed. This means need to provide a requirements.txt
file to install the required dependencies.
First lets load the run_evals_accelerate.py
script and create a requirements.txt
file with the required dependencies.
In lighteval, the evaluation is done by running the run_evals_accelerate.py
script. The script takes a task
argument which is defined as suite|task|num_few_shot|{0 or 1 to automatically reduce num_few_shot if prompt is too long}
. Alternatively, you can also provide a path to a txt file with the tasks you want to evaluate the model on, which we are going to do. This makes it easier for you to extend the evaluation to other benchmarks.
We are going to evaluate the model on the Truthfulqa benchmark with 0 few-shot examples. TruthfulQA is a benchmark designed to measure whether a language model generates truthful answers to questions, encompassing 817 questions across 38 categories including health, law, finance, and politics.
To evaluate a model on all the benchmarks of the Open LLM Leaderboard you can copy this file
3. Evaluate Zephyr 7B on TruthfulQA on Amazon SageMaker
In this example we are going to evaluate the HuggingFaceH4/zephyr-7b-beta on the MMLU benchmark, which is part of the Open LLM Leaderboard.
In addition to the task
argument we need to define:
model_args
: Hugging Face Model ID or path, defined aspretrained=HuggingFaceH4/zephyr-7b-beta
model_dtype
: The model data type, defined asbfloat16
,float16
orfloat32
output_dir
: The directory where the evaluation results will be saved, e.g./opt/ml/model
Lightevals can also evaluat peft models or use chat_templates
you find more about it here.
We can now start our evaluation job, with the .fit()
.
After the evaluation job is finished, we can download the evaluation results from the S3 bucket. Lighteval will save the results and generations in the output_dir
. The results are savedas json and include detailed information about each task and the model's performance. The results are available in the results
key.
In our test we achieved a mc1
score of 40.6% and an mc2
score of 57.47%. The mc2
is the score used in the Open LLM Leaderboard. Zephyr 7B achieved a mc2
score of 57.47% on the TruthfulQA benchmark, which is identical to the score on the Open LLM Leaderboard.
The evaluation on Truthfulqa took 999 seconds
. The ml.g5.4xlarge instance we used costs $2.03 per hour
for on-demand usage. As a result, the total cost for evaluating Zephyr 7B on Truthfulqa was $0.56
.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.