LLM Evaluation doesn't need to be complicated
Generative AI and large language models (LLMs) like GPT-4, Llama, and Claude have pathed a new era of AI-driven applications and use cases. However, evaluating LLMs can often feel daunting or confusing with many complex libraries and methodologies, It can easily get overwhelming.
LLM Evaluation doesn't need to be complicated. You don't need complex pipelines, databases or infrastructure components to get started building an effective evaluation pipeline.
A great example of this comes from Discord, which built a chatbot for 20M users. Discord focused on implementing evaluations that were easy to run and quick to implement. One clever technique they used was to check if a message was all lowercase to determine if the chatbot was being used casually or in another way.
In this blog post, we will learn how to set up a simplified evaluation workflow for your LLM applications. Inspired by G-EVAL and Self-Rewarding Language Models, we will use an additive score, chain-of-thought (CoT), and form-filling prompt templates with few-shot examples to guide the evaluation. This method aligns well with human judgments and makes the evaluation process understandable, effective, and easy to manage.
As LLM Judge we will use meta-llama/Meta-Llama-3-70B-Instruct hosted through Hugging Face Inference API with the OpenAI client. You can also use other LLMs.
How to create a good evaluation prompt for LLM as a Judge
When using LLM as a Judge for evaluation, the prompt you use to assess the quality of your model is the most important part. The following recommendations are based on practical experience and insights from recent research, particularly the G-EVAL paper and the Self-Rewarding Language Models paper.
1. Define a Clear Evaluation Metric (Optional: Additive Score)
Start by establishing a clear metric for your evaluation and break it down into specific criteria using, for example, an additive score. This approach enhances consistency and can align better with human judgment using few-shot examples. For example:
- Add 1 point if the answer directly addresses the main topic of the question without straying into unrelated areas.
- Award a point if answer is appropriate for educational use and introduces key concepts for learning coding.
- …
Using a small integer scale (0-5) simplifies the scoring process and reduces variability in the LLM's judgments.
2. Define Chain-of-Thought (CoT) Evaluation Steps
Define predefined reasoning steps for the LLM to apply a step-by-step evaluation process. This leads to a more thoughtful and accurate evaluation. For example:
- Read the question carefully to understand what is being asked.
- Read the answer thoroughly.
- Assess the length of the answer. Is it unnecessarily long or appropriately brief?
- …
3. Include Few-Shot Examples (Optional)
Adding examples of questions, responses, reasoning steps, and their evaluations can help guide the LLM more closely to human preferences and improve its robustness.
4. Define Output Schema
Request the evaluation results in a structured format (e.g., JSON) with fields for each criterion and the total score. This allows you to parse the results and calculate the metrics automatically. It can be improved by providing a few shot examples.
Here is an example of how this would look if you put it all together.
Use an LLM as a Judge to evaluate an RAG application
Retrieval Augmented Generation (RAG) is one of the most popular use cases for LLMs, but it is also one of the most difficult to evaluate. There are common metrics for RAG, but they might not always fit the use case or are to “generic”. We define a new RAG additive metric (3-Point Scale)
This 3-point additive metric evaluates RAG system responses based on their adherence to the given context, completeness in addressing all key elements, and relevance combined with conciseness.
Note: This is a completely made-up metric for demonstration purposes only. It is important you define the metrics and criteria based on your use case and importance.
To evaluate our model, we need to define the additive_criteria
, evaluation_steps
, json_schema
.
To help improve the model's performance, we define three few-shot examples: a 0-score example, a 1-score example, and a 3-score example. You can find them in the dataset repository. For the evaluation data, we will use a synthetic dataset from the **2023_10 NVIDIA SEC Filings.** This dataset includes a question, answer, and context. We are going to evaluate 50 random samples to see how well the answer performs based on our defined metric.
We are going to use the async client AsyncOpenAI
client to score multiple examples in parallel.
Then, we define our get_eval_score
method.
The last missing piece is the data. We use the datasets
library to load our samples.
Let's test an example.
Awesome, it works and looks good now. Let's evaluate all 50 examples and then calculate our average score.
Great. We achieved and average score of 2.78! To better understand why only 2.78 lets look at an example which scored poorly and if that's correct.
In my test, I got 2 samples with a score of 0. Lets look at the first.
Wow. Our LLM judge correctly identified that the question asked for 2023, but the context only provided information about 2022. Additionally, we see that the completeness and conciseness criteria really rely heavily on context. Depending on your needs, there could be room for improvements in our prompt.
Limitations
LLM, as a Judge, can have a bias toward preferring LLM-generated texts over human-written texts. This can be mitigated with good few-shot examples generated by human experts.
Your prompt and predefined steps and criteria are crucial for your results and may not align perfectly with every use case. The G-EVAL and Self-Rewarding Language Models papers highlight more examples of how prompts can be fine-tuned for better alignment.
Using LLMs as judges comes with limitations. One key issue is the potential for bias and inconsistency in their evaluations. The predefined steps and criteria we use may not align perfectly with every use case, requiring adjustments. The G-EVAL and Self-Rewarding Language Models papers highlight more examples of how prompts can be fine-tuned for better alignment.
Moreover, the additive score, while simple and effective, might not work in all scenarios. Sometimes, a simple boolean check (correct/incorrect) might be enough.
Lastly, don’t forget your judge's context window. If your prompt exceeds the window, it might make evaluation difficult.
Conclusion
Remember, this is a starting point. As you use this template in your system, you may need to refine and adjust it based on your specific needs. Using human-labeled few-shot examples allows you to align your LLM Judge with a human expert at almost 0 cost.
The key is to start simple, iterate, and refine your approach. And always look at your data. Evaluation is not something you do only once.
I can also recommend reading Hamels Your AI Product Needs Evals blog.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.