Evaluate LLMs and RAG a practical example using Langchain and Hugging Face
The rise of generative AI and LLMs like GPT-4, Llama or Claude enables a new era of AI drive applications and use cases. However, evaluating these models remains an open challenge. Academic benchmarks can no longer always be applied to generative models since the correct or most helpful answer can be formulated in different ways, which would give limited insight into real-world performance.
So, how can we evaluate the performance of LLMs if previous methods are not long valid?
Two main approaches show promising results for evaluating LLMs: leveraging human evaluations and using LLMs themselves as judges.
Human evaluation provides the most natural measure of quality but does not scale well. Crowdsourcing services can be used to collect human assessments on dimensions like relevance, fluency, and harmfulness. However, this process is relatively slow and costly.
Recent research has proposed using LLMs themselves as judges to evaluate other LLMs, an approach called LLM-as-a-judge demonstrates that large LLMs like GPT-4 can match human preferences with over 80% agreement when evaluating conversational chatbots.
In this blog post, we look at a hands-on example of how to evaluate LLMs:
- Criteria-based evaluation, such as helpfulness, relevance, or harmfulness
- RAG evaluation, whether our model correctly uses the provided context to answer
- Pairwise comparison and scoring to evaluate and generate AI feedback for RLAIF
We are going to use the meta-llama/Llama-2-70b-chat-hf hosted through Hugging Face Inference API as the LLM we evaluate with the huggingface_hub
library.
As "evaluator" we are going to use GPT-4
. You can use any supported llm
of langchain to evaluate your models. If you are sticking with GPT-4
make sure the environment variable OPENAI_API_KEY
is set and valid.
Note: You need a PRO
subscription on huggingface.co to use Llama 70B Chat via the API.
Criteria-based evaluation
Criteria-based evaluation can be useful when you want to measure an LLM's performance on specific attributes rather than relying on a single metric. It provides fine-grained, interpretable scores on conciseness, helpfulness, harmfulness, or custom criteria definitions. We are going to evaluate the output of the following prompt:
- conciseness of the generation
- correctness using an additional reference
- custom criteria whether it is explained for a 5-year-old.
Lets first take a look on what the model generates for the prompt:
Looks correct to me! The criteria evaluator returns a dictionary with the following values:
score
: Binary integer 0 to 1, where 1 would mean that the output is compliant with the criteria, and 0 otherwisevalue
: A "Y" or "N" corresponding to the scorereasoning
: String "chain of thought reasoning" from the LLM generated prior to creating the score
If you want to learn more about the criteria-based evaluation, check out the documentation.
Conciseness evaluation
Conciseness is a evaluation criteria that measures if the the submission concise and to the point.
{'reasoning': 'The criterion in question is conciseness. The criterion asks '
'whether the submission is concise and straight to the point. \\n'
'\\n'
'Looking at the submission, it starts with a direct answer to '
'the question: "The current president of the United States is '
'Joe Biden." This is concise and to the point. \\n'
'\\n'
'However, the submission then continues with additional '
'information: "He was inaugurated as the 46th President of the '
'United States on January 20, 2021, and is serving a four-year '
'term that will end on January 20, 2025." While this information '
'is not irrelevant, it is not directly asked for in the input '
'and therefore might be considered as extraneous detail. \\n'
'\\n'
'Therefore, based on the specific criterion of conciseness, the '
'submission could be seen as not being entirely to the point due '
'to the extra information provided after the direct answer.\\n'
'\\n'
'N',
'score': 0,
'value': 'N'}
If I would have to asses the reasoning of GPT-4 i would agree with its reasoning. The most concise answer would be "Joe Biden".
Correctness using an additional reference
We can evaluate our generation based on correctness, which would relly on the internal knowledge of the LLM. This might not be the best approach since we are not sure if the LLM has the correct knowledge. To make sure we create our evaluator with requires_reference=True
to use an additional reference to evaluate the correctness of the generation.
As reference we use the following text: "The new and 47th president of the United States is Philipp Schmid." This is obviously wrong, but we want to see if the evaluation LLM values the reference over the internal knowledge.
{'reasoning': 'The criterion for assessing the submission is its correctness, '
'accuracy, and factuality. \\n'
'\\n'
'Looking at the submission, the answer provided states that the '
'current president of the United States is Joe Biden, '
'inaugurated as the 46th president on January 20, 2021, serving '
'a term that will end on January 20, 2025. \\n'
'\\n'
'The reference, however, states that the new and 47th president '
'of the United States is Philipp Schmid.\\n'
'\\n'
'There is a discrepancy between the submission and the '
'reference. The submission states that Joe Biden is the current '
'and 46th president, while the reference states that Philipp '
'Schmid is the new and 47th president. \\n'
'\\n'
"Given this discrepancy, it's clear that the submission does not "
'match the reference and may not be correct or factual. \\n'
'\\n'
"However, it's worth noting that the assessment should be based "
'on widely accepted facts and not the reference if the reference '
'is incorrect. According to widely accepted facts, the '
'submission is correct. Joe Biden is the current president of '
'the United States. The reference seems to be incorrect.\\n'
'\\n'
'N',
'score': 0,
'value': 'N'}
Nice! It worked as expected. The LLM evaluated the generation as incorrect based on the reference, saying "There is a discrepancy between the submission and the reference".
Custom criteria whether it is explained for a 5-year-old.
Langchain allows you to define custom criteria to evaluate your generations. In this example we want to evaluate if the generation is explained for a 5-year-old. We define the criteria as follows:
{'reasoning': '1. The criteria stipulates that the output should be explained '
'in a way that a 5-year-old would understand it.\\n'
'2. The submission states that the current president of the '
'United States is Joe Biden, and that he was inaugurated as the '
'46th President of the United States on January 20, 2021.\\n'
'3. The submission also mentions that he is serving a four-year '
'term that will end on January 20, 2025.\\n'
'4. While the submission is factually correct, it uses terms '
'such as "inaugurated" and "four-year term", which a 5-year-old '
'may not understand.\\n'
'5. Therefore, the submission does not meet the criteria of '
'being explained in a way that a 5-year-old would understand.\\n'
'\\n'
'N',
'score': 0,
'value': 'N'}
The explanation of GPT-4
in the reasoning
with that a 5-year-old might not understand what a "four year term" is can make sense, but could also be the other way around. Since this is only a example of how to define custom criteria, we leave it as it is.
Retrival Augmented Generation (RAG) evaluation
Retrieval Augmented Generation (RAG) is one of the most popular use cases for LLMs, but it is also one of the most difficult to evaluate. We want RAG models to use the provided context to correctly answer a question, write a summary, or generate a response. This is a challenging task for LLMs, and it is difficult to evaluate whether the model is using the context correctly.
Langchain has a handy ContextQAEvalChain
class that allows you to evaluate your RAG models. It takes a context
and a question
as well as a prediction
and a reference
to evaluate the correctness of the generation. The evaluator returns a dictionary with the following values:
reasoning
: String "chain of thought reasoning" from the LLM generated prior to creating the scorescore
: Binary integer 0 to 1, where 1 would mean that the output is correct, and 0 otherwisevalue
: A "CORRECT" or "INCORRECT" corresponding to the score
Looks good! we can also quickly test how llama would respond with out the context
As we can see without the context the generation is incorrect. Now lets see if our evaluator can detect that as well. As reference we will use the raw number with 541,000
.
Nice! It worked as expected. The LLM evaluated the generation as correct. Lets now test what happens if we provide a wrong prediction.
Awesome! The evaluator detected that the generation is incorrect.
Alternatively, if you are not having a reference you can reuse the criteria
evaluator to evaluate the correctness using the "question" as input and the "context" as reference.
{'reasoning': 'The criteria is to assess the correctness, accuracy, and '
'factual nature of the submission.\\n'
'\\n'
'The submission states that according to the text, Nuremberg has '
'a population of 541,000 inhabitants.\\n'
'\\n'
'Looking at the reference, it indeed confirms that Nuremberg is '
'the 14th-largest city in Germany with 541,000 inhabitants.\\n'
'\\n'
'Thus, the submission is correct according to the reference '
'text. It accurately cites the number of inhabitants in '
'Nuremberg. It is factual as it is based on the given '
'reference.\\n'
'\\n'
'Therefore, the submission meets the criteria.\\n'
'\\n'
'Y',
'score': 1,
'value': 'Y'}
As we can see GPT-4 correctly reasoned that the generation is correct based on the provided context.
Pairwise comparison and scoring
Pairwise comparison or Scoring is a method for evaluating LLMs that asks the model to choose between two generations or generate scores for the quality. Those methods are useful for evaluating whether a model can generate a better response than another/previous model. This can also be used to generate preference data or AI Feedback for RLAIF or DPO.
Lets first look at the pairwise comparison. Here for we generate first two generations and then ask the LLM to choose between them.
Now, lets use our LLM to select its preferred generation.
{'reasoning': 'Both Assistant A and Assistant B provided appropriate and '
"relevant responses to the user's request. Both emails are "
'professional, polite, and adhere to the typical format of a '
"business email. However, Assistant B's response is more "
'detailed, as it includes specifics about what will be discussed '
'during the meeting (Smith project and sales numbers). This '
"added level of detail makes Assistant B's response more helpful "
'and demonstrates a greater depth of thought. Therefore, '
"Assistant B's response is superior. \\n"
'\\n'
'Final verdict: [[B]]',
'score': 0,
'value': 'B'}
The LLM selected the second generation as the preferred one, we could now use this information to generate AI Feedback for RLAIF or DPO. As next we want to look a bit more in detail into our two generation and how they would be scored. Scoring can help us to more qualitative evaluate our generations.
Conclusion
In this post, we looked at practical methods for evaluating large language models using Langchain. Criteria-based evaluation allows us to check models against attributes like conciseness and correctness. We evaluated RAG pipelines whether models properly utilize provided context and looked at pairwise comparison and scoring to generate preference judgments between model outputs.
As large language models continue to advance, evaluation remains crucial for tracking progress and mitigating risks. Using LLM-as-a-judge can provide a scalable method to approximate human judgments. However, biases like verbosity preference can impact the performance. Combining lightweight human evaluations with LLM-as-a-judge can provide both quality and scale.
Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.