Evaluate open LLMs with Vertex AI and Gemini
The Gen AI Evaluation Service in Vertex AI lets us evaluate LLMs or Application using existing or your own evaluation criterias. It supports academic metrics like BLEU, ROUGE, or LLM as a Judge with Pointwise and Pairwise metrics or custom metrics you can define yourself. As default LLM as a Judge Gemini 1.5 Pro
is used.
We can use the Gen AI Evaluation Service to evaluate the performance of open models and finetuned models using Vertex AI Endpoints and compute resources. In this example we will evaluate meta-llama/Meta-Llama-3.1-8B-Instruct generated summaries from news articles using a Pointwise metric based on G-Eval Coherence metric.
We will cover the following topics:
- Setup / Configuration
- Deploy Llama 3.1 8B on Vertex AI
- Evaluate Llama 3.1 8B using different prompts on Coherence
- Interpret the results
- Clean up resources
Setup / Configuration
First, you need to install gcloud
in your local machine, which is the command-line tool for Google Cloud, following the instructions at Cloud SDK Documentation - Install the gcloud CLI.
Then, you also need to install the google-cloud-aiplatform
Python SDK, required to programmatically create the Vertex AI model, register it, acreate the endpoint, and deploy it on Vertex AI.
For ease of use we define the following environment variables for GCP.
Note 1: Make sure to adapt the project ID to your GCP project.
Note 2: The Gen AI Evaluation Service is not available in all regions. If you want to use it, you need to select a region that supports it. us-central1
is currently supported.
Then you need to login into your GCP account and set the project ID to the one you want to use to register and deploy the models on Vertex AI.
Once you are logged in, you need to enable the necessary service APIs in GCP, such as the Vertex AI API, the Compute Engine API, and Google Container Registry related APIs.
Deploy Llama 3.1 8B on Vertex AI
Once everything is set up, we can deploy the Llama 3.1 8B model on Vertex AI. We will use the google-cloud-aiplatform
Python SDK to do so. meta-llama/Meta-Llama-3.1-8B-Instruct is a gated model, you need to login into your Hugging Face Hub account with a read-access token either fine-grained with access to the gated model, or just overall read-access to your account. More information on how to generate a read-only access token for the Hugging Face Hub in the instructions at Hugging Face Hub Security Tokens.
After we are logged in we can "upload" the model i.e. register the model on Vertex AI. If you want to learn more about the arguments you can pass to the upload
method, check out Deploy Gemma 7B with TGI on Vertex AI.
We will deploy the meta-llama/Meta-Llama-3.1-8B-Instruct
to 1x NVIDIA L4 accelerator with 24GB memory. We set TGI parameters to allow for a maximum of 8000 input tokens, 8192 maximum total tokens, and 8192 maximum batch prefill tokens.
WARNING: The Vertex AI endpoint deployment via the deploy
method may take from 15 to 25 minutes.
After the model is deployed, we can test our endpoint. We generate a helper generate
function to send requests to the deployed model. This will be later used to send requests to the deployed model and collect the outputs for evaluation.
Evaluate Llama 3.1 8B using different prompts on Coherence
We will evaluate the Llama 3.1 8B model using different prompts on Coherence. Coherence measures how well the individual sentences within a summarized news article connect together to form a unified and easily understandable narrative.
We are going to use the new Generative AI Evaluation Service. The Gen AI Evaluation Service can be used to:
- Model selection: Choose the best pre-trained model for your task based on benchmark results and its performance on your specific data.
- Generation settings: Tweak model parameters (like temperature) to optimize output for your needs.
- Prompt engineering: Craft effective prompts and prompt templates to guide the model towards your preferred behavior and responses.
- Improve and safeguard fine-tuning: Fine-tune a model to improve performance for your use case, while avoiding biases or undesirable behaviors.
- RAG optimization: Select the most effective Retrieval Augmented Generation (RAG) architecture to enhance performance for your application.
- Migration: Continuously assess and improve the performance of your AI solution by migrating to newer models when they provide a clear advantage for your specific use case.
In our case, we will use it to evaluate different prompt templates to achieve the most coherent summaries using Llama 3.1 8B Instruct.
We are going to use a reference free Pointwise metric based on G-Eval Coherence metric.
The first step is to define our prompt template and create our PointwiseMetric
. Vertex AI returns our response from the model in the response
field our news article will be made available in the text
field.
We are going to use argilla/news-summary dataset consisting of news article from Reuters. We are going to use a random subset of 15 articles to keep the evaluation fast. Feel free to change the dataset and the number of articles to evaluate the model with more data and different topics.
Before we can run the evaluation, we need to convert our dataset into a pandas dataframe.
Awesome! We are almost ready. Last step is to define our different summarization prompts we want to use for evaluation.
Now we can iterate over our prompts and create different evaluation tasks, use our coherence metric to evaluate the summaries and collect the results.
Nice, it looks like on our limited test the "simple" prompt yields the best results. We can inspect and compare the results in the GCP Console at Vertex AI > Model Development > Experiments.
The overview allows to compare the results across different experiments and to inspect the individual evaluations. Here we can see that the standard deviation of detailed is quite high. This could be because of the low sample size or that we need to improve the prompt further.
You can find more examples on how to use the Gen AI Evaluation Service in the Vertex AI Generative AI documentation including how to:
- how to customize the LLM as a Judge
- how to use Pairwise metrics and compare different LLMs
- how to evaluate different prompts more efficiently
Resource clean-up
Finally, you can already release the resources that you've created as follows, to avoid unnecessary costs:
deployed_model.undeploy_all
to undeploy the model from all the endpoints.deployed_model.delete
to delete the endpoint/s where the model was deployed gracefully, after theundeploy_all
method.model.delete
to delete the model from the registry.