Fine-tune classifier with ModernBERT in 2025
Large Language Models (LLMs) have become ubiquitous in 2024. However, smaller, specialized models - particularly for classification tasks - remain critical for building efficient and cost-effective AI systems. One key use case is routing user prompts to the most appropriate LLM or selecting optimal few-shot examples, where fast, accurate classification is essential.
This blog post demonstrates how to fine-tune ModernBERT, a new state-of-the-art encoder model, for classifying user prompts to implement an intelligent LLM router. ModernBERT is a refreshed version of BERT models, with 8192 token context length, significantly better downstream performance, and much faster processing speeds.
You will learn how to:
- Setup environment and install libraries
- Load and prepare the classification dataset
- Fine-tune & evaluate ModernBERT with the Hugging Face
Trainer
- Run inference & test model
Quick intro: ModernBERT
ModernBERT is a modernization of BERT maintaining full backward compatibility while delivering dramatic improvements through architectural innovations like rotary positional embeddings (RoPE), alternating attention patterns, and hardware-optimized design. The model comes in two sizes:
- ModernBERT Base (139M parameters)
- ModernBERT Large (395M parameters)
ModernBERT achieves state-of-the-art performance across classification, retrieval and code understanding tasks while being 2-4x faster than previous encoder models. This makes it ideal for high-throughput production applications like LLM routing, where both accuracy and latency are critical.
ModernBERT was trained on 2 trillion tokens of diverse data including web documents, code, and scientific articles - making it much more robust than traditional BERT models trained primarily on Wikipedia. This broader knowledge helps it better understand the nuances of user prompts across different domains.
If you want to learn more about ModernBERT's architecture and training process, check out the official blog.
Now let's get started building our LLM router with ModernBERT! š
Note: This tutorial was created and tested on an NVIDIA L4 GPU with 24GB of VRAM.
1. Setup environment and install libraries
Our first step is to install Hugging Face Libraries and Pyroch, including transformers and datasets.
We will use the Hugging Face Hub as a remote model versioning service. This means we will automatically push our model, logs and information to the Hub during training. You must register on the Hugging Face for this. After you have an account, we will use the login
util from the huggingface_hub
package to log into our account and store our token (access key) on the disk.
2. Load and prepare the classification dataset
In our example we want to fine-tune ModernBERT to act as a router for user prompts. Therefore we need a classification dataset consisting of user prompts and their "difficulty" score. We are going to use the DevQuasar/llm_router_dataset-synth
dataset, which is a synthetic dataset of ~15,000 user prompts with a difficulty score of "large_llm" (1
) or "small_llm" (0
).
We will use the load_dataset()
method from the š¤ Datasets library to load the DevQuasar/llm_router_dataset-synth
dataset.
Train dataset size: 10003 Test dataset size: 3080
Letās check out an example of the dataset.
To train our model, we need to convert our text prompts to token IDs. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary) if you want to learn more about this, outĀ chapter 6Ā of the Hugging Face Course.
3. Fine-tune & evaluate ModernBERT with the Hugging Face Trainer
After we have processed our dataset, we can start training our model. We will use the answerdotai/ModernBERT-base model. The first step is to load our model with AutoModelForSequenceClassification
class from the Hugging Face Hub. This will initialize the pre-trained ModernBERT weights with a classification head on top. Here we pass the number of classes (2) from our dataset and the label names to have readable outputs for inference.
We evaluate our model during training. TheĀ Trainer
Ā supports evaluation during training by providing aĀ compute_metrics
method. We use the evaluate
library to calculate the f1 metric during training on our test split.
The last step is to define the hyperparameters (TrainingArguments
) we use for our training. Here we are adding optimizations introduced features for fast training times using torch_compile
option in the TrainingArguments
.
We also leverage theĀ Hugging Face HubĀ integration of theĀ Trainer
Ā to push our checkpoints, logs, and metrics during training into a repository.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
- Avoid using
tokenizers
before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
- Avoid using
tokenizers
before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
We can start our training by using theĀ train
Ā method of the Trainer
.
Fine-tuning answerdotai/ModernBERT-base
on ~15,000 synthetic prompts for 5 epochs took 321
seconds and our best model achieved a f1
score of 0.993
. š I also ran the training with bert-base-uncased
to compare the training time and performance. The original BERT achieved a f1
score of 0.99
and took 1048
seconds to train.
Note: ModernBERT and BERT both almost achieve the same performance. This indicates that the dataset is not challenging and probably could be solved using a logistic regression classifier. I ran the same code on the banking77 dataset. A dataset of ~13,000 customer service queries with 77 classes. There the ModernBERT outperformed the original BERT by 3% (f1 score of 0.93 vs 0.90)
Lets save our final best model and tokenizer to the Hugging Face Hub and create a model card.
4. Run Inference & test model
To wrap up this tutorial, we will run inference on a few examples and test our model. We will use the pipeline
method from the transformers
library to run inference on our model.
Conclusion
In this tutorial, we learned how to fine-tune ModernBERT for an LLM routing classification task. We demonstrated how to leverage the Hugging Face ecosystem to efficiently train and deploy a specialized classifier that can intelligently route user prompts to the most appropriate LLM model.
Using modern training optimizations like flash attention, fused optimizers and mixed precision, we were able to train our model efficiently. Comparing ModernBERT with the original BERT we reduced training time by approximately 3x (1048s vs 321s) on our dataset and outperformed the original BERT by 3% on a more challenging dataset. But more importantly, ModernBERT was trained on 2 trillion tokens, which are more diverse and up to date than the Wikipedia-based training data of the original BERT.
This example showcases how smaller, specialized models remain valuable in the age of large language models - particularly for high-throughput, latency-sensitive tasks like LLM routing. By using ModernBERT's improved architecture and broader training data, we can build more robust and efficient classification systems.