AI Agent Benchmark Compendium

October 15, 202515 minute readView Code

This post provides a high-level overview to over 50 of modern benchmarks, grouped into four key categories Function Calling & Tool Use (1), General Assistant & Reasoning (2), Coding & Software Engineering (3), and Computer Interactions (4).

The compendium is also available as separate Github repository. Would love to keep this up to date and extend when need benchmarks are coming up. Please Open PRs or Issues.

Function Calling & Tool Use

BFCL (Berkeley Function Calling Leaderboard)

BFCL is a comprehensive benchmark designed to evaluate the function calling (also known as tool use) capabilities of Large Language Models (LLMs) in a wide range of real-world settings. It assesses models across various scenarios, including serial (simple), parallel, and multi-turn interactions, and evaluates agentic capabilities such as reasoning in stateful multi-step environments, memory, web search, and format sensitivity.

Links: Paper | GitHub | Leaderboard | Dataset

ToolBench

A massive-scale benchmark designed for evaluating and facilitating large language models in mastering over 16,000 real-world RESTful APIs. It functions as an instruction-tuning dataset for tool use, which was automatically generated using ChatGPT to enhance the general tool-use capabilities of large language models.

Links: Paper | GitHub | Leaderboard | Dataset

ComplexFuncBench

A benchmark specifically designed for the evaluation of complex function calling in LLMs. It addresses challenging scenarios across five key aspects: multi-step function calls within a single turn, function calls involving user-provided constraints, parameter value reasoning, calls with long parameter values, and calls requiring a 128k long-context length.

Links: Paper | GitHub | Dataset

τ-Bench/Tau-Bench

A conversational benchmark designed to test AI agents in dynamic, open-ended real-world scenarios. It specifically evaluates an agent's ability to interact with simulated human users and programmatic APIs while strictly adhering to domain-specific policies and maintaining consistent behavior, with domains in e-commerce and airline reservations.

Links: Paper | GitHub | Leaderboard

Composio Function Calling Benchmark

Tests the ability of LLMs to correctly call functions based on given prompts. It comprises 50 function calling problems, each designed to be solved using one of eight provided function schemas inspired by real-world API structures from ClickUp's integration endpoints.

Links: GitHub

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Evaluates an agent's ability to plan step-by-step API calls, retrieve relevant APIs, and correctly execute API calls to meet human needs based on understanding real-world API documentation. It features over 2,200 dialogues utilizing thousands of APIs.

Links: Paper | GitHub

HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

A novel benchmark designed to evaluate the function-calling capabilities of LLMs in realistic, multi-turn human-agent interactions, particularly simulating mobile assistant use cases. It tests models under challenging circumstances like imperfect instructions and shifts in user intent.

Links: Paper | GitHub | Dataset

DPAB-α

The Dria Pythonic Agent Benchmark is a comprehensive benchmark designed to evaluate the function calling capabilities of LLMs. It specifically compares the performance of models using Pythonic function calling versus traditional JSON-based methods across 100 problems.

Links: Blog

NFCL (Nexus Function Calling Leaderboard)

A benchmark designed to evaluate the proficiency of LLMs in single-turn function calling tasks. It assesses various complexities, including simple, parallel, and nested function calls, where the output of one function serves as an input for another.

Links: GitHub | Leaderboard

xLAM: A Family of Large Action Models for Function Calling and AI Agent Systems

A series of large action models (LLMs) developed by Salesforce AI Research, specifically optimized for function calling and AI agent tasks. These models are designed to enhance the generalizability and performance of AI agents across diverse environments.

Links: Paper | GitHub

ToolACE: A Framework for Generating High-Quality Tool-Learning Data for LLMs

An automatic agentic pipeline meticulously designed to generate accurate, complex, and diverse tool-learning data, specifically tailored to enhance the function-calling capabilities of LLMs.

Links: Paper

LiveMCPBench

A comprehensive benchmark designed to evaluate the ability of LLM agents to navigate and effectively utilize a large-scale Model Context Protocol (MCP) toolset in real-world scenarios, overcoming limitations of single-server environments.

Links: Paper | GitHub | Leaderboard | Dataset

MCP-Universe

A comprehensive framework and benchmark for developing, testing, and evaluating AI agents and LLMs through direct interaction with real-world Model Context Protocol (MCP) servers, rather than relying on simulations, covering domains like financial analysis and browser automation.

Links: Paper | GitHub | Leaderboard

General Assistant & Reasoning

GAIA (General AI Assistants)

A landmark benchmark designed to evaluate General AI Assistants, posing real-world questions that are conceptually simple for humans but significantly challenging for most advanced AI systems. It requires AI models to demonstrate a combination of fundamental abilities, including reasoning, multi-modality handling, web browsing, and proficient tool use.

Links: Paper | Leaderboard | Dataset

AgentBench: A Comprehensive Benchmark for Evaluating LLMs as Agents

A multi-dimensional, evolving benchmark designed to thoroughly assess the reasoning and decision-making capabilities of LLMs when functioning as autonomous agents. It encompasses eight distinct environments, including Operating System, Database, and Web Shopping.

Links: Paper | GitHub | Leaderboard | Dataset

AssistantBench

A challenging benchmark designed to evaluate the ability of web agents to automatically solve realistic and time-consuming tasks. It comprises 214 tasks that require navigating the open web, spanning multiple domains, and interacting with over 525 pages from 258 different websites.

Links: Paper | GitHub | Leaderboard | Dataset

LiveBench: A Challenging, Contamination-Free Benchmark for LLMs

A challenging, contamination-free benchmark for LLMs that regularly releases new questions from recent information sources to ensure models are tested on novel problems rather than memorized answers.

Links: Paper | GitHub | Leaderboard | Dataset

Humanity's Last Exam (HLE)

A highly challenging, multi-modal benchmark with 2,500 expert-level academic questions across a broad range of disciplines, designed to test models at the absolute frontier of human knowledge and require genuine reasoning capabilities rather than simple factual recall.

Links: Paper | GitHub | Leaderboard | Dataset

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

A benchmark designed to evaluate the robustness of LLM safeguards against potential misuse relevant to national security and public safety, using expert-crafted adversarial prompts across domains like CBRNE and political violence.

Links: Paper | Leaderboard

The MASK Benchmark: Disentangling Honesty from Accuracy in AI Systems

Aims to evaluate the honesty of LLMs by disentangling it from factual accuracy. The benchmark measures whether models will knowingly contradict their established beliefs when subjected to pressure to lie.

Links: Paper | GitHub | Leaderboard

SimpleQA

A factuality benchmark designed to evaluate the ability of LLMs to answer short, fact-seeking questions. It aims to measure how well models "know what they know" and to identify "hallucinations" or factually incorrect outputs with single, indisputable answers.

Links: Paper | GitHub | Dataset

SimpleQA Verified

A 1,000-prompt benchmark designed for evaluating the short-form factuality of LLMs, developed to address limitations in the original SimpleQA benchmark, such as noisy labels and topical biases, through a rigorous filtering process.

Links: Paper | Leaderboard | Dataset

FACTS Grounding

Evaluates LLMs' ability to generate long-form responses that are factually accurate and strictly "grounded" in provided context documents, thereby mitigating hallucination. Tasks require models to generate responses based exclusively on documents up to 32,000 tokens long.

Links: Paper | GitHub | Leaderboard | Dataset

Galileo Agent Leaderboard v2

Provides comprehensive performance metrics for LLM agents across business domains.

Links: Paper | GitHub

Coding & Software Engineering

SWE-bench: Evaluating AI in Real-World Software Engineering

A benchmark for evaluating LLMs and AI agents on their ability to resolve real-world software engineering issues. It comprises 2,294 problems sourced from GitHub issues across 12 popular Python repositories. The task is to generate a patch that resolves the issue.

Links: Paper | GitHub | Leaderboard | Dataset

SWE-bench Verified

SWE-bench Verified is a human-validated subset of the original SWE-bench dataset, containing 500 samples that assess the capability of AI models to resolve real-world software engineering issues. To improve the reliability of evaluation, SWE-bench Verified was created in collaboration with OpenAI and involved professional software developers who screened each sample to ensure well-specified issue descriptions and appropriate unit tests.

Links: Blog | Leaderboard

SWE-Bench Pro:

A benchmark for evaluating LLMs and AI agents on their ability to resolve real-world software engineering issues. It comprises 1,865 problems sourced from 41 diverse professional repositories. The task is to generate a patch that resolves the issue. Hidden Test set with 276 additional private tasks.

Links: Paper | GitHub | Leaderboard | Dataset

LiveCodeBench

A holistic and contamination-free benchmark for evaluating LLMs for code-related tasks. It continuously collects new problems from competitive programming platforms and assesses capabilities like self-repair, code execution, and test output prediction.

Links: Paper | GitHub | Leaderboard

SWE-PolyBench: A Multi-Language Benchmark for AI Coding Agents

A multi-language benchmark designed to evaluate AI coding agents across diverse programming tasks and languages. It contains over 2,000 curated issues from 21 real-world repositories, covering Java, JavaScript, TypeScript, and Python.

Links: Paper | GitHub | Leaderboard

Aider's "AI-Assisted Code" Benchmarks

A set of practical evaluations designed to measure how effectively LLMs can edit, refactor, and contribute to an existing codebase. These benchmarks include a code editing benchmark, a challenging refactoring benchmark, and a polyglot benchmark.

Links: GitHub | Leaderboard

Aider Polyglot Benchmark

Evaluates coding and self-correction abilities of LLMs by testing them on 225 challenging Exercism coding exercises across multiple languages, including C++, Go, Java, JavaScript, Python, and Rust.

Links: GitHub | Leaderboard

Computer Interaction (GUI & Web)

WebArena: A Realistic Web Environment for Building Autonomous Agents

A standalone, self-hostable web environment for building autonomous agents. WebArena creates websites from four popular categories with functionality and data mimicking their real-world equivalents and introduces a benchmark on interpreting high-level commands.

Links: Paper | GitHub

VisualWebArena

A benchmark designed to assess the performance of multimodal agents on realistic, visually grounded web tasks. It extends WebArena with 910 new, diverse, and complex tasks that require agents to accurately process image-text inputs and execute actions on websites.

Links: Paper | GitHub

Web Bench: A Benchmark for AI Browser Agents

A benchmark designed to evaluate the performance of AI browser agents. It differentiates agent capabilities on information retrieval (READ) tasks from state-changing (WRITE) tasks across 452 live websites, encompassing 5,750 tasks.

Links: Paper | GitHub | Dataset

WebVoyager

A foundational benchmark designed for evaluating Large Multimodal Models (LMMs) and web agents on end-to-end, real-world navigation tasks across a diverse set of popular, live websites, integrating both textual (HTML) and visual (screenshots) information.

Links: Paper | GitHub | Leaderboard

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

A simple yet challenging benchmark to measure the ability of agents to browse the web. It consists of 1,266 questions that demand persistent navigation of the internet to find hard-to-find, entangled information.

Links: Paper | GitHub

Mind2Web

A comprehensive benchmark for developing and evaluating generalist web agents. The original dataset includes over 2,000 open-ended tasks collected from 137 real-world websites, with variants for evaluating performance on live websites.

Links: Paper | GitHub | Leaderboard | Dataset

WebGames Benchmark

A comprehensive benchmark suite to evaluate general-purpose web-browsing AI agents, featuring over 50 interactive challenges crafted to be straightforward for humans but challenging for AI. It operates in a self-contained, hermetic testing environment.

Links: Paper | GitHub | Leaderboard | Dataset

ST-WebAgentBench

A benchmarking platform specifically designed to evaluate the safety and trustworthiness of autonomous web agents in realistic enterprise contexts, where policy compliance and safety are paramount.

Links: Paper | GitHub

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

A first-of-its-kind scalable, real computer environment for benchmarking multimodal agents on open-ended tasks within genuine operating systems, including Windows, macOS, and Ubuntu, featuring 369 real-world tasks.

Links: Paper | GitHub

OSUniverse

A benchmark for evaluating advanced GUI-navigation AI agents on complex, multimodal, desktop-oriented tasks. It features 160 tasks across five levels of complexity and nine categories, designed to be easy for humans but challenging for AI.

Links: Paper | GitHub

ScreenSuite Benchmark

A comprehensive suite of 13 benchmarks for evaluating Graphical User Interface (GUI) agents, focusing on the Vision Language Models (VLMs) that power them. It uses a vision-only evaluation stack without relying on accessibility trees or DOM information.

Links: GitHub

WorkArena++ Benchmark: Enhanced Evaluation for AI in Enterprise Workflows

A novel benchmark to rigorously evaluate AI agents in performing complex, realistic enterprise workflows. It expands upon the original WorkArena benchmark with 682 tasks that mimic the intricate operations of knowledge workers on the ServiceNow platform.

Links: Paper

AndroidWorld Benchmark

A dynamic benchmarking environment for autonomous agents that control mobile devices. It operates on a live Android emulator and features 116 hand-crafted tasks across 20 real-world Android applications, with millions of unique task variations.

Links: Paper | GitHub

WorldGUI

A comprehensive Graphical User Interface (GUI) benchmark designed to evaluate AI agents across ten widely used desktop and web applications (e.g., PowerPoint, VSCode). It features 315 tasks with diverse initial states to simulate authentic human-computer interactions.

Links: Paper | GitHub

macOSWorld

The first comprehensive, multilingual, and interactive benchmark to evaluate GUI agents operating within the macOS environment. It features 202 multilingual tasks across 30 applications, with instructions and interfaces in five languages.

Links: Paper | GitHub

OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation

A pioneering benchmark to evaluate an LLM agent's ability to automate complex office workflows across multiple applications, such as Word, Excel, and email. It assesses long-horizon planning and proficiency in switching between applications.

Links: Paper | GitHub

EEBD (Emergence Enterprise Benchmark Dataset)

Evaluates AI agents in realistic enterprise scenarios that require them to go beyond simple browser interaction, intelligently selecting tools like APIs and combining web UI interaction with API calls.

Links: Paper | GitHub | Dataset

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

A benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. It features long, compositional tasks in a simulated 3D environment.

Links: Paper | GitHub | Leaderboard

EmbodiedBench

A comprehensive benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) as embodied agents. It spans diverse tasks in navigation, manipulation, and high-level planning across four simulated environments.

Links: Paper | GitHub

Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.