HomeResearch LibraryAgentBench: Evaluating LLMs as Agents
benchmarkChapter 1ICLR 2024 · 2023

AgentBench: Evaluating LLMs as Agents

Xiao Liu (Tsinghua), Hao Yu (Tsinghua), Hanchen Zhang (Tsinghua)

Abstract

We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper
Eigenvector Research — Marco van Hurne
How this paper contributes to solving the Zone III problem (PASF-PADE)

AgentBench provides the first systematic comparison of LLMs as agents across realistic environments. The database and OS environments are directly relevant to enterprise deployments, making this a key reference for Zone III capability assessment.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Key Contributions

  • 8-environment agent benchmark
  • Standardized evaluation protocol
  • Cross-model comparison framework

Topics

agent evaluationbenchmarksmulti-environmentLLM agents