AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang

Abstract

We present AgentBench, a multi-dimensional evolving benchmark to evaluate LLMs as agents in various environments including operating systems, databases, knowledge graphs, digital games, and web browsing.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

AgentBench is the reference benchmark for enterprise agent evaluation. Before you deploy a Zone III agent, you need to know how it performs across diverse environments — not just the one you designed it for. The multi-environment design is particularly important: Zone III workflows routinely cross system boundaries (database, API, file system, web). An agent that excels in one environment but fails in another is a production liability.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Topics

agent benchmarksevaluationLLM agentsmulti-environment

Relevance Scores

Long-Horizon Score82

Enterprise Score78

Completeness88

Paper Info

Year2023

Venue

Type

ChapterCh. 7

Authors3

Zone III Analysis

Frameworks

PADE AEGIS