AgentBench: Evaluating LLMs as Agents

Xiao Liu (Tsinghua), Hao Yu (Tsinghua), Hanchen Zhang (Tsinghua)

Abstract

We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

AgentBench provides the first systematic comparison of LLMs as agents across realistic environments. The database and OS environments are directly relevant to enterprise deployments, making this a key reference for Zone III capability assessment.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Key Contributions

→8-environment agent benchmark
→Standardized evaluation protocol
→Cross-model comparison framework

Topics

agent evaluationbenchmarksmulti-environmentLLM agents

Relevance Scores

Long-Horizon Score86

Enterprise Score80

Completeness82

Paper Info

Year2023

VenueICLR 2024

Typebenchmark

ChapterCh. 1

Authors3

Zone III Analysis

Frameworks

PASF AEGIS