AgentBench: Evaluating LLMs as Agents

Xiao Liu (Tsinghua), Hao Yu (Tsinghua), Hanchen Zhang (Tsinghua)

Abstract

We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.

Key Contributions

→8-environment agent benchmark
→Standardized evaluation protocol
→Cross-model comparison framework

Topics

agent evaluationbenchmarksmulti-environmentLLM agents

Relevance Scores

Long-Horizon Score86

Enterprise Score80

Completeness82

Paper Info

Year2023

VenueICLR 2024

Typebenchmark

ChapterCh. 1

Authors3

Frameworks

PASF AEGIS