HomeResearch LibraryAgentBench: Evaluating LLMs as Agents
benchmarkChapter 1ICLR 2024 · 2023

AgentBench: Evaluating LLMs as Agents

Xiao Liu (Tsinghua), Hao Yu (Tsinghua), Hanchen Zhang (Tsinghua)

Abstract

We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.

Key Contributions

  • 8-environment agent benchmark
  • Standardized evaluation protocol
  • Cross-model comparison framework

Topics

agent evaluationbenchmarksmulti-environmentLLM agents