benchmarkChapter 1ICLR 2024 · 2023
AgentBench: Evaluating LLMs as Agents
Xiao Liu (Tsinghua), Hao Yu (Tsinghua), Hanchen Zhang (Tsinghua)
Abstract
We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.
Key Contributions
- →8-environment agent benchmark
- →Standardized evaluation protocol
- →Cross-model comparison framework
Topics
agent evaluationbenchmarksmulti-environmentLLM agents
Relevance Scores
Long-Horizon Score86
Enterprise Score80
Completeness82
Paper Info
Year2023
VenueICLR 2024
Typebenchmark
ChapterCh. 1
Authors3
Zone III Analysis
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 1 papers →