evaluationChapter 6arXiv · 2024
Towards Autonomous AI Agents: A Framework for Evaluating Long-Horizon Task Completion
Yao Fu (Edinburgh), Hao Peng (Edinburgh)
Abstract
We present a framework for evaluating AI agents on long-horizon tasks that require sustained reasoning, planning, and execution over extended periods. We identify key failure modes and propose metrics for measuring autonomous task completion.
Key Contributions
- →Long-horizon task evaluation framework
- →Failure mode taxonomy
- →Autonomous completion metrics
Topics
long-horizon evaluationautonomous agentstask completionfailure modes
Relevance Scores
Long-Horizon Score96
Enterprise Score88
Completeness86
Paper Info
Year2024
VenuearXiv
Typeevaluation
ChapterCh. 6
Authors2
Zone III Analysis
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 6 papers →