benchmarkChapter 6arXiv · 2024
OdysseyBench: Evaluating Long-Horizon Reasoning in Language Agents
Research Team (Stanford)
Abstract
We introduce OdysseyBench, a benchmark specifically designed to evaluate language agents on tasks requiring hundreds of sequential reasoning steps. OdysseyBench reveals non-linear performance degradation with task length.
Key Contributions
- →Long-horizon reasoning benchmark
- →Hundreds-of-steps evaluation
- →Non-linear degradation analysis
Topics
long-horizon evaluationbenchmarksreasoningmulti-step
Relevance Scores
Long-Horizon Score97
Enterprise Score86
Completeness84
Paper Info
Year2024
VenuearXiv
Typebenchmark
ChapterCh. 6
Authors1
Zone III Analysis
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 6 papers →