OdysseyBench: Evaluating Long-Horizon Reasoning in Language Agents

Research Team (Stanford)

Abstract

We introduce OdysseyBench, a benchmark specifically designed to evaluate language agents on tasks requiring hundreds of sequential reasoning steps. OdysseyBench reveals non-linear performance degradation with task length.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

OdysseyBench is the most relevant benchmark for Zone III evaluation. The finding that performance degrades non-linearly with task length — not linearly — is critical for enterprise planning. You cannot extrapolate from 10-step performance to 100-step performance.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Key Contributions

→Long-horizon reasoning benchmark
→Hundreds-of-steps evaluation
→Non-linear degradation analysis

Topics

long-horizon evaluationbenchmarksreasoningmulti-step

Relevance Scores

Long-Horizon Score97

Enterprise Score86

Completeness84

Paper Info

Year2024

VenuearXiv

Typebenchmark

ChapterCh. 6

Authors1

Zone III Analysis

Frameworks

PASF AEGIS