Towards Autonomous AI Agents: A Framework for Evaluating Long-Horizon Task Completion

Yao Fu (Edinburgh), Hao Peng (Edinburgh)

Abstract

We present a framework for evaluating AI agents on long-horizon tasks that require sustained reasoning, planning, and execution over extended periods. We identify key failure modes and propose metrics for measuring autonomous task completion.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

This paper directly addresses the Zone III evaluation problem. The failure mode taxonomy — context overflow, goal drift, tool failure cascades, and semantic drift — maps precisely onto what Eigenvector has observed in enterprise deployments.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Key Contributions

→Long-horizon task evaluation framework
→Failure mode taxonomy
→Autonomous completion metrics

Topics

long-horizon evaluationautonomous agentstask completionfailure modes

Relevance Scores

Long-Horizon Score96

Enterprise Score88

Completeness86

Paper Info

Year2024

VenuearXiv

Typeevaluation

ChapterCh. 6

Authors2

Zone III Analysis

Frameworks

PASF AEGIS