Evaluating Language-Model Agents on Realistic Autonomous Tasks

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du

Abstract

We evaluate language model agents on 12 realistic autonomous tasks requiring multi-step reasoning and real-world tool use, finding that current models succeed on only a small fraction of tasks.

Eigenvector Warning — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

This evaluation is a sobering reality check for Zone III advocates. Current models succeed on only a small fraction of realistic autonomous tasks. The failure modes are instructive: agents lose track of the goal, make irreversible mistakes, and fail to recognize when they are stuck. These are not edge cases — they are the norm. Zone III architecture must treat failure as the default state and success as the exception to be engineered.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Topics

autonomous tasksevaluationreal-worldmulti-step reasoning

Relevance Scores

Long-Horizon Score87

Enterprise Score83

Completeness84

Paper Info

Year2023

Venue

Type

ChapterCh. 7

Authors3

Zone III Analysis

Frameworks

AEGIS PADE