Understanding and Bridging the Planner-Coder Gap: A Systematic Study on the Robustness of Multi-Agent Systems for Code Generation

Zongyi Lyu, Songqiang Chen, Zhenlan Ji

Abstract

Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks. Despite their prosperous development, the fundamental mechanisms underlying their robustness remain poorly understood, raising critical concerns for real-world deployment. This paper conducts a systematic empirical study to uncover the internal robustness flaws of MASs using a mutation-based methodology. Through comprehensive failure analysis, we discover a fundamental cause underlying these robustness issues: the planner-coder gap, which accounts for 75.3% of failures. Based on this formulated information transformation process, we propose a repairing method that mitigates information loss through multi-prompt generation and introduces a monitor agent to bridge the planner-coder gap. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%--88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

This paper directly addresses one of the core structural challenges in Zone III deployments. The research on Multi-agent systems, Code generation, Robustness provides evidence-based foundations that enterprise architects cannot ignore when designing long-horizon autonomous workflows. The findings challenge the assumption that a base language model — however capable — can handle the complexity of durable, governed, multi-step execution without explicit architectural intervention. For Zone III practitioners, this paper belongs in the required reading list.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Topics

Multi-agent systemsCode generationRobustnessPlanner-coder gapSoftware engineering

Relevance Scores

Long-Horizon Score85

Enterprise Score80

Completeness75

Paper Info

Year2026

Venue

Type

ChapterCh. 3

Authors3

Zone III Analysis

Frameworks

PADE OCG