100 Papers on Zone III Agentic AI
Curated research on long-horizon autonomous workflows, runtime governance, and enterprise AI architecture.
Agent Drift: Semantic Degradation in Long-Running Autonomous Systems
We characterize the phenomenon of agent drift — the gradual degradation of semantic coherence in long-running autonomous AI systems. We identify three primary drift mechanisms: context contamination, goal displacement, a…
Agentic Process Automation: Transforming Enterprise Workflows with AI Agents
We present the PASF/PADE framework for agentic process automation in enterprise environments. The framework provides a systematic approach to identifying, designing, and deploying AI agents for long-horizon enterprise wo…
The State of AI Agents 2025: Enterprise Adoption, Challenges, and Roadmap
We present a comprehensive survey of enterprise AI agent adoption in 2025, covering deployment patterns, failure rates, governance practices, and the roadmap to Zone III autonomous operations. Based on interviews with 15…
Durable Execution: Temporal and Cadence-Aware Workflow Orchestration
Durable execution is a programming model that ensures workflows survive failures, restarts, and infrastructure changes. Temporal provides durable execution primitives for building reliable long-running workflows.…
Agentification Factory: A Methodology for Systematic Enterprise Agent Development
We present the Agentification Factory methodology for systematically developing enterprise AI agents at scale. The factory model provides repeatable processes for agent design, testing, deployment, and governance.…
LangGraph: Building Stateful, Multi-Actor Applications with LLMs
LangGraph is a library for building stateful, multi-actor applications with LLMs, used to create agent and multi-agent workflows. It extends LangChain with the ability to coordinate multiple chains (or actors) across mul…
Towards Autonomous AI Agents: A Framework for Evaluating Long-Horizon Task Completion
We present a framework for evaluating AI agents on long-horizon tasks that require sustained reasoning, planning, and execution over extended periods. We identify key failure modes and propose metrics for measuring auton…
Patternomics: A Framework for Enterprise AI Pattern Recognition and Reuse
We present Patternomics, a framework for identifying, cataloguing, and reusing successful patterns in enterprise AI deployments. Patternomics provides a systematic approach to building institutional knowledge about what …
MemGPT: Towards LLMs as Operating Systems
We present MemGPT, a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window. MemGPT knows when to push critical information to a …
Agent Safety: A Framework for Governing Autonomous AI Systems
We present a comprehensive framework for governing autonomous AI systems, covering technical safety measures, organizational governance, and regulatory compliance. The framework addresses the unique challenges of long-ru…
OdysseyBench: Evaluating Long-Horizon Reasoning in Language Agents
We introduce OdysseyBench, a benchmark specifically designed to evaluate language agents on tasks requiring hundreds of sequential reasoning steps. OdysseyBench reveals non-linear performance degradation with task length…
Towards Long-Horizon Planning with LLMs: A Survey
We survey the landscape of long-horizon planning with LLMs, covering task decomposition, subgoal generation, plan verification, and execution monitoring. We identify key challenges and promising research directions.…
Towards Reliable AI Agents: A Framework for Systematic Failure Analysis
We present a systematic framework for analyzing failures in AI agent systems, covering failure mode identification, root cause analysis, and mitigation strategy development. The framework is validated on 500+ real agent …
Evaluating Human-AI Collaboration in Long-Horizon Enterprise Tasks
We study human-AI collaboration patterns in long-horizon enterprise tasks, identifying optimal handoff points, oversight mechanisms, and collaboration protocols for Zone III workflows.…
The Roundtrip Value Governance Framework for Enterprise AI
We present the Roundtrip Value Governance (RVG) framework, which ensures that every agentic workflow creates measurable value that can be traced back to the original business objective. RVG provides a closed-loop account…
Model Context Protocol (MCP): A Standard for AI-Tool Integration
We introduce the Model Context Protocol (MCP), an open standard that enables seamless integration between AI assistants and external data sources, tools, and services. MCP provides a universal protocol for AI-tool commun…
We present a method for improving tool-calling agents at inference time through a reviewer agent that evaluates tool calls before execution. The reviewer provides feedback that allows the primary agent to correct its too…
Agentic AI in Financial Services: Opportunities and Governance Challenges
The Financial Stability Board examines the use of agentic AI in financial services, identifying opportunities for efficiency gains and systemic risks from autonomous agent deployment.…
Autonomous Agents in Healthcare: Safety and Governance Requirements
We examine the safety and governance requirements for deploying autonomous AI agents in healthcare settings, covering clinical decision support, administrative automation, and patient-facing applications.…
Tokenomics of Enterprise AI: Economic Models for Agentic Workflows
We present a framework for understanding and optimizing the economic model of enterprise AI deployments, covering token costs, compute allocation, value attribution, and ROI measurement for agentic workflows.…
Towards Autonomous AI: Challenges and Opportunities for Enterprise Deployment
Gartner examines the enterprise readiness for autonomous AI deployment, identifying the key technical, organizational, and governance challenges that must be addressed before Zone III workflows can be safely deployed at …
LAGRANGE: Latent Action Grounding for Robust Agentic Navigation in Graph Environments
We present LAGRANGE, a framework for grounding agent actions in structured knowledge graphs to improve reliability and interpretability of long-horizon agent navigation tasks.…
Toward a Science of Autonomous Systems: Reliability Engineering for AI
We apply systems safety engineering principles to autonomous AI systems, arguing that AI reliability requires the same rigorous engineering discipline as safety-critical systems in aerospace and nuclear industries.…
Agentic AI Workflows: The Next Wave of Enterprise Automation
Andrew Ng describes four key agentic design patterns — reflection, tool use, planning, and multi-agent collaboration — and argues that agentic workflows represent the next major wave of AI capability improvement.…
Evidence Factory: Automated Research Synthesis for Enterprise AI
We present the Evidence Factory, a systematic approach to building and maintaining the research evidence base for enterprise AI decisions. The Evidence Factory automates research collection, synthesis, and relevance asse…
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
We present AutoGen, a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow …
Hallucination is Inevitable: An Innate Limitation of Large Language Models
We prove that hallucination is an innate limitation of LLMs, not a fixable bug. We show that any LLM that can answer all questions must hallucinate on some inputs, establishing fundamental limits on factual reliability.…
We survey methods for extending the context length of language models, covering positional encoding extensions, efficient attention mechanisms, and memory-augmented architectures.…
GAIA: A Benchmark for General AI Assistants
We introduce GAIA, a benchmark for general AI assistants that tests real-world capabilities requiring multi-step reasoning, tool use, and information synthesis. GAIA questions require an average of 5.4 steps to solve.…
Responsible AI Practices for Enterprise Deployments
Microsoft's comprehensive framework for responsible AI deployment in enterprise settings, covering fairness, reliability, privacy, security, inclusiveness, transparency, and accountability.…
OpenAgents: An Open Platform for Language Agents in the Wild
We present OpenAgents, an open platform for deploying and evaluating language agents in real-world settings. OpenAgents includes three agents: a data analysis agent, a plugin agent, and a web agent.…
EU AI Act: Implications for Enterprise AI Systems
The EU AI Act establishes a comprehensive regulatory framework for artificial intelligence in the European Union, with specific requirements for high-risk AI systems including autonomous agents in enterprise settings.…
Towards Efficient and Reliable LLM Service: A Real-World Case Study
We present a real-world case study of deploying LLM services at scale, covering reliability challenges, latency optimization, and cost management. We identify key engineering lessons for production LLM deployments.…
Scalable Oversight: Supervising AI Systems That Exceed Human Capabilities
We discuss the challenge of providing oversight to AI systems that may exceed human capabilities in some domains. We propose scalable oversight as a research agenda for maintaining meaningful human control.…
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
We introduce SWE-bench, a benchmark for evaluating LLMs on real-world software engineering tasks. SWE-bench contains 2,294 GitHub issues and pull requests from 12 popular Python repositories.…
The Economic Potential of Generative AI: The Next Productivity Frontier
McKinsey estimates that generative AI could add $2.6 to $4.4 trillion annually to the global economy. The report identifies the highest-value use cases and the organizational changes required to capture this value.…
Lost in the Middle: How Language Models Use Long Contexts
We analyze how language models use long contexts and find that performance degrades when relevant information is in the middle of the context. Models are better at using information at the beginning or end.…
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
We present MetaGPT, a meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences.…
Towards Robust Long-Horizon Task Planning for Embodied Agents
We study the challenges of long-horizon task planning for embodied agents and propose methods for improving robustness to execution failures and environmental changes.…
PEARL: Prompting Large Language Models to Plan and Execute Actions for Long-Horizon Tasks
We present PEARL, a prompting framework for long-horizon task planning and execution. PEARL decomposes tasks into action plans, executes them step by step, and self-evaluates progress.…
ReAct: Synergizing Reasoning and Acting in Language Models
We explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner. ReAct allows LLMs to interact with external tools to retrieve additional information, leading to more relia…
WebArena: A Realistic Web Environment for Building Autonomous Agents
We present WebArena, a standalone, self-hostable web environment for building autonomous agents. WebArena includes realistic web applications with functional tools, user interfaces, and data.…
Executable Code Actions Elicit Better LLM Agents
We propose CodeAct, an agent design that uses executable Python code as the action space instead of structured JSON actions. CodeAct agents can dynamically create and execute code to interact with environments.…
Auditing Large Language Models: A Three-Layer Approach
We propose a three-layer approach to auditing large language models: governance audits, model audits, and application audits. The framework provides a systematic methodology for enterprise AI accountability.…
Measuring and Reducing LLM Hallucination without Gold Standard Answers
We present methods for measuring and reducing LLM hallucinations without requiring gold standard reference answers. Our approach uses consistency checking and uncertainty estimation.…
We introduce the Needle-in-a-Haystack evaluation, which tests whether LLMs can retrieve specific information from long documents. The evaluation reveals significant performance degradation with context length.…
Gorilla: Large Language Model Connected with Massive APIs
We introduce Gorilla, a finetuned LLM that can write accurate API calls. Gorilla is trained on 1,600+ API documentation pages and outperforms GPT-4 on writing API calls.…
Towards Verifiable AI: Formal Methods for Language Model Verification
We survey the application of formal verification methods to language models and AI agents. We identify key challenges and promising approaches for providing formal guarantees about agent behavior.…
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
We present RAG, a general-purpose fine-tuning recipe that combines parametric memory with non-parametric memory for knowledge-intensive NLP tasks. RAG retrieves relevant documents and conditions generation on them.…
We present Self-RAG, a framework that trains LLMs to retrieve, generate, and critique their own outputs. Self-RAG adaptively retrieves passages and generates reflective tokens to improve output quality.…
Toolchain*: Efficient Action Space Navigation in Large Language Model Agents
We present Toolchain*, a planning algorithm that efficiently navigates the action space of tool-using agents. Toolchain* uses A* search to find optimal tool sequences for complex tasks.…
Faithful Reasoning Using Large Language Models
We present a method for faithful reasoning with LLMs that produces verifiable reasoning chains. The approach separates reasoning into selection and inference steps, enabling verification of each step.…
We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.…
Agents: An Open-source Framework for Autonomous Language Agents
We present Agents, an open-source library for building autonomous language agents. Agents supports long-short term memory, tool use, web navigation, multi-agent communication, and human-agent interaction.…
We present LATS, a general framework for language agent search that combines Monte Carlo Tree Search with LLM-based agents. LATS uses LLM-generated heuristics to guide tree search over agent trajectories.…
We survey the rapidly evolving landscape of AI agent frameworks, analyzing their architectural choices, capabilities, and limitations. We identify key trends and open challenges in agent framework design.…
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
We present AgentCoder, a multi-agent framework for code generation that uses separate agents for programming, test design, and test execution. The iterative testing loop significantly improves code quality.…
Efficient Tool Use with Chain-of-Abstraction Reasoning
We introduce Chain-of-Abstraction (CoA), a method for efficient tool use that separates reasoning from tool execution. CoA generates abstract reasoning chains first, then fills in tool calls, reducing redundant API calls…
We present ChatDev, a virtual chat-powered software company with multi-agent collaboration. ChatDev uses role-playing to simulate different software development roles.…
Adaptive Retrieval-Augmented Generation for Conversational Systems
We present FLARE, an active retrieval augmented generation method that adaptively decides when and what to retrieve during generation. FLARE uses upcoming sentence prediction to trigger retrieval.…
We present a comprehensive survey of LLM-based autonomous agents, covering construction, application, and evaluation. We analyze 150+ papers to identify key trends and open challenges.…
We present Constitutional AI, a method for training AI systems to be helpful, harmless, and honest using a set of principles (a "constitution") and AI feedback rather than human feedback.…
Mixture-of-Agents Enhances Large Language Model Capabilities
We propose Mixture-of-Agents (MoA), a methodology that leverages the collective strengths of multiple LLMs to improve overall performance. MoA uses multiple LLMs as proposers and aggregators in a layered architecture.…
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
We present Mamba, a selective state space model that achieves linear-time sequence modeling. Mamba selectively propagates or forgets information based on content, enabling efficient long-sequence processing.…
Cognitive Architectures for Language Agents
We draw on the rich history of cognitive science and symbolic AI to propose CoALA, a conceptual framework for language agents. CoALA organizes memory, action, and decision-making into a coherent architecture.…
We survey the history and current state of neuro-symbolic AI, identifying three waves of development. The third wave combines deep learning with symbolic reasoning for robust, interpretable AI systems.…
Improving Factuality and Reasoning in Language Models through Multiagent Debate
We present a method for improving factuality and reasoning in LLMs through multi-agent debate. Multiple agents propose and debate answers, with the final answer emerging from the debate process.…
Toolformer: Language Models Can Teach Themselves to Use Tools
We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into the future token prediction.…
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
We present CRITIC, a framework that allows LLMs to validate and progressively amend their own outputs with the assistance of external tools. CRITIC uses tool feedback to identify and correct errors.…
We investigate the effectiveness of process reward models (PRMs) for improving mathematical reasoning. PRMs provide step-level feedback rather than outcome-level feedback, enabling more precise alignment of reasoning cha…
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models
We present DEPS, a method for interactive planning with LLMs that uses a describe-explain-plan-select loop. DEPS enables agents to interactively refine plans based on execution feedback.…
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
We present HuggingGPT, a system that uses ChatGPT as a controller to manage and invoke expert models in HuggingFace. By leveraging the rich model repository, HuggingGPT can tackle complex AI tasks.…
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
We introduce semantic uncertainty, an entropy-based uncertainty measure for free-form natural language generation. Semantic uncertainty accounts for the fact that many different sentences can express the same meaning.…
We study how to optimally scale test-time compute for LLMs. We find that the optimal allocation of test-time compute depends on the difficulty of the problem and the capabilities of the model.…
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
We present CAMEL, a communicative agents framework that uses role-playing to facilitate autonomous cooperation between agents. CAMEL enables agents to collaborate on complex tasks through structured conversation.…
RLHF: Training Language Models to Follow Instructions with Human Feedback
We present InstructGPT, trained using reinforcement learning from human feedback to follow instructions. RLHF significantly improves alignment with human intent compared to supervised fine-tuning alone.…
Sparks of Artificial General Intelligence: Early experiments with GPT-4
We investigate an early version of GPT-4 and argue that it exhibits sparks of AGI. We demonstrate GPT-4's capabilities across diverse domains and analyze its limitations.…
Interpretable Machine Learning: A Guide for Making Black Box Models Explainable
A comprehensive guide to interpretable machine learning, covering LIME, SHAP, and other methods for explaining black box model predictions. Essential reference for enterprise AI transparency.…
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
We introduce Tree of Thoughts (ToT), a framework that generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as inte…
LLM-as-a-Judge: Large Language Models as Evaluators
We explore using strong LLMs as judges to evaluate the quality of responses from chatbots. We introduce MT-bench, a challenging multi-turn question set, and Chatbot Arena, a benchmark platform.…
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimisation
We present Retroformer, a framework for improving language agents through retrospective policy gradient optimization. Retroformer learns from past trajectories to improve future performance without manual reward engineer…
Knowledge Graph Completion with Pretrained Multimodal Transformer for Downstream Tasks
We investigate using pretrained multimodal transformers for knowledge graph completion and downstream reasoning tasks. The approach combines structured knowledge with neural representations.…
Reflexion: Language Agents with Verbal Reinforcement Learning
We present Reflexion, a framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Reflexion converts binary or scalar feedback from the environment into verbal feedback in t…
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.…
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
We propose least-to-most prompting, a technique that decomposes complex problems into simpler subproblems and solves them sequentially, with each solution building on previous ones.…
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning
We study tool-augmented mathematical reasoning and propose methods for improving reliability of tool use in computation-intensive tasks. We introduce TORA, a corpus of tool-integrated reasoning examples.…
Voyager: An Open-Ended Embodied Agent with Large Language Models
We present Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention.…
Self-Refine: Iterative Refinement with Self-Feedback
We introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The same LLM generates the output, provides feedback, and refines the output.…
We discuss emergent abilities of large language models — abilities that are not present in smaller models but appear in larger models. We survey 137 emergent abilities across 8 models.…
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
We present AgentVerse, a multi-agent framework that enables dynamic team formation and collaboration. AgentVerse supports expert recruitment, collaborative decision-making, and action execution.…
AutoGPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, chains together LLM "thoughts", to autonomously achieve whatever goal you set.…
We introduce Mixtral 8x7B, a Sparse Mixture of Experts language model. Mixtral uses a router to select 2 of 8 expert FFN layers per token, achieving strong performance with reduced inference cost.…
Generative Agents: Interactive Simulacra of Human Behavior
We introduce generative agents, computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists create new paintings; and friends throw birthday…
We present a tree-of-thought approach guided by LLM-generated heuristics. The approach uses LLM guidance to prune the search tree, making deliberate reasoning more efficient.…
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
We introduce Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight alternative to RLHF. DPO directly optimizes for human preferences without explicit reward modeling.…
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
We introduce Promptbreeder, a self-referential system that evolves task prompts and mutation-prompts using LLMs. Promptbreeder automatically discovers better prompts through evolutionary search.…
Automatic Prompt Optimization with "Gradient Descent" and Beam Search
We present a method for automatic prompt optimization using textual "gradient descent" and beam search. The method iteratively improves prompts by analyzing errors and generating improved versions.…
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
We present GITM, a novel LLM-based agent system that leverages text-based knowledge and memory to handle complex tasks in open-world environments.…
We propose the Transformer, a model architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer achieves state-of-the-art results on machine translation tasks.…
Society of Mind: A Framework for Collaborative AI Agents
Minsky's foundational theory that intelligence emerges from the interaction of many simple agents. Each agent is unintelligent alone, but complex behavior emerges from their cooperation and competition.…