305 Papers on Zone III Agentic AI
Curated research on long-horizon autonomous workflows, runtime governance, and enterprise AI architecture.
The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied …
Architecting Trust in Artificial Epistemic Agents
Large language models are increasingly acting as epistemic agents, influencing our knowledge environment and decision-making. This paper argues that the impact of these AI agents on knowledge creation and synthesis neces…
Evaluating Agentic Artificial Intelligence: A Comprehensive Survey of Metrics, Benchmarks, and Methodologies
This survey presents a structured and comprehensive analysis of evaluation methodologies for Agentic AI, introducing an eleven-dimensional taxonomy. It systematically examines benchmarks, frameworks, and evaluation tools…
The Rise of Agentic AI: Synthesis of Current Knowledge and Future Research Agenda
This study synthesizes current knowledge on Agentic AI (AAI) by examining its conceptual foundations, practical applications, and emerging research directions. It distinguishes AAI from traditional automation and generat…
AI Agent Systems: Architectures, Applications, and Evaluation
This survey synthesizes the emerging landscape of AI agent architectures, covering deliberation, reasoning, planning, control, tool calling, and environment interaction. It organizes prior work into a unified taxonomy sp…
Log analysis is necessary for credible evaluation of AI agents
This paper argues that log analysis is crucial for credible evaluation of AI agents, as outcome-based benchmarks alone can be misleading. It presents a taxonomy of threats to credible evaluation, including inflated/defla…
An empirical study of testing practices in open source AI agent frameworks and agentic applications
This empirical study investigates the testing practices employed in open-source AI agent frameworks and agentic applications. It analyzes a dataset of 107 agent framework repositories to identify common testing strategie…
KNOWPLAN: Knowledge-Driven AI Agents for Smart Degree Pathway Planning
Recent advances in large language models (LLMs) provide powerful capabilities for knowledge-driven course planning. However, building reliable, constraint-aware study planners from publicly available course webpages rema…
Digital Intelligent World: From Data-Driven AI to Knowledge-Enabled Intelligent Agents
Although the latest artificial intelligence technologies can greatly improve work efficiency by automatically generating feasible solutions in the digital world (DW), they are incapable of discovering or creating new kno…
Knowledge Graph Representations for LLM-Based Policy Compliance Reasoning
The risks posed by AI features are increasing as they are rapidly integrated into software applications. In this paper, we present an agentic framework that constructs knowledge graphs (KGs) from AI policy documents and …
Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval
Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. This survey provides a concise, structured overview …
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. Re…
GraphAgents: Knowledge Graph-Guided Agentic AI for Cross-Domain Materials Design
Large Language Models (LLMs) promise to accelerate discovery by reasoning across the expanding scientific landscape. We introduce a multi-agent framework guided by large-scale knowledge graphs to find sustainable substit…
Agentic AI, Context Engineering and Knowledge Graphs: Current Approaches, Challenges and Opportunities
With the recent advancements in Large Language Models (LLMs) and Agentic AI, Context Engineering (CE) has emerged as a novel research area. Knowledge Graphs (KGs) offer a promising approach to integrate diverse contextua…
Anatomical Review of "Toward Efficient Agents: A Survey of Memory, Tool Learning, and Planning" Vol-I
This paper provides an anatomical review of efficient large language model (LLM)-based agents, focusing on memory, tool learning, and planning mechanisms. It systematizes diverse approaches from 2023 to 2025, highlightin…
GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
Agentic AI workflows often suffer from high error rates in multi-step processes, hindering their reliability in critical applications. This paper introduces GraphFlow, an architecture for formally verifiable visual workf…
Agentic AI workflows offer significant potential for automation, but their scalability and computational cost remain critical challenges. This paper presents AAFLOW, a framework of scalable patterns designed to optimize …
Agentic AI enables large language models (LLMs) to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often suffer from inefficiencies due to suboptimal tool utilization …
Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents
Artificial Intelligence is moving from models that only generate text to Agentic AI, where systems behave as autonomous entities that can perceive, reason, plan, and act. This paper investigates architectures and propose…
Reducing Cost of LLM Agents with Trajectory Reduction
Multi-turn agent systems based on Large Language Models (LLMs) face significant efficiency concerns due to the high computational cost of input tokens from ever-growing trajectories. This paper introduces AgentDiet, an i…
The 2026 Deloitte AI report highlights that while agentic AI usage is poised to rise sharply, oversight is lagging, with only one in five companies having a mature governance model for autonomous AI agents. The report co…
Agentic AI and Autonomous Decision-Making: A Review of Human-in-the-Loop Frameworks, Oversight Mechanisms, and Trust Calibration
The rapid proliferation of agentic artificial intelligence (AI) systems, which are autonomous agents capable of perceiving, reasoning, planning, and executing multi-step tasks with minimal human intervention, presents fo…
AI systems that use reasoning to autonomously pursue goals through interaction with external environments and tools—referred to hereafter as “AI agents” or “agentic AI”—promise transformative benefits for productivity an…
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
LLM-based agents struggle with long-horizon planning due to losing track of goals and sparse rewards in RL fine-tuning. This paper proposes a subgoal-driven framework with an agent that leverages proprietary models for o…
Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents
LLM-based agents often fail to sustain coherent behavior over long planning horizons due to a mismatch between step-wise reasoning and long-horizon planning. This paper argues that locally optimal choices lead to myopic …
LLM-KGPlan: Long-Horizon Task Planning via Knowledge-Guided Reasoning
Long-horizon task planning for robotic autonomy often results in logically inconsistent or physically infeasible plans from LLM-based agents. This paper proposes LLM-KGPlan, a systematic framework combining rule-guided C…
Agentic AI: a comprehensive survey of architectures, applications, and future directions
Agentic AI represents a transformative shift in artificial intelligence, but its rapid advancement has led to a fragmented understanding, often conflating modern neural systems with outdated symbolic models—a practice kn…
From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions
Large Language Models (LLMs) are increasingly deployed as agentic systems that plan, memorize, and act in open-world environments. This shift brings new security problems: failures are no longer only unsafe text generati…
The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey
AI agents that combine large language models with non-AI system components are rapidly emerging in real-world applications, offering unprecedented automation and flexibility. However, this unprecedented flexibility intro…
From threat to trust: assessing security risks of agentic AI systems
Agentic artificial intelligence (AI) systems are expected to have transformative impacts across sectors, including critical areas like finance and healthcare. Their architectural complexity, autonomous decision-making ab…
Threats and vulnerabilities in artificial intelligence and agentic AI models
Adversarial robustness in artificial intelligence is commonly defined in terms of input-level perturbations applied to static models. This study reconceptualises adversarial vulnerability for artificial and agentic AI sy…
Confidence Introspection: A Self-reflection Method for Reliable and Helpful Large Language Models
Large Language Models (LLMs) suffer from factual hallucinations, meaning the LLMs confidently provide responses that are inconsistent with reality. Previous studies explored fine-tuning-based verbalized confidence calibr…
Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing applications, encompassing question answering, text generation, and reasoning capabilities. However, their metacognitiv…
Mirroring the Mind: Distilling Human-Like Metacognitive Strategies into Large Language Models
Large Language Models (LLMs) often produce incorrect answers on multi-hop question answering even when the reasoning trace already contains a correct intermediate conclusion. We attribute this gap to weak self-regulation…
Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection
While Chain-of-Thought (CoT) prompting advances LLM reasoning, challenges persist in consistency, accuracy, and self-correction, especially for complex or ethically sensitive tasks. Existing single-dimensional reflection…
What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow & GitHub Issues
AI Agents have rapidly gained prominence in both research and industry as systems that extend large language models with planning, tool use, memory, and goal-directed action. Despite this progress, the development and ma…
Cyber Warfare and AI Agents: Strengthening National Security Against Advanced Persistent Threats (APTs)
This study addresses the application of Artificial Intelligence (AI) agents to enhance national security against Advanced Persistent Threats (APTs) in cyber warfare. With the increasing sophistication of cyber threats, p…
Agentic AI for autonomous preventive maintenance policy governance: a multi-agent framework for dynamic industrial environments
Agentic Artificial Intelligence (Agentic AI) is emerging as a practical paradigm for coordinating autonomous decision workflows in industrial asset management. This paper proposes an event-driven multi-agent architecture…
Society Agent: A Hierarchical Multi-Agent Architecture with Autonomous Persistent and Ephemeral Agents and Persistent Evolving Knowledge
Traditional AI coding assistants operate as single agents responding to immediate user requests, lacking persistence, organizational structure, and the ability to coordinate complex, long-running tasks. Existing multi-ag…
Large Language Model-Based Data Science Agents: A Comprehensive Survey of Architectures, Workflow Automation, and Emerging Research Challenges
The rapid advancement of large language models (LLMs) has catalyzed a transition from conventional, manually orchestrated data science workflows toward autonomous analytical systems capable of iterative reasoning, tool i…
Agent Benchmarks Fail Public Sector Requirements
This paper argues that existing benchmarks for LLM agents fail to meet the stringent legal, procedural, and structural requirements of the public sector. It defines criteria for public sector-relevant benchmarks, includi…
A hallucination detection and mitigation framework for faithful text summarization using LLMs
Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including text summarization. However, a significant challenge in deploying LLMs for critical applicati…
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands ge…
Understanding and Bridging the Planner-Coder Gap: A Systematic Study on the Robustness of Multi-Agent Systems for Code Generation
Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks. Despite their prosperous development, the fundamental mechani…
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy …
Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation
The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle wi…
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
Large Language Models (LLMs) have advanced artificial intelligence by enabling human-like text generation and natural language understanding. However, their reliance on static training data limits their ability to respon…
SoK: Agentic Retrieval-Augmented Generation (RAG): Taxonomy, Architectures, Evaluation, and Research Directions
Retrieval-Augmented Generation (RAG) systems are increasingly evolving into agentic architectures where large language models autonomously coordinate multi-step reasoning, dynamic memory management, and iterative retriev…
Contextual Memory Virtualisation: DAG-Based State Management and Structurally Lossless Trimming for LLM Agents
This paper proposes Contextual Memory Virtualisation (CMV) to abstract away strict physical token limits of the LLM context window. It uses DAG-based state management and structurally lossless trimming to manage memory s…
This paper introduces M4, a meta-memory framework for LLMs to evaluate its effectiveness in memory management. It features active memory management to prevent unbounded context growth, bounded by finite memory chains.…
Is Agentic RAG worth it? An experimental comparison of RAG approaches
Retrieval-Augmented Generation (RAG) systems are usually defined by the combination of a generator and a retrieval component that extracts textual context from a knowledge base to answer user queries. However, such basic…
Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
This work presents ReMemR1, which integrates memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. It also proposes a multi-level reward…
Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
This paper compares a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks. It evaluates both architectures on accuracy and cumulative API cost, showi…
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework [Experiment, Analysis & Benchmark]
Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks. This paper summarizes a unified framework that incorporates all existing agent memory methods and extensive…
Global artificial intelligence governance research in the digital and intelligent era: advances, trends and countermeasures
In the era of the intelligent science paradigm, AI governance has emerged as a focal point of global attention. This study systematically reviews the current status of global AI governance research, which contributes to …
Bridging the governance gap: adapting technology pioneer ai governance frameworks for financial services institutions
Financial services institutions are deploying sophisticated GenAI and agentic AI systems within governance frameworks designed for traditional statistical models, creating a widening gap between AI risk and institutional…
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Advancements in model algorithms, the growth of foundational models, and access to high-quality datasets have propelled the evolution of Artificial Intelligence Generated Content (AIGC). Despite its notable successes, AI…
AI-Driven Autonomous Enterprises and the Future of Work: Impact, Ethics, and Value Creation by 2026
Artificial intelligence (AI) systems are moving from assistive tools into enterprise workflows where software agents retrieve information, call tools, draft artifacts, execute bounded actions, and ask for human intervent…
Artificial Intelligence and new regulatory tools for the development of European enterprises
This study analyzes recent actions by European regulators to support enterprise digitalization within the Single Market, focusing on small businesses. It highlights how Information and Communication Technologies (ICT), e…
The Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review
The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offers transformative benefits but also conflicts with evolving regulatory and govern…
Responsible AI governance in Enterprise Systems: A Risk and Compliance Framework
Artificial Intelligence deployment in enterprise environments has accelerated dramatically, yet governance frameworks struggle to keep pace with the technology's rapid evolution. This research develops a comprehensive ri…
Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention
This paper investigates the effectiveness of LLM critic models in improving agent reliability. It demonstrates that while LLM critics can have high offline accuracy, their interventions at deployment time can lead to sev…
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
AI agents often fail in ways that are difficult to localize due to probabilistic, long-horizon, multi-agent executions and noisy tool outputs. This paper addresses this by manually annotating failed agent runs and releas…
This paper frames LLM context window management as a constrained resource allocation problem. It proposes a declarative approach where sources contribute to the context proportionally, compressing large memory entries mo…
From RAG to Agentic RAG for Faithful Islamic Question Answering
LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably …
This paper explores conditional memory access to extend LLM context length while managing KV footprint. It evaluates long-context performance by increasing the Rotary Position Embedding base.…
ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
This paper presents ARKV, an adaptive and resource-efficient KV cache management framework for LLM inference under limited memory budgets. It aims to reduce memory usage and maintain high throughput for large context win…
Agentic AI deployment in infrastructure-limited environments: Observability gaps, failure modes, and AI governance primitives
This paper discusses the application of agentic Artificial Intelligence (AI) systems to infrastructure-constrained environments, focusing on observability gaps, failure modes, and AI governance primitives. The study meas…
This paper describes AI agents as stochastic dynamical systems and frames learning to reason as transductive inference. It establishes three main results, including a possible failure mode of naive scaling where models b…
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
This paper empirically characterizes faults in agentic AI systems by analyzing 13,602 closed issues and merged pull requests from 40 open-source repositories. It derives taxonomies of fault types, symptoms, and root caus…
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
This paper presents the 2025 AI Agent Index, documenting information regarding the origins, design, capabilities, ecosystem, and safety features of 30 state-of-the-art AI agents. It illuminates broader trends in the deve…
We present an initial investigation into Agentic Retrieval-Augmented Generation (RAG) for Ukrainian, conducted within the UNLP 2026 Shared Task on Multi-Domain Document Understanding. Our system combines two-stage retrie…
This paper examines the legal implications of AI agents under existing and proposed EU law, particularly in light of the AI Act. It analyzes how current regulations apply to autonomous AI systems and identifies potential…
Advancing healthcare AI governance through a comprehensive maturity model based on systematic review
Artificial Intelligence (AI) deployment in healthcare is accelerating, yet governance frameworks remain fragmented and often assume extensive resources. Through a systematic review of 35 frameworks for AI implementation …
AI agents are increasingly deployed to execute important tasks, but many still fail in practice despite rising accuracy scores. This paper proposes twelve concrete metrics that decompose agent reliability along four key …
Agentic Process Automation: Transforming Enterprise Workflows with AI Agents
We present the PASF/PADE framework for agentic process automation in enterprise environments. The framework provides a systematic approach to identifying, designing, and deploying AI agents for long-horizon enterprise wo…
The State of AI Agents 2025: Enterprise Adoption, Challenges, and Roadmap
We present a comprehensive survey of enterprise AI agent adoption in 2025, covering deployment patterns, failure rates, governance practices, and the roadmap to Zone III autonomous operations. Based on interviews with 15…
Agentification Factory: A Methodology for Systematic Enterprise Agent Development
We present the Agentification Factory methodology for systematically developing enterprise AI agents at scale. The factory model provides repeatable processes for agent design, testing, deployment, and governance.…
Patternomics: A Framework for Enterprise AI Pattern Recognition and Reuse
We present Patternomics, a framework for identifying, cataloguing, and reusing successful patterns in enterprise AI deployments. Patternomics provides a systematic approach to building institutional knowledge about what …
The Roundtrip Value Governance Framework for Enterprise AI
We present the Roundtrip Value Governance (RVG) framework, which ensures that every agentic workflow creates measurable value that can be traced back to the original business objective. RVG provides a closed-loop account…
We present a method for improving tool-calling agents at inference time through a reviewer agent that evaluates tool calls before execution. The reviewer provides feedback that allows the primary agent to correct its too…
Tokenomics of Enterprise AI: Economic Models for Agentic Workflows
We present a framework for understanding and optimizing the economic model of enterprise AI deployments, covering token costs, compute allocation, value attribution, and ROI measurement for agentic workflows.…
Evidence Factory: Automated Research Synthesis for Enterprise AI
We present the Evidence Factory, a systematic approach to building and maintaining the research evidence base for enterprise AI decisions. The Evidence Factory automates research collection, synthesis, and relevance asse…
MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: maintainability. To h…
Transforming cybersecurity with agentic AI to combat emerging cyber threats
This paper investigates the transformative potential of agentic AI in cybersecurity, specifically addressing how it can enhance practices in response to emerging threats. It aims to explore how agentic AI can transform c…
A Characterization Study of Bugs in LLM Agent Workflow Orchestration Frameworks
Large Language Models (LLMs) have rapidly gained popularity, transforming research and industry. To support their adoption, LLM agent workflow orchestration frameworks (hereinafter referred to as LLM agent frameworks) li…
Sallma: A software architecture for llm-based multi-agent systems
The integration of Large Language Models (LLMs) into multi-agent systems has opened new avenues for automating complex workflows. However, designing robust software architectures for these systems remains a challenge. In…
Self-Organizing Agent Network for LLM-based Workflow Automation
Recent multi-agent frameworks built upon large language models (LLMs) have demonstrated remarkable capabilities in complex task planning. However, in real-world enterprise environments, business workflows are typically c…
Difficulty-Aware Agent Orchestration in LLM-Powered Workflows
Large Language Model (LLM)-based agentic systems have shown strong capabilities across various tasks. However, existing multi-agent frameworks often rely on static or task-level workflows, which either over-process simpl…
Control Industrial Automation System with Large Language Model Agents
Traditional industrial automation systems require specialized expertise to operate and complex reprogramming to adapt to new processes. Large language models offer the intelligence to make them more flexible and easier t…
From Code Generation to AI Collaboration: The Role of Multi-Agent Systems in Software Engineering
The integration of multi-agent systems (MAS) in software engineering is revolutionizing the way developers interact with artificial intelligence, shifting from simple code generation to sophisticated AI-driven collaborat…
Agents in software engineering: Survey, landscape, and vision
In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many stud…
Codesim: Multi-agent code generation and problem solving through simulation-driven planning and debugging
Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback …
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic…
Blueprint2Code: a multi-agent pipeline for reliable code generation via blueprint planning and repair
Automated programming has become a powerful tool for solving real-world problems. Code generation, in particular, plays a key role in improving developer productivity and reducing the entry barrier to software developmen…
Mitigating LLM Hallucinations Using a Multi-Agent Framework
Large Language Models (LLMs) have shown impressive capabilities in generating human-like text, but they often suffer from "hallucinations," producing factually incorrect or nonsensical information. This issue severely li…
Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
Large Language Models (LLMs) are increasingly deployed in real-world applications, yet their propensity for generating factually incorrect or inconsistent information, known as hallucination, remains a significant concer…
Large Language Models Hallucination: A Comprehensive Survey
Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated…
DocAgent: A Multi-Agent System for Automated Code Documentation Generation
High-quality code documentation is crucial for software development especially in the era of AI. However, generating it automatically using Large Language Models (LLMs) remains challenging, as existing approaches often p…
ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework
Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code…
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
Large language models (LLMs) struggle with complex, multi-step, long-horizon tasks. This paper proposes Plan-and-Act, a framework that incorporates explicit planning into LLM-based agents and enhances plan generation thr…
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Existing benchmarks for LLM agents often focus on atomic tasks, failing to capture the complexities of real-world, long-horizon workflows. This paper introduces OdysseyBench, a comprehensive benchmark for evaluating LLM …
Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems
This paper proposes an end-to-end Agent Assessment Framework for evaluating agentic AI systems beyond traditional task completion metrics. It addresses the challenges posed by the non-deterministic nature of LLM agents a…
This survey provides a comprehensive overview of the evaluation methodologies for LLM-based agents. It categorizes existing approaches, discusses common challenges, and highlights key metrics used to assess agent perform…
Preface of NORA 2025: First International Workshop on KNOwledge GRaphs & Agentic Systems Interplay
Agents have experienced significant growth in recent years, largely due to the rapid technological advancements of Large Language Models (LLMs). Although these agents benefit from LLMs’ advanced generation proficiency, t…
KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel…
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents
The ReAct capability in LLMs forms the foundation of modern agentic systems, but smaller models struggle with complex reasoning tasks. This paper introduces Pre-Act, a novel approach that enhances agent performance by cr…
Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents
LLM-based multi-agent systems often incur high latency for complex tasks requiring multiple iterative reasoning cycles. This paper proposes M1-Parallel, a framework that concurrently runs multiple multi-agent teams to un…
Enhancement of long-horizon task planning via active and passive modification in large language models
This study proposes a method for generating complex and long-horizon off-line task plans using large language models (LLMs). It addresses the limitation of simple planning results by enabling the LLM to actively collect …
Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities
AI agents have experienced a paradigm shift, from early dominance by reinforcement learning (RL) to the rise of agents powered by large language models (LLMs), and now further advancing towards a synergistic fusion of RL…
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale s…
SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents
Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle c…
Agentic generative AI for context-aware outlier removal and historical cost optimization in construction
This paper explores the application of agentic generative AI for optimizing historical cost data in the construction industry. It focuses on developing context-aware outlier removal techniques to improve the accuracy of …
Demystifying LLM-Based Software Engineering Agents
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and …
A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
This paper introduces a structured methodology for designing, developing, and deploying production-grade agentic AI workflows. It addresses challenges related to reliability, observability, maintainability, and alignment…
LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead
Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This article exp…
Small Language Models are the Future of Agentic AI
Large language models (LLMs) are often praised for exhibiting near-human performance on a wide range of tasks. However, the rise of agentic AI systems is ushering in applications where language models perform specialized…
This blog post discusses the critical need for robust AI governance frameworks as agentic AI rapidly advances. It emphasizes that governance should go beyond mere compliance to define how organizations design, deploy, an…
Agentic System Oversight and Human-AI Collaboration: Governance for Autonomous AI
The emergence of agentic AI systems, autonomous entities capable of reasoning, acting, and collaborating marks a significant shift in the evolution of artificial intelligence. These systems promise transformative benefit…
Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes
With the rise of generative AI, industry interest in software agents is growing. Given the stochastic nature of generative AI-based agents, their effective and safe deployment in organizations requires robust governance,…
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models
Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can…
The Emerging Agentic Enterprise: How Leaders Must Navigate a New Age of AI
The 2025 Artificial Intelligence and Business Strategy report, from MIT Sloan Management Review and Boston Consulting Group, looks at how organizations that are adopting agentic AI are gaining advantage while facing four…
How “AI Safety” is Leveraged Against Regulatory Oversight
This paper critically examines how the discourse around 'AI safety' is sometimes used to resist or undermine effective regulatory oversight of AI systems. It argues that an overemphasis on speculative, long-term risks ca…
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Collaboration for Oversight
This paper proposes a novel framework for 'superalignment' that extends beyond traditional weak-to-strong alignment by incorporating human-AI collaboration for robust oversight. It argues that as AI systems become more c…
Agentic Generative AI and National Security: Policy Recommendations for US Military Competitiveness
This paper presents a comprehensive analysis of Agentic Gen Artificial Intelligence (AI) frameworks and their integration into modern military systems. We examine the architectural foundations, development pipelines, and…
OWASP GenAI Security Project Releases Top 10 Risks and Mitigations for Agentic AI Security
The OWASP GenAI Security Project released the OWASP Top 10 for Agentic Applications, a key resource to help organizations identify and mitigate the unique risks posed by autonomous AI agents. This framework, developed wi…
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three c…
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous L…
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged mult…
AI-Native Memory: The Emergence of Persistent, Context-Aware “Second Me” Agents
The concept of AI-native memory is revolutionizing how AI agents interact and learn, moving beyond stateless interactions to create persistent, context-aware 'Second Me' agents. This article explores the architectural sh…
Evaluation and Benchmarking of LLM Agents: A Survey
The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agen…
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, memory, and autonomy, are emerging as powerful, flexible platforms for automation. Their ability to autonomously execute tas…
This study aims to analyze AI Risk Management Frameworks (AI RMFs), exploring their role in promoting the safe, accountable, and transparent adoption of AI technologies within economic systems. The first part of the rese…
Introspective Machines: Are LLMs Better at Self-Reflection Than Humans?
This article challenges conventional boundaries between human and artificial cognition by examining introspective capabilities in large language models (LLMs). Although humans have traditionally been considered unique in…
Artificial Intelligence Governance and Regulation; The impact of the EU AI Act, 2024 on Innovation, Accountability, and Global Compliance in a Digital Age
With the widespread usage and application of Artificial Intelligence, the need to ensure responsible use of AI systems has become a subject of debate. This paper examines the comprehensive approach adopted by the EU AI A…
The year 2024 witnessed accelerated global AI governance advancements, marked by strengthened multilateral frameworks and proliferating national regulatory initiatives. This acceleration underscores an unprecedented need…
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection
We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, …
This paper proposes a novel agentic memory system for LLM agents that dynamically organizes memories using the Zettelkasten method. It creates interconnected knowledge networks through dynamic indexing and linking, allow…
Self-Verifying Reflection Helps Transformers with CoT Reasoning
Advanced large language models (LLMs) frequently reflect in reasoning chain-of-thoughts (CoTs), where they self-verify the correctness of current solutions and explore alternatives. However, given recent findings that LL…
Advancing multi-step mathematical reasoning in large language models through multi-layered self-reflection with auto-prompting
Large Language Models (LLMs) have shown impressive capabilities in various natural language processing tasks, but complex multi-step mathematical reasoning remains a significant challenge. This paper introduces a novel a…
The emergence of AI agents powered by large language models (LLMs) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal-driven interactions. In this paradigm, agents inter…
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
AI agents interacting with each other directly and across internet platforms and physical environments create security challenges beyond traditional cybersecurity and AI safety frameworks. This paper introduces multi-age…
A Research Landscape of Agentic AI and Large Language Models: Applications, Challenges and Future Directions
Agentic AI and Large Language Models (LLMs) are transforming how language is understood and generated while reshaping decision-making, automation, and research practices. LLMs provide underlying reasoning capabilities, a…
Self-reflection enhances large language models towards substantial academic response
Crafting response letters to reviewers’ comments is a time-consuming yet critical part of academic peer review. The inexperience of researchers can hinder the publication of their work, exacerbating the Matthew effect in…
Efficient Low Rank Attention for Long-Context Inference in Large Language Models
This paper introduces Low Rank Query and Key attention (LRQK), a two-stage framework that decomposes full-precision query and key matrices into compact rank factors to reduce GPU memory costs for long-context inference. …
Cognitive Reasoning in Translation: Evaluating Chain-of-Thought, Explaining, Metacognition, and Critique in Humans and General-Purpose vs. Advanced-Reasoning Large Language Models
This chapter explores the cognitive reasoning processes involved in translation, comparing human capabilities with those of large language models (LLMs). We evaluate the effectiveness of various reasoning paradigms, incl…
This report serves as an accessible guide to the emerging field of AI agent governance. Agents - AI systems that can autonomously achieve goals in the world, with little to no explicit human instruction about how to do s…
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
This paper presents insights from a large-scale public competition designed to evaluate the security challenges in AI agent deployment. The competition involved 1.8 million prompt injection attacks and revealed 60,000 su…
Agent Drift: Semantic Degradation in Long-Running Autonomous Systems
We characterize the phenomenon of agent drift — the gradual degradation of semantic coherence in long-running autonomous AI systems. We identify three primary drift mechanisms: context contamination, goal displacement, a…
LangGraph: Building Stateful, Multi-Actor Applications with LLMs
LangGraph is a library for building stateful, multi-actor applications with LLMs, used to create agent and multi-agent workflows. It extends LangChain with the ability to coordinate multiple chains (or actors) across mul…
Towards Autonomous AI Agents: A Framework for Evaluating Long-Horizon Task Completion
We present a framework for evaluating AI agents on long-horizon tasks that require sustained reasoning, planning, and execution over extended periods. We identify key failure modes and propose metrics for measuring auton…
OdysseyBench: Evaluating Long-Horizon Reasoning in Language Agents
We introduce OdysseyBench, a benchmark specifically designed to evaluate language agents on tasks requiring hundreds of sequential reasoning steps. OdysseyBench reveals non-linear performance degradation with task length…
Towards Long-Horizon Planning with LLMs: A Survey
We survey the landscape of long-horizon planning with LLMs, covering task decomposition, subgoal generation, plan verification, and execution monitoring. We identify key challenges and promising research directions.…
Towards Reliable AI Agents: A Framework for Systematic Failure Analysis
We present a systematic framework for analyzing failures in AI agent systems, covering failure mode identification, root cause analysis, and mitigation strategy development. The framework is validated on 500+ real agent …
Agent Safety: A Framework for Governing Autonomous AI Systems
We present a comprehensive framework for governing autonomous AI systems, covering technical safety measures, organizational governance, and regulatory compliance. The framework addresses the unique challenges of long-ru…
Agentic AI in the Enterprise: Patterns, Pitfalls, and Best Practices
A practitioner's guide to deploying agentic AI in enterprise environments, covering common patterns, failure modes, and architectural best practices from real-world deployments.…
Evaluating Human-AI Collaboration in Long-Horizon Enterprise Tasks
We study human-AI collaboration patterns in long-horizon enterprise tasks, identifying optimal handoff points, oversight mechanisms, and collaboration protocols for Zone III workflows.…
Model Context Protocol (MCP): A Standard for AI-Tool Integration
We introduce the Model Context Protocol (MCP), an open standard that enables seamless integration between AI assistants and external data sources, tools, and services. MCP provides a universal protocol for AI-tool commun…
Autonomous Agents in Healthcare: Safety and Governance Requirements
We examine the safety and governance requirements for deploying autonomous AI agents in healthcare settings, covering clinical decision support, administrative automation, and patient-facing applications.…
Agentic AI in Financial Services: Opportunities and Governance Challenges
The Financial Stability Board examines the use of agentic AI in financial services, identifying opportunities for efficiency gains and systemic risks from autonomous agent deployment.…
RAISE: Implementing Memory to Enable Backtracking and Interrupts in LLM Agents
RAISE implements a dual-component memory system enabling LLM agents to backtrack to previous states and handle interrupts, improving reliability in long-horizon tasks.…
Towards Autonomous AI: Challenges and Opportunities for Enterprise Deployment
Gartner examines the enterprise readiness for autonomous AI deployment, identifying the key technical, organizational, and governance challenges that must be addressed before Zone III workflows can be safely deployed at …
LAGRANGE: Latent Action Grounding for Robust Agentic Navigation in Graph Environments
We present LAGRANGE, a framework for grounding agent actions in structured knowledge graphs to improve reliability and interpretability of long-horizon agent navigation tasks.…
LLM Agents can Autonomously Exploit One-day Vulnerabilities
We demonstrate that LLM agents can autonomously exploit one-day vulnerabilities in real systems, raising significant security concerns for autonomous AI deployment.…
Agentic AI Workflows: The Next Wave of Enterprise Automation
Andrew Ng describes four key agentic design patterns — reflection, tool use, planning, and multi-agent collaboration — and argues that agentic workflows represent the next major wave of AI capability improvement.…
We survey methods for extending the context length of language models, covering positional encoding extensions, efficient attention mechanisms, and memory-augmented architectures.…
Hallucination is Inevitable: An Innate Limitation of Large Language Models
We prove that hallucination is an innate limitation of LLMs, not a fixable bug. We show that any LLM that can answer all questions must hallucinate on some inputs, establishing fundamental limits on factual reliability.…
Responsible AI Practices for Enterprise Deployments
Microsoft's comprehensive framework for responsible AI deployment in enterprise settings, covering fairness, reliability, privacy, security, inclusiveness, transparency, and accountability.…
EU AI Act: Implications for Enterprise AI Systems
The EU AI Act establishes a comprehensive regulatory framework for artificial intelligence in the European Union, with specific requirements for high-risk AI systems including autonomous agents in enterprise settings.…
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
We introduce SWE-bench, a benchmark for evaluating LLMs on real-world software engineering tasks. SWE-bench contains 2,294 GitHub issues and pull requests from 12 popular Python repositories.…
Towards Robust Long-Horizon Task Planning for Embodied Agents
We study the challenges of long-horizon task planning for embodied agents and propose methods for improving robustness to execution failures and environmental changes.…
We propose CodeAct, an agent design that uses executable Python code as the action space instead of structured JSON actions. CodeAct agents can dynamically create and execute code to interact with environments.…
Towards Verifiable AI: Formal Methods for Language Model Verification
We survey the application of formal verification methods to language models and AI agents. We identify key challenges and promising approaches for providing formal guarantees about agent behavior.…
Agentic RAG: Turning RAG Systems into Agents
We present Self-RAG, a framework that trains LLMs to retrieve, generate, and critique their own outputs. Self-RAG adaptively retrieves passages and generates reflective tokens to improve output quality.…
Towards Efficient and Reliable LLM Serving: A Real-World Case Study
We present a real-world case study of LLM serving infrastructure, analyzing latency, throughput, and reliability challenges in production deployments at scale.…
We survey the rapidly evolving landscape of AI agent frameworks, analyzing their architectural choices, capabilities, and limitations. We identify key trends and open challenges in agent framework design.…
Detecting hallucinations in large language models using semantic entropy
Large language models (LLMs) have revolutionized natural language processing, but their tendency to "hallucinate"—generating factually incorrect or nonsensical information—remains a significant challenge. Current methods…
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Hallucination has been widely recognized to be a significant drawback for large language models (LLMs). There have been many works that attempt to reduce the extent of hallucination. These efforts have mostly been empiri…
Large Language Models Can Accomplish Business Process Management Tasks
Business Process Management (BPM) aims to improve organizational activities and their outcomes by managing the underlying processes. To achieve this, it is often necessary to consider information from various sources, in…
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
As Large Language Models (LLMs) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded. This …
From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
With the rise of large language models (LLMs), researchers are increasingly exploring their applications in various vertical domains, such as software engineering. LLMs have achieved remarkable success in areas including…
From Intelligent Automation to Agentic AI: Engineering the Next Generation of Enterprise Systems
Businesses are shifting away, however, and realizing agentic AI systems of autonomous, goal-oriented, software agents that are able to perceive, reason, make actions, and learn end-to-end systems. In this paper, a concep…
A Survey on Human-AI Collaboration with Large Foundation Models
This paper presents a comprehensive survey on human-AI collaboration with large foundation models (LFMs). It categorizes existing research into three main types: human-in-the-loop, AI-in-the-loop, and human-AI co-creatio…
A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models
The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: t…
Efficient Tool Use with Chain-of-Abstraction Reasoning
We introduce Chain-of-Abstraction (CoA), a method for efficient tool use that separates reasoning from tool execution. CoA generates abstract reasoning chains first, then fills in tool calls, reducing redundant API calls…
Mixture-of-Agents Enhances Large Language Model Capabilities
We propose Mixture-of-Agents (MoA), a methodology that leverages the collective strengths of multiple LLMs to improve overall performance. MoA uses multiple LLMs as proposers and aggregators in a layered architecture.…
Towards Automated Evaluation of LLM-based Multi-turn Dialogue Systems
We propose an automated evaluation framework for multi-turn dialogue systems that assesses coherence, consistency, and task completion across extended conversations.…
We study how to optimally scale test-time compute for LLMs. We find that the optimal allocation of test-time compute depends on the difficulty of the problem and the capabilities of the model.…
We study optimal allocation of test-time compute for LLMs, showing that scaling inference compute can be more efficient than scaling training compute for many tasks.…
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation
We study whether LLMs can be trusted as evaluators, finding systematic biases including position bias, verbosity bias, and self-enhancement bias that affect evaluation reliability.…
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
We present a platform and benchmark for long-horizon vision-language navigation, requiring agents to navigate complex environments over extended time horizons with minimal guidance.…
Towards Robust Multi-Modal Reasoning via Model Selection
We propose a model selection framework for multi-modal reasoning that dynamically selects the most appropriate model based on task characteristics, improving robustness and efficiency.…
We introduce Mixtral 8x7B, a Sparse Mixture of Experts language model. Mixtral uses a router to select 2 of 8 expert FFN layers per token, achieving strong performance with reduced inference cost.…
Advances in Multi-agent Reinforcement Learning: Persistent Autonomy and Robot Learning Lab Report 2024
Multi-Agent Reinforcement Learning (MARL) approaches have emerged as popular solutions to address the general challenges of cooperation in multi-agent environments, where the success of achieving shared or individual goa…
Evaluating Human-AI Collaboration: A Review and Methodological Framework
This paper provides a comprehensive review of existing methodologies for evaluating human-AI collaboration and proposes a new methodological framework. It identifies key dimensions for assessment, including task performa…
Durable Execution in Temporal: Building Reliable Long-Running Workflows
Temporal's durable execution model ensures that workflows continue executing despite failures, providing automatic retry, state persistence, and exactly-once semantics for long-running business processes.…
Durable Execution: Temporal and Cadence-Aware Workflow Orchestration
Durable execution is a programming model that ensures workflows survive failures, restarts, and infrastructure changes. Temporal provides durable execution primitives for building reliable long-running workflows.…
MemGPT: Towards LLMs as Operating Systems
We present MemGPT, a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window. MemGPT knows when to push critical information to a …
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
MetaGPT encodes software development SOPs into multi-agent collaboration, enabling agents to produce structured outputs (PRDs, architecture designs, code) through role-based communication.…
Toward a Science of Autonomous Systems: Reliability Engineering for AI
We apply systems safety engineering principles to autonomous AI systems, arguing that AI reliability requires the same rigorous engineering discipline as safety-critical systems in aerospace and nuclear industries.…
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
AutoGen is a framework for building LLM applications through multi-agent conversations, enabling flexible agent interaction patterns and human-in-the-loop capabilities.…
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
We present AutoGen, a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow …
GAIA: A Benchmark for General AI Assistants
We introduce GAIA, a benchmark for general AI assistants that tests real-world capabilities requiring multi-step reasoning, tool use, and information synthesis. GAIA questions require an average of 5.4 steps to solve.…
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena is a standalone, self-hostable web environment for building autonomous agents. It includes four websites mimicking real-world applications and a benchmark of 812 long-horizon tasks.…
OpenAgents: An Open Platform for Language Agents in the Wild
We present OpenAgents, an open platform for deploying and evaluating language agents in real-world settings. OpenAgents includes three agents: a data analysis agent, a plugin agent, and a web agent.…
Towards Efficient and Reliable LLM Service: A Real-World Case Study
We present a real-world case study of deploying LLM services at scale, covering reliability challenges, latency optimization, and cost management. We identify key engineering lessons for production LLM deployments.…
Lost in the Middle: How Language Models Use Long Contexts
We analyze how language models use long contexts and find that performance degrades when relevant information is in the middle of the context. Models are better at using information at the beginning or end.…
ChatDev is a multi-agent software development framework where specialized agents (CEO, CTO, programmer, tester) collaborate through natural language communication to complete software projects.…
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench is a benchmark for evaluating LLMs on real-world software engineering tasks from GitHub, requiring understanding of large codebases and generating patches that resolve issues.…
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
We introduce ToolLLM, a general tool-use framework enabling LLMs to master 16000+ real-world APIs. We collect ToolBench, an instruction-tuning dataset for tool use, and train ToolLLaMA.…
Scalable Oversight: Supervising AI Systems That Exceed Human Capabilities
We discuss the challenge of providing oversight to AI systems that may exceed human capabilities in some domains. We propose scalable oversight as a research agenda for maintaining meaningful human control.…
The Economic Potential of Generative AI: The Next Productivity Frontier
McKinsey estimates that generative AI could add $2.6 to $4.4 trillion annually to the global economy. The report identifies the highest-value use cases and the organizational changes required to capture this value.…
PEARL: Prompting Large Language Models to Plan and Execute Actions for Long-Horizon Tasks
We present PEARL, a prompting framework for long-horizon task planning and execution. PEARL decomposes tasks into action plans, executes them step by step, and self-evaluates progress.…
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts (ToT) enables LLMs to explore multiple reasoning paths, evaluate intermediate steps, and backtrack when necessary, enabling deliberate problem solving for complex tasks.…
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
We present MetaGPT, a meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences.…
WebArena: A Realistic Web Environment for Building Autonomous Agents
We present WebArena, a standalone, self-hostable web environment for building autonomous agents. WebArena includes realistic web applications with functional tools, user interfaces, and data.…
ReAct: Synergizing Reasoning and Acting in Language Models
We explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner. ReAct allows LLMs to interact with external tools to retrieve additional information, leading to more relia…
Evaluating Language-Model Agents on Realistic Autonomous Tasks
We evaluate language model agents on 12 realistic autonomous tasks requiring multi-step reasoning and real-world tool use, finding that current models succeed on only a small fraction of tasks.…
Needle In A Haystack: Evaluating Long-Context Retrieval
We introduce the Needle-in-a-Haystack evaluation, which tests whether LLMs can retrieve specific information from long documents. The evaluation reveals significant performance degradation with context length.…
Measuring and Reducing LLM Hallucination without Gold Standard Answers
We present methods for measuring and reducing LLM hallucinations without requiring gold standard reference answers. Our approach uses consistency checking and uncertainty estimation.…
Auditing Large Language Models: A Three-Layer Approach
We propose a three-layer approach to auditing large language models: governance audits, model audits, and application audits. The framework provides a systematic methodology for enterprise AI accountability.…
Gorilla: Large Language Model Connected with Massive APIs
We introduce Gorilla, a finetuned LLM that can write accurate API calls. Gorilla is trained on 1,600+ API documentation pages and outperforms GPT-4 on writing API calls.…
Toolchain*: Efficient Action Space Navigation in Large Language Model Agents
We present Toolchain*, a planning algorithm that efficiently navigates the action space of tool-using agents. Toolchain* uses A* search to find optimal tool sequences for complex tasks.…
Mind2Web is the first dataset for developing and evaluating generalist agents for the web, with over 2,000 open-ended tasks collected from 137 websites spanning 31 domains.…
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
CRITIC enables LLMs to self-correct by interacting with external tools to verify and critique their outputs, improving accuracy on code generation, mathematical reasoning, and question answering.…
We present LATS, a general framework for language agent search that combines Monte Carlo Tree Search with LLM-based agents. LATS uses LLM-generated heuristics to guide tree search over agent trajectories.…
We propose a unifying framework for language agents drawing on cognitive science, organizing agents around memory, action, and decision-making components.…
We present AgentBench, a multi-dimensional evolving benchmark for evaluating LLMs as agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital games, and web browsing.…
Agents: An Open-source Framework for Autonomous Language Agents
We present Agents, an open-source library for building autonomous language agents. Agents supports long-short term memory, tool use, web navigation, multi-agent communication, and human-agent interaction.…
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
We present AgentCoder, a multi-agent framework for code generation that uses separate agents for programming, test design, and test execution. The iterative testing loop significantly improves code quality.…
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents are computational agents that simulate believable human behavior, using a memory stream, reflection, and planning to produce coherent long-term behavior in a simulated environment.…
Although Large Language Models (LLMs) have demonstrated remarkable code-generation ability, they still struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through collabor…
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP), fueling a paradigm shift in information acquisition. Nevertheless, LLMs are prone to hallucination…
LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model
This work addresses long-horizon task planning with LLMs in open-world household environments, where existing methods fail to explicitly track key objects and attributes. The paper proposes an open state representation t…
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generat…
Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation
Large Language Models (LLMs) have made remarkable advancements in the field of natural language generation. However, the propensity of LLMs to generate inaccurate or non-factual content, termed "hallucinations", remains …
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Large Language Models (LLMs) are prone to generating factual inaccuracies, a phenomenon known as hallucination. Detecting these hallucinations without access to external knowledge or ground truth is a challenging problem…
Agents: An Open-source Framework for Autonomous Language Agents
Agents is an open-source framework for building autonomous language agents with long-short term memory, tool use, web navigation, multi-agent communication, and human-agent interaction.…
LATS unifies reasoning, acting, and planning in language agents using Monte Carlo Tree Search, enabling deliberate exploration and exploitation of the action space.…
We present ChatDev, a virtual chat-powered software company with multi-agent collaboration. ChatDev uses role-playing to simulate different software development roles.…
Adaptive Retrieval-Augmented Generation for Conversational Systems
We present FLARE, an active retrieval augmented generation method that adaptively decides when and what to retrieve during generation. FLARE uses upcoming sentence prediction to trigger retrieval.…
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager is the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention.…
A Survey on Large Language Model based Autonomous Agents
We present a comprehensive survey of LLM-based autonomous agents, covering construction, application, and evaluation. We analyze 150+ papers to identify key trends and open challenges.…
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
We present Mamba, a selective state space model that achieves linear-time sequence modeling. Mamba selectively propagates or forgets information based on content, enabling efficient long-sequence processing.…
We study the faithfulness of chain-of-thought reasoning, finding that LLMs often produce reasoning chains that are plausible but not causally connected to their final answers.…
OpenAgents: An Open Platform for Language Agents in the Wild
OpenAgents is an open platform for using and hosting language agents in the wild, featuring three agents: a data analysis agent, a plugins agent, and a web agent.…
Cognitive Architectures for Language Agents
We draw on the rich history of cognitive science and symbolic AI to propose CoALA, a conceptual framework for language agents. CoALA organizes memory, action, and decision-making into a coherent architecture.…
Improving Factuality and Reasoning in Language Models through Multiagent Debate
We present a method for improving factuality and reasoning in LLMs through multi-agent debate. Multiple agents propose and debate answers, with the final answer emerging from the debate process.…
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
HuggingGPT uses ChatGPT as a controller to manage and organize expert models from HuggingFace to solve AI tasks, connecting task planning, model selection, task execution, and response generation.…
Toolformer: Language Models Can Teach Themselves to Use Tools
Toolformer trains language models to decide which tools to call, when to call them, and how to incorporate the results, enabling self-supervised tool use without human annotations.…
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models
We present DEPS, a method for interactive planning with LLMs that uses a describe-explain-plan-select loop. DEPS enables agents to interactively refine plans based on execution feedback.…
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents
LLM-Planner uses LLMs for few-shot grounded planning in embodied agents, dynamically replanning based on environmental feedback to complete long-horizon tasks.…
We investigate the effectiveness of process reward models (PRMs) for improving mathematical reasoning. PRMs provide step-level feedback rather than outcome-level feedback, enabling more precise alignment of reasoning cha…
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Multiagent debate improves factuality and reasoning by having multiple LLM instances propose and debate answers, converging on more accurate solutions through iterative refinement.…
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
We present CRITIC, a framework that allows LLMs to validate and progressively amend their own outputs with the assistance of external tools. CRITIC uses tool feedback to identify and correct errors.…
Toolformer: Language Models Can Teach Themselves to Use Tools
We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into the future token prediction.…
LATM (LLMs As Tool Makers) enables LLMs to create their own reusable tools for problem-solving, separating tool creation from tool use with a dispatcher that routes tasks to appropriate tools.…
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
We present HuggingGPT, a system that uses ChatGPT as a controller to manage and invoke expert models in HuggingFace. By leveraging the rich model repository, HuggingGPT can tackle complex AI tasks.…
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models
DEPS (Describe, Explain, Plan and Select) is an interactive planning approach for embodied agents that decomposes complex tasks into manageable subtasks with explicit explanation steps.…
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
We present CAMEL, a communicative agents framework that uses role-playing to facilitate autonomous cooperation between agents. CAMEL enables agents to collaborate on complex tasks through structured conversation.…
We present AgentBench, a multi-dimensional evolving benchmark to evaluate LLMs as agents in various environments including operating systems, databases, knowledge graphs, digital games, and web browsing.…
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation
We introduce semantic uncertainty, an entropy-based uncertainty measure for free-form natural language generation. Semantic uncertainty accounts for the fact that many different sentences can express the same meaning.…
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a finetuned LLaMA-based model that surpasses GPT-4 on writing API calls. We introduce APIBench, a comprehensive dataset of HuggingFace, TorchHub, and TensorHub APIs.…
Sparks of Artificial General Intelligence: Early experiments with GPT-4
We investigate an early version of GPT-4 and argue that it exhibits sparks of AGI. We demonstrate GPT-4's capabilities across diverse domains and analyze its limitations.…
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
We introduce Tree of Thoughts (ToT), a framework that generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as inte…
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimisation
We present Retroformer, a framework for improving language agents through retrospective policy gradient optimization. Retroformer learns from past trajectories to improve future performance without manual reward engineer…
LLM-as-a-Judge: Large Language Models as Evaluators
We explore using strong LLMs as judges to evaluate the quality of responses from chatbots. We introduce MT-bench, a challenging multi-turn question set, and Chatbot Arena, a benchmark platform.…
Knowledge Graph Completion with Pretrained Multimodal Transformer for Downstream Tasks
We investigate using pretrained multimodal transformers for knowledge graph completion and downstream reasoning tasks. The approach combines structured knowledge with neural representations.…
Reflexion: Language Agents with Verbal Reinforcement Learning
We present Reflexion, a framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Reflexion converts binary or scalar feedback from the environment into verbal feedback in t…
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning
We study tool-augmented mathematical reasoning and propose methods for improving reliability of tool use in computation-intensive tasks. We introduce TORA, a corpus of tool-integrated reasoning examples.…
Voyager: An Open-Ended Embodied Agent with Large Language Models
We present Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention.…
CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society
CAMEL proposes a role-playing communicative agent framework to enable autonomous cooperation among agents. Using inception prompting, it guides agents to complete tasks while maintaining conversational coherence.…
We introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The same LLM generates the output, provides feedback, and refines the output.…
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
We present AgentVerse, a multi-agent framework that enables dynamic team formation and collaboration. AgentVerse supports expert recruitment, collaborative decision-making, and action execution.…
AutoGPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, chains together LLM "thoughts", to autonomously achieve whatever goal you set.…
Sparks of Artificial General Intelligence: Early experiments with GPT-4
We present an early evaluation of GPT-4 across diverse domains, arguing it exhibits sparks of AGI through its performance on novel tasks requiring reasoning, creativity, and multi-step problem solving.…
Generative Agents: Interactive Simulacra of Human Behavior
We introduce generative agents, computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists create new paintings; and friends throw birthday…
We present a tree-of-thought approach guided by LLM-generated heuristics. The approach uses LLM guidance to prune the search tree, making deliberate reasoning more efficient.…
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
We introduce Promptbreeder, a self-referential system that evolves task prompts and mutation-prompts using LLMs. Promptbreeder automatically discovers better prompts through evolutionary search.…
Automatic Prompt Optimization with "Gradient Descent" and Beam Search
We present a method for automatic prompt optimization using textual "gradient descent" and beam search. The method iteratively improves prompts by analyzing errors and generating improved versions.…
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
We introduce Direct Preference Optimization (DPO), a stable, performant, and computationally lightweight alternative to RLHF. DPO directly optimizes for human preferences without explicit reward modeling.…
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
We present GITM, a novel LLM-based agent system that leverages text-based knowledge and memory to handle complex tasks in open-world environments.…
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation
AgentSims is an open-source sandbox for evaluating LLM capabilities through task-based assessment in a simulated town environment, enabling evaluation of planning, memory, and social interaction.…
Optimising Human-AI Collaboration by Learning Convincing Explanations
This paper explores how to optimize human-AI collaboration by enabling AI systems to learn and generate convincing explanations for their decisions. It proposes a framework where AI agents are trained not only to perform…
Faithful Reasoning Using Large Language Models
We present a method for faithful reasoning with LLMs that produces verifiable reasoning chains. The approach separates reasoning into selection and inference steps, enabling verification of each step.…
Constitutional AI: Harmlessness from AI Feedback
We present Constitutional AI, a method for training AI systems to be helpful, harmless, and honest using a set of principles (a "constitution") and AI feedback rather than human feedback.…
We survey the history and current state of neuro-symbolic AI, identifying three waves of development. The third wave combines deep learning with symbolic reasoning for robust, interpretable AI systems.…
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI trains harmless AI assistants through a set of principles (a constitution) that guide the model to self-critique and revise its outputs.…
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Least-to-most prompting decomposes complex problems into simpler subproblems and solves them sequentially, enabling generalization to harder problems than seen in demonstrations.…
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting enables LLMs to solve complex reasoning tasks by generating intermediate reasoning steps, dramatically improving performance on arithmetic, commonsense, and symbolic reasoning.…
RLHF: Training Language Models to Follow Instructions with Human Feedback
We present InstructGPT, trained using reinforcement learning from human feedback to follow instructions. RLHF significantly improves alignment with human intent compared to supervised fine-tuning alone.…
Interpretable Machine Learning: A Guide for Making Black Box Models Explainable
A comprehensive guide to interpretable machine learning, covering LIME, SHAP, and other methods for explaining black box model predictions. Essential reference for enterprise AI transparency.…
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
We propose least-to-most prompting, a technique that decomposes complex problems into simpler subproblems and solves them sequentially, with each solution building on previous ones.…
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.…
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency samples multiple reasoning paths and selects the most consistent answer, improving chain-of-thought reasoning without additional training.…
We discuss emergent abilities of large language models — abilities that are not present in smaller models but appear in larger models. We survey 137 emergent abilities across 8 models.…
We survey emergent abilities of large language models — capabilities not present in smaller models that appear unpredictably at scale — and discuss their implications for AI development.…
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
We present RAG, a general-purpose fine-tuning recipe that combines parametric memory with non-parametric memory for knowledge-intensive NLP tasks. RAG retrieves relevant documents and conditions generation on them.…
We propose the Transformer, a model architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer achieves state-of-the-art results on machine translation tasks.…
Minsky's foundational theory that intelligence emerges from the interaction of many simple agents. Each agent is unintelligent alone, but complex behavior emerges from their cooperation and competition.…