Chapter 7 · 2026
Log analysis is necessary for credible evaluation of AI agents
Peter Kirgis, Sayash Kapoor, Stephan Rabanser
Abstract
This paper argues that log analysis is crucial for credible evaluation of AI agents, as outcome-based benchmarks alone can be misleading. It presents a taxonomy of threats to credible evaluation, including inflated/deflated scores and concealed dangerous actions. The authors propose guiding principles for log analysis and illustrate them with a case study, offering recommendations for benchmark creators, model developers, and evaluators.
Topics
AI agentsevaluationlog analysisbenchmarkingcredibility
Relevance Scores
Long-Horizon Score85
Enterprise Score80
Completeness75
Paper Info
Year2026
Venue
Type
ChapterCh. 7
Authors3
Zone III Analysis
Related Papers
Attention Is All You Need
2017 · Ch.7
Emergent Abilities of Large Language Models
2022 · Ch.7
Evidence Factory: Automated Research Synthesis for Ente…
2025 · Ch.7
Needle In A Haystack: Evaluating Long-Context Retrieval
2023 · Ch.2
View all Chapter 7 papers →