Chapter 2 · 2025
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Weixuan Wang, Dongge Han, Daniel Madrigal Diaz
Abstract
Existing benchmarks for LLM agents often focus on atomic tasks, failing to capture the complexities of real-world, long-horizon workflows. This paper introduces OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications. It also proposes HomerAgents, a multi-agent framework for automated benchmark generation, demonstrating that OdysseyBench effectively challenges state-of-the-art LLM agents.
Topics
LLM agentslong-horizon tasksbenchmarksoffice applicationsmulti-agent systems
Relevance Scores
Long-Horizon Score85
Enterprise Score80
Completeness75
Paper Info
Year2025
Venue
Type
ChapterCh. 2
Authors3
Zone III Analysis
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 2 papers →