Chapter 7 · 2026
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan
Abstract
We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. ResearchGym provides infrastructure for systematic evaluation and analysis of autonomous agents on closed-loop research.
Topics
AI AgentsLanguage ModelsEvaluationBenchmarks
Relevance Scores
Long-Horizon Score85
Enterprise Score80
Completeness75
Paper Info
Year2026
Venue
Type
ChapterCh. 7
Authors3
Zone III Analysis
Related Papers
Attention Is All You Need
2017 · Ch.7
Emergent Abilities of Large Language Models
2022 · Ch.7
Evidence Factory: Automated Research Synthesis for Ente…
2025 · Ch.7
View all Chapter 7 papers →