HomeResearch LibraryLLM-as-a-Judge: Large Language Models as Evaluators
evaluationChapter 5NeurIPS 2023 · 2023

LLM-as-a-Judge: Large Language Models as Evaluators

Lianmin Zheng (UC Berkeley), Wei-Lin Chiang (UC Berkeley)

Abstract

We explore using strong LLMs as judges to evaluate the quality of responses from chatbots. We introduce MT-bench, a challenging multi-turn question set, and Chatbot Arena, a benchmark platform.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper
Eigenvector Research — Marco van Hurne
How this paper contributes to solving the Zone III problem (PASF-PADE)

LLM-as-judge is the practical solution to the agent evaluation problem. For enterprise deployments, automated quality assessment of agent outputs — without human review of every step — requires exactly this pattern.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Key Contributions

  • LLM-as-judge methodology
  • MT-bench multi-turn benchmark
  • Chatbot Arena evaluation platform

Topics

LLM evaluationautomated judgingmulti-turn evaluationbenchmarks