LLM-as-a-Judge: Large Language Models as Evaluators

Lianmin Zheng (UC Berkeley), Wei-Lin Chiang (UC Berkeley)

Abstract

We explore using strong LLMs as judges to evaluate the quality of responses from chatbots. We introduce MT-bench, a challenging multi-turn question set, and Chatbot Arena, a benchmark platform.

Eigenvector Insight — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

LLM-as-judge is the practical solution to the agent evaluation problem. For enterprise deployments, automated quality assessment of agent outputs — without human review of every step — requires exactly this pattern.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Key Contributions

→LLM-as-judge methodology
→MT-bench multi-turn benchmark
→Chatbot Arena evaluation platform

Topics

LLM evaluationautomated judgingmulti-turn evaluationbenchmarks

Relevance Scores

Long-Horizon Score75

Enterprise Score84

Completeness83

Paper Info

Year2023

VenueNeurIPS 2023

Typeevaluation

ChapterCh. 5

Authors2

Zone III Analysis

Frameworks

AEGIS PASF