HomeResearch LibraryCan Large Language Models be Trusted for Evaluation? Sc…
Chapter 6 · 2024

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation

Yuxuan Liu, Tianchi Yang, Shaohan Huang

Abstract

We study whether LLMs can be trusted as evaluators, finding systematic biases including position bias, verbosity bias, and self-enhancement bias that affect evaluation reliability.

Eigenvector Warning — Zone III / PASF-PADE AnalysisNot part of the original paper
Eigenvector Research — Marco van Hurne
How this paper contributes to solving the Zone III problem (PASF-PADE)

LLM-as-judge is a common pattern in Zone III quality assurance — using one model to evaluate another's outputs. This paper reveals that LLM judges have systematic biases that can corrupt evaluation. Zone III quality assurance cannot rely solely on LLM evaluation; it needs diverse evaluation mechanisms including rule-based checks, human spot-checks, and statistical monitoring.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Topics

LLM evaluationbiasmeta-evaluationreliability