Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation

Yuxuan Liu, Tianchi Yang, Shaohan Huang

Abstract

We study whether LLMs can be trusted as evaluators, finding systematic biases including position bias, verbosity bias, and self-enhancement bias that affect evaluation reliability.

Eigenvector Warning — Zone III / PASF-PADE AnalysisNot part of the original paper

Eigenvector Research — Marco van Hurne

How this paper contributes to solving the Zone III problem (PASF-PADE)

LLM-as-judge is a common pattern in Zone III quality assurance — using one model to evaluate another's outputs. This paper reveals that LLM judges have systematic biases that can corrupt evaluation. Zone III quality assurance cannot rely solely on LLM evaluation; it needs diverse evaluation mechanisms including rule-based checks, human spot-checks, and statistical monitoring.

Why AI is not sufficient for Zone III without this

Zone III refers to high-complexity, high-risk, long-running agentic workflows — the class of enterprise AI deployments where a single failure can cascade across hundreds of steps. Standard AI models, trained to predict the next token, are not inherently designed for durable, governed, multi-step execution. This paper addresses one or more of the structural gaps that make Zone III deployments unsafe without explicit architectural intervention.

Topics

LLM evaluationbiasmeta-evaluationreliability

Relevance Scores

Long-Horizon Score78

Enterprise Score82

Completeness83

Paper Info

Year2024

Venue

Type

ChapterCh. 6

Authors3

Zone III Analysis

Frameworks

AEGIS