evaluationChapter 5NeurIPS 2023 · 2023
LLM-as-a-Judge: Large Language Models as Evaluators
Lianmin Zheng (UC Berkeley), Wei-Lin Chiang (UC Berkeley)
Abstract
We explore using strong LLMs as judges to evaluate the quality of responses from chatbots. We introduce MT-bench, a challenging multi-turn question set, and Chatbot Arena, a benchmark platform.
Key Contributions
- →LLM-as-judge methodology
- →MT-bench multi-turn benchmark
- →Chatbot Arena evaluation platform
Topics
LLM evaluationautomated judgingmulti-turn evaluationbenchmarks
Relevance Scores
Long-Horizon Score75
Enterprise Score84
Completeness83
Paper Info
Year2023
VenueNeurIPS 2023
Typeevaluation
ChapterCh. 5
Authors2
Zone III Analysis
Related Papers
ReAct: Synergizing Reasoning and Acting in Language Mod…
2023 · Ch.1
Reflexion: Language Agents with Verbal Reinforcement Le…
2023 · Ch.1
Tree of Thoughts: Deliberate Problem Solving with Large…
2023 · Ch.1
Toolformer: Language Models Can Teach Themselves to Use…
2023 · Ch.1
View all Chapter 5 papers →