LLM-as-Judge Has a Reliability Problem in Production

The headline number everyone quotes for LLM-as-Judge is 80%: GPT-4 agrees with human evaluators roughly 80% of the time, the same rate at which human annotators agree with each other. That figure comes from Lianmin Zheng and colleagues’ 2023 MT-Bench study, built on about 3,000 expert votes, and it made …