[CS.AI] Disruptive Method: Subset Selection for Evaluatin...

Abstract

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters—a property that itself depends on costly human annotations.

In this work, we develop a method called Metric Match for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels.

We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation.

Furthermore, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match.

All project code is publicly available, and we additionally provide an installable package for ease of use.

Blogger's Review: The Metric Match method proposed in this paper effectively reduces the dependency on human annotations for LLM judges, lowering costs significantly. Its practical value is particularly evident in the medical case study, demonstrating substantial economic benefits. The innovation and applicability of the methodology provide a fresh perspective on LLM evaluation.

[CS.AI] Disruptive Method: Subset Selection for Evaluating LLM Reliability

Abstract