How to Build LLM-as-a-Judge Evaluations
LLM-as-a-judge works because a capable model, given a clear rubric and the relevant context, agrees with human raters often enough to be useful, at a fraction of the cost and time. It fails when the rubric is vague, when the judge's biases go uncorrected, or when nobody ever checked whether it agrees with people. The steps below produce a judge you can defend.
A judge is only as good as the rubric it applies. Vague instructions like "rate the helpfulness from 1 to 5" produce inconsistent, uninterpretable scores. Write explicit criteria for each score level or each dimension. For a faithfulness judge: "Score 1 if any claim in the answer is not supported by the provided context, score 0 only if every claim is directly supported." For a helpfulness rubric, define what distinguishes a 5 from a 3 in concrete terms: completeness, correctness, and directness of the response. The discipline of writing the rubric often surfaces that your team did not actually agree on what good means, which is valuable on its own.
There are two modes. Direct scoring asks the judge to assign an absolute score to a single output against the rubric, which you need when tracking a metric over time or gating against a threshold. Pairwise comparison asks the judge which of two outputs is better, which is more reliable than absolute scoring because relative judgments are easier and more consistent than absolute ones. Use pairwise comparison when choosing between two prompts or models, and use direct scoring when you need a number to trend. Many teams use pairwise for development decisions and a calibrated direct score for production monitoring.
The judge prompt assembles the rubric, the original input, the output being judged, and the relevant context (for faithfulness, the retrieved documents). Require the judge to reason before deciding: ask it to identify each claim and check it, or to note the strengths and weaknesses of each candidate, and only then emit a verdict. This chain-of-reasoning step measurably improves agreement with humans and gives you an explanation you can inspect when a score looks wrong. Require a structured output so the verdict is machine-readable.
You are evaluating whether an ANSWER is faithful to the CONTEXT.
Rules:
- A claim is "supported" only if it follows directly from the CONTEXT.
- Score 0 if every claim is supported, 1 if any claim is unsupported.
CONTEXT:
{context}
ANSWER:
{answer}
Respond as JSON:
{"claims": [{"claim": "...", "supported": true|false}],
"verdict": 0|1,
"reason": "one sentence"}Judges have documented biases you must counter. Position bias: in pairwise comparison, judges favor whichever answer is presented first, so evaluate each pair in both orders and only count a win if the judge prefers the same answer regardless of position. Length bias: judges tend to prefer longer answers, so include a rubric instruction that length is not a virtue and verbosity without added substance is a fault. Self-preference bias: a judge may favor outputs that resemble its own style, so where possible use a different model family for the judge than for the system under test. Constrain the output format tightly so the judge spends its capacity on the judgment, not on prose.
This is the step teams skip and the step that makes a judge trustworthy. Take a sample of at least 50 to 100 outputs, have humans label them against the same rubric, and measure how well the judge agrees with the humans using agreement rate or a correlation statistic like Cohen's kappa. If agreement is high, the judge is a valid stand-in for human evaluation on this task. If it is low, the rubric is ambiguous or the task is too subtle for the judge, and you fix the rubric or fall back to human review. A judge that has never been compared to humans provides false confidence, not measurement.
Judge reliability is not permanent. The judge model can change when the provider updates it, the task distribution can drift, and a rubric that worked on early data can become ambiguous on new cases. Re-run the human-agreement check periodically, and always re-validate after changing the judge model or model version. Treat the judge as a measurement instrument that needs recalibration, not a fixed oracle.
A trustworthy LLM judge needs an explicit rubric, structured reasoning before its verdict, active control for position and length bias, and validation against human labels. Without the human-agreement check, the judge is an unmeasured opinion.
When a Judge Is the Wrong Tool
LLM-as-a-judge is powerful but not universal, and knowing when not to use it saves money and avoids false confidence. When a deterministic check can answer the question, use it: validating JSON, checking for a required citation, confirming a number falls in range, or matching a regex are all faster, cheaper, and more reliable as assertions than as judge calls. When a definitive ground truth exists, reference-based scoring beats a judge, because comparing against a known answer is more objective than asking a model for an opinion. Reserve the judge for the genuinely open-ended quality dimensions, helpfulness, faithfulness, tone, where no assertion or reference can capture the judgment, and even then back it with the human-agreement check.
The judge is also the wrong tool when the task requires domain expertise the judge model does not have. A judge scoring medical, legal, or specialized technical correctness can be confidently wrong in ways that are hard to detect without an expert, so for these high-stakes domains the judge should be treated as a first-pass filter that routes uncertain cases to human experts rather than as the final word. The general rule is to use the cheapest evaluator that can reliably answer the question: assertions first, references where they exist, judges for open-ended quality, and humans for the cases the judge cannot be trusted on. This layering keeps cost down and accuracy up, and it prevents the common failure of running expensive judge calls on questions a one-line assertion could have settled.
Where Judges Fit in the Evaluation Stack
LLM-as-a-judge sits in the middle of the evaluation pyramid. Below it, cheap deterministic assertions catch format and policy bugs on every output. The judge handles the open-ended quality dimensions that assertions cannot. Above it, human review handles the cases where the judge reports low confidence or disagrees with itself across runs, and provides the labels that keep the judge calibrated. This layering gets you most of the rigor of human evaluation at a fraction of the cost, which is the entire reason the technique matters. Note that judging for evaluation is distinct from using a model to rank retrieval results at query time, covered in LLM-as-a-judge for retrieval ranking, though the bias-control lessons carry over.