Home » Cognitive Scoring » LLM-as-a-Judge

How to Use LLM-as-a-Judge for Relevance Scoring

LLM-as-a-judge uses a large language model to evaluate whether each retrieval candidate actually answers the user's query. Instead of relying solely on embedding similarity or metadata scores, you send the query and each candidate to an LLM that assesses relevance, specificity, and answer quality. This catches nuances that mathematical scoring misses, but it adds significant latency and cost.

Before You Start

LLM-as-a-judge is the most expensive reranking approach. Each candidate evaluation consumes LLM tokens (typically 200 to 500 tokens per evaluation), and evaluating 10 candidates per query means 2,000 to 5,000 tokens per retrieval. At typical API pricing, this adds $0.005 to $0.02 per query depending on the model. For high-volume applications, this cost may be prohibitive. Consider cross-encoder reranking or cognitive scoring first, and reserve LLM-as-a-judge for applications where the highest possible accuracy justifies the expense.

You also need to accept 500 milliseconds to 2 seconds of added latency per query, even with parallel evaluation. If your application requires sub-100-millisecond retrieval, this approach is not suitable as the primary reranking method, though it can work as an offline evaluation tool.

Step-by-Step Implementation

Step 1: Design the evaluation prompt.
The prompt asks the LLM to score how well a candidate answers a query. Be specific about the scoring scale and criteria. A 1 to 5 scale works well: 1 means completely irrelevant, 3 means somewhat related but does not answer the question, 5 means directly answers the question with accurate information. Include instructions to return only the numeric score to simplify parsing.

JUDGE_PROMPT = """Score how well the following document answers the query.

Query: {query}

Document: {document}

Score on a scale of 1-5:
1 = Completely irrelevant to the query
2 = Same topic but does not answer the question
3 = Partially answers the question
4 = Mostly answers the question with minor gaps
5 = Directly and completely answers the question

Return only the numeric score, nothing else."""

Step 2: Choose the judge model.
Larger models produce better judgments but cost more. Claude Haiku or GPT-4o-mini work well as judges for most use cases, providing reasonable quality at lower cost than full-size models. For high-stakes applications where every ranking decision matters, use Claude Sonnet or GPT-4o. Avoid the smallest models (like GPT-3.5-turbo) because their relevance judgments tend to be inconsistent and poorly calibrated.

Step 3: Implement batch evaluation.
Send all candidate evaluations in parallel rather than sequentially. Most LLM APIs support concurrent requests, and evaluating 10 candidates sequentially takes 10 times longer than evaluating them in parallel. Use asyncio, threading, or the API's batch endpoint to parallelize the calls.

import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def evaluate_candidate(query: str, document: str) -> float:
    prompt = JUDGE_PROMPT.format(query=query, document=document)
    response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    try:
        score = float(response.content[0].text.strip())
        return min(max(score, 1.0), 5.0)
    except (ValueError, IndexError):
        return 3.0  # default to neutral on parse failure

async def judge_rerank(query: str, candidates: list,
                       top_k: int = 5) -> list:
    tasks = [evaluate_candidate(query, c['content']) for c in candidates]
    scores = await asyncio.gather(*tasks)

    for candidate, score in zip(candidates, scores):
        candidate['judge_score'] = score

    candidates.sort(key=lambda x: x['judge_score'], reverse=True)
    return candidates[:top_k]

Step 4: Parse and normalize scores.
LLMs do not always return clean numeric responses. Sometimes they add explanations, use decimal values, or return the score in a sentence. Build robust parsing that extracts the first number from the response, clamps it to your scale range, and falls back to a neutral default when parsing fails. Track the parse failure rate; if it exceeds 5 percent, revise your prompt to be more explicit about the expected output format.

Step 5: Add fallback handling.
LLM APIs can time out, rate limit, or return errors. Your reranking function needs graceful fallback behavior: if the LLM evaluation fails for a candidate, fall back to the vector similarity score. If the entire batch fails (API outage), skip reranking and return the vector-sorted results. Never let a reranking failure block the retrieval response entirely.

async def judge_rerank_with_fallback(query: str, candidates: list,
                                     top_k: int = 5) -> list:
    try:
        return await asyncio.wait_for(
            judge_rerank(query, candidates, top_k),
            timeout=3.0  # 3 second timeout for all evaluations
        )
    except (asyncio.TimeoutError, Exception):
        # fallback to vector similarity ordering
        candidates.sort(key=lambda x: x['similarity'], reverse=True)
        return candidates[:top_k]

Step 6: Monitor cost and calibrate.
Track the token usage and dollar cost per query to ensure the LLM-as-a-judge approach stays within budget. Also calibrate the LLM's judgments against human evaluations. Prepare 50 to 100 query-document pairs with human relevance scores and compare them to the LLM's scores. If the LLM consistently over-scores or under-scores certain types of content, adjust your prompt or add calibration rules. A well-calibrated judge should agree with human judgments on at least 80 percent of cases within one point on a 5-point scale.

When LLM-as-a-Judge Makes Sense

LLM-based reranking works best for low-volume, high-stakes applications. Internal knowledge management systems where a wrong answer could lead to a bad business decision, medical information retrieval where accuracy is critical, and legal research where missing a relevant case could be malpractice are all good candidates. For these applications, the $0.01 to $0.02 per query is negligible compared to the cost of a wrong answer.

For high-volume applications like customer support chatbots or developer assistants, LLM-as-a-judge is usually too expensive and too slow. Cognitive scoring provides a better cost-performance trade-off for these use cases, adding multi-dimensional ranking at under 40 milliseconds and zero incremental API cost per query.

Combining LLM Judgment with Cognitive Scoring

The most powerful approach uses both. Cognitive scoring handles the factors that an LLM cannot see from the text alone: recency, access frequency, entity connections, and corroboration history. LLM judgment handles the factors that cognitive scoring cannot capture: whether the document truly answers the question, whether the information is specific enough, and whether the tone and level of detail match what the user needs. A weighted blend of both scores produces rankings that are both semantically precise and temporally relevant.

Start with cognitive scoring that runs on every query at no additional cost. Add LLM judgment when you need the highest possible accuracy.

Get Started Free

How to Use LLM-as-a-Judge for Relevance Scoring

Before You Start

Step-by-Step Implementation

When LLM-as-a-Judge Makes Sense

Combining LLM Judgment with Cognitive Scoring

Related Articles