Home » Beyond RAG » Confidence Scoring

How to Build Confidence Scoring for RAG Answers

Confidence scoring tells your RAG system when to answer, when to flag uncertainty, and when to decline entirely. Without it, the system returns every answer with the same apparent certainty, regardless of whether the retrieval found strong evidence or nothing relevant. Building a confidence score from retrieval quality, answer groundedness, and source agreement lets you route high-confidence answers directly to users, medium-confidence answers to human review, and low-confidence answers to a graceful "I do not know" response.

Why Confidence Matters More Than Accuracy

A RAG system that is 80% accurate and always confident is less useful than one that is 80% accurate and knows which 20% it is unsure about. The first system is wrong 20% of the time with no warning. The second system correctly flags its uncertain cases, so users know when to trust the answer and when to verify. In practice, this means the second system's effective accuracy (answers the user can trust) is much higher because the uncertain 20% gets routed to verification rather than presented as fact.

Confidence scoring also enables tiered responses. High confidence (above 0.85): return the answer directly. Medium confidence (0.6 to 0.85): return the answer with a "this may not be complete" caveat and references to source documents. Low confidence (below 0.6): decline to answer and suggest where the user might find the information. This tiered approach dramatically improves user trust because the system's stated confidence aligns with its actual accuracy.

Step-by-Step Implementation

Step 1: Score retrieval quality.
The first confidence signal comes from the retrieval step itself. If the top retrieved chunks have high similarity scores and are tightly clustered (multiple chunks from the same topic), retrieval is likely good. If the top scores are low and the results span unrelated topics, retrieval probably failed. Capture three metrics: the top similarity score, the score gap between the first and fifth result (a large gap means the top result is much more relevant than the rest), and the topic concentration (how many of the top results share the same source document or topic).
def score_retrieval_quality(results): if not results: return 0.0 scores = [r.score for r in results] top_score = scores[0] # Score gap: high gap means one clear winner score_gap = scores[0] - scores[min(4, len(scores)-1)] # Topic concentration: same source = focused retrieval sources = [r.metadata.get("source") for r in results[:5]] unique_sources = len(set(s for s in sources if s)) concentration = 1.0 - (unique_sources - 1) / max(4, len(sources)) # Weighted combination quality = ( top_score * 0.5 + min(score_gap * 2, 1.0) * 0.25 + concentration * 0.25 ) return min(1.0, quality)
Step 2: Score answer groundedness.
After generation, check what percentage of claims in the answer are supported by the retrieved context. Use an LLM evaluator that takes each claim and the retrieved chunks, and determines whether the claim is directly supported, partially supported, or unsupported. The groundedness score is the ratio of supported claims to total claims.
GROUNDEDNESS_PROMPT = """For each claim in the answer, determine if it is SUPPORTED, PARTIAL, or UNSUPPORTED by the context. Answer: {answer} Context: {context} Return JSON: [{"claim": "...", "status": "SUPPORTED|PARTIAL|UNSUPPORTED"}]""" def score_groundedness(answer, chunks): context = "\n\n".join(c.text for c in chunks) response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=2000, messages=[{"role": "user", "content": GROUNDEDNESS_PROMPT .replace("{answer}", answer) .replace("{context}", context)}] ) claims = json.loads(response.content[0].text) if not claims: return 0.5 supported = sum(1 for c in claims if c["status"] == "SUPPORTED") partial = sum(1 for c in claims if c["status"] == "PARTIAL") return (supported + partial * 0.5) / len(claims)
Step 3: Score source agreement.
When multiple retrieved chunks address the same question, check whether they agree. Consistent answers across sources increase confidence. Contradictory answers decrease it. This signal is particularly valuable for factual questions where a specific answer is either right or wrong (version numbers, configuration values, dates, prices).
AGREEMENT_PROMPT = """Do these passages agree or disagree on the answer to the question? Question: {query} Passages: {passages} Return JSON: {"agreement": "agree|partial|disagree", "details": "brief explanation"}""" def score_source_agreement(query, chunks): if len(chunks) < 2: return 0.5 # Not enough sources to assess agreement passages = "\n\n---\n\n".join( f"[{i+1}] {c.text}" for i, c in enumerate(chunks[:5])) response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=200, messages=[{"role": "user", "content": AGREEMENT_PROMPT .replace("{query}", query) .replace("{passages}", passages)}] ) result = json.loads(response.content[0].text) agreement_scores = {"agree": 1.0, "partial": 0.6, "disagree": 0.2} return agreement_scores.get(result["agreement"], 0.5)
Step 4: Combine into a composite score.
Weight the three signals into a single confidence score. Retrieval quality is the foundation: if retrieval failed, the other signals are unreliable. Groundedness catches hallucination. Source agreement catches contradictions. Use the minimum of retrieval quality and the weighted combination of all three, so a critical failure in any one signal caps the overall confidence.
def composite_confidence(retrieval_quality, groundedness, agreement): # Weighted combination weighted = ( retrieval_quality * 0.4 + groundedness * 0.35 + agreement * 0.25 ) # Cap at retrieval quality: bad retrieval = low confidence regardless return min(weighted, retrieval_quality + 0.1) def rag_with_confidence(query, retriever, threshold_high=0.85, threshold_low=0.6): chunks = retriever.search(query, top_k=10) reranked = rerank(query, chunks, top_k=5) rq = score_retrieval_quality(reranked) if rq < 0.3: return {"answer": None, "confidence": rq, "message": "Could not find relevant information."} answer = generate_grounded(query, reranked) gd = score_groundedness(answer, reranked) ag = score_source_agreement(query, reranked) confidence = composite_confidence(rq, gd, ag) if confidence >= threshold_high: return {"answer": answer, "confidence": confidence, "tier": "high"} elif confidence >= threshold_low: return {"answer": answer, "confidence": confidence, "tier": "medium", "caveat": "This answer may be incomplete."} else: return {"answer": None, "confidence": confidence, "tier": "low", "message": "Insufficient evidence to answer reliably."}
Step 5: Calibrate thresholds on real queries.
The default thresholds (0.85 high, 0.6 low) are starting points. Calibrate them on a labeled query set where you know the correct answers. Run your pipeline on 200 to 500 queries, compute the confidence score for each, and plot accuracy at each confidence level. The high threshold should be set where accuracy exceeds 95%. The low threshold should be set where accuracy drops below 70%. These values vary by application, so calibration on your specific data is essential.

Cost Optimization

Full confidence scoring adds 2 to 3 LLM calls per query (groundedness check, agreement check, and optionally claim extraction). To manage cost, only run the full pipeline when retrieval quality is ambiguous (0.4 to 0.8). Very high retrieval quality (above 0.8) is likely correct and can skip the detailed checks. Very low retrieval quality (below 0.4) is likely wrong and can be declined immediately. This selective approach reduces average confidence scoring cost by 50 to 70%.

Adaptive Recall builds confidence scoring into the memory system itself. Every memory carries a confidence score that reflects corroboration, contradiction status, and historical retrieval accuracy. When memories are recalled, their confidence scores feed directly into the retrieval ranking. The LLM receives memories that have already been assessed for reliability, which means the generation step starts from a higher-quality context and needs less post-hoc verification.

Start with pre-scored memories. Adaptive Recall's confidence system tracks corroboration and evidence strength so you know how much to trust each retrieval result.

Try It Free