How to Detect Hallucinations in LLM Output
Before You Start
Hallucination detection is a post-generation step that runs after the LLM produces its response but before the response reaches the user. You need the generated response, the source context that was provided to the LLM (retrieved documents, memories, or knowledge base entries), and optionally access to a natural language inference model for entailment checking. The detection pipeline adds latency, so you will need to decide whether to run it on every response or selectively on high-risk queries.
The fundamental challenge is that hallucinations are designed to sound correct. The model does not flag its own fabrications, and the text itself reads identically whether it is grounded in fact or entirely made up. Detection requires external verification against a source of truth, which means you need a source of truth to verify against. In RAG systems, the retrieved documents serve this role. In memory-grounded systems, the retrieved memories serve this role. Without any reference material, automated detection is limited to statistical techniques like self-consistency, which are useful but less reliable than source-based verification.
Step-by-Step Detection Pipeline
The first step is to decompose the LLM's response into discrete claims that can be verified independently. A single paragraph might contain five or six distinct factual assertions mixed with opinions, synthesis, and filler text. You need to isolate the factual claims because those are what can be verified. Use an LLM to extract claims, with a prompt like: "List every factual assertion in this text as a separate, self-contained statement. Exclude opinions, recommendations, and hedged statements." The output is a list of atomic claims, each of which can be checked against the source material.
# Claim extraction prompt
EXTRACT_PROMPT = """List every factual claim in the following
text. Each claim should be a single, self-contained assertion
that could be true or false. Exclude opinions, questions,
and recommendations.
Text: {response}
Claims (one per line):"""
claims = llm.generate(EXTRACT_PROMPT.format(response=response))
claim_list = claims.strip().split("\n")Generate the response multiple times (3 to 5 generations) using different temperature values or random seeds. Extract claims from each generation and compare them. Claims that appear consistently across all generations are more likely to be grounded in real knowledge. Claims that change between generations, where a name differs, a number shifts, or a date moves, are likely hallucinations that the model is not confident about. Self-consistency is particularly effective for detecting fabricated numbers, dates, and proper nouns, which are the most common hallucination types. The downside is that it requires multiple LLM calls, increasing latency and cost by 3x to 5x for checked responses.
def check_self_consistency(query, context, n_samples=3):
responses = []
for i in range(n_samples):
resp = llm.generate(
system=context,
user=query,
temperature=0.7,
seed=i * 1000
)
responses.append(extract_claims(resp))
stable_claims = set.intersection(*[set(r) for r in responses])
unstable_claims = set.union(*[set(r) for r in responses]) - stable_claims
return stable_claims, unstable_claimsFor each extracted claim, search the source documents or retrieved memories for supporting evidence. This can be as simple as checking whether key terms from the claim appear in the source material, or as sophisticated as computing semantic similarity between the claim and every passage in the source context. Claims with high similarity to a source passage are likely grounded. Claims with no matching passage in any source are potential hallucinations. This step catches extrinsic hallucinations, where the model adds information that was not in the provided context.
def verify_against_sources(claim, source_docs, threshold=0.75):
claim_embedding = embed(claim)
best_match = 0.0
best_source = None
for doc in source_docs:
for passage in doc.passages:
similarity = cosine_sim(claim_embedding, embed(passage))
if similarity > best_match:
best_match = similarity
best_source = passage
return {
"claim": claim,
"supported": best_match >= threshold,
"similarity": best_match,
"source": best_source
}Semantic similarity catches obvious mismatches but misses subtle hallucinations where the claim uses words that appear in the source but makes a different assertion. Natural language inference (NLI) models are trained specifically to determine whether a hypothesis (the claim) is entailed by, contradicted by, or neutral relative to a premise (the source passage). Run each claim through an NLI model with the best-matching source passage as the premise. Claims classified as "entailed" are well-grounded. Claims classified as "contradicted" are definitely hallucinations. Claims classified as "neutral" are the ambiguous middle ground that may require human review.
from transformers import pipeline
nli = pipeline("text-classification",
model="cross-encoder/nli-deberta-v3-large")
def check_entailment(claim, source_passage):
result = nli(f"{source_passage} [SEP] {claim}")
# Returns: entailment, contradiction, or neutral
return result[0]["label"], result[0]["score"]Each detection technique produces a different signal. Self-consistency gives you a stability score (how many times the claim appeared identically across generations). Source matching gives you a similarity score. Entailment gives you a classification with confidence. Combine these into a single reliability score per claim using a weighted formula. Claims that are self-consistent, source-matched, and entailed score high. Claims that are unstable, unmatched, or contradicted score low. Set a threshold below which claims are flagged for review, softened with hedging language, or removed from the response entirely.
def score_claim(claim, consistency, source_match, entailment):
weights = {
"consistency": 0.3,
"source_match": 0.4,
"entailment": 0.3
}
consistency_score = 1.0 if claim in consistency["stable"] else 0.2
source_score = source_match["similarity"]
entailment_score = (
1.0 if entailment["label"] == "entailment"
else 0.0 if entailment["label"] == "contradiction"
else 0.4
)
final = (
weights["consistency"] * consistency_score +
weights["source_match"] * source_score +
weights["entailment"] * entailment_score
)
return finalWhat to Do with Detected Hallucinations
Once you have identified potentially hallucinated claims, you have several options depending on your application's risk tolerance. The most aggressive approach removes flagged claims from the response entirely, presenting only the well-grounded portions. This is appropriate for high-stakes applications (medical, legal, financial) where a fabricated claim is worse than a gap in the response. The moderate approach replaces flagged claims with hedged versions: "The project may use PostgreSQL" instead of "The project uses PostgreSQL." This preserves the information while signaling uncertainty. The lightest approach adds visual indicators (footnotes, confidence badges, or tooltip warnings) to flagged claims without changing the text, letting the user decide how much to trust each statement.
For systems with persistent memory, detected hallucinations should also trigger a feedback loop. When a claim is flagged as unsupported, the system can store a negative observation: "The model claimed X, but this was not supported by available context." This observation helps the system avoid the same fabrication in future interactions, because the memory of the failed claim becomes part of the grounding context. Over time, this creates a "known unknowns" layer where the system explicitly remembers topics where it has previously hallucinated and applies extra caution.
Performance Considerations
The full detection pipeline (claim extraction, self-consistency, source matching, entailment) adds significant latency, typically 2 to 5 seconds on top of the initial generation time. For real-time applications, you will want to run detection selectively rather than on every response. Good triggers for full detection include: queries about specific facts (dates, names, numbers), queries where the retrieval step returned few or low-similarity results (suggesting the model may lack grounding material), and queries in high-risk domains. For low-risk queries (creative tasks, open-ended brainstorming), lightweight detection or no detection at all is a reasonable trade-off.
Build AI that checks its own work. Adaptive Recall provides confidence-scored memories and source attribution that power reliable hallucination detection at every layer.
Get Started Free