How to Build a Confidence Scoring Pipeline
Why Confidence Matters for Retrieval
Without confidence scoring, every memory has equal authority. A user mentioning "I think we use PostgreSQL 14" in passing gets the same retrieval weight as a deployment runbook that documents PostgreSQL 15 with connection details, migration history, and version-specific configuration. When both memories match a query about database configuration, the system has no way to prefer the authoritative source over the casual remark.
Confidence scoring solves this by tracking evidence. The deployment runbook gets corroborated every time someone references it, accesses the database details, or stores a new memory that aligns with its claims. The casual remark either gets corroborated (someone else confirms PostgreSQL 14) or contradicted (the runbook says 15). Over time, the confidence scores diverge, and retrieval naturally surfaces the well-established answer.
Step-by-Step Implementation
Use a 0-to-10 scale with a default starting value of 5.0 for newly stored memories. Define threshold values that trigger specific behaviors: memories above 8.0 are protected from decay (they represent well-established knowledge), memories below 2.0 are candidates for archival or deletion (they have been contradicted or never corroborated), and memories at the default 5.0 are unverified observations that have not yet accumulated evidence in either direction.
CONFIDENCE_DEFAULT = 5.0
CONFIDENCE_MIN = 0.0
CONFIDENCE_MAX = 10.0
CONFIDENCE_PROTECTED = 8.0 # resist decay above this
CONFIDENCE_ARCHIVE = 2.0 # candidate for removal below this
CORROBORATION_BOOST = 0.5 # per corroborating source
CONTRADICTION_PENALTY = 1.5 # per contradicting source
MIN_CORROBORATIONS = 3 # required for high confidenceWhen a new memory is stored, compare it against existing memories that share entities or topic overlap. Use semantic similarity to find memories making similar claims. If the new memory supports an existing claim (similarity above a threshold and no contradictory signals), increment the corroboration count on the existing memory and boost its confidence score.
def detect_corroboration(new_memory, existing_memories, threshold=0.85):
corroborated = []
new_emb = new_memory['embedding']
new_entities = set(new_memory['entities'])
for mem in existing_memories:
# must share at least one entity
shared = new_entities.intersection(set(mem['entities']))
if not shared:
continue
sim = cosine_similarity(new_emb, mem['embedding'])
if sim >= threshold:
corroborated.append(mem['id'])
return corroboratedThe entity overlap check prevents false corroboration between unrelated memories that happen to use similar vocabulary. Two memories must be about the same entities and make similar claims to count as corroborating each other.
Contradictions are harder to detect than corroboration because they require understanding that two statements conflict rather than simply differing. A practical approach uses entity overlap plus semantic analysis: if two memories share entities but make claims that an LLM judges as contradictory, flag them. For systems that cannot afford LLM calls on every store operation, use heuristics like detecting negation words, different numerical values for the same metric, or different version numbers for the same software.
def detect_contradictions(new_memory, existing_memories,
entity_overlap_min=2):
candidates = []
new_entities = set(new_memory['entities'])
for mem in existing_memories:
shared = new_entities.intersection(set(mem['entities']))
if len(shared) < entity_overlap_min:
continue
# high entity overlap but moderate text similarity
# suggests same topic, different claims
sim = cosine_similarity(new_memory['embedding'], mem['embedding'])
if 0.4 <= sim <= 0.75:
candidates.append({
'memory_id': mem['id'],
'shared_entities': list(shared),
'similarity': sim
})
return candidatesApply confidence adjustments using bounded arithmetic. Each corroborating source adds a fixed boost (typically 0.5 points), and each contradiction applies a penalty (typically 1.5 points, larger than the boost because false information is more damaging than missing information). Clamp the result to the 0-10 range.
def update_confidence(memory, corroborations=0, contradictions=0):
current = memory.get('confidence', CONFIDENCE_DEFAULT)
adjustment = (corroborations * CORROBORATION_BOOST -
contradictions * CONTRADICTION_PENALTY)
new_confidence = current + adjustment
new_confidence = max(CONFIDENCE_MIN, min(CONFIDENCE_MAX, new_confidence))
memory['confidence'] = new_confidence
memory['corroboration_count'] = memory.get('corroboration_count', 0) + corroborations
memory['contradiction_count'] = memory.get('contradiction_count', 0) + contradictions
return new_confidenceDo not allow a memory to reach protected status (above 8.0 confidence) until it has been corroborated by at least three independent sources. This evidence gate prevents a single repeated observation from being treated as established fact. The corroboration count must come from distinct memory storage events, not from the same source restating the same claim.
def apply_evidence_gate(memory):
if memory['confidence'] > CONFIDENCE_PROTECTED:
if memory.get('corroboration_count', 0) < MIN_CORROBORATIONS:
memory['confidence'] = CONFIDENCE_PROTECTED
return memory['confidence']Use the confidence score as a multiplier on the combined retrieval score. Normalize confidence to a range that does not completely suppress low-confidence memories (they might still be the only relevant result) but meaningfully favors high-confidence ones. A linear mapping from confidence 0-10 to a multiplier of 0.5-1.0 works well: even a zero-confidence memory retains half its retrieval score, while a fully corroborated memory gets the full score.
Running Confidence Updates
Confidence scoring can run synchronously (on every store operation) or asynchronously (in periodic consolidation batches). Synchronous updates detect corroboration and contradictions immediately but add latency to store operations. Asynchronous updates batch the analysis into periodic runs, which keeps store operations fast but delays confidence adjustments.
Adaptive Recall uses a hybrid approach. Basic corroboration detection (entity overlap plus similarity threshold) runs synchronously on store, adding minimal latency. Deep contradiction analysis and cross-reference validation run asynchronously through the reflect tool, which performs comprehensive consolidation on a configurable schedule.
Adaptive Recall runs evidence-gated confidence scoring automatically. Every memory is corroborated, validated, and scored without manual intervention.
Get Started Free