Why Self-Improvement Needs Verifiable Outcomes
The Proxy Metric Trap
Every metric is a proxy for something harder to measure. Click-through rate is a proxy for relevance. User satisfaction ratings are a proxy for value delivered. Time spent on page is a proxy for engagement. These proxies are useful approximations, but when a system optimizes directly for them, the approximation breaks down.
Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. A self-improving retrieval system that optimizes for click-through rate will learn to surface attention-grabbing results rather than genuinely useful ones. A system that optimizes for user satisfaction ratings will learn to produce confidently-worded responses that sound authoritative, regardless of accuracy, because users tend to rate confident responses higher than hedged ones. A system that optimizes for engagement time will learn to provide partial answers that keep users in the conversation longer rather than complete answers that resolve their need quickly.
This is not a theoretical concern. Recommendation systems across the internet have been optimizing for engagement proxies for years, producing the well-documented effects of filter bubbles, outrage amplification, and content quality degradation. The same dynamic applies to any self-improving system: if the improvement signal rewards behavior that correlates with but is not identical to actual quality, the system will eventually find and exploit the gap between the proxy and the real target.
What Makes an Outcome Verifiable
A verifiable outcome has three properties. First, it is observable: you can determine the outcome through measurement or inspection rather than inference. Second, it is objective: two independent evaluators would agree on whether the outcome occurred. Third, it is attributable: you can trace the outcome back to a specific system action (a retrieval, a suggestion, a stored memory) rather than just to the system in general.
For memory and retrieval systems, verifiable outcomes include factual accuracy (does the retrieved information match authoritative sources), task completion (was the user's goal achieved), and prediction correctness (did the system's claim about the future turn out to be true). These are harder to measure than engagement proxies, but they directly reflect whether the system is actually being useful.
Factual accuracy is the most straightforward to verify. If the system retrieves a memory that says "the API supports batch requests up to 100 items," you can check the actual API documentation. If the memory is correct, the retrieval contributed to a verifiable good outcome. If the memory is wrong, the retrieval contributed to a verifiable bad outcome. The verification can often be automated by maintaining a set of authoritative sources and periodically checking stored memories against them.
Task completion is harder to verify but more holistically meaningful. Did the user resolve their support ticket after the system retrieved relevant troubleshooting steps? Did the developer successfully deploy after the system recalled the deployment procedure? Task completion captures the full value chain from retrieval to outcome, not just whether the individual memory was accurate. The challenge is attribution: many factors beyond the AI's retrieval contribute to whether a task succeeds, and isolating the AI's contribution requires careful experimental design.
Verifiable vs Subjective Signals
Not all quality dimensions are verifiable. Whether a response was phrased clearly, whether the level of detail was appropriate, whether the system understood the user's intent, these are subjective judgments where reasonable people disagree. Verifiable outcomes cannot replace subjective feedback entirely; they complement it.
The key insight is that verifiable outcomes should gate confidence, while subjective feedback should influence ranking. A memory that is verifiably correct earns the right to exist at a high confidence level. Among memories that are all verifiably correct, subjective feedback determines which ones rank higher for specific query types. This separation ensures that the system never becomes highly confident in wrong information (because confidence requires verification) while still learning user preferences for how to present correct information.
In practice, this means running two parallel feedback loops. The verification loop checks factual accuracy on a schedule and adjusts confidence scores based on verified correctness. The preference loop collects user feedback and adjusts retrieval rankings based on helpfulness. The verification loop has veto power: a memory that fails verification has its confidence reduced regardless of how many users rated it positively. The preference loop operates within the bounds set by the verification loop: among verified memories, user preferences guide ranking.
Designing Verification for Your Domain
The specific verification mechanisms depend on your domain. For a customer support memory system, verification might involve checking stored product specifications against the current product database, verifying troubleshooting procedures against the engineering team's runbooks, and confirming that policy information matches the current policy documents. For a coding assistant, verification might involve running stored code examples to confirm they compile, checking API references against current documentation, and validating that stored patterns match the project's current architecture.
Verification does not need to be exhaustive. Even partial verification provides significant value. If you can verify 20% of your stored memories against authoritative sources, those verified memories form a trusted core that anchors the rest of the knowledge base. Unverified memories remain at moderate confidence and rely on corroboration and user feedback for scoring. The verified memories provide a ground truth baseline that prevents the overall system from drifting too far from reality.
Automated verification is ideal but not always feasible. Where automated checks are possible (API documentation, database schemas, configuration values), run them on a schedule and update confidence scores based on results. Where automated checks are not possible, implement periodic human review of a random sample of memories. Even reviewing 50 memories per week provides a calibration signal that catches systematic errors before they propagate.
The Accountability Chain
Verifiable outcomes create accountability at every layer of the system. When a retrieved memory leads to a bad outcome, the verification trail shows exactly what happened: which memory was retrieved, what confidence it had, which evidence supported it, and when it was last verified. This accountability chain is essential for debugging, for regulatory compliance, and for building trust with users who need to understand why the system gave a particular answer.
Without verifiable outcomes, the accountability chain is broken. You can see that the system retrieved a memory and that the user was dissatisfied, but you cannot determine whether the problem was that the memory was wrong (a knowledge issue), the memory was correct but irrelevant (a retrieval issue), or the memory was correct and relevant but the user misunderstood it (a presentation issue). Verifiable outcomes let you distinguish between these failure modes and apply the appropriate fix to each.
Adaptive Recall builds verifiable outcomes into its learning pipeline. Evidence-gated confidence updates require corroboration, and the consolidation process checks memory accuracy against source evidence before promoting confidence.
Try It Free