Home » Self-Improving AI » Three Conditions

The Three Conditions for Safe AI Self-Improvement

Safe self-improvement in AI systems requires three conditions operating simultaneously: verifiable outcomes that confirm the system is actually getting better, bounded updates that prevent any single interaction from causing dramatic behavior changes, and a complete audit trail that makes every learning event traceable and reversible. Remove any one condition and the system becomes vulnerable to specific failure modes that can turn self-improvement into self-degradation.

Why All Three Conditions Are Necessary

Each condition prevents a specific class of failure, and the failures they prevent are different. Verifiable outcomes prevent the system from optimizing for the wrong thing. Bounded updates prevent the system from changing too fast. The audit trail prevents the system from changing in ways that cannot be understood or undone. A system with only two of three conditions will eventually fail in the way that the missing condition was supposed to prevent.

A system with verifiable outcomes and an audit trail but unbounded updates can confirm that each individual update is correct and can trace every change, but a single adversarial interaction or a single noisy signal can dramatically reshape the system's behavior. The outcome was verifiably correct in isolation, but the magnitude of the change was disproportionate.

A system with bounded updates and an audit trail but no verifiable outcomes will make small, traceable changes that are individually harmless, but those changes may be pushing the system in the wrong direction. After thousands of small, well-logged steps in the wrong direction, the system has drifted far from useful behavior, and the audit trail documents exactly how it got there without having prevented it.

A system with verifiable outcomes and bounded updates but no audit trail makes correct, measured improvements, but when something eventually goes wrong (and it will), nobody can determine what changed, when, or why. Debugging becomes impossible, regulatory compliance fails, and the team cannot learn from the incident to prevent future occurrences.

Condition 1: Verifiable Outcomes

Verifiable outcomes are measurable results that confirm whether the system's behavior led to a good or bad result. The key word is "verifiable," meaning the measurement can be checked against an independent ground truth rather than relying on the system's own assessment of its performance.

For memory systems, verifiable outcomes include: factual accuracy (does the stored information match authoritative sources), retrieval relevance (did the retrieved memory address the user's actual need), task completion (did the user achieve their goal after interacting with the system), and prediction accuracy (did claims the system made turn out to be correct). Not all of these are equally easy to measure, and not all queries produce verifiable outcomes. The important thing is that a meaningful fraction of the system's operations produce outcomes that can be checked.

Without verifiable outcomes, the system has no way to distinguish between changes that make it better and changes that make it worse. It will optimize for whatever proxy metric is available, whether click-through rate, engagement time, or average confidence score, and these proxies will diverge from actual quality over time. Verifiable outcomes anchor the learning signal to reality.

Practical implementation does not require verifying every outcome. Even if only 15 to 20% of interactions produce verifiable results, that is enough to provide a calibration signal. The verified fraction establishes a ground truth that the system can extrapolate from for the unverified majority. Memories that consistently contribute to verified good outcomes earn high confidence, which calibrates their behavior even in interactions where no verification is possible.

Condition 2: Bounded Updates

Bounded updates ensure that no single interaction, feedback signal, or consolidation event can change the system's behavior by more than a predetermined amount. The bound operates at multiple levels: individual memory confidence changes are clamped to a maximum delta (typically 0.2 to 0.5 on a 10-point scale), per-session total confidence changes are capped, and global scoring parameter adjustments are rate-limited.

Bounded updates serve two purposes. First, they provide stability: the system's behavior changes gradually and predictably rather than in sudden jumps. A user who interacts with the system today and returns tomorrow encounters a system that behaves almost identically, even though learning has occurred in between. Second, they provide security: an attacker who feeds the system false information can only shift confidence by the bounded amount, which is typically too small to meaningfully affect retrieval rankings for well-established knowledge.

The bound must be set based on the system's total knowledge volume. A system with 1,000 memories can tolerate larger per-memory updates because each memory is a larger fraction of the total. A system with 1,000,000 memories should use smaller per-memory updates because even a small percentage of large changes could collectively shift the system's behavior significantly. A good heuristic is to set the per-interaction bound so that the maximum possible drift over a day (bound multiplied by maximum interactions per day) is less than 10% of the system's total confidence budget.

Bounded updates also enable practical rollback. If every change is small, rolling back one day's worth of changes is a minor adjustment. If changes are unbounded, rolling back a single large change might require understanding its cascading effects on dependent memories and knowledge graph edges, making rollback complex and risky.

Condition 3: Audit Trail

The audit trail records every learning event with enough context to understand what happened, why, and how to reverse it. Each entry should include: a timestamp, the type of change (confidence update, consolidation merge, decay application, graph edge modification), the before and after values, the triggering event or data source, and the evidence or logic that justified the change.

The audit trail serves three distinct audiences. Engineers use it for debugging: when the system behaves unexpectedly, the audit trail shows which changes led to the current state. Product teams use it for monitoring: aggregated audit trail data reveals trends in learning velocity, confidence distribution shifts, and knowledge coverage changes. Compliance teams use it for regulatory requirements: industries that mandate explainable AI decisions (healthcare, finance, legal) require the ability to trace any AI output back through the decision chain to the underlying knowledge and the evidence that supported it.

The audit trail must be immutable. If the learning system can modify or delete its own audit entries, the trail becomes untrustworthy. Store audit events in an append-only log or an external system that the learning system can write to but not modify. This ensures that even if the learning system malfunctions, the audit trail provides an accurate record of what happened.

Storage requirements for audit trails are modest. A typical learning event generates 200 to 500 bytes of audit data. A system processing 10,000 learning events per day generates roughly 2 to 5 MB of audit data per day, or about 1 to 2 GB per year. This is well within the capacity of any logging infrastructure and small enough to retain for years without cost concerns.

Implementing All Three Together

The three conditions interact to create a learning system that is greater than the sum of its parts. Verifiable outcomes tell the bounded update mechanism which direction to push confidence scores. Bounded updates ensure that each push is small enough to be reversible. The audit trail records each push so that patterns can be analyzed and problems can be traced.

A typical learning cycle works as follows. A retrieval event occurs and generates feedback. The feedback is checked against verifiable outcomes where possible: was the retrieved information factually correct, did the user complete their task? The feedback is translated into a bounded confidence adjustment: the delta is computed from the feedback strength and clamped to the maximum allowed change. The adjustment is applied and logged to the audit trail with full context: what was the feedback, what verification was performed, what was the old and new confidence, and what triggered the event. Later, the audit trail is analyzed to verify that learning trends are positive and to detect any patterns that suggest problems.

The conditions also create a natural accountability framework. When something goes wrong, you can ask: Was the outcome measurement correct? (Condition 1.) Was the update bounded appropriately? (Condition 2.) Can we trace exactly what happened? (Condition 3.) If all three answers are yes and the system still produced a bad result, the issue is in the learning logic itself, which is a localized debugging problem rather than a systemic trust failure.

Adaptive Recall implements all three conditions natively. Evidence-gated confidence updates provide verifiable outcomes, bounded update rules prevent dramatic changes, and the full audit trail is accessible through the status tool.

Get Started Free