How to Add RLHF to a Non-LLM AI System
Before You Start
You need a system that produces outputs users can evaluate (search results, memory retrievals, recommendations) and a way to collect feedback without adding significant friction to the user experience. The core challenge of RLHF in non-LLM systems is feedback collection: LLM training can use dedicated human raters, but production retrieval systems need to learn from real users who are trying to accomplish tasks, not evaluate results.
Step-by-Step Implementation
The feedback mechanism must be lightweight enough that users actually use it. A simple thumbs up/thumbs down on each interaction is the minimum viable interface. For retrieval systems, a more useful pattern is to let users flag specific results as helpful or unhelpful, which provides result-level rather than interaction-level feedback.
For memory systems where the user does not directly see the retrieved memories (they are injected into the model's prompt), feedback must come from indirect signals: did the model's response satisfy the user? Was the response accurate based on the injected context? Did the user need to correct information that came from a memory? These indirect signals are noisier than direct result feedback but are the only option when the user interacts with the model rather than with the memory system directly.
Pairwise comparisons are the gold standard for RLHF because humans are better at comparing two options than rating a single option on an absolute scale. Present two alternative ranking orders for the same query and record which one the user prefers. This produces training data in the form (query, ranking_A, ranking_B, preference).
def collect_comparison(query, user_id):
# Generate two different rankings
ranking_a = retrieve_with_params(query, user_id,
params=config_a)
ranking_b = retrieve_with_params(query, user_id,
params=config_b)
# Serve one, track which would have been served by each
# strategy, and observe which results the user engages with
served = ranking_a # Serve A this time
event = log_comparison_event(
query=query,
ranking_a=ranking_a,
ranking_b=ranking_b,
served="a"
)
return served, event
def record_preference(event_id, preferred_ranking):
store_preference({
"event_id": event_id,
"preferred": preferred_ranking,
"timestamp": time.time()
})Build a model that predicts human preferences from query and result features. The model learns which ranking characteristics humans prefer: do they value recency over similarity? Do they prefer shorter, more specific memories over longer, more general ones? The trained model can then score any ranking for any query, even ones not seen during training.
from sklearn.linear_model import LogisticRegression
import numpy as np
class PreferenceModel:
def __init__(self):
self.model = LogisticRegression()
self.fitted = False
def extract_features(self, ranking):
return np.array([
np.mean([r["similarity"] for r in ranking]),
np.mean([r["recency_score"] for r in ranking]),
np.mean([r["confidence"] for r in ranking]),
np.std([r["similarity"] for r in ranking]),
len(ranking)
])
def train(self, comparisons):
X = []
y = []
for comp in comparisons:
feat_a = self.extract_features(comp["ranking_a"])
feat_b = self.extract_features(comp["ranking_b"])
# Feature difference encodes preference direction
X.append(feat_a - feat_b)
y.append(1 if comp["preferred"] == "a" else 0)
self.model.fit(np.array(X), np.array(y))
self.fitted = True
def predict_preference(self, ranking):
if not self.fitted:
return 0.5
features = self.extract_features(ranking)
return self.model.predict_proba(
features.reshape(1, -1)
)[0][1]Feed the preference model's predictions into the ranking policy as reward signals. Instead of optimizing for click-through or dwell time (proxy metrics), optimize for the predicted human preference score (a learned metric that reflects what humans actually value).
In practice, only a small fraction of interactions produce explicit feedback. Most users never click thumbs up or thumbs down. Address this by combining explicit feedback (high weight, sparse) with implicit signals (lower weight, abundant). The preference model trained on explicit comparisons can predict preferences for interactions without explicit feedback, extending the reach of the human signal.
Reserve 10-20% of comparison data as a holdout set that is not used for training. Periodically evaluate the preference model on this holdout to verify that its predictions generalize. If holdout accuracy degrades, the model is overfitting to the training comparisons and needs retraining with a larger, more diverse dataset.
Evidence-Gated RLHF
Adaptive Recall applies an evidence-gating principle to its implicit RLHF. Instead of updating rankings based on any single piece of feedback, the system requires consistent evidence across multiple interactions before adjusting activation levels significantly. A memory that receives positive implicit feedback (being retrieved and used successfully) across five independent sessions gains confidence that a single positive interaction would not provide. This gating prevents the system from overreacting to noise while still learning from genuine patterns.
Build a retrieval system that learns from human behavior automatically. Adaptive Recall's cognitive scoring captures implicit preferences through access patterns.
Get Started Free