Home » Reinforcement Learning » RLHF for Non-LLM Systems

How to Add RLHF to a Non-LLM AI System

RLHF (Reinforcement Learning from Human Feedback) is best known for aligning language models, but the same principles apply to any AI system where human preferences should guide behavior. Retrieval engines, memory APIs, recommendation systems, and search ranking all benefit from learning what humans actually consider "good results" rather than relying on proxy metrics.

Before You Start

You need a system that produces outputs users can evaluate (search results, memory retrievals, recommendations) and a way to collect feedback without adding significant friction to the user experience. The core challenge of RLHF in non-LLM systems is feedback collection: LLM training can use dedicated human raters, but production retrieval systems need to learn from real users who are trying to accomplish tasks, not evaluate results.

Step-by-Step Implementation

Step 1: Design the feedback interface.
The feedback mechanism must be lightweight enough that users actually use it. A simple thumbs up/thumbs down on each interaction is the minimum viable interface. For retrieval systems, a more useful pattern is to let users flag specific results as helpful or unhelpful, which provides result-level rather than interaction-level feedback.

For memory systems where the user does not directly see the retrieved memories (they are injected into the model's prompt), feedback must come from indirect signals: did the model's response satisfy the user? Was the response accurate based on the injected context? Did the user need to correct information that came from a memory? These indirect signals are noisier than direct result feedback but are the only option when the user interacts with the model rather than with the memory system directly.

Step 2: Collect comparison pairs.
Pairwise comparisons are the gold standard for RLHF because humans are better at comparing two options than rating a single option on an absolute scale. Present two alternative ranking orders for the same query and record which one the user prefers. This produces training data in the form (query, ranking_A, ranking_B, preference).

def collect_comparison(query, user_id):
    # Generate two different rankings
    ranking_a = retrieve_with_params(query, user_id,
                                     params=config_a)
    ranking_b = retrieve_with_params(query, user_id,
                                     params=config_b)

    # Serve one, track which would have been served by each
    # strategy, and observe which results the user engages with
    served = ranking_a  # Serve A this time
    event = log_comparison_event(
        query=query,
        ranking_a=ranking_a,
        ranking_b=ranking_b,
        served="a"
    )
    return served, event


def record_preference(event_id, preferred_ranking):
    store_preference({
        "event_id": event_id,
        "preferred": preferred_ranking,
        "timestamp": time.time()
    })

Interleaving for implicit comparison. Instead of showing two separate result lists, interleave results from both ranking strategies into a single list and track which results the user engages with. Results from the preferred strategy get more engagement. This collects comparison data without requiring users to explicitly choose between alternatives.

Step 3: Train a preference model.
Build a model that predicts human preferences from query and result features. The model learns which ranking characteristics humans prefer: do they value recency over similarity? Do they prefer shorter, more specific memories over longer, more general ones? The trained model can then score any ranking for any query, even ones not seen during training.

from sklearn.linear_model import LogisticRegression
import numpy as np

class PreferenceModel:
    def __init__(self):
        self.model = LogisticRegression()
        self.fitted = False

    def extract_features(self, ranking):
        return np.array([
            np.mean([r["similarity"] for r in ranking]),
            np.mean([r["recency_score"] for r in ranking]),
            np.mean([r["confidence"] for r in ranking]),
            np.std([r["similarity"] for r in ranking]),
            len(ranking)
        ])

    def train(self, comparisons):
        X = []
        y = []
        for comp in comparisons:
            feat_a = self.extract_features(comp["ranking_a"])
            feat_b = self.extract_features(comp["ranking_b"])
            # Feature difference encodes preference direction
            X.append(feat_a - feat_b)
            y.append(1 if comp["preferred"] == "a" else 0)

        self.model.fit(np.array(X), np.array(y))
        self.fitted = True

    def predict_preference(self, ranking):
        if not self.fitted:
            return 0.5
        features = self.extract_features(ranking)
        return self.model.predict_proba(
            features.reshape(1, -1)
        )[0][1]

Step 4: Use preferences as rewards.
Feed the preference model's predictions into the ranking policy as reward signals. Instead of optimizing for click-through or dwell time (proxy metrics), optimize for the predicted human preference score (a learned metric that reflects what humans actually value).

Step 5: Handle feedback sparsity.
In practice, only a small fraction of interactions produce explicit feedback. Most users never click thumbs up or thumbs down. Address this by combining explicit feedback (high weight, sparse) with implicit signals (lower weight, abundant). The preference model trained on explicit comparisons can predict preferences for interactions without explicit feedback, extending the reach of the human signal.

Step 6: Validate with holdout testing.
Reserve 10-20% of comparison data as a holdout set that is not used for training. Periodically evaluate the preference model on this holdout to verify that its predictions generalize. If holdout accuracy degrades, the model is overfitting to the training comparisons and needs retraining with a larger, more diverse dataset.

Evidence-Gated RLHF

Adaptive Recall applies an evidence-gating principle to its implicit RLHF. Instead of updating rankings based on any single piece of feedback, the system requires consistent evidence across multiple interactions before adjusting activation levels significantly. A memory that receives positive implicit feedback (being retrieved and used successfully) across five independent sessions gains confidence that a single positive interaction would not provide. This gating prevents the system from overreacting to noise while still learning from genuine patterns.

Build a retrieval system that learns from human behavior automatically. Adaptive Recall's cognitive scoring captures implicit preferences through access patterns.

Get Started Free

How to Add RLHF to a Non-LLM AI System

Before You Start

Step-by-Step Implementation

Evidence-Gated RLHF

Related Articles