How to Implement Reinforcement Learning from Feedback
Before You Start
This is not RLHF in the model-training sense. You are not fine-tuning an LLM with human preference data. You are applying reinforcement learning concepts to the retrieval and ranking layer that sits around your LLM. The LLM's weights stay frozen. What changes is which memories the system retrieves, how it ranks them, and how much confidence it places in each piece of knowledge. This distinction matters because it means you do not need GPU infrastructure, training pipelines, or ML engineering expertise. You need a feedback collection mechanism, a reward function, and a policy update loop that adjusts retrieval parameters.
You need an existing retrieval system that produces ranked results, a mechanism for collecting feedback (explicit or implicit), and a way to attribute outcomes to specific retrieval decisions. If you have not built feedback collection yet, start with the feedback loop guide first.
Step-by-Step Implementation
Map your retrieval system to the standard RL framework. The state is the combination of the user's query, the conversation context, and the current state of the memory store. The action space is the set of possible retrieval decisions: which memories to retrieve, how many to return, in what order, and with what confidence annotations. The reward signal comes from feedback: positive reward when the retrieval contributes to a good outcome, negative reward when it does not, and zero reward when no feedback is available. The policy is the retrieval ranking function, including all the scoring parameters (recency weight, relevance weight, confidence weight, spreading activation weight) that determine which memories surface for a given query.
The reward function translates raw feedback into a scalar value that the system can optimize. A well-designed reward function combines multiple signals with appropriate weighting. Explicit positive feedback (thumbs up, "this was helpful") contributes +1.0. Explicit negative feedback (thumbs down, "this is wrong") contributes -1.0. Implicit positive signals (user acted on the information, conversation proceeded productively) contribute +0.3 to +0.5. Implicit negative signals (user ignored the retrieval, asked the same question differently) contribute -0.1 to -0.3. No feedback contributes 0.0. The reward is attributed to the specific memories that were retrieved and used in the response. Discount the reward by recency: feedback received within 5 minutes of the retrieval gets full weight, while feedback received hours later gets reduced weight because the connection to the specific retrieval becomes less certain.
def compute_reward(retrieval_event, feedback_signals):
reward = 0.0
for signal in feedback_signals:
time_delta = (signal.timestamp - retrieval_event.timestamp).seconds
recency_discount = max(0.1, 1.0 - (time_delta / 3600))
if signal.type == "explicit_positive":
reward += 1.0 * recency_discount
elif signal.type == "explicit_negative":
reward += -1.0 * recency_discount
elif signal.type == "implicit_positive":
reward += 0.4 * recency_discount
elif signal.type == "implicit_negative":
reward += -0.2 * recency_discount
return max(-1.0, min(1.0, reward))A retrieval system that always returns the highest-ranked memories will never discover whether lower-ranked memories might be better for certain queries. This is the explore-exploit trade-off. Implement epsilon-greedy exploration: with probability epsilon (start at 0.1), replace one of the retrieved memories with a random alternative from the candidate set. Over time, reduce epsilon as the system accumulates more data and the policy becomes more reliable. An alternative is Thompson sampling, where each memory's score is drawn from a distribution based on its confidence and feedback history rather than using a point estimate. Memories with high uncertainty (few feedback data points) have wider distributions and are therefore more likely to be sampled, providing natural exploration. Thompson sampling is more sophisticated than epsilon-greedy but also produces more consistent results because it concentrates exploration on the memories that have the most to learn.
After each retrieval episode receives its reward, update the scoring parameters that produced that retrieval. The simplest approach is to adjust the confidence scores of the retrieved memories in the direction indicated by the reward, using a learning rate that controls the step size. Memories that received positive rewards get a small confidence boost; memories that received negative rewards get a small confidence reduction. The learning rate should be small (0.01 to 0.05 of the reward magnitude) to ensure bounded updates. Additionally, track which scoring parameter combinations (recency weight, relevance weight, graph traversal depth) led to the best rewards and gradually adjust the global scoring weights toward the configurations that perform best. This is a form of contextual bandits: the system learns not just which memories are good, but which scoring strategies work best for different types of queries.
def update_policy(retrieval_event, reward, learning_rate=0.03):
for memory_ref in retrieval_event.retrieved_memories:
memory = get_memory(memory_ref.memory_id)
delta = reward * learning_rate * memory_ref.relevance_score
memory.confidence = max(1.0, min(10.0,
memory.confidence + delta
))
# Update global scoring weights based on strategy performance
strategy = retrieval_event.scoring_strategy
strategy_stats = get_strategy_stats(strategy)
strategy_stats.update(reward)
if strategy_stats.sample_count > 50:
adjust_scoring_weights(strategy, strategy_stats.mean_reward)Most retrieval events generate sparse or no feedback. Experience replay stores past retrieval episodes with their rewards and periodically replays them through the policy update mechanism. This improves sample efficiency: instead of learning from each episode once, the system learns from it multiple times. Implement a replay buffer that stores the most recent 10,000 episodes with non-zero rewards. During each update cycle, sample a batch of 32 to 64 episodes from the replay buffer and apply the policy update to each. Prioritize episodes with larger absolute rewards (both positive and negative) because they carry more information. Deprioritize very old episodes because the memory store may have changed significantly since then, making the old reward signals less relevant.
Run the RL system for at least two weeks before evaluating its impact. Compare retrieval quality metrics (precision, recall, mean reciprocal rank) between the RL-enabled period and the preceding baseline period. If the metrics improved, gradually increase the learning rate and reduce the exploration rate. If the metrics stayed flat, the reward function may not be capturing the right signals; review individual episodes to check whether the reward assignments match human judgment. If the metrics degraded, reduce the learning rate, increase exploration, and review whether the reward function has a systematic bias (for example, if it rewards engagement rather than accuracy).
Practical Considerations
Sparse rewards. In most production systems, fewer than 10% of retrievals receive explicit feedback. The system must learn primarily from implicit signals and from the minority of episodes that do receive feedback. Experience replay helps by reusing the episodes that have feedback, and Thompson sampling helps by concentrating exploration where uncertainty is highest. If feedback is extremely sparse (fewer than 1%), consider adding an active learning component that explicitly asks users for feedback on uncertain retrievals.
Delayed rewards. Some outcomes take hours or days to materialize. Keep the reward attribution window open for a configurable period and process late-arriving rewards when they arrive. The recency discount in the reward function handles the reduced certainty of delayed attribution.
Non-stationary environments. User behavior and information relevance change over time. The RL system should use a sliding window for computing statistics (mean reward per strategy, confidence adjustments) rather than accumulating lifetime statistics. A 90-day window balances the need for sufficient data with the need to adapt to changes.
Adaptive Recall applies reinforcement learning principles to memory retrieval automatically. Confidence scores evolve based on outcomes, and the cognitive scoring model adapts to your usage patterns over time.
Get Started Free