Home » Reinforcement Learning » Design Reward Functions

How to Design Reward Functions for Memory Systems

A reward function translates observable user behavior into a numerical signal that the memory system can optimize. The right reward function drives genuine improvement in retrieval quality. The wrong one leads to reward hacking, where the system optimizes for a metric that does not reflect real user satisfaction. This guide covers signal selection, composition, negative rewards, and calibration.

Before You Start

You need to understand what "good retrieval" means in your specific application. For a customer support memory system, good retrieval means the agent had the context needed to resolve the issue. For a coding assistant, good retrieval means the model referenced the right project conventions. For a research tool, good retrieval means the user found the information they were looking for. Define your quality objective before designing the reward function, because the function must point at the right goal.

Step-by-Step Design

Step 1: Identify available signals.
List every observable behavior that might indicate whether retrieved memories were useful. Different application types provide different signals.

For memory-augmented LLM applications, the richest signals are: whether the model referenced injected memories in its response (content usage), whether the user asked a follow-up question on the same topic (possible dissatisfaction), whether the user explicitly corrected the model (clear negative signal), whether the task was completed (ultimate positive signal), and whether the user gave explicit feedback (thumbs up/down, ratings).

For search and retrieval APIs, signals include: click-through on results, dwell time on clicked results, scroll depth past the first result, query reformulation after viewing results, and session-level task completion.

Step 2: Classify signal reliability.
Not all signals are equally reliable indicators of retrieval quality. Classify each signal as strong, moderate, or weak based on how well it correlates with actual user satisfaction.

Strong signals: Task completion, explicit user corrections, explicit ratings. These directly indicate whether the retrieval helped the user achieve their goal.

Moderate signals: Content usage in the model's response, session duration, follow-up question patterns. These correlate with quality but have confounding factors. A model might not reference a memory even though it influenced the response. A long session might mean engagement or frustration.

Weak signals: Click-through rate, scroll behavior, time on page. These are easy to measure but weakly correlated with satisfaction. A user might click on an irrelevant result out of curiosity, or might get the answer from a snippet without clicking at all.

Step 3: Build the composite reward.
Combine multiple signals with weights that reflect their reliability. Strong signals should have higher weights than weak ones. The composite approach is more robust than any single signal because it averages out noise from individual measurements.

class RewardFunction:
    def __init__(self):
        # Weights reflect signal reliability
        self.weights = {
            "task_completed": 3.0,      # Strong
            "explicit_positive": 2.0,   # Strong
            "memory_referenced": 1.0,   # Moderate
            "no_reformulation": 0.5,    # Moderate
            "session_continued": 0.3,   # Weak positive
        }
        self.penalties = {
            "explicit_negative": -2.5,  # Strong
            "user_correction": -2.0,    # Strong
            "query_reformulated": -0.8, # Moderate
            "memory_ignored": -0.3,     # Weak
        }

    def compute(self, signals):
        reward = 0.0

        for signal_type, value in signals.items():
            if signal_type in self.weights:
                reward += self.weights[signal_type] * value
            elif signal_type in self.penalties:
                reward += self.penalties[signal_type] * value

        return reward

Step 4: Add negative rewards.
Negative signals are as important as positive ones. Without penalties for bad outcomes, the system has no incentive to avoid returning irrelevant memories. A memory that was served but ignored should receive a small penalty. A memory that caused the user to correct the model should receive a larger penalty. A memory that contributed to an incorrect answer should receive the largest penalty.

The key design decision is the ratio between positive and negative reward magnitudes. If positive rewards are much larger than penalties, the system learns to cast a wide net (returning many results hoping some are useful). If penalties are much larger than positive rewards, the system becomes overly conservative (returning very few results to minimize risk). A balanced ratio encourages the system to return a focused set of high-confidence results.

Step 5: Test for reward hacking.
Reward hacking occurs when the system finds ways to maximize the reward signal without actually improving quality. For example, if the reward heavily weights "memory referenced in response," the system might learn to return popular, frequently mentioned facts that are easy for the model to reference, even when they are not actually relevant to the query.

Test for reward hacking by comparing the reward trend with a human evaluation of quality. Pull a random sample of 50 retrieval events per week, have a human rate the quality of the served results on a 1-5 scale, and compare this human rating with the computed reward. If the computed reward is increasing but the human rating is flat or declining, the reward function is being hacked.

Step 6: Calibrate and tune.
Once the reward function is deployed, tune the weights based on observed correlations. If the "task completed" signal turns out to be noisy (many tasks are completed without using memory at all), reduce its weight. If the "memory referenced" signal turns out to be highly predictive of human-rated quality, increase its weight. Recalibrate quarterly as usage patterns change.

ACT-R as a Natural Reward

Adaptive Recall avoids the complexity of explicit reward engineering by using ACT-R activation dynamics as an implicit reward function. Every retrieval event naturally updates the activation levels of the memories involved. A memory that gets retrieved gains base-level activation (the retrieval itself is the reward). A memory that is retrieved frequently gains cumulative activation. A memory that is never retrieved loses activation through decay. The activation equations, grounded in decades of cognitive science validation, serve as the reward function without requiring manual signal weighting or calibration.

This approach sidesteps the reward hacking problem because the "reward" (activation gain) is directly tied to the fundamental action (being retrieved) rather than to downstream behavioral signals that may be noisy or gameable. The system naturally promotes useful memories and demotes unused ones through the same mathematical framework that models human memory.

Skip reward engineering. Adaptive Recall's cognitive scoring provides natural learning dynamics that improve retrieval without manual reward design.

Get Started Free

How to Design Reward Functions for Memory Systems

Before You Start

Step-by-Step Design

ACT-R as a Natural Reward

Related Articles