Home » Reinforcement Learning » Bandits for Retrieval

Multi-Armed Bandits for Retrieval Explained

The multi-armed bandit problem provides a mathematical framework for deciding between competing strategies when each strategy's quality is uncertain. Applied to retrieval, bandits help systems choose the best ranking approach by balancing exploitation of known good strategies with exploration of potentially better alternatives.

The Bandit Problem

Imagine a row of slot machines, each with a different, unknown payout rate. You have a fixed budget of pulls. You want to maximize your total winnings. If you only play the machine that has paid out best so far (pure exploitation), you might miss a better machine you never tried. If you spread your pulls evenly across all machines (pure exploration), you waste pulls on machines you already know are bad. The optimal strategy balances exploration (trying uncertain machines) with exploitation (playing the best known machine).

In retrieval, each "machine" is a ranking strategy or parameter configuration. Each "pull" is a query served to a user. The "payout" is user satisfaction with the results. The system needs to discover which ranking strategy works best while also delivering good results to users during the discovery process.

Three Common Algorithms

Epsilon-greedy is the simplest approach. With probability 1-epsilon (typically 0.9 to 0.95), serve results using the best-known ranking strategy. With probability epsilon (0.05 to 0.10), randomly select a different strategy. This guarantees exploration at a fixed rate. The drawback is that the exploration rate does not decrease as the system learns, which means it continues exploring even after the best strategy is well-established.

Upper Confidence Bound (UCB) selects the strategy with the highest upper bound on estimated quality. Strategies that have been tried fewer times have wider confidence intervals, so they get selected more often for exploration. As data accumulates, confidence intervals narrow, and the algorithm naturally shifts from exploration to exploitation. UCB provides a principled exploration schedule without a fixed epsilon parameter.

Thompson sampling maintains a probability distribution (typically Beta) over each strategy's quality and samples from it to select the next strategy. Strategies with uncertain quality can produce high samples, driving exploration. Strategies with well-known quality produce predictable samples centered on the true value, driving exploitation. Thompson sampling adapts its exploration rate automatically and handles non-stationary environments where strategy quality changes over time.

Contextual Bandits

Standard bandits assume the same strategy is optimal for all queries. In practice, different query types benefit from different ranking strategies. Factual queries ("what database does the project use?") might benefit from confidence-heavy ranking. Exploratory queries ("what else is relevant?") might benefit from diversity-heavy ranking. Contextual bandits learn separate strategy preferences for different contexts, such as query type, user experience level, or time of day.

Contextual bandits are more powerful but require more data to learn because they partition the feedback signal across multiple contexts. Start with standard bandits and add context dimensions only when you have enough traffic to learn context-specific preferences reliably.

Bandits in Memory Systems

For memory retrieval, the "arms" might represent different retrieval configurations: how many memories to retrieve, how much weight to give recency versus similarity, whether to include graph-traversal results, and what confidence threshold to apply. The bandit discovers the optimal configuration for your specific application and user base without requiring manual tuning.

Adaptive Recall achieves similar adaptive behavior through ACT-R's spreading activation mechanism, which naturally adjusts the influence of different retrieval signals based on the entity context of each query. This provides context-dependent ranking without explicit contextual bandit machinery, because the graph structure encodes the contextual relationships that drive different ranking preferences.

Get adaptive ranking without building bandit infrastructure. Adaptive Recall's cognitive scoring adjusts to your usage patterns automatically.

Get Started Free

Multi-Armed Bandits for Retrieval Explained

The Bandit Problem

Three Common Algorithms

Contextual Bandits

Bandits in Memory Systems

Related Articles