Home » Cognitive Scoring » Add Reranking to RAG

How to Add Reranking to Your RAG Pipeline

Adding reranking to a RAG pipeline means inserting a scoring step between vector retrieval and LLM generation. Instead of passing the top vector-similarity results directly to the language model, you score each candidate on additional quality factors and reorder them so the best results appear first. This typically improves answer accuracy by 15 to 25 percent because the LLM receives better context.

Before You Start

This guide assumes you have a working RAG pipeline that retrieves documents from a vector store and passes them to an LLM for answer generation. You should be able to modify the retrieval step to return more candidates than you currently use and insert a post-retrieval scoring function before the generation step. If you are building a new RAG system from scratch, consider starting with the two-stage retrieval guide instead.

You should also have a way to evaluate retrieval quality. At minimum, prepare 20 to 50 test queries where you know which documents or memories contain the correct answer. This test set lets you measure whether reranking actually improves your results, because not every pipeline benefits equally.

Step-by-Step Implementation

Step 1: Identify your retrieval bottleneck.
RAG failures come from two places: the retrieval stage (wrong documents retrieved) or the generation stage (right documents retrieved but LLM generates a bad answer). Reranking only helps with retrieval failures. Run your test queries and check whether the correct document appears anywhere in the top 20 vector results. If it does but is ranked too low (position 5 or worse), reranking can promote it. If it is not in the top 20 at all, your problem is in the embedding model or chunking strategy, not ranking.

Step 2: Choose a reranking approach.
Three main options exist, each with different trade-offs. Cross-encoder models (like BGE-reranker or Cohere Rerank) score query-document pairs using a transformer and add 50 to 200 milliseconds of latency. LLM-as-a-judge uses a language model to evaluate relevance and costs more but handles nuanced relevance judgments. Cognitive scoring uses precomputed metadata (recency, frequency, entity connections, confidence) and adds under 40 milliseconds. Choose based on your latency budget and what factors matter for your use case.

Step 3: Adjust your candidate retrieval count.
When using only vector similarity, you might retrieve 5 or 10 candidates. With a reranker, increase this to 20 to 50 candidates. The reranker needs a larger pool to work with because its job is to find the best results among a broader set. Retrieving more candidates from the vector store is cheap (milliseconds of additional latency), and it gives the reranker a better chance of finding the right answer even if vector similarity ranks it poorly.

# Before: retrieve top 5 by vector similarity
results = vector_store.query(query_embedding, top_k=5)

# After: retrieve top 30, then rerank to top 5
candidates = vector_store.query(query_embedding, top_k=30)
results = rerank(query, candidates, top_k=5)

Step 4: Implement the reranking function.
The reranking function takes the query and a list of candidates, scores each candidate, sorts by the new score, and returns the top results. The scoring logic depends on your chosen approach. For cognitive scoring, you combine vector similarity with base-level activation, spreading activation, and confidence weighting. For cross-encoder reranking, you run each query-candidate pair through the model and use the output score.

def rerank(query, candidates, top_k=5):
    scored = []
    for candidate in candidates:
        score = compute_rerank_score(query, candidate)
        scored.append((candidate, score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return [item[0] for item in scored[:top_k]]

def compute_rerank_score(query, candidate):
    sim = candidate['similarity_score']
    activation = base_level_activation(candidate['access_times'])
    confidence = candidate.get('confidence', 5.0) / 10.0

    # weighted combination
    return (0.4 * sim) + (0.35 * sigmoid(activation)) + (0.25 * confidence)

Step 5: Set a score threshold.
Not all candidates deserve to be passed to the LLM. After reranking, filter out any candidate with a combined score below a threshold. This prevents low-quality, tangentially relevant results from diluting the context. Start with a threshold that keeps 3 to 5 results for most queries and adjust based on your test set performance. Too high a threshold means the LLM sometimes gets no context; too low means it gets noise.

Step 6: Measure and tune.
Run your test queries through the pipeline with and without reranking. Measure recall at k (how often the correct answer appears in the top k results) and mean reciprocal rank (the average position of the correct answer). If reranking improves both metrics, you have a working configuration. If one metric improves while the other degrades, adjust your score weights to balance the trade-off. Increase the vector similarity weight if relevant results are being pushed too low; increase the activation or confidence weight if stale or unreliable results are ranking too high.

Common Pitfalls

The most common mistake is reranking too few candidates. If you only retrieve the top 5 from vector search and then rerank them, the reranker can only reorder those 5. The correct answer might have been at position 8 or 15 in the vector results, and you never gave the reranker a chance to find it. Always retrieve at least 3 to 5 times as many candidates as you plan to return after reranking.

Another common issue is ignoring the latency impact. Cross-encoder reranking adds 50 to 200 milliseconds per query, and LLM-as-a-judge can add 500 milliseconds or more. If your application has a tight latency budget, cognitive scoring (under 40 milliseconds) or a lightweight cross-encoder might be better than a large reranking model. Measure the end-to-end latency, not just the accuracy improvement.

Finally, do not assume reranking always helps. For curated, static knowledge bases with a few hundred well-organized documents, vector similarity alone often produces good enough rankings. Reranking provides the most benefit when the document store is large, dynamic, and contains overlapping or contradictory information, exactly the conditions where cognitive scoring shines.

Skip building your own reranking pipeline. Adaptive Recall applies cognitive scoring to every retrieval call automatically.

Get Started Free

How to Add Reranking to Your RAG Pipeline

Before You Start

Step-by-Step Implementation

Common Pitfalls

Related Articles