How to Implement Cross-Encoder Reranking
Before You Start
Cross-encoder reranking requires either a local GPU (or fast CPU) to run inference or an API call to a hosted reranking service. If you are running locally, you need Python with the sentence-transformers library. If you are using a hosted service, Cohere Rerank and similar APIs provide cross-encoder scoring without local infrastructure. Plan for 50 to 200 milliseconds of added latency per query when scoring 20 to 30 candidates.
You also need an existing vector retrieval stage that returns candidates. Cross-encoders are too slow to score against a full document store, so they operate on a pre-filtered candidate set from vector search. This guide assumes you have that stage working already.
Step-by-Step Implementation
The model choice depends on your accuracy needs and deployment constraints. BGE-reranker-v2 from BAAI is a strong open-source option that runs well on GPU and achieves competitive accuracy on standard benchmarks. MS MARCO fine-tuned models (like cross-encoder/ms-marco-MiniLM-L-12-v2) are smaller and faster but slightly less accurate. Cohere Rerank is a hosted API that requires no local infrastructure and handles scaling automatically. For most applications, BGE-reranker-v2 provides the best balance of accuracy and speed.
For local inference, install sentence-transformers and load the cross-encoder model. The first inference call downloads the model weights (typically 400 MB to 1.5 GB). Subsequent calls use the cached model. For hosted inference, set up your API key and client library. Either way, wrap the model in a scoring function that accepts a query and a document and returns a relevance score between 0 and 1.
# Local cross-encoder with sentence-transformers
from sentence_transformers import CrossEncoder
model = CrossEncoder('BAAI/bge-reranker-v2-m3', max_length=512)
def cross_encode_score(query: str, document: str) -> float:
score = model.predict([(query, document)])[0]
return float(score)# Hosted cross-encoder with Cohere
import cohere
co = cohere.Client(api_key="YOUR_API_KEY")
def cross_encode_batch(query: str, documents: list) -> list:
response = co.rerank(
query=query,
documents=documents,
model="rerank-english-v3.0",
top_n=len(documents)
)
return [(r.index, r.relevance_score) for r in response.results]Create a function that takes the query and a list of candidates, scores each pair, and returns the candidates with their cross-encoder scores attached. The cross-encoder processes the query and document text together, so you need to pass the actual text content, not just embeddings.
def rerank_with_cross_encoder(query: str, candidates: list,
top_k: int = 5) -> list:
pairs = [(query, c['content']) for c in candidates]
scores = model.predict(pairs)
for candidate, score in zip(candidates, scores):
candidate['cross_encoder_score'] = float(score)
candidates.sort(key=lambda x: x['cross_encoder_score'], reverse=True)
return candidates[:top_k]Always pass all query-candidate pairs to the model in a single batch rather than making individual calls. Batch inference amortizes model loading and GPU kernel launch overhead across all pairs. For 20 candidates, batch inference is typically 5 to 10 times faster than sequential inference. If using a hosted API, check whether the API supports batch requests and use them.
Insert the cross-encoder reranking step between vector retrieval and result delivery. The vector store returns candidates ranked by embedding similarity. The cross-encoder reranks those candidates by semantic relevance. Optionally, combine the cross-encoder score with other factors (recency, confidence) for a final multi-dimensional ranking.
async def retrieve_and_rerank(query: str, final_k: int = 5):
# stage 1: vector similarity
candidates = await vector_store.query(query, top_k=30)
# stage 2: cross-encoder reranking
reranked = rerank_with_cross_encoder(query, candidates, top_k=final_k)
return rerankedIf cross-encoder latency exceeds your budget, you have several options. Reduce the candidate count from 30 to 15 or 20, which linearly reduces inference time. Use a smaller model (MiniLM-based cross-encoders run 3 to 5 times faster than large models). Quantize the model to INT8 for a 2x speedup with minimal accuracy loss. Move inference to GPU if you are running on CPU. For the tightest latency budgets, consider using cognitive scoring instead, which achieves different (but complementary) quality improvements at under 40 milliseconds.
Cross-Encoders vs Cognitive Scoring
Cross-encoders and cognitive scoring are complementary approaches that improve ranking in different ways. Cross-encoders improve semantic relevance scoring by processing query and document together with cross-attention. They are better at determining whether a document truly answers a question rather than just discussing the same topic. Cognitive scoring improves ranking by adding temporal, relational, and reliability dimensions that cross-encoders cannot see because those factors are not in the text.
The best results come from combining both: use vector search for candidate retrieval, cross-encoder for semantic reranking, and cognitive scoring for multi-factor final ranking. Adaptive Recall includes cognitive scoring natively and can be paired with cross-encoder models in a three-stage pipeline for maximum accuracy.
Model Size and Speed Trade-offs
| Model | Parameters | Latency (20 candidates) | Relative Accuracy |
|---|---|---|---|
| MiniLM-L-6 | 22M | 15-30ms (GPU) | Baseline |
| MiniLM-L-12 | 33M | 25-50ms (GPU) | +3-5% |
| BGE-reranker-v2 | 560M | 80-150ms (GPU) | +8-12% |
| Cohere Rerank v3 | Hosted | 100-200ms (API) | +10-15% |
Combine cross-encoder precision with cognitive scoring intelligence. Adaptive Recall provides the cognitive layer, and you bring the cross-encoder of your choice.
Try It Free