Home » Cognitive Scoring » Cross-Encoder Reranking

How to Implement Cross-Encoder Reranking

Cross-encoder reranking passes each query-candidate pair through a transformer model that processes them together, producing a relevance score that captures fine-grained semantic interactions between the query and document. This achieves higher accuracy than bi-encoder similarity because the model can attend to both texts simultaneously, but it costs more computation because scores cannot be precomputed.

Before You Start

Cross-encoder reranking requires either a local GPU (or fast CPU) to run inference or an API call to a hosted reranking service. If you are running locally, you need Python with the sentence-transformers library. If you are using a hosted service, Cohere Rerank and similar APIs provide cross-encoder scoring without local infrastructure. Plan for 50 to 200 milliseconds of added latency per query when scoring 20 to 30 candidates.

You also need an existing vector retrieval stage that returns candidates. Cross-encoders are too slow to score against a full document store, so they operate on a pre-filtered candidate set from vector search. This guide assumes you have that stage working already.

Step-by-Step Implementation

Step 1: Choose a cross-encoder model.
The model choice depends on your accuracy needs and deployment constraints. BGE-reranker-v2 from BAAI is a strong open-source option that runs well on GPU and achieves competitive accuracy on standard benchmarks. MS MARCO fine-tuned models (like cross-encoder/ms-marco-MiniLM-L-12-v2) are smaller and faster but slightly less accurate. Cohere Rerank is a hosted API that requires no local infrastructure and handles scaling automatically. For most applications, BGE-reranker-v2 provides the best balance of accuracy and speed.

Step 2: Set up model inference.
For local inference, install sentence-transformers and load the cross-encoder model. The first inference call downloads the model weights (typically 400 MB to 1.5 GB). Subsequent calls use the cached model. For hosted inference, set up your API key and client library. Either way, wrap the model in a scoring function that accepts a query and a document and returns a relevance score between 0 and 1.

# Local cross-encoder with sentence-transformers
from sentence_transformers import CrossEncoder

model = CrossEncoder('BAAI/bge-reranker-v2-m3', max_length=512)

def cross_encode_score(query: str, document: str) -> float:
    score = model.predict([(query, document)])[0]
    return float(score)

# Hosted cross-encoder with Cohere
import cohere

co = cohere.Client(api_key="YOUR_API_KEY")

def cross_encode_batch(query: str, documents: list) -> list:
    response = co.rerank(
        query=query,
        documents=documents,
        model="rerank-english-v3.0",
        top_n=len(documents)
    )
    return [(r.index, r.relevance_score) for r in response.results]

Step 3: Build the scoring function.
Create a function that takes the query and a list of candidates, scores each pair, and returns the candidates with their cross-encoder scores attached. The cross-encoder processes the query and document text together, so you need to pass the actual text content, not just embeddings.

def rerank_with_cross_encoder(query: str, candidates: list,
                              top_k: int = 5) -> list:
    pairs = [(query, c['content']) for c in candidates]
    scores = model.predict(pairs)

    for candidate, score in zip(candidates, scores):
        candidate['cross_encoder_score'] = float(score)

    candidates.sort(key=lambda x: x['cross_encoder_score'], reverse=True)
    return candidates[:top_k]

Step 4: Batch inference for efficiency.
Always pass all query-candidate pairs to the model in a single batch rather than making individual calls. Batch inference amortizes model loading and GPU kernel launch overhead across all pairs. For 20 candidates, batch inference is typically 5 to 10 times faster than sequential inference. If using a hosted API, check whether the API supports batch requests and use them.

Step 5: Integrate with your retrieval pipeline.
Insert the cross-encoder reranking step between vector retrieval and result delivery. The vector store returns candidates ranked by embedding similarity. The cross-encoder reranks those candidates by semantic relevance. Optionally, combine the cross-encoder score with other factors (recency, confidence) for a final multi-dimensional ranking.

async def retrieve_and_rerank(query: str, final_k: int = 5):
    # stage 1: vector similarity
    candidates = await vector_store.query(query, top_k=30)

    # stage 2: cross-encoder reranking
    reranked = rerank_with_cross_encoder(query, candidates, top_k=final_k)

    return reranked

Step 6: Optimize latency.
If cross-encoder latency exceeds your budget, you have several options. Reduce the candidate count from 30 to 15 or 20, which linearly reduces inference time. Use a smaller model (MiniLM-based cross-encoders run 3 to 5 times faster than large models). Quantize the model to INT8 for a 2x speedup with minimal accuracy loss. Move inference to GPU if you are running on CPU. For the tightest latency budgets, consider using cognitive scoring instead, which achieves different (but complementary) quality improvements at under 40 milliseconds.

Cross-Encoders vs Cognitive Scoring

Cross-encoders and cognitive scoring are complementary approaches that improve ranking in different ways. Cross-encoders improve semantic relevance scoring by processing query and document together with cross-attention. They are better at determining whether a document truly answers a question rather than just discussing the same topic. Cognitive scoring improves ranking by adding temporal, relational, and reliability dimensions that cross-encoders cannot see because those factors are not in the text.

The best results come from combining both: use vector search for candidate retrieval, cross-encoder for semantic reranking, and cognitive scoring for multi-factor final ranking. Adaptive Recall includes cognitive scoring natively and can be paired with cross-encoder models in a three-stage pipeline for maximum accuracy.

Model Size and Speed Trade-offs

Model	Parameters	Latency (20 candidates)	Relative Accuracy
MiniLM-L-6	22M	15-30ms (GPU)	Baseline
MiniLM-L-12	33M	25-50ms (GPU)	+3-5%
BGE-reranker-v2	560M	80-150ms (GPU)	+8-12%
Cohere Rerank v3	Hosted	100-200ms (API)	+10-15%

Combine cross-encoder precision with cognitive scoring intelligence. Adaptive Recall provides the cognitive layer, and you bring the cross-encoder of your choice.

Try It Free

How to Implement Cross-Encoder Reranking

Before You Start

Step-by-Step Implementation

Cross-Encoders vs Cognitive Scoring

Model Size and Speed Trade-offs

Related Articles