Home » Cognitive Scoring » Best Open-Source Reranker

What Is the Best Open-Source Reranking Model

BGE-reranker-v2-m3 from BAAI is the best overall open-source reranking model as of 2026. It achieves accuracy close to commercial alternatives like Cohere Rerank while running locally on GPU. For latency-sensitive applications, cross-encoder/ms-marco-MiniLM-L-6-v2 provides a strong balance of speed and accuracy at a fraction of the parameter count. Both can be combined with cognitive scoring for multi-factor ranking that adds recency, confidence, and entity awareness on top of semantic reranking.

The Top Contenders

BGE-reranker-v2-m3 (BAAI)

BGE-reranker-v2-m3 is a 560M parameter cross-encoder trained by the Beijing Academy of Artificial Intelligence. It supports multiple languages (English, Chinese, Japanese, Korean, and more) and achieves top scores on the MTEB reranking benchmarks. Its accuracy is within 1 to 2 percentage points of Cohere Rerank v3 on most English evaluation sets, making it the strongest open-source option for applications where accuracy is the priority.

The trade-off is size and speed. At 560M parameters, it requires a GPU with at least 2 GB of VRAM and takes 80 to 150 milliseconds to score 20 candidates. CPU inference is possible but slow (500ms or more for 20 candidates). For applications with GPU infrastructure and a latency budget of 100 to 200ms for reranking, this is the recommended choice.

MS MARCO MiniLM Cross-Encoders

The cross-encoder/ms-marco-MiniLM family includes models at 6 and 12 transformer layers. The L-6 variant (22M parameters) runs in 15 to 30ms on GPU for 20 candidates, making it one of the fastest cross-encoders available. The L-12 variant (33M parameters) is slightly more accurate but takes 25 to 50ms. Both are trained on the MS MARCO passage ranking dataset, which makes them well-suited for English question-answering and information retrieval tasks.

These models are the best choice when you need reranking with minimal latency overhead. The accuracy gap compared to BGE-reranker-v2 is 5 to 8 percentage points on NDCG@10, which is significant but acceptable for many applications, especially when combined with cognitive scoring that adds value through non-semantic dimensions.

GTE-reranker (Alibaba)

GTE-reranker from Alibaba is a newer entry that competes with BGE-reranker on accuracy benchmarks. It uses a modified transformer architecture optimized for long document pairs (up to 8192 tokens), which makes it particularly suitable for reranking long passages or full documents rather than short chunks. Accuracy is comparable to BGE-reranker-v2 on English tasks and slightly better on some multilingual benchmarks.

The model is available in multiple sizes, from a compact 137M parameter version to a full 560M version. The compact version offers a good middle ground between MiniLM speed and BGE accuracy.

ColBERTv2 and RAGatouille

ColBERTv2 is a late-interaction model that works differently from traditional cross-encoders. Instead of processing query-document pairs together, it produces per-token embeddings for both query and document independently, then computes a maximum similarity score between each query token and all document tokens. This allows document representations to be precomputed, making ColBERT suitable as both a retriever and a reranker.

The RAGatouille library provides a convenient Python interface for ColBERT models. Accuracy is between MiniLM and BGE-reranker, with the advantage of faster reranking (10 to 30ms for precomputed documents) and the ability to serve as a standalone retriever without a separate vector database.

Comparison Table

Model	Parameters	Latency (20 docs, GPU)	NDCG@10 (MS MARCO)	Best For
MiniLM-L-6	22M	15-30ms	Baseline	Low-latency reranking
MiniLM-L-12	33M	25-50ms	+3-5%	Balanced speed/accuracy
GTE-reranker (compact)	137M	40-80ms	+5-8%	Long documents
BGE-reranker-v2	560M	80-150ms	+8-12%	Maximum accuracy
ColBERTv2	110M	10-30ms*	+5-9%	Dual retriever/reranker

*ColBERT latency: The 10 to 30ms figure assumes document representations are precomputed. If computing document representations at query time, latency is comparable to other cross-encoders of similar size.

How to Choose

For most applications, start with MiniLM-L-6 (fastest, simplest to deploy, good enough accuracy for many use cases). If accuracy on your test set is not sufficient, upgrade to BGE-reranker-v2. If you need multilingual support, BGE-reranker-v2-m3 is the clear choice. If you are already using ColBERT for retrieval, use it for reranking as well to avoid maintaining two model pipelines.

Regardless of which model you choose, consider layering cognitive scoring on top of the cross-encoder results. The cross-encoder improves semantic precision, while cognitive scoring adds recency, confidence, and entity awareness. The combination addresses both types of retrieval errors: semantically imprecise results (fixed by cross-encoder) and stale or unreliable results (fixed by cognitive scoring).

Open Source vs Hosted

Hosted reranking APIs (Cohere Rerank, Jina Reranker) offer higher accuracy and zero deployment overhead at the cost of per-query pricing and API latency (100 to 200ms network round trip). Open-source models require GPU infrastructure and deployment effort but have zero per-query cost and lower latency (no network round trip). For high-volume applications (over 10,000 queries per day), open-source models typically pay for themselves within a month through API cost savings.

Add cognitive scoring on top of any reranker. Adaptive Recall provides the recency, confidence, and entity layers that complement open-source cross-encoder precision.

Try It Free