What Is the Best Open-Source Reranking Model
The Top Contenders
BGE-reranker-v2-m3 (BAAI)
BGE-reranker-v2-m3 is a 560M parameter cross-encoder trained by the Beijing Academy of Artificial Intelligence. It supports multiple languages (English, Chinese, Japanese, Korean, and more) and achieves top scores on the MTEB reranking benchmarks. Its accuracy is within 1 to 2 percentage points of Cohere Rerank v3 on most English evaluation sets, making it the strongest open-source option for applications where accuracy is the priority.
The trade-off is size and speed. At 560M parameters, it requires a GPU with at least 2 GB of VRAM and takes 80 to 150 milliseconds to score 20 candidates. CPU inference is possible but slow (500ms or more for 20 candidates). For applications with GPU infrastructure and a latency budget of 100 to 200ms for reranking, this is the recommended choice.
MS MARCO MiniLM Cross-Encoders
The cross-encoder/ms-marco-MiniLM family includes models at 6 and 12 transformer layers. The L-6 variant (22M parameters) runs in 15 to 30ms on GPU for 20 candidates, making it one of the fastest cross-encoders available. The L-12 variant (33M parameters) is slightly more accurate but takes 25 to 50ms. Both are trained on the MS MARCO passage ranking dataset, which makes them well-suited for English question-answering and information retrieval tasks.
These models are the best choice when you need reranking with minimal latency overhead. The accuracy gap compared to BGE-reranker-v2 is 5 to 8 percentage points on NDCG@10, which is significant but acceptable for many applications, especially when combined with cognitive scoring that adds value through non-semantic dimensions.
GTE-reranker (Alibaba)
GTE-reranker from Alibaba is a newer entry that competes with BGE-reranker on accuracy benchmarks. It uses a modified transformer architecture optimized for long document pairs (up to 8192 tokens), which makes it particularly suitable for reranking long passages or full documents rather than short chunks. Accuracy is comparable to BGE-reranker-v2 on English tasks and slightly better on some multilingual benchmarks.
The model is available in multiple sizes, from a compact 137M parameter version to a full 560M version. The compact version offers a good middle ground between MiniLM speed and BGE accuracy.
ColBERTv2 and RAGatouille
ColBERTv2 is a late-interaction model that works differently from traditional cross-encoders. Instead of processing query-document pairs together, it produces per-token embeddings for both query and document independently, then computes a maximum similarity score between each query token and all document tokens. This allows document representations to be precomputed, making ColBERT suitable as both a retriever and a reranker.
The RAGatouille library provides a convenient Python interface for ColBERT models. Accuracy is between MiniLM and BGE-reranker, with the advantage of faster reranking (10 to 30ms for precomputed documents) and the ability to serve as a standalone retriever without a separate vector database.
Comparison Table
| Model | Parameters | Latency (20 docs, GPU) | NDCG@10 (MS MARCO) | Best For |
|---|---|---|---|---|
| MiniLM-L-6 | 22M | 15-30ms | Baseline | Low-latency reranking |
| MiniLM-L-12 | 33M | 25-50ms | +3-5% | Balanced speed/accuracy |
| GTE-reranker (compact) | 137M | 40-80ms | +5-8% | Long documents |
| BGE-reranker-v2 | 560M | 80-150ms | +8-12% | Maximum accuracy |
| ColBERTv2 | 110M | 10-30ms* | +5-9% | Dual retriever/reranker |
How to Choose
For most applications, start with MiniLM-L-6 (fastest, simplest to deploy, good enough accuracy for many use cases). If accuracy on your test set is not sufficient, upgrade to BGE-reranker-v2. If you need multilingual support, BGE-reranker-v2-m3 is the clear choice. If you are already using ColBERT for retrieval, use it for reranking as well to avoid maintaining two model pipelines.
Regardless of which model you choose, consider layering cognitive scoring on top of the cross-encoder results. The cross-encoder improves semantic precision, while cognitive scoring adds recency, confidence, and entity awareness. The combination addresses both types of retrieval errors: semantically imprecise results (fixed by cross-encoder) and stale or unreliable results (fixed by cognitive scoring).
Open Source vs Hosted
Hosted reranking APIs (Cohere Rerank, Jina Reranker) offer higher accuracy and zero deployment overhead at the cost of per-query pricing and API latency (100 to 200ms network round trip). Open-source models require GPU infrastructure and deployment effort but have zero per-query cost and lower latency (no network round trip). For high-volume applications (over 10,000 queries per day), open-source models typically pay for themselves within a month through API cost savings.
Add cognitive scoring on top of any reranker. Adaptive Recall provides the recency, confidence, and entity layers that complement open-source cross-encoder precision.
Try It Free