Home » Cognitive Scoring » Bi-Encoders vs Cross-Encoders vs ColBERT

Bi-Encoders vs Cross-Encoders vs ColBERT Explained

Bi-encoders, cross-encoders, and ColBERT represent three different architectures for scoring the relevance between a query and a document. Bi-encoders are fast because they precompute document representations, making them ideal for initial retrieval. Cross-encoders are accurate because they process query and document together, making them ideal for reranking. ColBERT sits between the two, offering near cross-encoder accuracy at closer to bi-encoder speed through a late-interaction design.

How Bi-Encoders Work

A bi-encoder uses two independent encoder networks (usually the same model with shared weights) to produce a fixed-length vector for the query and a fixed-length vector for the document. The relevance score is the cosine similarity (or dot product) between these two vectors. The critical property is independence: the document vector is computed once and stored in an index, and only the query vector needs to be computed at search time.

This independence makes bi-encoders extremely fast. Searching a million documents takes milliseconds because the search is a nearest-neighbor lookup in the precomputed vector index, not a million model inference passes. Approximate nearest neighbor algorithms (HNSW, IVF) make this even faster at the cost of occasionally missing a relevant result.

The trade-off is that the query and document never interact inside the model. The bi-encoder must compress all the information about the document into a single fixed-length vector (typically 768 or 1536 dimensions), and the relevance judgment is a simple geometric comparison between two points. This means the model cannot capture fine-grained interactions between specific query terms and specific document passages. It works well for broad topical matching but struggles with nuanced relevance judgments.

Common bi-encoder models include OpenAI's text-embedding-3, Voyage AI's voyage-3, Cohere's embed-v4, and the open-source BGE and GTE families. These models are the backbone of virtually every vector search system in production today.

How Cross-Encoders Work

A cross-encoder takes the query and document as a single concatenated input and processes them together through a transformer. The model can attend to both texts simultaneously, allowing it to capture interactions between specific query terms and specific document passages. The output is a single relevance score rather than separate vectors.

This joint processing produces significantly more accurate relevance judgments than bi-encoders. A cross-encoder can determine that "how do I fix error 403" is better answered by "users need the admin role to access the dashboard endpoint" (which explains the cause of a 403) than by "error codes in HTTP range from 400 to 599" (which is topically related but does not answer the question). The bi-encoder might rank these similarly because both are about HTTP errors, but the cross-encoder sees the interaction between "fix" in the query and "need the admin role" in the first document.

The cost is speed. Because the query and document must be processed together, cross-encoder scores cannot be precomputed. Every query requires a separate inference pass for each candidate document. Scoring 20 candidates takes 20 inference passes, which typically costs 50 to 200 milliseconds on GPU. This makes cross-encoders impractical for initial retrieval against large collections but excellent for reranking a small candidate set.

Popular cross-encoder models include BGE-reranker-v2 from BAAI, the MS MARCO fine-tuned family (cross-encoder/ms-marco-MiniLM), and Cohere Rerank. These are purpose-built for the reranking task and are trained on query-document pairs with human relevance judgments.

How ColBERT Works

ColBERT (Contextualized Late Interaction over BERT) is a late-interaction model that splits the difference between bi-encoders and cross-encoders. Instead of compressing the document into a single vector (bi-encoder) or processing query and document together (cross-encoder), ColBERT produces per-token representations for both query and document independently. At search time, it computes a maximum similarity score between each query token and all document tokens, then sums these maximum similarities.

This design gives ColBERT several advantages. Like bi-encoders, it can precompute document representations (per-token vectors), making search faster than cross-encoders. Like cross-encoders, it captures token-level interactions between query and document, making it more accurate than single-vector bi-encoders. The per-token matching can determine that "fix" in the query should match "need the admin role" in the document because the token-level interaction captures the relevance relationship that a single-vector comparison misses.

The trade-off is storage and index complexity. Instead of one vector per document, ColBERT stores one vector per token, which increases the index size by a factor of 50 to 200. This makes ColBERT more expensive to deploy and requires specialized index structures (like the PLAID engine) to achieve fast search. However, for applications that need accuracy between bi-encoder and cross-encoder at speeds closer to bi-encoder, ColBERT is a compelling option.

ColBERT v2 and the newer ColPali (which extends the approach to multimodal documents) represent the current state of this architecture family. Open-source implementations are available through the RAGatouille library and the Stanford ColBERT project.

Comparison Table

Property	Bi-Encoder	Cross-Encoder	ColBERT
Query-document interaction	None (independent encoding)	Full (joint processing)	Late (per-token matching)
Precompute documents	Yes (one vector per doc)	No	Yes (one vector per token)
Search speed (1M docs)	5-20ms	Impractical	50-200ms
Rerank speed (20 candidates)	N/A (used for retrieval)	50-200ms (GPU)	10-30ms
Accuracy (NDCG@10)	Baseline	+10-15% over bi-encoder	+7-12% over bi-encoder
Index size per document	3-6 KB	N/A	50-200 KB
Best role	First-stage retrieval	Second-stage reranking	Either (with trade-offs)

When to Use Each

Use bi-encoders for the first retrieval stage in virtually all applications. They provide the speed and scalability needed to search across large collections, and every major vector database is optimized for this use case. The embedding models from OpenAI, Cohere, Voyage, and the open-source community are mature, well-documented, and easy to deploy.

Use cross-encoders for second-stage reranking when accuracy matters more than latency. Customer support systems, legal research, medical information retrieval, and other domains where a wrong answer has consequences benefit from cross-encoder reranking. The 50 to 200 milliseconds of added latency is acceptable for most interactive applications.

Use ColBERT when you need accuracy close to cross-encoders but with faster search over larger collections. ColBERT works well as a single-stage retriever for medium-sized collections (up to a few million documents) where cross-encoder reranking is too slow and bi-encoder accuracy is not sufficient. The higher storage cost is justified when accuracy is critical and the collection size is manageable.

Where Cognitive Scoring Fits

Cognitive scoring is not a replacement for any of these encoder architectures. It operates on a completely different dimension: temporal, relational, and reliability factors that none of the encoders can capture from text alone. You use a bi-encoder for candidate retrieval, optionally a cross-encoder for semantic reranking, and cognitive scoring for multi-factor ranking that accounts for recency, access patterns, entity connections, and confidence. Each layer adds value that the others cannot provide.

In Adaptive Recall, the default pipeline uses bi-encoder retrieval followed by cognitive scoring. For applications that need maximum accuracy, you can insert a cross-encoder reranking step between them, creating a three-stage pipeline: bi-encoder recall, cross-encoder relevance, cognitive scoring for final ranking.

Layer cognitive scoring on top of any encoder architecture. Adaptive Recall adds recency, confidence, and entity awareness to your existing retrieval stack.

Try It Free