Home » Vector Search and Embeddings » Choose Embedding Model

How to Choose the Right Embedding Model

Choosing an embedding model means balancing retrieval quality, cost, latency, and domain fit for your specific application. The right model depends on what language your content is in, how technical or domain-specific it is, how many documents you need to embed, and whether you can self-host or need a managed API. This guide walks through the evaluation process with concrete criteria at each step.

Why the Model Choice Matters

The embedding model is the foundation of your entire vector search system. Every other component, the vector database, the indexing strategy, the reranking layer, builds on top of the embeddings the model produces. A model that embeds your content poorly creates a ceiling on retrieval quality that no downstream component can overcome. Switching models later is expensive because it requires re-embedding your entire corpus, so getting the choice right upfront saves significant engineering time.

The difference between a good and a mediocre embedding model is typically 10 to 20 percentage points of recall on retrieval benchmarks. On your specific data, the gap may be smaller or larger depending on how well the model's training data matches your domain. A model trained heavily on web text performs well on general knowledge bases but may underperform on specialized content like medical records, legal contracts, or source code.

Step-by-Step Selection Process

Step 1: Define your retrieval requirements.
Before comparing models, establish what your application needs. What language is the content in? If it is English-only, most models work well. If it is multilingual, you need a model trained on multiple languages (Cohere embed-v4, multilingual-e5-large). What domain is the content in? General documentation, code, legal, medical, and financial content each has specialized models that outperform general-purpose ones. How many documents will you embed? The answer determines whether API costs are manageable or whether self-hosting becomes necessary. What latency do your queries need? Real-time applications need models that embed a query in under 50 milliseconds.
Step 2: Shortlist models by benchmark performance.
Use the MTEB leaderboard to identify the top-performing models for your use case. Focus on the retrieval subtasks (BEIR benchmark) rather than the overall MTEB score, because retrieval performance is what matters for search. For code retrieval, check CodeSearchNet scores. For multilingual content, check MIRACL scores. Narrow your list to 3 to 5 candidates that rank well on the subtasks most relevant to your application.
# Top embedding models by MTEB retrieval scores (as of early 2026): # # API-based: # OpenAI text-embedding-3-large (3072 dims, strong general purpose) # OpenAI text-embedding-3-small (1536 dims, good cost/quality ratio) # Cohere embed-v4 (1024 dims, best multilingual) # Voyage voyage-3 (1024 dims, strong on technical content) # Voyage voyage-code-3 (1024 dims, best for code) # # Open source (self-hosted): # BGE-large-en-v1.5 (1024 dims, strong English) # E5-large-v2 (1024 dims, instruction-tuned) # GTE-large (1024 dims, good all-around) # NV-Embed-v2 (4096 dims, top MTEB scores) # nomic-embed-text-v1.5 (768 dims, good value)
Step 3: Evaluate cost and latency.
For API-based models, calculate the cost to embed your full corpus and the ongoing cost for query embeddings plus new document ingestion. OpenAI's text-embedding-3-small costs roughly $0.02 per million tokens, so embedding 100,000 documents averaging 500 tokens each costs about $1. The large model costs roughly $0.13 per million tokens. For self-hosted models, estimate the GPU cost: a single A10G instance can embed roughly 1,000 documents per second with batching, and costs $1 to $2 per hour on major clouds. Measure query embedding latency for your shortlisted models because this adds directly to your search response time.
Step 4: Test on your own data.
Select 500 to 1,000 representative documents from your corpus and 50 to 100 real or realistic queries with known relevant results. Embed everything with each candidate model, run the queries, and measure recall at k (the fraction of known relevant documents that appear in the top k results). This is the most important step because benchmark performance does not always predict performance on your specific data. A model that ranks third on MTEB may rank first on your data if your content matches its training distribution.
import numpy as np from typing import List def recall_at_k(query_embedding: np.ndarray, doc_embeddings: np.ndarray, relevant_ids: List[int], k: int = 10) -> float: similarities = np.dot(doc_embeddings, query_embedding) top_k_ids = np.argsort(similarities)[-k:][::-1] hits = len(set(top_k_ids) & set(relevant_ids)) return hits / len(relevant_ids) if relevant_ids else 0.0 # Run for each model, average across all queries # Model with highest mean recall@10 on YOUR data wins
Step 5: Check dimension and storage impact.
Calculate total storage for your expected corpus. Each vector consumes (dimensions x 4) bytes in float32. A million vectors at 1536 dimensions: 1M x 1536 x 4 = 5.7 GB before indexing. At 3072 dimensions: 11.5 GB. Add 2 to 4 times overhead for the HNSW index. If storage is a constraint, consider models with fewer dimensions, models that support Matryoshka embeddings (where you can truncate to fewer dimensions with controlled quality loss), or scalar/product quantization.
Step 6: Commit and plan for migration.
Select the model that gives the best recall on your data at acceptable cost and latency. Embed your full corpus. Document the exact model name and version (not just "OpenAI embeddings" but "text-embedding-3-large, dimensions=1536"). Different model versions produce incompatible vector spaces, so if you ever need to re-embed, you need to know exactly which model produced the current vectors. If your embedding model provider deprecates or changes the model, you will need to re-embed everything.
Matryoshka embeddings: Some models (OpenAI's text-embedding-3 family, nomic-embed) support Matryoshka representation learning, which means you can truncate vectors to fewer dimensions after embedding. A 3072-dimensional vector can be truncated to 1536 or 768 dimensions with a small, predictable quality loss. This lets you start with high dimensions and compress later if storage becomes a constraint, without re-embedding.

When to Switch Models

Re-embedding is expensive, so switching models should be driven by measurable retrieval quality problems, not by a new model appearing on the MTEB leaderboard. If your recall metrics are meeting requirements, your current model is fine. Switch when retrieval quality degrades on a specific content type, when a domain-specific model becomes available for your niche, or when cost savings from a newer model justify the re-embedding effort.

Adaptive Recall handles embedding as part of the memory storage pipeline, so you do not manage model selection or re-embedding directly. The system uses embeddings as one of four retrieval signals (alongside cognitive activation, knowledge graph traversal, and confidence scoring), which means the overall retrieval quality is less dependent on any single embedding model's performance. Even when similarity scores are mediocre, the other scoring signals compensate.

Stop worrying about embedding model selection. Adaptive Recall combines vector similarity with cognitive scoring and graph traversal, so retrieval quality does not depend on a single model.

Try It Free