Home » Vector Search and Embeddings

Vector Search and Embeddings for AI Applications

Vector search converts text into numerical representations called embeddings and finds the most similar vectors to answer a query. It is the retrieval backbone of every RAG pipeline, every semantic search engine, and most AI memory systems built today. Understanding how embeddings work, which distance metrics to use, how to chunk documents, and where vector search falls short is essential for building AI applications that retrieve relevant information reliably.

What Vector Search Is and How It Works

Vector search is a retrieval technique that finds information by mathematical similarity rather than keyword matching. Traditional search engines build inverted indexes that map words to documents, so a query for "database connection pooling" finds documents containing those exact words. Vector search converts both the query and all stored documents into numerical vectors (arrays of floating-point numbers) and finds the stored vectors that are geometrically closest to the query vector. Two pieces of text that discuss the same concept but use entirely different words will produce similar vectors, which is why vector search is also called semantic search.

The process works in two phases. During indexing, each document or chunk of text is passed through an embedding model (a neural network trained specifically for this purpose) that outputs a fixed-length vector, typically between 384 and 3072 dimensions. This vector captures the semantic meaning of the text as a point in high-dimensional space. The vector is stored alongside a reference to the original text in a vector database or index. During querying, the search query is passed through the same embedding model to produce a query vector, and the database finds the stored vectors closest to the query vector using a distance metric like cosine similarity. The original text associated with the closest vectors is returned as the search results.

The quality of vector search depends entirely on the embedding model. A good embedding model places semantically similar text close together and semantically different text far apart in the vector space. The model is trained on massive text corpora where it learns that "how to configure database connection pools" and "setting up DB connection management" discuss the same topic and should produce nearby vectors, while "how to configure a swimming pool" should be far away despite sharing the word "pool." This learned semantic understanding is what makes vector search powerful for information retrieval.

Vector search has two fundamental strengths that keyword search lacks. First, it handles vocabulary mismatch. A user who searches for "fix broken login" will find a document titled "Troubleshooting authentication failures" because the embedding model learned that these phrases are semantically equivalent. Second, it handles conceptual queries. A search for "ways to make the API faster" will surface documents about caching, connection pooling, query optimization, and load balancing because the model understands these are all methods of improving API performance, even though none of them contain the phrase "make the API faster."

Understanding Embeddings

An embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of a piece of text. A text embedding model takes a string of any length and outputs a vector of a specific dimensionality, typically 384, 768, 1024, 1536, or 3072 dimensions depending on the model. Each dimension captures some aspect of meaning, though individual dimensions are not directly interpretable by humans. The vector as a whole encodes what the text is about, how it relates to other concepts, and its semantic context.

Embedding models are trained using contrastive learning on pairs or triplets of text. The model learns to produce similar vectors for text pairs that are semantically related (a question and its answer, a paragraph and its summary, two paraphrases of the same idea) and different vectors for unrelated pairs. After training on millions of such pairs, the model generalizes to produce meaningful vectors for text it has never seen. The quality of an embedding model is measured by how well this generalization works on retrieval benchmarks like MTEB (Massive Text Embedding Benchmark), which tests the model across dozens of tasks including search, classification, and clustering.

Dimensionality represents a trade-off between expressiveness and efficiency. A 384-dimensional vector can distinguish between broad topics but may conflate subtle differences within a topic. A 3072-dimensional vector captures finer distinctions but requires 8 times more storage and takes longer to compare. For most retrieval applications, 768 to 1536 dimensions provide the best balance. At 1536 dimensions (the output size of OpenAI's text-embedding-3-large and several other popular models), each vector occupies about 6 KB of storage in float32 format, which means a million vectors require roughly 6 GB of storage before indexing overhead.

Not all text embeds equally well. Short, topical phrases like "database connection pooling" produce consistent, high-quality embeddings because the meaning is concentrated and unambiguous. Long documents with multiple topics produce embeddings that average across all topics, diluting the signal for any individual topic. This is why chunking, splitting documents into smaller pieces before embedding, is critical for retrieval quality. A paragraph about connection pooling embedded on its own will be found by a query about connection pooling, but the same paragraph buried in a 5,000-word document about the entire infrastructure produces a diluted embedding that may not match the query strongly enough to appear in the top results.

Distance Metrics: Cosine, Dot Product, Euclidean

Distance metrics determine how similarity between two vectors is calculated. The three most common metrics in vector search are cosine similarity, dot product, and Euclidean distance. Each answers the question "how similar are these two vectors" in a slightly different way, and the right choice depends on how your embedding model was trained.

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have a cosine similarity of 1.0, regardless of how long they are. Two vectors pointing in opposite directions have a cosine similarity of -1.0. Two perpendicular vectors have a cosine similarity of 0. Because cosine similarity only considers direction, it is robust to differences in text length. A short query and a long document can have high cosine similarity if they discuss the same topic, even though the long document's embedding has a larger magnitude. Most embedding models are trained to produce unit-normalized vectors (magnitude 1.0), in which case cosine similarity and dot product produce identical rankings.

Dot product multiplies corresponding dimensions and sums the results. Unlike cosine similarity, it is sensitive to both direction and magnitude. Two vectors that point in the same direction and are both long will have a higher dot product than two that point in the same direction but are shorter. Some embedding models use this property intentionally: they produce longer vectors for more informative or higher-quality content, so the dot product naturally ranks high-quality matches above low-quality matches that happen to be topically similar. If your embedding model produces non-normalized vectors and the documentation says to use dot product, follow that guidance.

Euclidean distance (L2 distance) measures the straight-line distance between two points in the vector space. Smaller distances mean more similar vectors. It considers both direction and magnitude, making it the most geometrically intuitive metric. In practice, for normalized vectors, ranking by Euclidean distance produces the same order as ranking by cosine similarity because the mathematical relationship between the two is monotonic when vectors have unit length. Euclidean distance is slightly faster to compute because it avoids the normalization step, which is why some databases default to it.

For almost all retrieval applications using modern embedding models, cosine similarity is the right default. Most popular models (OpenAI's text-embedding-3 family, Cohere's embed-v4, Voyage's voyage-3) produce normalized vectors, so cosine similarity and dot product give identical results. If you are using a model that explicitly produces non-normalized vectors with intentional magnitude variation, use dot product. If you are doing spatial or geometric operations where absolute position matters rather than direction, use Euclidean distance.

Choosing an Embedding Model

The embedding model is the single most important component in a vector search system. A better model improves retrieval quality more than any amount of prompt engineering, reranking, or infrastructure optimization. The landscape of embedding models has expanded rapidly, with competitive options from OpenAI, Cohere, Voyage AI, Google, and the open-source community.

OpenAI's text-embedding-3-large (3072 dimensions) and text-embedding-3-small (1536 dimensions) are the most widely deployed commercial embedding models. They offer strong general-purpose performance across English and multilingual text, with the large model ranking among the top performers on MTEB benchmarks. Pricing is straightforward per-token, making cost estimation simple. The main consideration is vendor lock-in: once you embed your corpus with an OpenAI model, switching to a different model requires re-embedding everything because different models produce incompatible vector spaces.

Cohere's embed-v4 offers competitive MTEB scores with built-in support for different input types (search_document, search_query, classification, clustering). This type distinction means the model produces slightly different embeddings depending on whether the text is a document being indexed or a query being searched, which can improve retrieval precision. Cohere also offers multilingual models that handle over 100 languages in a single model, making them a strong choice for applications serving international users.

Voyage AI specializes in domain-specific embeddings. Their voyage-3 model performs well on general benchmarks, but their real differentiation is voyage-code-3 for code retrieval and voyage-law-2 for legal documents. If your application is domain-specific and a Voyage model exists for that domain, it will likely outperform general-purpose models by a meaningful margin because it was trained on domain-specific data that captures the vocabulary and reasoning patterns of that field.

Open-source models like BGE-large, E5-large, and GTE-large provide strong performance without API costs. They run on your own infrastructure, which eliminates per-token charges but requires GPU resources for inference. For applications that embed large corpora (millions of documents) or have high query volumes, self-hosted open-source models can be significantly cheaper than API-based models. The trade-off is operational complexity: you manage the GPU instances, model serving, and scaling yourself.

When choosing a model, prioritize retrieval benchmarks over general MTEB scores. MTEB aggregates across many tasks, but for search applications, the retrieval-specific subtasks (BEIR, MIRACL) matter most. Test your top two or three candidates on a sample of your actual data and queries. A model that ranks first on benchmarks may not rank first on your specific domain because benchmark datasets rarely match production query distributions.

Chunking Strategies for Retrieval

Chunking is the process of splitting documents into smaller pieces before embedding. The goal is to create chunks that are small enough to produce focused embeddings but large enough to contain sufficient context for a useful answer. Chunk size directly impacts retrieval quality: too large and the embedding averages across multiple topics, too small and the chunk lacks enough context to be useful when returned as a search result.

Fixed-size chunking splits text at regular token intervals (for example, every 512 tokens) with optional overlap between consecutive chunks. This is the simplest approach and works reasonably well for homogeneous content like documentation or articles where topic transitions are gradual. A typical configuration is 512-token chunks with 50-token overlap. The overlap ensures that information spanning a chunk boundary appears in both chunks, reducing the risk of splitting a critical piece of information across two chunks where neither chunk captures the full meaning.

Semantic chunking uses natural boundaries in the text to determine where to split. Paragraph breaks, section headings, list boundaries, and topic transitions all represent meaningful splitting points. This produces chunks of varying size but better semantic coherence because each chunk covers a complete thought or topic. The implementation is more complex: you need to identify structural elements in the text and handle edge cases where a section is too long for a single chunk or too short to justify its own embedding.

Recursive chunking combines both approaches. Start by splitting on major structural boundaries (sections, headings). If any resulting chunk exceeds a maximum size (for example, 1,000 tokens), recursively split it at paragraph boundaries. If a paragraph still exceeds the maximum, split it at sentence boundaries. This produces chunks that respect document structure when possible and fall back to structural splitting only when necessary.

The optimal chunk size depends on your query patterns. If users ask short, specific questions ("how do I configure the timeout"), smaller chunks (200 to 400 tokens) work well because they match the specificity of the query. If users ask broad questions ("explain the authentication architecture"), larger chunks (800 to 1,200 tokens) work better because they contain enough context to provide a comprehensive answer. Most production systems use 400 to 600 tokens as a default and tune from there based on retrieval metrics.

Parent-child chunking is an advanced technique that indexes small chunks for precise matching but returns the parent chunk (the larger section containing the match) as context. For example, split a document into 200-token chunks for embedding and indexing, but when a chunk matches a query, return the surrounding 800-token parent context. This gives you the embedding precision of small chunks with the contextual richness of large chunks.

Hybrid Search: Combining Vectors with Keywords

Hybrid search runs both vector (semantic) search and keyword (lexical) search against the same corpus and combines the results. This addresses the primary weakness of vector search alone: it struggles with exact-match queries, specific identifiers, and terminology that the embedding model has not seen frequently in training data. Keyword search excels at these cases because it matches tokens directly.

The performance difference is measurable. Studies on information retrieval benchmarks consistently show that hybrid search achieves 88 to 91% recall at top-10, compared to 75 to 80% for vector search alone and 65 to 72% for keyword search alone. The improvement comes from complementary failure modes: the queries that vector search misses (exact terms, rare identifiers) are exactly the queries that keyword search handles well, and vice versa.

BM25 is the standard keyword search algorithm used in hybrid systems. It ranks documents by how frequently the query terms appear (term frequency) adjusted by how rare those terms are across the corpus (inverse document frequency) and normalized by document length. BM25 has been the baseline for information retrieval since 1994 and remains competitive because it is fast, well-understood, and handles exact-match queries reliably.

Combining results from two different search systems requires a fusion strategy. The most common approach is Reciprocal Rank Fusion (RRF), which scores each result based on its rank position in each system's results. A document that appears at position 2 in vector search and position 5 in keyword search receives a higher combined score than a document at position 1 in only one system. RRF is effective because it does not require normalizing scores across different systems (vector similarity scores and BM25 scores are on different scales), it only requires rank positions.

An alternative to RRF is weighted score combination, where you normalize both score distributions to [0, 1] and then combine them with weights. For example, 0.7 times the vector score plus 0.3 times the BM25 score. This gives you direct control over the balance between semantic and lexical matching but requires careful score normalization and tuning of the weights for your specific data and query distribution.

Most modern vector databases support hybrid search natively. Weaviate has built-in BM25 plus vector search with automatic fusion. Qdrant supports hybrid search through its query API. pgvector can be combined with PostgreSQL's full-text search using ts_rank in the same SQL query. If your database does not support hybrid search natively, you can run both searches separately and merge results in your application layer using RRF.

Vector Database Landscape

Vector databases are purpose-built for storing, indexing, and querying high-dimensional vectors at scale. The market has grown from a few specialized options to dozens of choices, ranging from dedicated vector databases to vector extensions for existing databases.

Pinecone is a fully managed vector database that requires zero infrastructure management. You create an index, upsert vectors, and query. Scaling is automatic. This simplicity makes it the default choice for teams that want to build a retrieval system without managing database infrastructure. The trade-off is cost: Pinecone's pricing scales with vector count and query volume, and at large scale it becomes significantly more expensive than self-managed alternatives. It also does not support hybrid search natively, requiring a separate keyword search system.

Qdrant is an open-source vector database written in Rust with a focus on performance. It supports hybrid search (dense vectors plus sparse vectors for keyword matching), filtering, and payload storage alongside vectors. You can run it self-hosted or use their managed cloud offering. Qdrant consistently benchmarks among the fastest options for high-throughput query workloads, making it a strong choice for production systems with demanding latency requirements.

Weaviate is an open-source vector database with a module ecosystem that includes embedding model integrations, hybrid search, and generative search (vector search plus LLM generation in one query). Its hybrid search implementation is particularly mature, combining BM25 and vector search with automatic fusion. The trade-off is operational complexity: Weaviate has more moving parts than simpler alternatives, which means more configuration and monitoring.

pgvector is a PostgreSQL extension that adds vector storage and similarity search to an existing PostgreSQL database. If you already run PostgreSQL, pgvector avoids adding a separate database to your stack. It handles millions of vectors with HNSW indexing and supports exact or approximate nearest neighbor search. The main limitation is that pgvector is a single-node solution: it scales vertically (bigger machine) but not horizontally (more machines). For most applications under 10 million vectors, this is not a constraint. PostgreSQL's full-text search combines naturally with pgvector for hybrid search in a single SQL query.

ChromaDB is an open-source embedding database designed for developer simplicity. It stores documents, embeddings, and metadata together and includes a built-in embedding function that can generate embeddings automatically. ChromaDB is popular for prototyping and small-scale applications because it requires minimal setup. For production workloads at scale, most teams migrate to one of the more performance-focused options listed above.

Where Vector Search Falls Short

Vector search has systematic failure modes that matter for production retrieval systems. Understanding these limitations is essential for knowing when to augment vector search with other retrieval techniques like keyword search, reranking, or knowledge graph traversal.

Exact-match failure is the most common issue. When a user searches for "ERR_CONNECTION_REFUSED" or "deployment version 3.2.1" or "JIRA-4521," vector search often returns poor results because embedding models treat these as opaque strings rather than meaningful identifiers. The model has not seen enough training examples to learn that "ERR_CONNECTION_REFUSED" is a specific, important error code. Keyword search handles these queries trivially, which is the strongest argument for hybrid search.

Negation and absence queries confuse embeddings. "Which services do NOT use Redis" and "which services use Redis" produce nearly identical embeddings because the semantic content is similar (both are about services and Redis). The embedding does not encode the logical difference between "uses" and "does not use." Knowledge graph queries handle this naturally because they can check for the absence of a relationship.

Multi-hop reasoning is beyond the scope of single-query vector search. If answering a question requires following a chain of relationships ("what database does the service that handles payments use"), vector search only finds documents semantically similar to the surface-level query. It cannot follow the chain from "payments" to "checkout service" to "PostgreSQL" unless a single document happens to mention all three. Knowledge graphs and agentic retrieval (decomposing the query into sub-queries) address this limitation.

Temporal reasoning is weak in standard embeddings. "What changed last week" and "what changed six months ago" produce similar embeddings because the model treats temporal references as minor modifiers rather than fundamental query constraints. Metadata filtering (filtering by timestamp before or after vector search) is the standard solution.

The "lost in the middle" problem affects how retrieved context is used by LLMs. Research shows that LLMs pay less attention to information in the middle of their context window compared to information at the beginning and end. When vector search returns 10 relevant chunks, the most important information may not be at the top if the ranking is based purely on similarity scores. Reranking with a cross-encoder model helps by re-ordering results based on deeper semantic understanding.

Optimization and Scaling

Vector search performance depends on three factors: index type, quantization, and hardware. Each offers trade-offs between speed, accuracy, and cost.

HNSW (Hierarchical Navigable Small World) is the dominant index type for approximate nearest neighbor search. It builds a multi-layer graph where each node connects to its nearest neighbors, and search navigates this graph from top to bottom. HNSW provides 95 to 99% recall at sub-millisecond latency for millions of vectors. The trade-off is memory: the HNSW index must fit in RAM for optimal performance, and the index itself is roughly 2 to 4 times the size of the raw vectors. For a million 1536-dimensional vectors (6 GB raw), the HNSW index may require 12 to 24 GB of RAM.

IVF (Inverted File Index) partitions vectors into clusters and searches only the nearest clusters at query time. It uses less memory than HNSW because the full index does not need to fit in RAM, but query latency is higher and recall is lower (typically 90 to 95%). IVF is the better choice when you have more vectors than available RAM and can tolerate slightly lower recall.

Quantization reduces vector storage size by representing each dimension with fewer bits. Product quantization (PQ) compresses 1536 float32 dimensions (6,144 bytes) into roughly 384 to 768 bytes, a 8 to 16 times reduction. Scalar quantization is simpler, converting float32 to int8, giving a 4 times reduction with minimal recall loss. Binary quantization is the most aggressive, reducing each dimension to a single bit (192 bytes for 1536 dimensions) but with 5 to 10% recall degradation. The right quantization level depends on your accuracy requirements and cost constraints.

Hardware choices shape the cost profile. CPU-based search handles up to roughly 5 million vectors at reasonable latency on modern server hardware. GPU-accelerated search (using NVIDIA's RAFT or FAISS GPU) pushes throughput 10 to 50 times higher, enabling sub-millisecond searches across billions of vectors. For most applications under 10 million vectors, CPU-based search on appropriately sized instances is the cost-effective choice. GPU acceleration becomes cost-justified when query volumes exceed thousands per second or vector counts exceed tens of millions.

Implementation Guides

Getting Started

Advanced Techniques

Core Concepts

Fundamentals

Comparisons

Common Questions

Adaptive Recall combines vector search with cognitive scoring and knowledge graph traversal, so your retrieval is semantic, structural, and time-aware. Stop relying on similarity alone.

Get Started Free