Home » Vector Search and Embeddings » Chunk Documents

How to Chunk Documents for Better Retrieval

Chunking splits documents into smaller pieces before embedding so that each piece produces a focused vector that matches specific queries well. The choice of chunking strategy and chunk size directly impacts retrieval quality: too large and the embedding dilutes across multiple topics, too small and individual chunks lack enough context to be useful as retrieved results. This guide covers the main chunking approaches with working code and guidance on which to use when.

Why Chunking Matters

Embedding models produce a single fixed-length vector for each input text. When that input is a 5,000-word document covering authentication, database setup, deployment, and monitoring, the resulting vector is an average of all those topics. A query about "database connection limits" produces a vector that partially matches the document vector, but the match is weak because the document vector also represents three other unrelated topics. If that same database section were embedded as a separate chunk, the query vector would match it much more strongly because the chunk vector focuses entirely on the relevant topic.

Chunking quality has a larger impact on retrieval accuracy than most other pipeline decisions. A study by LlamaIndex found that switching from 1,024-token chunks to 256-token chunks improved recall@5 by 12% on their benchmark dataset. Conversely, chunks that are too small (under 100 tokens) performed worse because they lacked sufficient context for the embedding model to produce meaningful vectors and lacked sufficient content to be useful when included in an LLM prompt.

Step-by-Step Implementation

Step 1: Analyze your content structure.
Before choosing a chunking strategy, examine representative documents from your corpus. Are they structured with clear headings and sections (documentation, articles)? Are they unstructured running text (transcripts, emails, chat logs)? Do they have natural short units (FAQ entries, support tickets)? The answer determines which chunking approach will work best. Structured content benefits from semantic chunking at section boundaries. Unstructured content typically needs fixed-size or recursive chunking.
Step 2: Choose a chunking strategy.
There are four main approaches. Fixed-size chunking splits text at regular token intervals. Semantic chunking splits at natural boundaries like paragraphs and sections. Recursive chunking tries progressively smaller boundaries until chunks meet a target size. Parent-child chunking indexes small chunks for precision but returns larger parent contexts for completeness.
# Strategy 1: Fixed-size chunking # Best for: uniform, unstructured text (transcripts, logs) import tiktoken def fixed_size_chunks(text: str, chunk_size: int = 400, overlap: int = 50, model: str = "cl100k_base"): enc = tiktoken.get_encoding(model) tokens = enc.encode(text) chunks = [] start = 0 while start < len(tokens): end = min(start + chunk_size, len(tokens)) chunk_tokens = tokens[start:end] chunks.append(enc.decode(chunk_tokens)) start += chunk_size - overlap return chunks
# Strategy 2: Semantic chunking (paragraph/section boundaries) # Best for: structured documents with headings import re def semantic_chunks(text: str, max_tokens: int = 600, min_tokens: int = 100): # Split on double newlines (paragraph boundaries) paragraphs = re.split(r'\n\s*\n', text) chunks = [] current_chunk = [] current_size = 0 for para in paragraphs: para_tokens = len(para.split()) * 1.3 # rough token estimate if current_size + para_tokens > max_tokens and current_size >= min_tokens: chunks.append('\n\n'.join(current_chunk)) current_chunk = [para] current_size = para_tokens else: current_chunk.append(para) current_size += para_tokens if current_chunk: chunks.append('\n\n'.join(current_chunk)) return chunks
# Strategy 3: Recursive chunking # Best for: mixed content with varying section sizes def recursive_chunks(text: str, max_tokens: int = 500, separators=None): if separators is None: separators = ['\n## ', '\n### ', '\n\n', '\n', '. ', ' '] token_count = len(text.split()) * 1.3 if token_count <= max_tokens: return [text] for sep in separators: parts = text.split(sep) if len(parts) > 1: chunks = [] current = parts[0] for part in parts[1:]: candidate = current + sep + part if len(candidate.split()) * 1.3 > max_tokens: if current.strip(): chunks.extend(recursive_chunks(current, max_tokens, separators)) current = part else: current = candidate if current.strip(): chunks.extend(recursive_chunks(current, max_tokens, separators)) return chunks # Last resort: hard split words = text.split() mid = len(words) // 2 return (recursive_chunks(' '.join(words[:mid]), max_tokens, separators) + recursive_chunks(' '.join(words[mid:]), max_tokens, separators))
# Strategy 4: Parent-child chunking # Best for: when you need precise matching with rich context def parent_child_chunks(text: str, parent_size: int = 1000, child_size: int = 200, overlap: int = 30): parents = fixed_size_chunks(text, chunk_size=parent_size, overlap=0) all_children = [] for parent_idx, parent in enumerate(parents): children = fixed_size_chunks(parent, chunk_size=child_size, overlap=overlap) for child in children: all_children.append({ "text": child, "parent_idx": parent_idx, "parent_text": parent }) return all_children # At query time: search against child embeddings, # but return parent_text as context to the LLM
Step 3: Set chunk size parameters.
The right chunk size depends on your query patterns. For specific, targeted queries ("what is the connection pool timeout"), smaller chunks (200 to 400 tokens) produce more precise matches. For broad, explanatory queries ("explain the authentication architecture"), larger chunks (600 to 1,000 tokens) provide more complete answers. Start with 400 tokens as a default and adjust based on retrieval evaluation. Most production systems settle between 300 and 600 tokens after tuning.
Step 4: Add overlap between chunks.
Overlap ensures that information near chunk boundaries is captured in at least one chunk. Without overlap, a sentence spanning two chunks may not fully appear in either one, and neither chunk's embedding captures its meaning. An overlap of 10 to 15% of the chunk size (40 to 60 tokens for a 400-token chunk) handles most boundary cases without significantly increasing storage. Larger overlaps waste storage without improving retrieval.
Step 5: Enrich chunks with metadata.
Attach metadata to each chunk that enables filtering and provides context when the chunk is retrieved. At minimum, include the source document identifier, the section heading (if available), and the chunk position within the document. This metadata supports pre-filtering (search only within a specific document or section) and provides context to the LLM (the chunk came from section "Database Configuration" of the infrastructure guide).
def enrich_chunk(chunk_text: str, source_doc: str, section: str, position: int) -> dict: return { "text": chunk_text, "metadata": { "source": source_doc, "section": section, "position": position, "token_count": len(chunk_text.split()) * 1.3 } }
Step 6: Evaluate and iterate.
Test different chunk sizes on a set of queries with known relevant answers. Measure recall@k: what fraction of known relevant chunks appear in the top k results. If recall is low on specific queries, chunks may be too large (topic dilution). If recall is low on broad queries, chunks may be too small (insufficient context). Iterate until recall stabilizes, then lock in the chunk size for your production pipeline.

Common Mistakes

Embedding entire documents without chunking is the most common mistake. Even with large embedding models that accept long inputs, the resulting vector is a semantic average that matches no specific query well. Always chunk, even if your documents are relatively short (under 1,000 tokens).

Splitting mid-sentence creates chunks where the beginning or end is semantically incomplete. Always split at sentence boundaries at minimum, and prefer paragraph or section boundaries when available.

Ignoring chunk overlap at boundaries causes information loss. Sentences and concepts that span chunk boundaries are partially captured in each chunk, reducing the embedding quality of both. Even 30 to 50 tokens of overlap significantly reduces this problem.

Adaptive Recall handles chunking, embedding, and retrieval as a managed pipeline. Store memories in natural language and the system handles the rest.

Try It Free