Home » Beyond RAG » Agentic RAG

How to Build Agentic RAG with Query Decomposition

Agentic RAG uses an LLM agent to decompose complex queries into sub-questions, retrieve for each independently, evaluate whether the results are sufficient, and synthesize a final answer from multiple retrieval passes. This approach improves accuracy by 25 to 40% on complex, multi-part questions compared to single-pass RAG because each sub-question targets a specific piece of information rather than hoping a single broad query surfaces everything.

When You Need Agentic RAG

Single-pass RAG works well for simple, direct questions: "what is our refund policy," "how do I deploy to staging," "what is the default timeout." The query maps to a single topic, the relevant document uses similar vocabulary, and the top-k chunks contain the answer. No orchestration needed.

Agentic RAG becomes necessary when queries are complex, meaning they require information from multiple sources, involve comparisons, ask for analysis, or reference entities that are related through chains of connections. "Compare the authentication approaches of our three main services" requires three separate retrievals, one per service. "What downstream systems are affected if Redis goes down" requires finding what depends on Redis, then finding what depends on those dependencies. A single retrieval query for either of these questions returns fragments rather than the complete answer.

Step-by-Step Implementation

Step 1: Classify query complexity.
Not every query needs the agentic pipeline. Simple queries should go through the fast path (single retrieval, single generation) to avoid unnecessary latency and cost. Use an LLM classifier or rule-based heuristics to route queries. Queries with conjunctions ("and," "but," "compared to"), multiple entities, or multi-step reasoning ("if X then what happens to Y") are candidates for the agentic path. Queries with a single topic and no comparison or causation are fast-path candidates.

from anthropic import Anthropic

client = Anthropic()

CLASSIFY_PROMPT = """Classify this query as SIMPLE or COMPLEX.

SIMPLE: Single topic, direct lookup, one piece of information needed.
COMPLEX: Multiple topics, comparison, multi-step reasoning, or requires
combining information from different sources.

Query: {query}

Respond with just SIMPLE or COMPLEX."""

def classify_query(query):
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{"role": "user",
            "content": CLASSIFY_PROMPT.replace("{query}", query)}]
    )
    return response.content[0].text.strip()

Step 2: Decompose complex queries.
Use an LLM to break the complex query into independent sub-questions. Each sub-question should be answerable from a single retrieval pass. The decomposition prompt should instruct the model to produce self-contained sub-questions (each sub-question should make sense on its own without context from the other sub-questions) and to cover all parts of the original question.

DECOMPOSE_PROMPT = """Break this complex question into independent sub-questions.
Each sub-question should:
- Be answerable from a single document or data source
- Be self-contained (make sense without the other sub-questions)
- Cover one specific piece of information needed for the full answer

Return as a JSON array of strings.

Question: {query}"""

def decompose_query(query):
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1000,
        messages=[{"role": "user",
            "content": DECOMPOSE_PROMPT.replace("{query}", query)}]
    )
    return json.loads(response.content[0].text)

For example, "Compare the authentication approaches of our three main services and which one is most secure" decomposes into: "What authentication approach does Service A use?", "What authentication approach does Service B use?", "What authentication approach does Service C use?", "What are the security characteristics of each authentication approach?" Each sub-question targets a specific piece of information that can be retrieved independently.

Step 3: Retrieve for each sub-question.
Run each sub-question through your retrieval pipeline independently. This is the key insight of agentic RAG: instead of one broad query that returns a mixed bag of partially relevant chunks, each sub-question targets exactly the information it needs. For sub-questions about specific entities, a knowledge graph lookup may be more effective than vector search. For conceptual sub-questions, vector search with reranking works well. The agent can route each sub-question to the most appropriate retrieval path.

def retrieve_for_subquestions(sub_questions, retriever):
    results = {}
    for sq in sub_questions:
        chunks = retriever.search(sq, top_k=5)
        results[sq] = {
            "chunks": chunks,
            "confidence": max(c.score for c in chunks) if chunks else 0.0
        }
    return results

Step 4: Evaluate retrieval sufficiency.
After retrieving for each sub-question, check whether the results are sufficient to answer it. If the top chunk score is below a confidence threshold, or if an LLM evaluator determines that the retrieved chunks do not contain the needed information, generate a follow-up query with different phrasing and retrieve again. This iterative retrieval is what makes agentic RAG significantly more accurate than single-pass retrieval.

EVAL_PROMPT = """Given this question and the retrieved context, can the
question be answered from this context?

Question: {question}
Context: {context}

Respond with:
- YES if the context contains enough information
- NO with a suggested rephrased query if the context is insufficient"""

def evaluate_and_retry(sq, chunks, retriever, max_retries=2):
    for attempt in range(max_retries):
        context = "\n".join(c.text for c in chunks)
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=200,
            messages=[{"role": "user",
                "content": EVAL_PROMPT.replace("{question}", sq)
                    .replace("{context}", context)}]
        )
        result = response.content[0].text
        if result.startswith("YES"):
            return chunks
        rephrased = result.split(":", 1)[-1].strip()
        chunks = retriever.search(rephrased, top_k=5)
    return chunks

Step 5: Synthesize the final answer.
Combine the retrieved chunks from all sub-questions into a single context and generate the final answer. The synthesis prompt should instruct the LLM to address each part of the original question, cite specific sources, and acknowledge when a sub-question could not be fully answered. Include both the original question and the sub-questions in the prompt so the LLM understands the reasoning structure.

SYNTHESIZE_PROMPT = """Answer the original question by combining information
from the research below. Cite specific sources. If any part cannot be
answered from the available information, say so explicitly.

Original question: {original_query}

Research results:
{sub_results}"""

def synthesize(original_query, sub_results):
    formatted = ""
    for sq, data in sub_results.items():
        context = "\n".join(c.text for c in data["chunks"][:3])
        formatted += f"\n## {sq}\n{context}\n"

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2000,
        messages=[{"role": "user",
            "content": SYNTHESIZE_PROMPT
                .replace("{original_query}", original_query)
                .replace("{sub_results}", formatted)}]
    )
    return response.content[0].text

Managing Latency and Cost

Agentic RAG uses more LLM calls than single-pass RAG: one for classification, one for decomposition, one per sub-question for evaluation, and one for synthesis. With 4 sub-questions and one retry, that is 8 LLM calls compared to 1 for naive RAG. The cost difference is proportional, roughly 8x per query.

Three strategies manage this. First, route only complex queries to the agentic pipeline. If 80% of queries are simple, your average cost increase is 8x on 20% of queries, or roughly 2.4x overall. Second, use a cheaper model (Haiku) for classification and evaluation, reserving the more capable model for decomposition and synthesis. Third, run sub-question retrievals and evaluations in parallel rather than sequentially to reduce wall-clock latency even though total token usage stays the same.

Adaptive Recall provides many of the benefits of agentic RAG without the multi-call overhead. Cognitive scoring handles recency and importance weighting. The knowledge graph handles entity relationship traversal. Spreading activation handles the discovery of indirectly related information. These mechanisms run as part of a single recall operation rather than requiring multiple LLM orchestration calls.

Get agentic retrieval quality without building the orchestration. Adaptive Recall's cognitive scoring and graph traversal find what simple retrieval misses, in a single call.

Get Started Free

How to Build Agentic RAG with Query Decomposition

When You Need Agentic RAG

Step-by-Step Implementation

Managing Latency and Cost

Related Articles