Home » Beyond RAG » Agentic RAG Explained

Agentic RAG Explained: From Simple to Advanced

Agentic RAG gives the retrieval system the ability to plan, execute, evaluate, and iterate on its retrieval strategy rather than running a single query and hoping for the best. The "agent" in agentic RAG is an LLM that decides what to search for, judges whether the results are sufficient, and takes follow-up actions when they are not. This ranges from simple query rewriting (one extra LLM call) to fully autonomous research agents that decompose questions, query multiple data sources, resolve contradictions, and synthesize multi-source answers.

The Agentic RAG Spectrum

Agentic RAG is not a single architecture. It is a spectrum of increasing autonomy in the retrieval process. Each level adds capability at the cost of latency and complexity. Understanding where on this spectrum your application should sit is the most important design decision.

Level 0: Standard RAG (Not Agentic)

Query comes in, gets embedded, top-k chunks retrieved, passed to LLM, answer generated. No decision-making in the retrieval process. The query goes through the same pipeline regardless of complexity, ambiguity, or topic. This is the baseline that tutorials teach.

Level 1: Query Rewriting

An LLM rewrites the user's query before retrieval to improve the match between query vocabulary and document vocabulary. The user asks "why is my API slow" and the rewriter produces "API latency causes," "high response time troubleshooting," and "performance optimization for API endpoints." Each rewritten query is searched, and the results are merged. This is the simplest agentic behavior because it requires only one additional LLM call and does not change the retrieval pipeline, just the input to it.

Query rewriting addresses vocabulary mismatch, the most common retrieval failure. It adds 200 to 500 milliseconds of latency (the rewrite LLM call) and typically improves recall by 10 to 20%. Most production RAG systems should implement at least this level.

Level 2: Routing

An LLM classifies the query and routes it to the most appropriate retrieval strategy or data source. A factual lookup goes to the knowledge base. A code question goes to the repository index. A recent event question goes to the real-time data feed. A comparison question triggers parallel searches across multiple sources. The router adds one LLM call for classification and enables the system to use the best retrieval approach for each query type rather than forcing everything through the same pipeline.

Routing is particularly valuable when your application has heterogeneous data sources: documentation in a vector store, structured data in a SQL database, entity information in a knowledge graph, and real-time data in an API. Without routing, you either miss sources or search all of them for every query (wasteful and noisy).

Level 3: Query Decomposition

An LLM breaks complex queries into sub-questions, retrieves for each independently, and synthesizes the results. "Compare the pricing and performance of Services A, B, and C" becomes three pricing lookups and three performance lookups, each targeted at a specific piece of information. The decomposition agent then combines the six results into a coherent comparison.

This level addresses the fragmentation failure where the answer to a complex question is spread across multiple documents and no single retrieval finds it all. The cost is 3 to 10 LLM calls per query (decomposition, per-sub-question evaluation, synthesis) and proportional latency increase. Use this level when complex, multi-part questions are a significant portion of your query traffic.

Level 4: Iterative Retrieval with Self-Evaluation

After each retrieval step, the agent evaluates whether it has enough information to answer the question. If not, it generates follow-up queries targeting the specific gaps. This creates a retrieval loop: retrieve, evaluate, identify gaps, retrieve more, evaluate again, until the agent is satisfied or reaches a maximum iteration count.

Self-evaluation prevents the agent from answering with incomplete information, a common problem at lower levels where the system generates from whatever it finds regardless of completeness. The trade-off is unpredictable latency, the loop might terminate after one iteration or five, depending on how much information is available. Set a maximum iteration count (typically 3 to 5) and a time budget to keep latency bounded.

Level 5: Autonomous Research Agent

The agent has access to multiple tools (vector search, keyword search, knowledge graph queries, SQL queries, web search, API calls) and autonomously decides which tools to use, in what order, and how to combine the results. It can generate hypotheses, test them against the data, revise its approach, and produce a comprehensive research report with citations and confidence assessments.

This level is appropriate for research, analysis, and investigation tasks where thoroughness matters more than speed. A single query might take 30 seconds to 5 minutes as the agent explores the knowledge space. It is not appropriate for real-time conversational applications where users expect sub-second responses.

Choosing the Right Level

The decision depends on three factors: query complexity distribution, latency requirements, and accuracy expectations.

Mostly simple queries, low latency required: Level 1 (query rewriting). Adds minimal latency, catches vocabulary mismatch. Customer support bots, FAQ systems, documentation search.

Mixed query types, moderate latency acceptable: Level 2 (routing) plus Level 3 (decomposition) for complex queries. Route simple queries to the fast path and complex queries to the agentic path. Most production applications land here.

Complex queries dominant, accuracy critical: Level 4 (iterative retrieval). Medical, legal, financial applications where an incomplete answer is worse than a slow answer.

Research and analysis tasks: Level 5 (autonomous agent). Due diligence, competitive analysis, compliance review. Latency is measured in minutes, not milliseconds.

The Alternative to Building Agents

Agentic RAG solves retrieval problems by adding LLM reasoning at query time. An alternative approach solves the same problems at the storage and indexing level, so retrieval works better without per-query agent calls.

Cognitive scoring (base-level activation, spreading activation) provides the ranking improvements that reranking agents provide, but computed from pre-existing scores rather than per-query LLM calls. Knowledge graph traversal provides the multi-hop reasoning that decomposition agents provide, but through pre-indexed entity relationships rather than iterative retrieval. Memory consolidation provides the freshness and accuracy that self-evaluation agents check for, but maintained continuously rather than assessed at query time.

Adaptive Recall takes this approach. By scoring memories with cognitive science models, connecting them through a knowledge graph, and maintaining them through a lifecycle process, it achieves retrieval quality comparable to Level 3 or Level 4 agentic RAG in a single recall operation. The computational work happens at store time (entity extraction, relationship indexing, confidence scoring) and consolidation time (merging, freshness checking, contradiction detection) rather than at query time, resulting in both high accuracy and low latency.

Get agentic-quality retrieval without the agent. Adaptive Recall's cognitive scoring and graph traversal deliver multi-step retrieval quality in a single call.

Get Started Free