How to Handle Tool Errors and Retries in AI
Before You Start
You need a working tool execution loop from the function calling guide. You should also have production logs or test scenarios that generate tool failures: network timeouts, API rate limits, authentication errors, invalid parameters, and service unavailability. Testing error handling against real failure modes is essential because theoretical error handling often misses the cases that actually occur.
Step-by-Step Implementation
When a tool fails, the error message you return to the model determines whether it can recover or whether it produces a useless response. Raw error codes like "500 Internal Server Error" or "ECONNREFUSED" tell the model nothing actionable. Instead, format errors as natural language explanations that describe what happened, why, and what the model's options are.
# Bad: raw error, model cannot reason about it
{"error": "ConnectionError: ECONNREFUSED 10.0.1.42:5432"}
# Good: natural language with context and options
{
"error": "The order database is temporarily unreachable. This is likely a transient network issue. You can tell the user you are unable to look up their order right now and suggest they try again in a few minutes, or offer to help with something else."
}
# Good: parameter error with correction guidance
{
"error": "Invalid customer ID format. Expected a UUID like '550e8400-e29b-41d4-a716-446655440000' but received 'john_smith'. If the user provided a name instead of an ID, use search_customers to find their ID first."
}The key insight is that the model uses the error message to decide what to do next, just like it uses a successful result. A clear, actionable error message enables the model to try a different approach, ask the user for clarification, or explain the situation accurately. A cryptic error message forces the model to hallucinate an explanation, which is almost always worse.
Many tool failures are transient: network timeouts, rate limits (HTTP 429), temporary service unavailability (HTTP 503). These failures resolve themselves within seconds or minutes. The execution layer should retry transient failures automatically with exponential backoff, so the model never sees them unless they persist beyond the retry budget.
import time
TRANSIENT_ERRORS = {429, 500, 502, 503, 504}
MAX_RETRIES = 3
BASE_DELAY = 0.5 # seconds
def execute_with_retry(func, arguments):
last_error = None
for attempt in range(MAX_RETRIES + 1):
try:
return func(**arguments)
except HTTPError as e:
if e.status_code in TRANSIENT_ERRORS and attempt < MAX_RETRIES:
delay = BASE_DELAY * (2 ** attempt)
time.sleep(delay)
last_error = e
else:
raise
raise last_errorKeep the retry budget short: 3 retries with exponential backoff (0.5s, 1s, 2s) adds at most 3.5 seconds to a tool call. Users tolerate this latency for a successful result. They do not tolerate 30 seconds of retries that ultimately fail anyway. Set retry limits based on how long the user will wait, not on how long the service might take to recover.
For tools that are essential to the agent's core functionality, define fallback behavior that activates when the primary path fails. A knowledge base search that times out might fall back to a cached index with slightly stale data. An inventory check that fails might return the last known stock levels with a staleness warning. A real-time pricing tool that is down might return the last cached price with a note that it may have changed.
def get_product_info(product_id):
try:
return product_api.get(product_id, timeout=5)
except (TimeoutError, ServiceUnavailableError):
cached = product_cache.get(product_id)
if cached:
cached["_note"] = f"Cached data from {cached['_cached_at']}. Live data temporarily unavailable."
return cached
return {"error": "Product information is temporarily unavailable and no cached data exists. Ask the user to try again in a few minutes."}Fallback design requires a judgment call about what "degraded but useful" means for each tool. For read-only tools (lookups, searches), serving slightly stale data is almost always better than returning an error. For write tools (creating records, processing transactions), there is no safe fallback because partial or stale writes can cause real problems. Write tool failures should be surfaced to the model with clear explanations, not silently degraded.
A circuit breaker tracks failure rates for each tool and stops calling a tool entirely when its failure rate exceeds a threshold. This prevents the agent from repeatedly hammering a failing service (which can worsen the outage) and eliminates the latency of waiting for timeouts on calls that will certainly fail. When the circuit breaker is open, the tool immediately returns a cached or fallback response without attempting the network call.
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure_time = 0
self.state = "closed" # closed = normal, open = blocking
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open" # try one call
else:
raise CircuitOpenError("Service is temporarily unavailable")
try:
result = func(*args, **kwargs)
self.failures = 0
self.state = "closed"
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.threshold:
self.state = "open"
raiseWhen a tool fails midway through a chain, the worst response is to discard all progress and show a generic error. The best response is to communicate what succeeded, what failed, and what the user's options are. Include the chain progress in the error context so the model can summarize partial results accurately.
Build a chain state tracker that records each step's result. When a step fails, package the entire chain state (completed steps with results, the failed step with its error, remaining planned steps) and include it in the error message to the model. The model can then generate a response like: "I found your order #A6789 and confirmed it is eligible for a return, but the return processing system is currently unavailable. Would you like me to try again in a few minutes, or would you prefer to call support directly?"
Error Categories and Recommended Actions
Different error types warrant different handling strategies. Authentication errors (401, 403) should not be retried because they indicate a configuration problem, not a transient failure. Surface them clearly to the model with a suggestion to inform the user that the operation requires different permissions. Validation errors (400) indicate the model passed bad parameters. Return the validation details so the model can correct the parameters and retry. Rate limits (429) should be retried with the delay specified in the Retry-After header if present, or with exponential backoff. Server errors (500, 502, 503, 504) should be retried with backoff because they are usually transient. Timeout errors should be retried once (the service may have been briefly slow) and then surfaced as unavailability.
Give your agent error handling that improves with experience. Adaptive Recall stores tool failure patterns so your agent learns which errors are transient and which require alternative approaches.
Try It Free