How to Implement Checkpoint Recovery for Agents
Why Conversation Replay Does Not Work
The most intuitive recovery approach is to save the full conversation history and replay it on restart. This fails for three reasons. First, LLM responses are non-deterministic: replaying the same conversation may produce different tool calls and different reasoning, leading to a divergent execution path. Second, tool calls in the replayed conversation may return different results (the system state has changed since the original execution), causing the agent to make different decisions based on stale cached outputs. Third, the conversation accumulates noise, since intermediate reasoning steps, retries, and dead ends all persist in the conversation and consume context window space on recovery.
Checkpoint recovery avoids these problems by storing facts rather than conversation. The checkpoint says "Step 3 completed, the database has 47 tables, the largest is users at 2.3M rows." On recovery, the agent receives these facts directly and continues from step 4 without re-running steps 1 through 3 or trying to reconstruct why it made the decisions it made during those steps.
Step-by-Step Implementation
Checkpoint granularity determines how much work you repeat on recovery. Fine-grained checkpointing (after every tool call) minimizes repeated work but adds latency to every operation. Coarse-grained checkpointing (after each plan step) has lower overhead but means you may repeat the last partial step on recovery. For most agents, checkpointing at plan-step boundaries is the right trade-off: each step typically takes 1 to 5 minutes, so the worst-case repeated work is one step.
The checkpoint should be a self-contained document that another instance of the same agent can use to continue execution. Serialize the goal, the full plan, the list of completed step indices, the result of each completed step, key decisions made (with brief reasoning), facts discovered during execution, and a version identifier for the checkpoint format. Write this to a database that survives the failure that killed the agent.
import json
import hashlib
from datetime import datetime
class CheckpointManager:
def __init__(self, storage_backend):
self.storage = storage_backend
def write(self, task_id, state):
"""Write a checkpoint with integrity verification."""
payload = json.dumps(state, sort_keys=True)
checksum = hashlib.sha256(
payload.encode()
).hexdigest()[:16]
self.storage.put(
key=f"checkpoint:{task_id}",
value={
"state": payload,
"checksum": checksum,
"version": 2,
"written_at": datetime.utcnow().isoformat()
}
)
def read(self, task_id):
"""Read and verify a checkpoint."""
record = self.storage.get(f"checkpoint:{task_id}")
if not record:
return None
payload = record["state"]
expected = hashlib.sha256(
payload.encode()
).hexdigest()[:16]
if expected != record["checksum"]:
raise CorruptCheckpointError(
f"Checksum mismatch for {task_id}"
)
return json.loads(payload)
def delete(self, task_id):
"""Clean up after task completion."""
self.storage.delete(f"checkpoint:{task_id}")When the agent starts, query the checkpoint store for any tasks assigned to this agent that have a checkpoint but are not marked as complete. These represent interrupted executions that need to be resumed. If multiple incomplete tasks exist, prioritize by the checkpoint's age (most recent first) or by the task's priority level.
def agent_startup(agent_id, checkpoint_mgr):
"""Check for interrupted tasks on startup."""
incomplete = checkpoint_mgr.find_incomplete(agent_id)
if not incomplete:
return None # No recovery needed
# Resume the most recent interrupted task
latest = max(incomplete, key=lambda c: c["written_at"])
state = checkpoint_mgr.read(latest["task_id"])
completed_count = len(state["completed"])
total_steps = len(state["plan"])
print(f"Recovering task {latest['task_id']}: "
f"{completed_count}/{total_steps} steps completed")
return stateBuild a new LLM prompt that summarizes the checkpoint state in a way that allows the agent to continue naturally. Include the goal, the plan, summaries of completed steps and their results, key discoveries, and an explicit instruction to continue from the next incomplete step. This prompt should read like a briefing document for a new team member taking over a task, because that is essentially what is happening: a fresh LLM instance is picking up where the previous one left off.
def build_recovery_prompt(state):
"""Build an LLM prompt from checkpoint state."""
steps_summary = []
for i, step in enumerate(state["plan"]):
if i in state["completed"]:
result = state["results"].get(str(i),
"completed, no result recorded")
steps_summary.append(
f" [{i+1}] {step['name']} - DONE: {result}"
)
else:
steps_summary.append(
f" [{i+1}] {step['name']} - PENDING"
)
discoveries = "\n".join(
f" - {d}" for d in state.get("discoveries", [])
)
next_step = len(state["completed"]) + 1
return f"""You are resuming a task that was interrupted.
Do not repeat completed steps. Continue from step {next_step}.
GOAL: {state['goal']}
PLAN STATUS:
{chr(10).join(steps_summary)}
FACTS DISCOVERED SO FAR:
{discoveries}
Continue executing from step {next_step}. Use the results
from completed steps as context. If any completed step's
result seems outdated, verify it before relying on it."""If the agent was interrupted during a step that has external side effects (creating a resource, sending a message, modifying a database), check whether that operation completed before retrying. The safest approach is to use idempotency keys for all external operations, so retrying the same operation with the same key is a no-op if it already succeeded.
After a task completes successfully, delete the checkpoint or let it expire via TTL. Before deletion, extract any key findings and store them in long-term memory so they benefit future tasks. Stale checkpoints that accumulate in the store waste storage and can cause confusion if the checkpoint detection logic finds them during a future startup.
Edge Cases to Handle
Checkpoint too old. If the agent finds a checkpoint from days ago, the system state may have changed significantly. Add a maximum checkpoint age threshold. Checkpoints older than the threshold should trigger a re-evaluation of the plan rather than blind resumption, since the assumptions that informed the original plan may no longer hold.
Plan is no longer valid. The original plan may reference resources, services, or conditions that no longer exist. Before resuming, validate that the remaining steps are still feasible. If the goal has changed or the environment has shifted, re-plan from the current state rather than continuing the old plan.
Multiple recovery attempts. If the agent fails repeatedly on the same step, the checkpoint will cause it to retry that step indefinitely. Add a retry counter per step and escalate (skip the step with a failure note, alert a human, or abort the task) after a configurable number of retries.
Build agents that recover gracefully. Adaptive Recall provides persistent memory that your checkpoint system can write to and read from, so recovered agents have both task state and institutional knowledge.
Try It Free