How to Persist Agent State Across Restarts

Persisting agent state means writing a checkpoint that captures the agent's current goal, completed steps, intermediate results, and key decisions to durable storage at regular intervals. When the agent restarts after a crash, deployment, or timeout, it reads the checkpoint and resumes from where it left off rather than starting the task over. This is critical for any agent that runs tasks lasting more than a few minutes, because without persistence, a restart means repeating all work from scratch.

What State Needs to Persist

An LLM-based agent's "state" is a mix of structured data and unstructured context. Not all of it needs to persist, and persisting the wrong parts (like the full conversation history) creates fragile checkpoints that are expensive to store and difficult to resume from. The key is separating factual state from ephemeral state.

Factual state is the information that the agent needs to continue working: the original goal, the plan it generated, which steps have been completed, the results of each completed step, decisions it made and why, and any facts it discovered about the environment. This state is structured, relatively compact, and can be serialized to JSON or a database without loss.

Ephemeral state is the information that is useful during active execution but not needed for recovery: the current conversation context (which can be reconstructed from factual state), temporary variables used in intermediate computations, and cached tool outputs that can be re-fetched. Persisting ephemeral state wastes storage and creates brittle checkpoints that fail when the conversation format changes or cached data expires.

Step-by-Step Implementation

Step 1: Identify what state to persist.
Audit your agent's execution loop and list every piece of information it uses to make decisions. Classify each as factual (needed for recovery) or ephemeral (can be reconstructed). For most agents, the factual state fits in a JSON document under 10KB: the goal, the plan (a list of steps with status), results per step, key decisions, and discovered facts. Everything else is either reconstructable or not worth persisting.

import json
from datetime import datetime

class AgentCheckpoint:
    def __init__(self, agent_id, task_id):
        self.agent_id = agent_id
        self.task_id = task_id
        self.goal = ""
        self.plan = []        # list of step dicts
        self.completed = []   # indices of completed steps
        self.results = {}     # step_index -> result
        self.decisions = []   # key decisions with reasoning
        self.discoveries = [] # facts learned during execution
        self.created_at = datetime.utcnow().isoformat()
        self.updated_at = self.created_at

    def to_dict(self):
        return {
            "agent_id": self.agent_id,
            "task_id": self.task_id,
            "goal": self.goal,
            "plan": self.plan,
            "completed": self.completed,
            "results": self.results,
            "decisions": self.decisions,
            "discoveries": self.discoveries,
            "created_at": self.created_at,
            "updated_at": datetime.utcnow().isoformat()
        }

    @classmethod
    def from_dict(cls, data):
        cp = cls(data["agent_id"], data["task_id"])
        cp.goal = data["goal"]
        cp.plan = data["plan"]
        cp.completed = data["completed"]
        cp.results = data["results"]
        cp.decisions = data["decisions"]
        cp.discoveries = data["discoveries"]
        cp.created_at = data["created_at"]
        cp.updated_at = data["updated_at"]
        return cp

Step 2: Choose a persistence backend.
For agents running on a single machine, a local JSON file or SQLite database works. For distributed agents or agents that may run on different machines after a restart, use a shared database (PostgreSQL, DynamoDB) or a memory API. The key requirement is durability: the checkpoint must survive the same failure that killed the agent. Writing to the local filesystem of a container that gets destroyed on restart does not count as durable persistence.

import boto3

# DynamoDB persistence for distributed agents
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("agent-checkpoints")

def save_checkpoint(checkpoint):
    table.put_item(Item={
        "task_id": checkpoint.task_id,
        "agent_id": checkpoint.agent_id,
        "state": json.dumps(checkpoint.to_dict()),
        "ttl": int(datetime.utcnow().timestamp()) + 86400 * 7
    })

def load_checkpoint(task_id):
    response = table.get_item(Key={"task_id": task_id})
    if "Item" in response:
        data = json.loads(response["Item"]["state"])
        return AgentCheckpoint.from_dict(data)
    return None

Step 3: Implement checkpoint writes.
Write checkpoints at three points: after each step completes successfully, before any operation that might fail (external API calls, database mutations), and periodically during long-running steps (every 60 seconds of continuous work). The cost of writing a 10KB JSON document to a database is negligible compared to the cost of repeating hours of agent work. Err on the side of checkpointing too often rather than too rarely.

def run_agent_with_checkpoints(task_id, goal):
    # Check for existing checkpoint
    checkpoint = load_checkpoint(task_id)

    if checkpoint:
        print(f"Resuming from step "
              f"{len(checkpoint.completed)}"
              f"/{len(checkpoint.plan)}")
        start_step = len(checkpoint.completed)
    else:
        checkpoint = AgentCheckpoint("agent-01", task_id)
        checkpoint.goal = goal
        checkpoint.plan = generate_plan(goal)
        save_checkpoint(checkpoint)
        start_step = 0

    for i in range(start_step, len(checkpoint.plan)):
        step = checkpoint.plan[i]

        # Execute the step
        result = execute_step(step, checkpoint)

        # Record completion and checkpoint
        checkpoint.completed.append(i)
        checkpoint.results[str(i)] = result
        save_checkpoint(checkpoint)

    return compile_final_result(checkpoint)

Step 4: Build the recovery path.
On startup, the agent checks for an existing checkpoint for its current task. If one exists, it reconstructs a fresh LLM context from the checkpoint data: "You are resuming a task. Here is the goal, the plan, what has been completed so far, and the results of each completed step. Continue from step N." This is more reliable than replaying the original conversation because it gives the LLM a clean, factual summary rather than a potentially confusing replay of old reasoning.

def build_resume_context(checkpoint):
    """Build a fresh LLM context from checkpoint state."""
    completed_summary = ""
    for i in checkpoint.completed:
        step = checkpoint.plan[i]
        result = checkpoint.results.get(str(i), "no result")
        completed_summary += (
            f"Step {i+1} ({step['name']}): "
            f"Completed. Result: {result}\n"
        )

    discoveries = "\n".join(
        f"- {d}" for d in checkpoint.discoveries
    )

    return f"""You are resuming a task after an interruption.

Goal: {checkpoint.goal}

Plan ({len(checkpoint.plan)} steps total):
{format_plan(checkpoint.plan)}

Completed so far:
{completed_summary}

Key discoveries:
{discoveries}

Continue from step {len(checkpoint.completed) + 1}."""

Step 5: Handle partial completion.
If the agent crashes mid-step, the step may be partially completed. An API call may have been sent but the response not recorded. A database write may have succeeded but the checkpoint not updated. Design steps to be idempotent where possible: calling them twice with the same input produces the same result. For non-idempotent operations (sending an email, creating a resource), add a pre-check that verifies whether the operation already completed before retrying.

def execute_step_idempotent(step, checkpoint):
    """Execute a step with idempotency checks."""
    step_key = f"step_{step['index']}"

    # Check if this step already produced a result
    if step.get("idempotency_key"):
        existing = check_operation_status(
            step["idempotency_key"]
        )
        if existing:
            return existing

    # Execute with idempotency key for external calls
    result = execute_with_key(
        step,
        idempotency_key=f"{checkpoint.task_id}:{step_key}"
    )
    return result

Combining Checkpoints with Long-Term Memory

Checkpoints and long-term memory serve different purposes but complement each other. Checkpoints capture task-specific state that is needed to resume a particular execution. Long-term memory captures knowledge that is useful across tasks and sessions. When a task completes, the agent should write its key findings and outcomes to long-term memory and then delete the checkpoint (or let it expire via TTL).

Adaptive Recall serves as both the long-term memory and a recovery context source. When an agent resumes, it recalls memories relevant to the current task, which gives it context from all previous sessions, not just the interrupted one. Combined with task-level checkpointing for step-by-step recovery, this provides a complete persistence strategy: checkpoints for intra-task continuity, Adaptive Recall for cross-task knowledge accumulation.

Build agents that survive restarts and accumulate knowledge. Adaptive Recall provides the persistent memory layer that your checkpointing strategy writes to.

Get Started Free

How to Persist Agent State Across Restarts

What State Needs to Persist

Step-by-Step Implementation

Combining Checkpoints with Long-Term Memory

Related Articles