Memory and State

← Back to Overview: Agentic AI

Why Agents Need Memory

A single-turn LLM call has no memory. Every call starts fresh. This is fine for answering a question, but an agent working on a multi-step task needs to remember what it has done, what it has learned, and what remains.

Without memory, an agent: - Cannot refer to the result of a tool call it made 10 steps ago - Cannot resume after a crash or restart - Cannot build on knowledge accumulated from previous tasks - Will repeat work it has already done

Memory is what enables agents to operate over extended horizons. But memory is not one thing — there are four distinct types, each serving a different purpose and operating at a different timescale.

The Four Memory Types

┌──────────────────────────────────────────────────────────────────┐
│  TIMESCALE: task session                                          │
│                                                                   │
│  1. In-Context Memory (short-term)                                │
│     The LLM's active context window. Everything the model        │
│     can "see" right now. Limited. Lost when session ends.        │
│                                                                   │
│  2. Working Memory (task state)                                   │
│     Structured state outside the context window. Task ledger,   │
│     partial results, assignments. Persisted to a database.       │
│     Survives agent restarts.                                     │
├──────────────────────────────────────────────────────────────────┤
│  TIMESCALE: cross-task, persistent                                │
│                                                                   │
│  3. Episodic Memory (event log)                                   │
│     Sequential log of everything that happened. Append-only.    │
│     Used for debugging, auditing, and training data.             │
│                                                                   │
│  4. Semantic Memory (long-term knowledge)                         │
│     Vector store of accumulated knowledge and preferences.       │
│     Queried semantically when relevant to the current task.      │
└──────────────────────────────────────────────────────────────────┘

Each type is accessed at a different moment, stored differently, and fails in different ways when it goes wrong.

1. In-Context Memory (Short-Term)

What it is: The content of the LLM's active context window. Everything the model "sees" in a given API call.

What goes into it: - System prompt (role, capabilities, constraints) - Conversation history (user messages and assistant responses) - Tool call results from this session - Current task goal and relevant context

Limits: Every model has a context window limit measured in tokens. As of 2024:

Model	Context Window
Claude 3.5 Sonnet	200k tokens
GPT-4o	128k tokens
Gemini 1.5 Pro	1M tokens
Gemini 2.0 Flash	1M tokens

While these numbers seem large, a 10-step agent task can easily accumulate 50–100k tokens if tool results include long documents.

What Happens When Context Fills Up

When context approaches the limit: 1. The model starts to lose attention to early content — it focuses on recent tokens 2. The original goal, stated at the beginning, gets "pushed out" of effective attention 3. Reasoning quality degrades — the agent starts contradicting its earlier work 4. The API will return an error if you exceed the hard limit

This degradation before hitting the hard limit is called context drift — and it's much harder to detect than a hard error.

Managing In-Context Memory

Strategy 1: Sliding window Keep only the most recent N messages in context. Drop older messages.

def truncate_to_recent(messages: list[dict], max_messages: int = 20) -> list[dict]:
    system = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    # Always keep system prompt + most recent messages
    recent = non_system[-max_messages:] if len(non_system) > max_messages else non_system
    return system + recent

Risk: Drops early context. If the user mentioned an important constraint 30 messages ago, the agent may forget it.

Strategy 2: Hierarchical summarization Summarize groups of older messages while keeping recent messages verbatim.

def compress_history(messages: list[dict], keep_recent: int = 10) -> list[dict]:
    if len(messages) <= keep_recent + 2:  # +2 for system + current
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    convo_msgs = [m for m in messages if m["role"] != "system"]

    # Summarize old messages
    to_summarize = convo_msgs[:-keep_recent]
    recent = convo_msgs[-keep_recent:]

    summary_prompt = f"""
    Summarize the following conversation history into a compact representation.
    Preserve: key decisions made, important facts learned, actions taken.
    Omit: pleasantries, repeated information, irrelevant details.

    History: {json.dumps(to_summarize)}
    """
    summary = llm.invoke(summary_prompt).content

    summary_msg = {"role": "user", "content": f"[Summary of earlier conversation]: {summary}"}
    return system_msgs + [summary_msg] + recent

Strategy 3: Token budget management Allocate explicit token budgets to each context section and enforce them.

class ContextBudget:
    system_prompt: int = 2_000
    task_ledger: int = 1_000
    long_term_memories: int = 5_000
    recent_history: int = 50_000
    tool_results: int = 30_000

    @property
    def total(self) -> int:
        return sum([self.system_prompt, self.task_ledger, 
                   self.long_term_memories, self.recent_history, self.tool_results])

def build_context(state: AgentState, budget: ContextBudget) -> list[dict]:
    messages = []

    # Always: system prompt
    messages.append({"role": "system", "content": truncate(state.system_prompt, budget.system_prompt)})

    # Always: task ledger (compact, critical)
    messages.append({"role": "user", "content": truncate(state.task_ledger.to_markdown(), budget.task_ledger)})

    # Semantic memory results (most relevant to current step)
    if state.retrieved_memories:
        messages.append({"role": "user", "content": truncate(format_memories(state.retrieved_memories), budget.long_term_memories)})

    # Recent history + tool results
    recent = truncate_by_tokens(state.conversation_history, budget.recent_history + budget.tool_results)
    messages.extend(recent)

    return messages

2. Working Memory (Task State)

What it is: Structured data maintained by the orchestration layer, outside the LLM's context window. It persists the state of the current task across steps and agent restarts.

What goes into it:

from dataclasses import dataclass, field
from typing import Any

@dataclass
class TaskState:
    task_id: str
    goal: str
    status: str                  # "running", "paused", "done", "failed"

    # Task structure
    subtasks: list["Subtask"]    # planned subtasks with statuses

    # Results
    intermediate_outputs: dict[str, Any]  # keyed by subtask ID
    final_output: str | None

    # Execution metadata
    current_step: int
    agent_assignments: dict[str, str]  # subtask_id → agent_id

    # Failure tracking
    errors: list[str]
    retry_counts: dict[str, int]  # subtask_id → attempts

    # Cost tracking
    total_tokens_used: int
    total_cost_usd: float

    # Timestamps
    created_at: str
    updated_at: str
    completed_at: str | None

@dataclass
class Subtask:
    id: str
    description: str
    status: str                  # "pending", "in_progress", "done", "failed", "skipped"
    dependencies: list[str]      # IDs of subtasks that must complete first
    assigned_agent: str | None
    output: Any | None
    error: str | None

The Task Ledger

The task ledger is the most important component of working memory. It is a structured record of what the agent set out to do and how far it has gotten. Unlike conversation history, it is compact and never summarized away.

@dataclass
class TaskLedger:
    goal: str
    subtasks: list[Subtask]

    def to_markdown(self) -> str:
        lines = [f"**Goal:** {self.goal}\n", "**Subtasks:**"]
        for task in self.subtasks:
            icon = {"pending": "⬜", "in_progress": "🔄", "done": "✅", "failed": "❌"}.get(task.status, "?")
            lines.append(f"- {icon} [{task.id}] {task.description}")
            if task.output:
                lines.append(f"  → Result: {str(task.output)[:100]}...")
            if task.error:
                lines.append(f"  → Error: {task.error}")
        return "\n".join(lines)

    def next_executable(self) -> list[Subtask]:
        """Return subtasks that have all dependencies satisfied."""
        done_ids = {t.id for t in self.subtasks if t.status == "done"}
        return [
            t for t in self.subtasks
            if t.status == "pending"
            and all(dep in done_ids for dep in t.dependencies)
        ]

    def is_complete(self) -> bool:
        return all(t.status in ("done", "skipped") for t in self.subtasks)

    def has_failed(self) -> bool:
        return any(t.status == "failed" for t in self.subtasks)

Checkpointing: Surviving Failures

Checkpointing serializes the full task state to persistent storage at defined intervals. If the system crashes, restarts, or the agent process dies, it can resume from the last checkpoint.

import json
import redis

class TaskCheckpointer:
    def __init__(self, redis_client: redis.Redis, ttl_seconds: int = 86400):
        self.redis = redis_client
        self.ttl = ttl_seconds

    def save(self, state: TaskState):
        key = f"task:{state.task_id}:checkpoint"
        self.redis.setex(key, self.ttl, json.dumps(state.__dict__))

    def load(self, task_id: str) -> TaskState | None:
        key = f"task:{task_id}:checkpoint"
        data = self.redis.get(key)
        if not data:
            return None
        return TaskState(**json.loads(data))

    def delete(self, task_id: str):
        self.redis.delete(f"task:{task_id}:checkpoint")

LangGraph has built-in checkpointing via its MemorySaver and SqliteSaver backends — every state transition is automatically persisted.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string("checkpoints.db")
graph = workflow.compile(checkpointer=memory)

# Resume from a previous checkpoint
config = {"configurable": {"thread_id": "task-001"}}
result = graph.invoke(inputs, config)  # continues from where it left off

State Machines

Some agents benefit from modeling their execution as an explicit state machine — where the current "state" determines what actions are valid and what transitions are possible.

from enum import Enum

class AgentState(Enum):
    PLANNING = "planning"
    RESEARCHING = "researching"
    WRITING = "writing"
    REVIEWING = "reviewing"
    AWAITING_HITL = "awaiting_hitl"
    DONE = "done"
    FAILED = "failed"

VALID_TRANSITIONS = {
    AgentState.PLANNING: [AgentState.RESEARCHING, AgentState.FAILED],
    AgentState.RESEARCHING: [AgentState.WRITING, AgentState.FAILED],
    AgentState.WRITING: [AgentState.REVIEWING, AgentState.AWAITING_HITL],
    AgentState.REVIEWING: [AgentState.DONE, AgentState.WRITING, AgentState.FAILED],
    AgentState.AWAITING_HITL: [AgentState.WRITING, AgentState.FAILED],
}

def transition(current: AgentState, next_state: AgentState) -> AgentState:
    if next_state not in VALID_TRANSITIONS.get(current, []):
        raise ValueError(f"Invalid transition: {current} → {next_state}")
    return next_state

State machines make reasoning about the system much easier — you can see at a glance what states are possible, what transitions are valid, and what could cause the system to get stuck.

3. Episodic Memory (Event Log)

What it is: An append-only sequential record of everything the agent did. Every LLM call, tool call, agent assignment, HITL decision, error, and state transition is logged.

What it is NOT: Episodic memory is not used for retrieval during the current task. It is written to during execution and read after execution — for debugging, auditing, cost analysis, and generating training data.

from dataclasses import dataclass, field
from datetime import datetime
import json

@dataclass
class AgentEvent:
    event_id: str
    task_id: str
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    event_type: str = ""  # "llm_call", "tool_call", "state_transition", "hitl", "error"

    # LLM call fields
    model: str | None = None
    input_tokens: int | None = None
    output_tokens: int | None = None
    cost_usd: float | None = None
    latency_ms: int | None = None

    # Tool call fields
    tool_name: str | None = None
    tool_args: dict | None = None
    tool_result: dict | None = None
    tool_success: bool | None = None

    # State transition fields
    from_state: str | None = None
    to_state: str | None = None

    # HITL fields
    action_proposed: str | None = None
    human_decision: str | None = None  # "approved", "rejected", "timeout"

    # Error fields
    error_type: str | None = None
    error_message: str | None = None
    stack_trace: str | None = None

class EventLog:
    def __init__(self, storage):
        self.storage = storage

    def append(self, event: AgentEvent):
        self.storage.append(event.task_id, event)

    def get_trace(self, task_id: str) -> list[AgentEvent]:
        return self.storage.get_all(task_id)

    def get_cost_summary(self, task_id: str) -> dict:
        events = self.get_trace(task_id)
        llm_events = [e for e in events if e.event_type == "llm_call"]
        return {
            "total_cost_usd": sum(e.cost_usd or 0 for e in llm_events),
            "total_input_tokens": sum(e.input_tokens or 0 for e in llm_events),
            "total_output_tokens": sum(e.output_tokens or 0 for e in llm_events),
            "llm_calls": len(llm_events),
        }

Using Episodic Memory for Debugging

When something goes wrong, episodic memory is the crime scene. The debugging workflow:

Load the full trace for the failed task
Find the first step where the state diverged from expected
Inspect tool call arguments at that step
Check whether the tool result was correctly used in the next step
Replay from the checkpoint before that step

def find_first_wrong_step(trace: list[AgentEvent]) -> AgentEvent | None:
    """Find the first tool call that returned an error or unexpected result."""
    for event in trace:
        if event.event_type == "tool_call" and not event.tool_success:
            return event
        if event.event_type == "error":
            return event
    return None

Using Episodic Memory as Training Data

High-quality traces where the agent succeeded can be used as few-shot examples or for fine-tuning:

def extract_training_examples(
    traces: list[list[AgentEvent]], 
    min_success_score: float = 0.9
) -> list[dict]:
    examples = []
    for trace in traces:
        # Only use successful traces
        if compute_success_score(trace) < min_success_score:
            continue

        # Extract the trajectory as a training example
        examples.append({
            "task": trace[0].task_id,
            "trajectory": [
                {"tool": e.tool_name, "args": e.tool_args, "result": e.tool_result}
                for e in trace if e.event_type == "tool_call"
            ],
            "final_answer": extract_final_answer(trace)
        })

    return examples

4. Semantic Memory (Long-Term Knowledge)

What it is: A vector store of knowledge and preferences accumulated across many tasks and sessions. When a new task starts, relevant memories are retrieved and included in the context.

What goes into it: - Successful research results from previous tasks - User preferences and style guides - Domain knowledge (product specs, company policies, technical facts) - Templates and successful past outputs - Learned calibrations ("user prefers bullet points over prose")

What does NOT go into it: - Raw conversation history (too much, too noisy) - Episodic events (those go in the event log) - Task state (that's working memory)

Storing to Semantic Memory

Not every output should be stored. Use selective storage — only store information that is likely to be useful in future tasks.

from langchain.vectorstores import Chroma
from langchain.embeddings import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
vector_store = Chroma(embedding_function=embeddings, persist_directory="./agent_memory")

@dataclass
class Memory:
    content: str
    memory_type: str           # "fact", "preference", "template", "procedure"
    source_task_id: str
    tags: list[str]
    importance: float          # 0.0 to 1.0 — used for pruning
    created_at: str
    last_accessed: str

def store_memory(content: str, memory_type: str, tags: list[str], importance: float):
    memory = Memory(
        content=content,
        memory_type=memory_type,
        source_task_id=current_task_id(),
        tags=tags,
        importance=importance,
        created_at=datetime.utcnow().isoformat(),
        last_accessed=datetime.utcnow().isoformat()
    )
    vector_store.add_texts(
        texts=[memory.content],
        metadatas=[memory.__dict__]
    )

Retrieving from Semantic Memory

At the start of each task (or each step where new context is needed), retrieve relevant memories using semantic search.

def retrieve_relevant_memories(
    query: str, 
    k: int = 5, 
    memory_types: list[str] | None = None
) -> list[Memory]:
    filter_dict = {}
    if memory_types:
        filter_dict["memory_type"] = {"$in": memory_types}

    results = vector_store.similarity_search_with_score(
        query, k=k, filter=filter_dict
    )

    # Only return memories above a similarity threshold
    return [
        Memory(**doc.metadata)
        for doc, score in results
        if score > 0.75
    ]

# Usage at task start
relevant_memories = retrieve_relevant_memories(
    query=task_goal,
    k=5,
    memory_types=["fact", "procedure"]
)

if relevant_memories:
    context_addition = "Relevant knowledge from previous tasks:\n" + "\n".join(
        f"- {m.content}" for m in relevant_memories
    )

Memory Pollution

Memory pollution occurs when incorrect, outdated, or low-quality information accumulates in long-term memory and starts degrading agent performance. This is a serious production concern for long-running systems.

Causes: - Storing intermediate results without verifying their correctness - Not updating memories when facts change - Retrieving a low-relevance memory and treating it as authoritative

Prevention:

# 1. Quality gate before storage
def store_if_verified(content: str, evidence: str, importance: float):
    if importance < 0.5:
        return  # don't store low-importance information

    # Verify before storing
    verification = llm.invoke(f"""
    Fact to store: {content}
    Evidence: {evidence}

    Is this fact accurate and worth storing for future reference?
    Reply YES or NO with brief reasoning.
    """)

    if verification.content.startswith("YES"):
        store_memory(content, "fact", [], importance)

# 2. TTL on memories
def prune_expired_memories(max_age_days: int = 90):
    cutoff = datetime.utcnow() - timedelta(days=max_age_days)
    # Remove memories not accessed since cutoff and with low importance
    vector_store.delete(
        where={"$and": [
            {"last_accessed": {"$lt": cutoff.isoformat()}},
            {"importance": {"$lt": 0.7}}
        ]}
    )

# 3. Memory correction
def correct_memory(old_content: str, new_content: str, reason: str):
    # Find and delete the old memory
    results = vector_store.similarity_search(old_content, k=1)
    if results and results[0].page_content.strip() == old_content.strip():
        vector_store.delete([results[0].metadata["id"]])

    # Store the corrected version
    store_memory(new_content, "fact", [], importance=0.9)

Memory Access Pattern: Putting It Together

A well-designed agent uses all four memory types at the right moments:

TASK START:
  → Query semantic memory (long-term) for relevant knowledge
  → Load working memory (task state) if resuming
  → Initialize in-context memory (system prompt + task ledger)
  → Start appending to episodic memory (event log)

EACH STEP:
  → Read from in-context memory (LLM sees full context)
  → Read/write working memory (update task ledger, partial results)
  → Append to episodic memory (log the step)
  → Optionally query semantic memory for specific knowledge needs

TASK END:
  → Extract valuable findings → write to semantic memory
  → Finalize working memory (mark task done)
  → Archive episodic memory trace
  → Clear in-context memory (session ends)

class AgentMemoryManager:
    def __init__(self, vector_store, checkpointer, event_log):
        self.vector_store = vector_store
        self.checkpointer = checkpointer
        self.event_log = event_log

    def initialize_task(self, task_id: str, goal: str) -> TaskState:
        # Check for existing checkpoint (resume case)
        existing = self.checkpointer.load(task_id)
        if existing:
            return existing

        # Retrieve relevant prior knowledge
        memories = retrieve_relevant_memories(goal, k=5)

        # Create fresh state with retrieved context
        state = TaskState(
            task_id=task_id,
            goal=goal,
            prior_knowledge=memories,
            status="running"
        )
        self.checkpointer.save(state)
        return state

    def after_step(self, state: TaskState, event: AgentEvent):
        self.checkpointer.save(state)   # update working memory
        self.event_log.append(event)    # append to episodic memory

    def finalize_task(self, state: TaskState, final_output: str):
        # Extract and store valuable findings for future tasks
        if state.status == "done" and final_output:
            self._extract_and_store_knowledge(state.goal, final_output)

        state.status = "done"
        state.final_output = final_output
        self.checkpointer.save(state)

    def _extract_and_store_knowledge(self, goal: str, output: str):
        extraction_prompt = f"""
        Task: {goal}
        Output: {output}

        Extract 3-5 facts or findings from this output that would be useful in future, similar tasks.
        Format each as a single sentence. Be specific and factual.
        """
        findings = llm.invoke(extraction_prompt).content.split("\n")
        for finding in findings:
            if finding.strip():
                store_memory(finding.strip(), "fact", [], importance=0.7)

Study Notes

Four memory types, four purposes. In-context = what the LLM sees right now. Working = task state that survives restarts. Episodic = audit trail for debugging. Semantic = long-term knowledge across sessions. Using the wrong type for the job is a common source of bugs (e.g., trying to use in-context memory for cross-session knowledge).
The task ledger is the agent's spine. It's the one component that should always be in context, never summarized away, and always up to date. An agent without a task ledger is an agent that can forget what it was doing.
Checkpointing costs very little; not having it costs everything. A crashed task that must restart from scratch wastes all the API calls made before the crash. Checkpoint to Redis or SQLite after every significant step.
Semantic memory requires discipline. Storing everything poisons it. Store only verified, important facts with an explicit importance score. TTL and pruning are not optional for long-running production systems.
Context window management is an engineering task, not an LLM task. The LLM does not manage its own context. The framework must actively monitor token counts and apply compression strategies before the limit is hit, not after.

Q&A Review Bank

Q1: What are the four memory types and what is the primary purpose of each? [Easy] A: In-Context Memory is the LLM's active context window — everything it can "see" in a given API call; its primary purpose is providing the model with immediate, task-relevant context. Working Memory is structured state maintained outside the context window by the orchestration layer (task ledger, partial results, assignments); its purpose is preserving task progress across steps and agent restarts. Episodic Memory is an append-only event log of everything the agent did (tool calls, LLM calls, errors, HITL decisions); its purpose is debugging, auditing, and training data generation. Semantic Memory is a vector store of accumulated knowledge across sessions (facts, preferences, past research); its purpose is giving agents access to relevant knowledge from previous tasks without repeating work.

Q2: What is context drift and what is the most reliable prevention? [Medium] A: Context drift is the gradual degradation of agent reasoning quality as the context window fills — the agent starts losing attention to early content (including the original goal), its reasoning becomes inconsistent, and it may repeat work or contradict earlier decisions. The most reliable prevention is the task ledger: a compact, structured record of the goal and remaining subtasks that is always present at the top of the context, never summarized away, and updated after every step. Even if all historical messages are summarized, the task ledger ensures the agent always knows exactly what it set out to do and what remains.

Q3: Why is checkpointing critical for production agentic systems? [Medium] A: A multi-step agent task may involve 10–50 LLM calls and many tool calls. Without checkpointing, any failure — network timeout, process crash, server restart, budget exceeded — causes the entire task to fail and requires restarting from scratch, wasting all computation and API costs already incurred. With checkpointing (serializing task state to persistent storage after every step), a resumed task picks up from the last successful step. This is especially important for long-running tasks where restarting from scratch is expensive, tasks that involve HITL gates (the human may have already approved an action), and systems that need to handle failures gracefully rather than crashing.

Q4: What is memory pollution and what three mechanisms prevent it? [Hard] A: Memory pollution is the accumulation of incorrect, outdated, or low-quality information in semantic (long-term) memory, which then degrades agent performance in future tasks when retrieved as "authoritative knowledge." Three prevention mechanisms: (1) Quality gates — verify before storing; use an LLM to confirm a fact is accurate and worth preserving; skip low-importance information (importance < threshold). (2) TTL and pruning — set a maximum age for memories; automatically delete memories that haven't been accessed recently AND have low importance scores; facts that are regularly relevant will be re-accessed and survive pruning. (3) Memory correction — when a stored fact is discovered to be wrong, find and delete it from the vector store and replace it with the corrected version; this requires explicit correction logic, not just a new write (which would create a contradiction).

Q5: When should you retrieve from semantic memory vs read from working memory? [Medium] A: Retrieve from semantic memory (vector search) when you need knowledge that was accumulated across previous tasks — facts from past research, user preferences, domain knowledge, successful templates. It requires a similarity search and should happen at task initialization and at specific steps where the agent needs external knowledge. Read from working memory (structured state) when you need the current task's state — the task ledger, partial results, agent assignments, error log. Working memory is always available as structured data (no search required) and is read/written continuously throughout the task. The distinction: semantic memory is "what do I know from past experience?"; working memory is "what is the current state of this specific task?"

Q6: Describe a complete memory access pattern for a multi-step agent task. [Hard] A: At task start: query semantic memory (vector search) for knowledge relevant to the goal; load working memory (task state from checkpoint) if resuming; initialize in-context memory (system prompt + task ledger + retrieved semantic memories); start appending to episodic memory. Each step: LLM reads full in-context memory; after the step, update working memory (mark subtask progress, append tool results to intermediate outputs); append the step as an event to episodic memory; optionally query semantic memory for step-specific knowledge needs. At task end: extract high-quality findings from the final output and write them to semantic memory for future use; finalize working memory (mark task done, store final output); archive the episodic trace; discard in-context memory (session ends). This full lifecycle ensures the agent benefits from past experience (semantic), recovers from failures (working + checkpointing), and produces debuggable traces (episodic).