Working memory via todos

TL;DR

Working memory via todos gives agents a simple, explicit todo list tool that tracks task progress, preventing the agent from losing its place during multi-step operations that span dozens of tool calls.
The todo list acts as structured working memory: a small scratchpad the agent reads at each step to orient itself, separate from the sprawling conversation history.
Tool interface is intentionally minimal: TodoWrite(items) to create or update the list, TodoRead() to retrieve it. Status values: not-started, in-progress, completed.
Analysis of 88 real Claude conversation sessions showed that TodoWrite usage correlated directly with smoother sessions and fewer "forgotten subtask" failures (Balic, 2025).
The todo list is for the agent, not the user. It's a cognitive scaffold that makes internal state explicit and inspectable.
Limitation: agents sometimes forget to update the list, create overly granular items, or treat the todo as a plan rather than a tracker. Prompt engineering helps, but doesn't eliminate these failure modes.

Your AI coding agent is refactoring a payment module. The task involves: updating the data model, modifying three service files, adjusting the API layer, updating tests, and running the test suite. The agent starts strong, updates the data model, and modifies the first service file. Then it calls a tool to read the second service file, gets back 200 lines of code, and begins working on it.

By the time it finishes the second service file, the conversation history has grown to 15,000 tokens of tool calls, code diffs, and intermediate reasoning. The agent starts modifying the API layer, then pauses, confused: "Wait, did I already update the PaymentService.processRefund method, or was that just the data model?" It re-reads a file it already modified, makes a redundant change, and silently skips the third service file because it lost track.

I've seen this failure mode repeatedly with agents performing tasks that involve more than 5-6 sequential tool calls. The context window fills with operational details (file contents, diff outputs, tool responses), and the original task plan gets buried. The agent has no mechanism to step back and ask "where am I in this process?" It's like a surgeon operating without a checklist: each individual step might go fine, but the sequence breaks.

The problem gets worse as the task gets longer. A 3-step task rarely fails this way because the plan fits comfortably in recent context. A 10-step task almost always degrades without explicit tracking. And it fails silently: the agent doesn't announce "I lost track." It just starts making confused decisions and the user only notices when the output is wrong.

The root cause is that LLMs don't have persistent working memory. Everything they "know" comes from the context window. When that window fills with tool outputs and intermediate steps, the high-level plan drowns in low-level details. The agent can see the trees (individual tool calls) but loses the forest (the overall task structure).

Context window ≠ working memory

The context window holds everything the agent has seen, but working memory is a much smaller, structured representation of "what I'm doing right now." Humans maintain working memory (7±2 items per Miller's research) separately from long-term recall. Agents need the same separation. A 128K-token context window is not working memory. It's a filing cabinet. The agent needs a notepad.

What Is It?

Working memory via todos gives the agent a lightweight, structured todo list tool that it reads and updates at each step of a multi-step task. The agent writes the plan as todo items at the start, marks items in-progress as it works on them, and marks them completed when done. At any point, a TodoRead() call returns the current state of the plan.

Think of it like a pilot's preflight checklist. The pilot doesn't rely on memory for the 30-step startup sequence. They have a physical checklist, mark each item as they complete it, and always know exactly where they are in the process. If someone interrupts them, they glance at the checklist and resume from the right spot. The todo list gives the agent the same capability.

The key insight: the todo list is not for the user. It's for the agent. Users can inspect it (which is a nice side benefit for debugging), but its primary purpose is to give the agent an explicit, structured representation of its own progress that doesn't get lost in the noise of conversation history.

How It Works

The tool interface: intentionally minimal

The todo tool is deliberately simple. Two operations, three status values, and a flat list structure. Complexity is the enemy here because the agent needs to use this tool fluently without spending reasoning tokens on the tool's own mechanics.

TodoWrite(items) creates or replaces the entire todo list. Each item has an id, a title (the task description), and a status (not-started, in-progress, or completed). The agent calls this at the start to create the plan, and again whenever it needs to update a status.

TodoRead() returns the current todo list with all items and their statuses. The agent calls this before starting each new step to orient itself. This is the "glance at the checklist" moment.

The constraint that matters: only one item should be in-progress at a time. This prevents the agent from context-switching between tasks (a common failure mode where the agent starts three things and finishes none). Sequential focus produces better results than parallel attempts.

# Simplified todo tool interface
from dataclasses import dataclass
from enum import Enum

class Status(Enum):
    NOT_STARTED = "not-started"
    IN_PROGRESS = "in-progress"
    COMPLETED = "completed"

@dataclass
class TodoItem:
    id: str
    title: str
    status: Status = Status.NOT_STARTED

class TodoTool:
    def __init__(self):
        self._items: list[TodoItem] = []

    def write(self, items: list[dict]) -> str:
        self._items = [
            TodoItem(id=i["id"], title=i["title"],
                     status=Status(i.get("status", "not-started")))
            for i in items
        ]
        return f"Todo list updated: {len(self._items)} items"

    def read(self) -> list[dict]:
        return [
            {"id": i.id, "title": i.title, "status": i.status.value}
            for i in self._items
        ]

The execution loop: read, work, update

The mechanism is a simple loop that the agent follows at each step. The loop structure is the core teaching point because it's what transforms an ad-hoc sequence of tool calls into a disciplined process.

Step 1: Read the todo list. Before doing anything, the agent calls TodoRead() to see the current state. This is the orientation step. The agent sees which items are completed, which one is in-progress, and which are remaining.

Step 2: Identify the next action. Based on the todo list, the agent picks up the in-progress item (if one exists) or starts the next not-started item. This prevents skipping tasks or repeating completed ones.

Step 3: Do the work. The agent performs the actual task: read a file, make an edit, call an API, run a command. This is the standard agent loop. The todo list doesn't change how the agent works. It changes how the agent knows where it is.

Step 4: Update the todo list. After completing the step, the agent calls TodoWrite() to mark the current item as completed and optionally set the next item to in-progress. This records progress explicitly.

Step 5: Loop back to Step 1. The agent reads the updated list and continues. If all items are completed, the task is done.

TodoRead

>Waiting...

Pick Next Task

>Waiting...

Execute Step

>Waiting...

TodoWrite

>Waiting...

All Done?

>Waiting...

Agent execution loop with todo-based working memory tracking

TL;DR

Working memory via todos gives agents a simple, explicit todo list tool that tracks task progress, preventing the agent from losing its place during multi-step operations that span dozens of tool calls.
The todo list acts as structured working memory: a small scratchpad the agent reads at each step to orient itself, separate from the sprawling conversation history.
Tool interface is intentionally minimal: TodoWrite(items) to create or update the list, TodoRead() to retrieve it. Status values: not-started, in-progress, completed.
Analysis of 88 real Claude conversation sessions showed that TodoWrite usage correlated directly with smoother sessions and fewer "forgotten subtask" failures (Balic, 2025).
The todo list is for the agent, not the user. It's a cognitive scaffold that makes internal state explicit and inspectable.
Limitation: agents sometimes forget to update the list, create overly granular items, or treat the todo as a plan rather than a tracker. Prompt engineering helps, but doesn't eliminate these failure modes.

# Simplified todo tool interface
from dataclasses import dataclass
from enum import Enum

class Status(Enum):
    NOT_STARTED = "not-started"
    IN_PROGRESS = "in-progress"
    COMPLETED = "completed"

@dataclass
class TodoItem:
    id: str
    title: str
    status: Status = Status.NOT_STARTED

class TodoTool:
    def __init__(self):
        self._items: list[TodoItem] = []

    def write(self, items: list[dict]) -> str:
        self._items = [
            TodoItem(id=i["id"], title=i["title"],
                     status=Status(i.get("status", "not-started")))
            for i in items
        ]
        return f"Todo list updated: {len(self._items)} items"

    def read(self) -> list[dict]:
        return [
            {"id": i.id, "title": i.title, "status": i.status.value}
            for i in self._items
        ]

The execution loop: read, work, update

Step 5: Loop back to Step 1. The agent reads the updated list and continues. If all items are completed, the task is done.

TodoRead

>Waiting...

Pick Next Task

>Waiting...

Execute Step

>Waiting...

TodoWrite

>Waiting...

All Done?

>Waiting...

Agent execution loop with todo-based working memory tracking

Working memory via todos

TL;DR

The Problem It Solves

What Is It?

How It Works

The tool interface: intentionally minimal

The execution loop: read, work, update

Continue Reading with Premium

Comments

Working memory via todos

TL;DR

The Problem It Solves

What Is It?

How It Works

The tool interface: intentionally minimal

The execution loop: read, work, update

Continue Reading with Premium

Comments