Stateful agents with LangGraph
Learn how LangGraph models agent state as a typed graph, how conditional edges enable complex branching workflows, and how persistent checkpointing lets agents survive crashes and support human approval gates.
TL;DR
- Stateless chains lose all intermediate work on crash. LangGraph persists state as a typed object after every node, enabling resume-from-checkpoint on any failure.
- State is a Python TypedDict that flows through the graph. Every node reads the full state and returns only the fields it updates. The graph merges updates using reducer functions.
- Conditional edges are Python functions routing to different nodes based on the current state. This replaces spaghetti if-else logic with explicit, testable routing functions.
- Attach SQLite, PostgreSQL, or Redis checkpointers to persist state after each node. Resume a crashed run by passing the same
thread_id. interrupt_beforepauses the graph before a specified node and freezes state in the checkpointer until a human resumes it. This is the production pattern for approval gates on irreversible actions.- LangGraph deployments at Klarna, LinkedIn, and Uber handle customer service, research automation, and code review workflows in production.
The problem it solves
You build a 12-step research pipeline: fetch documents, extract entities, cluster themes, generate summaries, cross-reference sources, and produce a final report. The pipeline runs correctly on the first 10 jobs. Job 11, a network timeout kills the process at step 7. All 6 completed steps are gone. The job restarts from step 0.
With a stateless chain, there is no other option. State exists only in memory. When the process dies, the state dies with it. Every restart costs the full run time and compute of the failed job.
I have watched teams burn entire sprints rewriting pipeline state management by hand: serializing intermediate results to JSON files, loading them on restart, writing corrupt-state recovery logic. It works, but it is brittle infrastructure that the team maintains forever.
LangGraph makes state a first-class concept. Every node reads from and writes to a typed state object. The checkpointer persists that state to a database after every node completes. When the pipeline crashes at step 7, you restart with the same thread_id. LangGraph loads the last checkpoint and resumes from step 8.
What is it?
LangGraph is a Python library for building stateful, graph-structured agent workflows. You define a graph where nodes are Python functions and edges connect them. A TypedDict state object flows through the graph, read and updated by each node. A checkpointer persists the state after every node to a database.
Think of a relay race. Each runner (node) receives the baton (state), runs their leg, and hands off. The baton accumulates context from every leg. If a runner stumbles (node fails), the race does not restart from the starting line. You resume from the last successful handoff. LangGraph is the relay race coordinator.
How it works
State as a TypedDict
Every LangGraph graph defines a State class. It is a TypedDict (or Pydantic model) that holds everything the workflow needs to track: inputs, intermediate outputs, control flags, accumulated results, and the message history.
# LangGraph state definition for a document research pipeline
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
class ResearchState(TypedDict):
# Input
query: str
# Accumulated retrieval results
retrieved_docs: list[str]
# Generated output
draft_answer: str
# Control flags
needs_verification: bool
verified: bool
# Final output
final_answer: str
# Message history (add_messages reducer appends, not replaces)
messages: Annotated[list, add_messages]
The Annotated[list, add_messages] annotation is a reducer. When a node returns {"messages": [new_message]}, LangGraph calls add_messages(existing_list, [new_message]) rather than replacing the list. Reducers are how you accumulate data across nodes without managing merge logic manually.
Every node receives the full current state as a Python dict and returns only the fields it changes. LangGraph merges the returned dict into the state automatically.
Node functions
A node is a Python function that takes the current state and returns a dict of updates. Nodes can call LLMs, execute tools, run database queries, send API requests, or do pure Python computation. Any function that reads state and returns a partial update is a valid node.
# LangGraph node functions
def retrieve_docs(state: ResearchState) -> dict:
"""Retrieve relevant documents from vector store."""
docs = vector_store.similarity_search(state["query"], k=5)
return {"retrieved_docs": [doc.page_content for doc in docs]}
def generate_answer(state: ResearchState) -> dict:
"""Generate a draft answer using retrieved context."""
context = "\n\n".join(state["retrieved_docs"])
response = llm.invoke(
f"Context:\n{context}\n\nQuery: {state['query']}\n\nAnswer:"
)
return {
"draft_answer": response.content,
# Flag long answers for verification
"needs_verification": len(response.content.split()) > 200
}
def verify_answer(state: ResearchState) -> dict:
"""Cross-check the draft answer against source documents."""
verdict = verifier_llm.invoke(
f"Claim: {state['draft_answer']}\nSources: {state['retrieved_docs']}\n"
f"Is this claim fully supported by the sources? Answer yes or no."
)
return {
"verified": verdict.content.strip().lower().startswith("yes"),
"final_answer": state["draft_answer"] if verdict.content.strip().lower().startswith("yes") else ""
}
Nodes should do one thing. Combining retrieval and generation into a single node makes the graph harder to test, harder to checkpoint at the right granularity, and impossible to branch conditionally between those two operations.
Conditional edges
Conditional edges route the graph to different nodes based on the current state. The routing function is a plain Python function: it receives the state and returns the name of the next node (or END to terminate the graph).
from langgraph.graph import StateGraph, END
workflow = StateGraph(ResearchState)
# Register nodes
workflow.add_node("retrieve_docs", retrieve_docs)
workflow.add_node("generate_answer", generate_answer)
workflow.add_node("verify_answer", verify_answer)
# Set entry point
workflow.set_entry_point("retrieve_docs")
# Unconditional edge: retrieve always leads to generation
workflow.add_edge("retrieve_docs", "generate_answer")
# Conditional edge: generation routes based on needs_verification flag
def route_after_generation(state: ResearchState) -> str:
if state["needs_verification"]:
return "verify_answer"
return END
workflow.add_conditional_edges("generate_answer", route_after_generation)
workflow.add_edge("verify_answer", END)
The routing function is pure Python operating on state. It has no side effects, is easy to unit test, and makes the branching logic explicit and readable.
Persistence and checkpointing
To persist state across crashes and sessions, attach a checkpointer when compiling the graph. LangGraph ships SQLite (development), PostgreSQL (multi-instance production), and Redis (high-throughput, short-lived workflows) checkpointers.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.