PII tokenization
Replace PII in user input with reversible tokens before sending to the LLM, then de-tokenize the response, keeping real data off cloud model APIs.
TL;DR
- Before the LLM call, scan input for PII (names, emails, phone numbers, SSNs) and replace each with a token (
PERSON_1,EMAIL_2,PHONE_3). - A session-scoped token map holds the real values; after the LLM responds, de-tokenize to restore them in the output.
- The LLM never processes real PII, making it safe to route GDPR/HIPAA-sensitive workflows through cloud LLM APIs.
- Token assignment must be deterministic within a session: the same email always maps to the same token, so multi-turn reasoning stays coherent.
- Limitation: PII the scanner misses still reaches the model. Scanner precision determines your actual privacy posture.
The Problem It Solves
Your AI-powered customer support agent needs to process a refund. The user says: "Hi, I'm John Smith, my email is john.smith@acme.com and I'd like a refund on order #12345." Your agent sends this entire message, including the real name and real email, to GPT-4o or Claude to generate a response.
That raw PII now sits in a third-party cloud provider's infrastructure. It's in their logs, potentially in their training pipeline (depending on your data processing agreement), and definitely outside your compliance boundary. If you're in healthcare, finance, or any GDPR-regulated market, this is a violation waiting to happen.
"Just anonymize the data" is the obvious answer. But anonymization destroys the information the agent needs to do its job. You can't send a refund email if you've permanently erased the email address. PII tokenization solves this by replacing real values with reversible placeholders that the LLM can reason about, without ever seeing the actual data.
What Is It?
PII tokenization is an interception pattern that replaces personally identifiable information with reversible placeholder tokens before LLM processing, then restores the original values in the model's output before acting on it. Think of it as a translation layer: real names and emails go in, coded aliases come out, the LLM works with the aliases, and on the way back out, the aliases get swapped back to real values.
The analogy is a courtroom sketch artist. The artist draws the scene accurately, capturing expressions, gestures, and positions, but replaces real faces with stylized representations. The sketch conveys all the information needed for the story without revealing actual identities. PII tokenization does the same thing for text: it preserves the semantic structure while masking the identifying details.
How It Works
The three-phase pipeline
Every PII tokenization system follows the same core pipeline: scan, replace, restore. The scanner identifies PII in the input text. The tokenizer replaces each PII instance with a deterministic placeholder. After LLM processing, the de-tokenizer swaps tokens back to real values before tool execution or user-facing output.
I've seen teams try to skip the scanning phase and just regex-replace anything that looks like an email or phone number. That works for about a week before you hit names, addresses, and context-dependent PII that simple patterns can't catch.
Scanner types: regex, NER, and hybrid
The scanner is the most critical component. It determines what PII gets caught and what slips through. Three approaches exist, each with distinct accuracy and latency profiles.
Regex-based scanning uses pattern matching for structured PII: email addresses, phone numbers, SSNs, credit card numbers, IP addresses. Fast (under 5ms for typical messages), deterministic, and easy to test. The weakness is obvious: regex can't detect names, addresses, or context-dependent PII like "my birthday is March 15th."
NER-based scanning uses a named entity recognition model (spaCy, Presidio's ML backend, AWS Comprehend, Google DLP) to identify entities like PERSON, LOCATION, ORGANIZATION. It handles unstructured PII that regex misses. The cost is latency (50-200ms per call) and the possibility of false positives, like flagging "Amazon" as a LOCATION when it's a company name in your context.
Hybrid scanning runs regex first for structured patterns, then NER for everything else. This is what Microsoft Presidio does internally, and it's the right default for production systems. Regex handles the easy cases instantly; NER catches the semantic entities regex can't see.
Session-scoped token maps
Token assignment must be consistent within a conversation. If the user mentions "john@acme.com" in message 1 and refers to "that email" in message 3, the LLM needs to know both references point to EMAIL_1. If every message generates a fresh token map, the LLM loses track of entity relationships across turns.
The token map is a bidirectional dictionary: real_value β token for tokenization, token β real_value for de-tokenization. It's scoped to a session (typically a conversation or request chain) and destroyed when the session ends.
I've debugged a system where tokens were request-scoped instead of session-scoped. The agent kept "forgetting" which customer it was helping mid-conversation because EMAIL_1 in turn 1 and EMAIL_1 in turn 3 referred to different addresses. Switching to session-scoped maps fixed it immediately.
Token format conventions:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.