Tokenization

TL;DR

Tokenization splits text into subword units and maps each to an integer ID from a fixed vocabulary. LLMs never see raw text.
BPE (Byte Pair Encoding) is the dominant algorithm: start with bytes, iteratively merge the most frequent adjacent pairs until you hit the target vocabulary size (~100K for GPT-4, ~128K for Llama 3).
1 token is roughly 0.75 English words. Non-Latin scripts like Japanese, Chinese, and Arabic cost 2-4x more tokens per word.
Everything in the LLM world is metered in tokens: pricing, context windows, latency. The fastest way to cut LLM costs is to count your tokens.
Token boundaries explain why models struggle with character-level tasks. The model sees ['str', 'awb', 'erry'], not the letters in "strawberry."

You type "cache invalidation is hard" into ChatGPT. The model needs to process those words. But neural networks only work with numbers, not strings. Somehow, that sentence has to become an array of integers before the transformer touches it.

The naive approach: assign one number per word. "Cache" = 1, "invalidation" = 2, and so on. This breaks immediately. Your vocabulary explodes when you add inflections ("invalidate", "invalidated", "invalidating"), compound words ("database", "timestamp"), and multiple languages. A 500K-word vocabulary means 500K embedding rows, most of which the model rarely sees during training.

The other extreme: one number per character. Now your vocabulary is tiny (a few hundred entries), but sequences become extremely long. Every sentence becomes hundreds of tokens, and the model must learn spelling, morphology, and meaning all from single characters. Training becomes slow and attention cost (which scales quadratically with sequence length) becomes brutal.

Subword tokenization solves both problems. Common words like "the" and "cache" stay as single tokens. Rare words split into meaningful pieces: "tokenization" becomes ['token', 'ization']. The vocabulary stays manageable (32K to 128K entries), sequences stay short, and the model can handle words it has never seen before by composing known pieces.

What is it?

Tokenization is the process of converting raw text into a sequence of integer IDs from a fixed vocabulary. Each integer maps to a "token," which is a subword unit that might be a full word, a word fragment, a punctuation mark, or a special control character.

Think of it like a phrasebook. When you travel to a foreign country, you don't translate letter by letter (too slow) or memorize entire sentences (too many). You learn common phrases and word parts, then combine them. The tokenizer is the phrasebook: it knows thousands of common text fragments and their IDs.

The tokenizer runs once before the model sees any input. It produces a flat array of integers. Decoding (turning model output back into text) reverses the mapping. The vocabulary is frozen at training time and cannot change after deployment.

How it works

BPE: the algorithm that powers modern tokenizers

Byte Pair Encoding (BPE) is the algorithm behind GPT-4's tokenizer (cl100k_base), Llama 3, and most production LLMs. The core idea is simple: start with individual bytes and iteratively merge the most frequent adjacent pairs.

Here is a concrete walkthrough. Suppose your training corpus contains the word "low" 5 times, "lower" 2 times, and "newest" 6 times.

Step 1: Start with characters. The initial vocabulary is every unique byte: l, o, w, e, r, n, s, t. Each word is a sequence of characters.

Step 2: Count adjacent pairs. The pair (e, s) appears 6 times (in "newest"). The pair (l, o) appears 7 times (in "low" and "lower"). Merge (l, o) into a new token lo.

Step 3: Repeat. Now count pairs again with the merged vocabulary. The pair (lo, w) appears 7 times. Merge it into low. Continue until you hit your target vocabulary size.

After thousands of merges, common words like "the," "function," and "return" become single tokens. Rare words still decompose into known pieces. I've found that walking through these three steps in an interview immediately shows you understand the mechanism, not just the name.

Token vocabulary and special tokens

The vocabulary is not just word fragments. It includes special tokens that control model behavior. These are invisible to end users but critical to how the model processes input.

Token	Purpose	Used By
`[BOS]` / `<s>`	Beginning of sequence, signals the start of input	Llama, T5
`[EOS]` / `</s>`	End of sequence, tells the model to stop generating	Most models
`[PAD]`	Padding for batch processing (fills shorter sequences to uniform length)	All models in batched inference
`[CLS]`	Classification token, pooled for sentence-level tasks	BERT, RoBERTa
`[SEP]`	Separator between segments (e.g., question and passage)	BERT
`[MASK]`	Masked token for fill-in-the-blank pretraining	BERT, RoBERTa
`<\|im_start\|>`	Chat turn delimiter (system, user, assistant)	GPT-4, ChatML format
`<\|endoftext\|>`	End of document marker	GPT family

Chat-based models add turn-delimiter tokens that wrap each message in a conversation. When you send a system prompt, a user message, and expect an assistant response, the tokenizer inserts delimiters like <|im_start|>system, <|im_start|>user, and <|im_start|>assistant around each segment. These tokens eat into your context window but are invisible in the API response.

Special tokens in user input

If your application accepts freeform user text, strip or escape control tokens like <|endoftext|> before tokenizing. These can silently truncate the model's view of the input or corrupt the conversation structure. Always sanitize at the boundary.

Tokenization across languages and code

Here is the thing most people miss: tokenization is not language-neutral. Vocabularies built from English-dominant training corpora tokenize English cheaply and everything else expensively. This is not a minor difference.

The sentence "The capital of France is Paris" tokenizes to roughly 7 tokens in GPT-4. The equivalent in Japanese, "フランスの首都はパリです," tokenizes to about 14 tokens. Same semantic content, double the cost. For Arabic and Hindi, the penalty is often 2-3x. For Chinese, each character can become 2-3 tokens.

Code tokenization has its own patterns. Common keywords like function, return, and import are usually single tokens because they appear frequently in training data. But domain-specific identifiers like calculateShippingCost might split into ['calculate', 'Shipping', 'Cost']. Whitespace is often encoded as part of the token (a leading space before "the" is a different token than "the" at the start of a line).

I've seen teams build multilingual products and estimate costs purely from English test conversations. In production, their Japanese and Arabic users drove token consumption 2.5x higher than budget. Always measure with your actual target languages.

The tiktoken library: counting tokens in practice

Before sending any request to an LLM API, count your tokens. OpenAI provides tiktoken for this. Here is a practical example:

# Token counting with tiktoken (OpenAI's tokenizer library)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

english = "The capital of France is Paris"
japanese = "フランスの首都はパリです"

en_tokens = enc.encode(english)
jp_tokens = enc.encode(japanese)

print(f"English: {len(en_tokens)} tokens → {en_tokens}")
print(f"Japanese: {len(jp_tokens)} tokens → {jp_tokens}")
print(f"Ratio: {len(jp_tokens) / len(en_tokens):.1f}x more tokens")

# Decode individual tokens to see what the model "sees"
for t in en_tokens:
    print(f"  Token {t}: '{enc.decode([t])}'")

For Hugging Face models (Llama, Mistral), use the transformers tokenizer instead. The key rule: always count tokens with the tokenizer that matches your model. Mismatching tokenizers gives wrong counts.

Context window budgeting

Every LLM has a fixed context window measured in tokens. GPT-4o supports 128K tokens. Claude 3.5 supports 200K. But "supports 128K tokens" does not mean "use 128K tokens." Latency and cost scale with token count, and attention quality degrades on very long contexts.

My recommendation: budget your context window explicitly. Here is how a typical RAG application breaks down:

I've seen teams blow their entire context window budget on a verbose system prompt that could have been half the length. A 2,000-token system prompt on a 4K context window leaves almost no room for retrieved documents or conversation history. The fix is simple: run your system prompt through the tokenizer, measure it, and compress ruthlessly.

For your interview: say "I'd budget the context window into four buckets: system prompt, conversation history, retrieved context, and output reservation. I'd measure each in tokens and set hard limits." This shows you think about LLM constraints as engineering problems, not magic.

Key variants

Four subword tokenization algorithms dominate the field. BPE is the most common, but understanding the alternatives helps you evaluate model choices.

Algorithm	How It Works	Used By	Best For	Key Tradeoff
BPE (Byte Pair Encoding)	Iteratively merge most frequent byte pairs	GPT-4, Llama 3, Mistral	General-purpose, code-heavy workloads	Large vocabularies needed for multilingual support
WordPiece	Like BPE, but merges based on likelihood gain, not raw frequency	BERT, DistilBERT, Electra	Classification tasks, sentence embeddings	Tied to BERT-era models, less common in generative LLMs
SentencePiece	Treats input as raw byte stream (no pre-tokenization on whitespace)	T5, Llama (original), mBART	Multilingual models, language-agnostic pipelines	Slightly less efficient for English-only workloads
Unigram	Starts with a large vocabulary, prunes tokens that least impact likelihood	XLNet, Albert, SentencePiece (optional mode)	Research models, probabilistic tokenization	Less deterministic; same text can tokenize differently

The practical takeaway: if you are using GPT-4 or Llama 3, you are using BPE. If you are working with BERT-family models for embeddings or classification, you are using WordPiece. Know which one your model uses, because the tokenizer and model must match exactly.

When to use / when to avoid

When to care about tokenization

When estimating LLM API costs. Token count drives pricing directly. Multiply word count by 1.35 for English, by 2.5-4x for non-Latin scripts.
When building a RAG system. Chunk documents at token boundaries, not word or sentence boundaries, to avoid exceeding the context window mid-chunk.
When your product serves multilingual users. Measure token costs per language. Japanese and Arabic conversations can cost 2-3x more than English.
When debugging unexpected model behavior. If the model misspells a word or fails at character-level tasks, check the tokenization. The answer is usually "the model doesn't see what you think it sees."
When designing system prompts. Every token in your system prompt is repeated on every API call. A 2,000-token system prompt at 100K daily requests = 200M tokens/day on prompts alone.

When tokenization is not your problem

When using a hosted API with a default tokenizer. You cannot change the tokenizer for GPT-4 or Claude. Just count and budget.
When building a simple chatbot with short conversations. If your conversations are under 1K tokens total, context window budgeting is overkill.
When the model already handles your language well. If you are building an English-only product on GPT-4, the token efficiency is already optimized.

Real-world examples

OpenAI's billing model. GPT-4o charges $2.50 per million input tokens and $10 per million output tokens. A customer service bot with a 2,000-token system prompt processing 100,000 requests per day spends 200 million tokens per day on system prompts alone. At $2.50/M, that is $500/day just for system prompts. Prompt compression techniques like LLMLingua can cut this by 40-60%.

Multilingual fintech startup. A team I've seen building a Japanese support bot discovered their per-conversation cost was 3x higher than projected. Japanese text tokenizes at 2-3 tokens per character in GPT-4's vocabulary. They switched their embedding and retrieval layer to a Japanese-optimized model (reducing token count for the retrieval step) and compressed their system prompt from 1,800 tokens to 600. Total cost dropped by 55%.

GitHub Copilot's context engineering. The Copilot system prompt includes recent editor context, nearby file contents, and type information. Token-aware context selection (filling the context window with the highest-value tokens rather than just the N nearest lines) is a core engineering investment. Every token matters when your context window is the bottleneck between "helpful suggestion" and "hallucinated garbage."

Anthropic's Claude tokenizer. Claude uses a vocabulary of ~100K tokens optimized for both English and code. Anthropic published that their tokenizer achieves roughly 15% better compression on code compared to GPT-4's tokenizer, which means the same code costs fewer tokens. When comparing model costs across providers, always measure with each provider's tokenizer.

Limitations and tradeoffs

Limitation	Impact	Mitigation
Fixed vocabulary at training time	Cannot adapt to new slang, domain jargon, or emerging terms after training	Fine-tune on domain data (vocabulary stays fixed, but embeddings adapt)
Language imbalance in vocabulary	Non-English text costs 2-4x more tokens per word	Use models with multilingual-optimized tokenizers; compress prompts
Token boundaries obscure character structure	Arithmetic, spelling, and URL parsing become harder for the model	Use tool calls for character-level tasks; don't rely on the model for counting
Prompt injection via token tricks	Crafted strings can exploit tokenizer quirks to bypass filters	Sanitize control tokens; treat tokenization as a security boundary
Vocabulary size vs. embedding table size	Larger vocabulary = more parameters in the embedding layer	Balance: 32K is too small for multilingual, 256K wastes parameters

The fundamental tension: vocabulary size vs. sequence length. A larger vocabulary means more words become single tokens (shorter sequences, faster inference), but the embedding table grows and rare tokens get undertrained. A smaller vocabulary means longer sequences (slower, more expensive) but every token is well-trained. Modern models settle around 100K-128K as the sweet spot.

How this shows up in interviews

When to bring it up

Mention tokenization proactively when the interviewer discusses:

LLM-powered feature design (chatbots, search, copilots)
Cost estimation for AI products
Multilingual or global product requirements
Context window management in RAG systems
Why a model fails at specific tasks (character counting, arithmetic)

What depth is expected

Junior/mid-level: Know that tokenization exists, that 1 token is roughly 0.75 English words, and that context windows are measured in tokens.
Senior: Explain BPE step by step, discuss the multilingual token tax, and budget a context window for a RAG application.
Staff/principal: Discuss vocabulary size tradeoffs, tokenizer-model coupling, the security implications of special tokens, and how tokenization affects model capabilities (character-level reasoning, code generation efficiency).

Interview Q&A

Interviewer Asks	Strong Answer
"What is tokenization and why does it matter?"	"Tokenization maps text to integer IDs using a subword vocabulary. It determines pricing, context limits, and which tasks the model can handle well."
"How does BPE work?"	"Start with individual bytes, count the most frequent adjacent pair across the corpus, merge it into a new token, repeat until you reach vocabulary size."
"Why do non-English languages cost more?"	"Vocabularies are built from English-heavy corpora. Japanese characters split into 2-3 tokens each because they rarely appear as merged pairs."
"How would you budget a context window?"	"Four buckets: system prompt, conversation history, retrieved context, and output reservation. Measure each in tokens with the model's own tokenizer."
"Why do models fail at counting letters?"	"The model sees tokens, not characters. 'Strawberry' splits into subword chunks, so counting 'r' requires reasoning about character content of each token."

Common interview mistakes

Mistake	Why It's Wrong	Say This Instead
"1 token = 1 word"	Tokens are subword units. 1 English token is about 0.75 words. The ratio varies dramatically across languages.	"1 token is roughly 0.75 English words, and non-Latin scripts cost 2-4x more tokens per word."
"GPT-4 can handle 128K tokens, so context length isn't a problem"	Latency scales with token count, attention degrades on long contexts, and cost scales linearly. 128K is a ceiling, not a target.	"128K is the max, but I'd budget explicitly and keep context as short as possible for cost and quality."
"Tokenization is just preprocessing"	Tokenization determines what the model can and cannot do. Token boundaries affect spelling, arithmetic, and multilingual performance.	"Tokenization shapes model capabilities. Token boundaries explain why models struggle with character-level tasks."
Forgetting to mention special tokens	Special tokens like `[BOS]`, `[EOS]`, and chat delimiters consume context window space and can be exploited via prompt injection.	"I'd also account for special tokens and chat delimiters in my context budget, and sanitize user input for control tokens."
Assuming all tokenizers are the same	Different models use different tokenizers. The same text produces different token counts across providers.	"I'd measure token count with each provider's specific tokenizer before comparing costs."

Test your understanding

Quick recap

Tokenization converts raw text into integer IDs from a fixed subword vocabulary, and it is the first step in every LLM inference.
BPE builds the vocabulary by iteratively merging the most frequent adjacent byte pairs until reaching the target size (100K-128K for modern models).
1 English token is roughly 0.75 words. Non-Latin scripts cost 2-4x more tokens per word due to English-heavy training corpora.
Special tokens ([BOS], [EOS], chat delimiters) consume context window space and must be sanitized in user input to prevent injection.
Context window budgeting (system prompt + history + retrieved context + output reservation) is an engineering discipline, not an afterthought.
Token boundaries explain character-level failures: the model sees subword chunks, not individual letters, which is why "how many r's in strawberry" was hard for older models.
Always measure token counts with the model's own tokenizer before estimating costs. Different providers produce different counts for the same text.

Large language models - Tokenization is the first step in every LLM pipeline. Understanding it makes LLM architecture and behavior much clearer.
Embeddings - After tokenization, each token is mapped to an embedding vector. Token quality directly affects embedding quality.
Context engineering - Context window budgeting depends entirely on token counts. Every technique in context engineering is constrained by tokenization.
Transformer architecture - The transformer processes token sequences. Attention cost scales quadratically with sequence length, which tokenization directly controls.