Multimodal models
Learn how multimodal models process images, audio, and video alongside text, what CLIP-based architectures look like, and how to use vision LLMs effectively in production systems.
TL;DR
- Multimodal models accept multiple input types (image, audio, video) alongside text. The most impactful category today is vision-language models that combine images with text.
- The standard architecture uses a vision encoder (ViT or SigLIP) to convert an image into patch embeddings, a projection layer to map those embeddings into the LLM's token space, then a standard transformer decoder for generation.
- CLIP (OpenAI, 2021) established contrastive learning as the go-to technique for aligning image and text representations in a shared embedding space.
- Images are expensive: a single high-resolution image costs 85 to 1,600 tokens depending on the API and resolution setting. Token budgets must account for this.
- GPT-4o, Claude 3.7 Sonnet, and Gemini 2.0 are the production leaders, each with different strengths in OCR, spatial reasoning, and long-context multimodal understanding.
- Multimodal models hallucinate on fine-grained visual details: object counting, spatial relationships, and small text OCR are all weaker than they appear in demos.
The problem it solves
Most of the world's information isn't text. Enterprise documents are PDFs full of charts and tables. Product catalogs are images. Medical records include scans. Accessibility tooling needs to describe images to screen readers. A model that only processes text can't work with any of these directly.
Before multimodal models, the standard approach was a brittle pipeline: run a specialized vision model (object detection, OCR) to produce text, then pass that text to an LLM. Information degrades at every handoff. The OCR misreads a number in a chart. The object detector misses context. The LLM confidently reasons from corrupted input.
I've watched teams build these serial pipelines and spend more time debugging the handoffs between models than building the actual product. The vision model outputs "revenue: $2.3M" when the chart actually shows $23M, and the downstream LLM generates a confident but wrong financial summary.
The fundamental improvement: multimodal models see the image directly. No lossy intermediate extraction. The model reads the chart, recognizes the axes, and extracts the number from pixels. One model, one step, fewer failure modes.
What is it?
A multimodal model is a neural network that can process and reason across multiple data types (modalities) within a single architecture. The most common combination today is vision plus language: the model accepts images and text as input and generates text as output.
Think of it like a bilingual person who can read both English and Japanese. A text-only model is monolingual. It can only process information in one "language" (text tokens). A multimodal model is bilingual: it can read images (visual tokens) and text (language tokens) and reason about both together, translating freely between them.
The key insight is that images and text can be represented in the same mathematical space. Once you convert an image into a sequence of vectors that live alongside text token embeddings, the transformer's attention mechanism handles the rest. It attends to both visual and textual tokens, learning which parts of the image are relevant to the text query.
How it works
Vision encoders: turning images into tokens
A vision encoder converts an image into a sequence of embedding vectors, analogous to how a tokenizer converts text into token embeddings. The dominant architecture is the Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020.
ViT works by splitting the image into fixed-size patches (typically 14x14 or 16x16 pixels). Each patch is flattened and linearly projected into an embedding vector. A 224x224 image with 14x14 patches produces 256 patch tokens. A 512x512 image produces 1,024+ patch tokens.
SigLIP (Google, 2023) is the newer alternative used in Gemini and PaliGemma. It replaces CLIP's softmax-based contrastive loss with a sigmoid loss that scales better to large batch sizes and doesn't require the global normalization that limits CLIP's training efficiency.
I've found that the choice of vision encoder matters less than most teams think. ViT-L/14 and SigLIP-SO400M produce similar quality for most production tasks. The projection layer and LLM backbone matter more.
The projection layer: bridging two worlds
The vision encoder produces embeddings in its own dimensional space (e.g., 1,024 dimensions for ViT-L). The LLM expects embeddings in its space (e.g., 4,096 dimensions for Llama). The projection layer bridges this gap.
The simplest approach is a linear projection: a learned matrix that maps from vision dimensions to LLM dimensions. LLaVA (Liu et al., 2023) proved that even a simple two-layer MLP works remarkably well as a projection layer. More complex cross-attention projectors (like Flamingo's Perceiver Resampler) can compress the visual token count but add architectural complexity.
# Simplified projection layer (LLaVA-style)
class VisionProjection(nn.Module):
def __init__(self, vision_dim=1024, llm_dim=4096):
super().__init__()
self.proj = nn.Sequential(
nn.Linear(vision_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim),
)
def forward(self, vision_embeddings):
# vision_embeddings: [batch, num_patches, vision_dim]
# output: [batch, num_patches, llm_dim]
return self.proj(vision_embeddings)
After projection, visual tokens are concatenated with text tokens to form a single sequence. The LLM processes this unified sequence with standard self-attention. No separate cross-attention module is needed in the simplest architectures.
CLIP and contrastive learning
CLIP (Contrastive Language-Image Pre-training, OpenAI, 2021) is the architecture that made modern vision-language models possible. It trains an image encoder and a text encoder simultaneously so that matching image-text pairs produce similar embeddings.
The training process uses contrastive learning on 400M image-text pairs scraped from the internet. For each batch, CLIP maximizes the cosine similarity between matching pairs (image of a dog, text "a photo of a dog") and minimizes it between non-matching pairs (image of a dog, text "a photo of a car").
After training, CLIP's shared embedding space enables zero-shot classification: encode an image, encode candidate text labels, pick the label with highest cosine similarity. No task-specific training needed. This is the foundation for image search, content moderation, and the vision encoders used inside GPT-4V, LLaVA, and other vision LLMs.
Fusion strategies: early, late, and cross-attention
How and when visual and textual information combine defines the model's architecture family.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.