Mixture of experts

TL;DR

Mixture of Experts (MoE) replaces the dense feed-forward network in each transformer block with N parallel expert networks and a lightweight router that selects the top-K experts per token.
Only 2 of 8 (or 2 of 16) experts activate per token, so a 47B-parameter model like Mixtral 8x7B uses only ~13B parameters of compute per forward pass, matching the speed of a 13B dense model while having the knowledge capacity of a much larger one.
The router is a learned linear layer that produces a probability distribution over experts. Top-K selection with load-balancing loss prevents "expert collapse" where all tokens route to the same expert.
MoE trades memory (all experts must be in VRAM) for throughput (fewer FLOPs per token). Serving requires more memory but fewer compute resources.
GPT-4 is widely reported to use a MoE architecture. Mixtral 8x7B, DeepSeek-V3, and Grok-1 are confirmed open MoE models.

GPT-3 has 175 billion parameters. Making a model 10x more capable by scaling to 1.75 trillion parameters would also require roughly 10x more FLOPs per token, 10x more memory, and 10x longer inference time. The cost per request becomes prohibitive and real-time serving becomes impossible.

The deeper problem is that dense models are indiscriminate. Every single token, regardless of complexity, passes through the full model. The query "what is 2+2?" activates the same 175B parameters as a query requiring deep multi-step reasoning about quantum mechanics. There is no mechanism to route simple queries to a lightweight path and complex ones to a heavier path.

This is the problem of conditional computation: can a model have more total capacity (parameters) without paying the full cost on every single input? The answer is Mixture of Experts.

What is it?

Mixture of Experts is a neural network architecture where each transformer block replaces its single dense Feed-Forward Network (FFN) with N expert FFNs and a lightweight router that sends each token to only the top-K of those experts.

Think of a hospital with 47 specialists. When a patient arrives, a triage nurse examines their symptoms and routes them to the 2 most relevant specialists, say a cardiologist and a neurologist. The patient does not see all 47 doctors. The hospital has more total expertise than any single generalist could hold, but each visit costs only 2 appointments instead of 47. The triage nurse is the router, each specialist is an expert, and the total medical staff is the parameter count. The cost of the visit is the compute per token.

In Mixtral 8x7B, the "8x7B" means 8 experts each sized like a 7B model's FFN layer. Total parameters: roughly 47B. Active parameters per token: roughly 13B (plus the non-MoE layers like attention, embeddings, and layer norms that are always active). The speed of a 13B model. The knowledge of a 47B model.

How it works

The MoE layer

In a standard transformer block, the FFN takes a hidden state h (say, 4096-dimensional), projects it up to d_ff = 16384 dimensions, applies a non-linearity (ReLU or SiLU), and projects back to 4096. This is the "expansion and compression" step, roughly where factual knowledge gets stored.

In an MoE block, this single FFN is replaced with N identical expert FFNs plus one router. Each expert has the same architecture as the original FFN. Only K experts run per token. The final output is the weighted sum of the selected experts' outputs: output = sum(expert_i(h) * gate_weight_i) for the K selected experts. The gate_weight_i values come from the router and sum to 1.

The gating and routing mechanism

The router is a simple linear layer: a weight matrix W_gate with shape (d_model, N_experts). For each token, it computes scores = softmax(h @ W_gate), then selects the indices of the top-K scores. The gate weights used in the final sum are re-normalized among the K selected experts so they sum to 1.

The router is jointly trained with the rest of the model via backpropagation. Because top-K selection is discrete (not differentiable), gradients flow only through the selected experts, not the rejected ones. This creates a structural training risk called expert collapse.

Expert collapse happens when the router learns to always prefer 1 or 2 experts. Those experts receive all the gradient signal and improve rapidly. The remaining experts receive no gradient, stop improving, and become permanently useless. The model degrades to essentially a dense model with wasted parameters.

The standard fix is an auxiliary load balancing loss added to the training objective:

L_balance = alpha * sum_i (f_i * P_i)

Where f_i is the fraction of tokens routed to expert i in a batch (a hard count), and P_i is the mean softmax probability assigned to expert i across the batch (a soft differentiable signal). The product penalizes expert overload. alpha is a hyperparameter, typically 0.01 to 0.1, that controls how hard the balance is enforced. Setting it too high hurts model quality; too low leads to collapse.

Training MoE models

Training MoE models introduces one structural challenge not present in dense training: tokens from the same batch need to be processed by different experts, which may live on different devices. This is solved through expert parallelism.

In expert parallelism, each expert is assigned to a different GPU (or group of GPUs). The forward pass at each MoE layer requires two all-to-all communication operations, dispatching tokens to their assigned expert's GPU and gathering results back. On fast interconnects (NVLink, InfiniBand), this overhead is roughly 10-15%. On slow interconnects (TCP/IP across racks), it can be much worse.

TL;DR

Mixture of Experts (MoE) replaces the dense feed-forward network in each transformer block with N parallel expert networks and a lightweight router that selects the top-K experts per token.
Only 2 of 8 (or 2 of 16) experts activate per token, so a 47B-parameter model like Mixtral 8x7B uses only ~13B parameters of compute per forward pass, matching the speed of a 13B dense model while having the knowledge capacity of a much larger one.
The router is a learned linear layer that produces a probability distribution over experts. Top-K selection with load-balancing loss prevents "expert collapse" where all tokens route to the same expert.
MoE trades memory (all experts must be in VRAM) for throughput (fewer FLOPs per token). Serving requires more memory but fewer compute resources.
GPT-4 is widely reported to use a MoE architecture. Mixtral 8x7B, DeepSeek-V3, and Grok-1 are confirmed open MoE models.

L_balance = alpha * sum_i (f_i * P_i)

Mixture of experts

TL;DR

The problem it solves

What is it?

How it works

The MoE layer

The gating and routing mechanism

Training MoE models

Continue Reading with Premium

Comments

Mixture of experts

TL;DR

The problem it solves

What is it?

How it works

The MoE layer

The gating and routing mechanism

Training MoE models

Continue Reading with Premium

Comments