How Discord delivers real-time voice chat
How Discord uses WebRTC with a selective forwarding unit, Opus codec encoding, jitter buffers, and voice server routing to deliver low-latency group voice chat to millions of concurrent users.
The Problem Statement
Interviewer: "You are on a Discord server with 15 people in a voice channel. Everyone can hear everyone else in near-real time. How does that actually work? Walk me through what happens from the moment I speak into my microphone to the moment my voice comes out of someone else's speaker."
This question tests three things. First, whether you understand how real-time audio differs from HTTP request/response (latency budget measured in tens of milliseconds, not seconds). Second, whether you know why peer-to-peer breaks down for groups bigger than two or three people. Third, whether you understand how codec and network stack interact when packets arrive out of order or drop entirely.
Most candidates hand-wave with "WebRTC" and stop there. The strong answer explains the topology choice (SFU, not full mesh), the codec (Opus, not MP3), and what happens when the network misbehaves.
Clarifying the Scenario
You: "Before I start, a few quick questions. When you say near-real time, are we targeting sub-200ms end-to-end latency, or is gaming-quality sub-100ms the bar?"
Interviewer: "Gaming-quality. Discord's target is under 100ms for most users."
You: "Got it. And are we talking about a group call architecture, or could I describe a two-person DM call? The topology choice is completely different."
Interviewer: "Group call. Assume the channel has 5 to 50 people."
You: "Perfect. And should I cover video as well, or just voice?"
Interviewer: "Just voice. Keep it focused."
You: "OK. I will structure this in three parts. First, what happens on the sending side: microphone capture, encoding, and packetization. Second, the server architecture that routes audio between all participants. Third, the receiving side: how we reconstruct a clean audio signal even when packets are lost or arrive late."
My Approach
I break this into four layers. Each layer solves a specific problem:
- Capture and encoding: Turn raw microphone samples into packets small enough to ship over UDP in under 10ms.
- Transport and routing: Get those packets from sender to all receivers via a server topology that scales beyond two people.
- Reconstruction: Reassemble packets at the receiver into a smooth audio stream despite jitter and packet loss.
- Failure modes: What the system does when the network drops 10% of packets, or when a participant's connection is constrained.
The key insight I always lead with: real-time audio is a fundamentally different problem from file transfer. You cannot buffer-and-retry. A voice packet that arrives 200ms late is not late, it is useless. Every design decision flows from that constraint.
The Architecture
Here is how a voice channel call works end to end. The SFU (Selective Forwarding Unit) sits at the center of every group call.
Let me walk through this step by step.
Step 1: Capture and preprocessing. Your microphone captures Pulse Code Modulation (PCM) audio at 48kHz. That is 48,000 samples per second, each 16 bits. Before any compression, WebRTC's built-in Acoustic Echo Cancellation (AEC) removes your speaker output from your microphone input. Voice Activity Detection (VAD) checks every 20ms frame: if there is silence, the frame is discarded. This silence suppression reduces transmission by 40-60% for typical conversation.
Step 2: Opus encoding. The Opus encoder takes each 20ms PCM frame (960 samples) and compresses it. At Discord's typical bitrate of around 64-96kbps, a 20ms frame becomes roughly 160-240 bytes. Opus is adaptive: it drops to 6kbps for constrained paths or scales up to 510kbps for music. This adaptability is why Discord uses Opus even for Go Live screen sharing with audio.
Step 3: RTP packetization. Each Opus frame is wrapped in a Real-time Transport Protocol (RTP) packet with a sequence number and timestamp. Sequence numbers let the receiver detect out-of-order delivery. The packet travels as Secure RTP (SRTP) over UDP, encrypted with DTLS.
Step 4: SFU routing. The packet hits Discord's voice server. The SFU receives your SRTP stream and forwards it to every other participant in the channel. The SFU does NOT decode or re-encode. It reads the RTP header (specifically the Synchronization Source identifier, or SSRC) to identify who is speaking, then forwards the raw encrypted packet to every subscriber. Forwarding is O(1) work per packet regardless of bitrate, which is what makes the SFU scale.
Step 5: Reconstruction at the receiver. The listener's device receives multiple RTP streams (one per active speaker). A jitter buffer absorbs arrival-time variation by holding packets briefly (typically 20-80ms) and releasing them at a steady rate. Opus's built-in Packet Loss Concealment (PLC) fills any gaps. The mixer sums all streams into a single PCM output buffer that goes to the speaker.
Why 20ms frames?
20ms is the standard Opus frame size for voice. Shorter frames (10ms) reduce latency but increase packet overhead. Longer frames (40ms, 60ms) reduce overhead but increase latency and make packet loss more audible. 20ms is the sweet spot for interactive voice.
The SFU vs Full Mesh vs MCU Topology
This is the most important architectural decision in group voice. The topology you choose determines server cost, latency, and scalability ceiling. I have seen candidates get this completely wrong by assuming Discord uses peer-to-peer, which breaks down the moment you have more than two or three people.
Full Mesh P2P. Every participant connects directly to every other participant. For N people, each person uploads N-1 copies of their audio stream. With 10 people at 64kbps each, upload bandwidth per user is 576kbps. On a residential connection with 5-10 Mbps upload, this is manageable for 10 people but NAT traversal failure rates increase dramatically with connection count. Discord tried P2P for small calls early on and moved away from it entirely.
MCU (Multipoint Control Unit). A server decodes all incoming streams, mixes them into a single audio track, and re-encodes for each participant. Server CPU cost is enormous: O(N) decode operations plus N encode operations per 20ms frame. A 50-person call requires 50 decode cycles plus 50 encode cycles every 20ms. MCU approaches exist in older enterprise video systems like Cisco WebEx. Economically unviable at Discord's scale of millions of concurrent voice users.
SFU (Selective Forwarding Unit). The server receives encoded packets and forwards them without decoding. CPU cost per server is dominated by network I/O and memory copy operations, not compute. Each participant receives separate streams from each speaker and mixes client-side. The tradeoff: each listener downloads one stream per active speaker. In a 20-person call with 5 active speakers, a listener downloads 5 streams at 64kbps each, totaling 320kbps download. Manageable on modern connections.
Jitter Buffer and Packet Loss Recovery
Voice packets travel over UDP, which provides no ordering, delivery, or congestion guarantees. Packets arrive out of order. Packets get dropped. The jitter buffer and the codec's PLC are what prevent this from being audible.
The key insight here: you cannot wait forever for a late packet. At some point you have to play something, even if the packet never arrives.
Jitter buffer mechanics. The jitter buffer holds incoming RTP frames and releases them at a fixed playout rate. It absorbs arrival jitter (variation in inter-packet arrival time) by adding adjustable delay. Discord's adaptive jitter buffer starts at around 40ms and adjusts based on recent observed jitter. If the network is smooth, the buffer shrinks (lower latency). If jitter spikes, it grows (better protection).
Adaptive playout delay. The buffer tracks the 95th-percentile inter-arrival jitter over the last 2 seconds and sets the playout delay accordingly. A user on a LAN might have a 20ms buffer, while a user on congested cellular might have 80ms. The tradeoff is explicit: more buffer depth means more delay but fewer audible gaps.
Packet Loss Concealment. When the jitter buffer detects a gap (missing sequence number after the playout deadline), it passes a "missing frame" signal to the Opus decoder. Opus PLC generates a synthetic continuation of the previous frame by extrapolating from the spectral content of the last received frame. For packet loss up to about 20%, Opus PLC is largely imperceptible to listeners for speech content.
Late packets are discarded, not played
If a packet arrives after the jitter buffer has already released its playout slot, that packet is discarded even if it eventually arrives. In real-time audio, a late packet is worse than no packet because it would disrupt the smooth playout clock. Opus PLC handles the gap; the belated arrival gets thrown away.
Voice Server Routing and Geographic Affinity
Discord has dozens of voice server regions worldwide. When you join a voice channel, you connect to the geographically closest voice server region assigned to that channel. RTT between client and voice server is the single largest contributor to end-to-end latency, so getting this placement right matters enormously.
Region assignment. When a voice channel is first joined, Discord's gateway assigns it to a voice server region. The region is stored with the channel state. All participants, regardless of their own location, connect to the channel's assigned voice server. A user in Tokyo connecting to a US-East-assigned channel will experience higher latency than a user in New York.
Why pin all participants to one server? Because all participants must connect to the same SFU for the call to function. The SFU is the central routing point. If participants connected to different regional servers, you would need server-to-server forwarding, adding latency and complexity. For calls where all participants are in the same region, the assigned server is optimal. For globally distributed calls, the server placement is a deliberate compromise.
Manual region override. Discord lets server admins manually set the voice region for a channel. I have seen gaming guilds with members across the US and Europe manually set the region to US-East because their most latency-sensitive members are North American competitive gamers who accept the tradeoff on behalf of EU members.
Failover. If a voice server goes down mid-call, the Discord gateway detects the disconnection via heartbeat failure and assigns the call to a new voice server in the same or fallback region. Clients reconnect with a brief interruption of typically 2-5 seconds. Discord also monitors voice server health and pre-emptively migrates channels away from overloaded servers.
Mention region assignment in your interview
Noting that the SFU is geographically distributed and that region assignment is a deliberate design choice (all participants share one SFU, not per-user servers) demonstrates that you understand the routing tradeoffs of distributed real-time systems. Most candidates skip this and describe a single global server.
The Tricky Parts
-
Synchronizing multiple streams. A 15-person call means the receiving client handles up to 15 incoming RTP streams simultaneously, each with its own jitter buffer and sequence counter. Mixing them correctly requires timestamp alignment: if Alice and Bob are both talking, their streams must be mixed with the correct time offset. Discord uses RTP timestamps (based on sampling clock) for this synchronization, ensuring that what was spoken simultaneously arrives mixed simultaneously.
-
Active speaker detection at scale. In a 50-person channel, forwarding all 50 streams to every participant would be 2,500 forwarding operations per 20ms frame. Instead, the SFU uses VAD to detect active speakers and forwards only streams where voice activity is detected. Typically 1-3 speakers are active at any moment. The SFU also sends an active speaker event over the signaling channel so clients update their UI.
-
Congestion on the sender's uplink. If a participant's upload bandwidth is constrained, their Opus encoder receives RTCP feedback from the SFU indicating elevated packet loss. The encoder adaptation logic reduces bitrate. At very low bitrates below 16kbps, Opus switches from MDCT (suited for music) to SILK (speech codec optimized for voice intelligibility at low bitrate), preserving comprehensibility even on very poor paths.
-
NAT traversal for outbound UDP. SRTP runs over UDP, typically to port 443 or 3478 on Discord's voice servers. Most firewalls allow outbound UDP. For enterprise networks that block UDP entirely, Discord falls back to WebRTC over TCP using TURN relay servers, which adds latency but maintains connectivity where UDP is blocked.
-
Channel state persistence. Voice channels exist independently of whether anyone is in them. The channel's region assignment, name, and permissions persist in Discord's main database. When a user joins, the gateway performs a lookup for the channel's assigned voice server URL and sends a
VOICE_SERVER_UPDATEevent to the client over the persistent WebSocket signaling connection.
What Most People Get Wrong
| Mistake | What they say | Why it is wrong | What to say instead |
|---|---|---|---|
| Wrong topology | "Discord uses peer-to-peer like a phone call" | P2P upload bandwidth exhausts at 4-5 people on typical connections | "SFU: all audio routes through a central server per region that forwards packets without decoding" |
| Wrong codec | "Discord uses MP3 or AAC" | MP3/AAC algorithmic delay is 200-400ms, not suitable for real-time interactive audio | "Opus codec, 20ms frames, adaptive 6-510kbps, built-in PLC and FEC" |
| Ignoring jitter buffer | "Packets arrive and play immediately" | UDP delivers packets out of order with variable delay; without a buffer, audio would glitch constantly | "Jitter buffer holds frames 20-80ms and releases them at a steady 50Hz rate" |
| Ignoring packet loss | "The network is reliable enough" | Real Internet paths see 0.5-5% packet loss under load; at 20ms frames, even 5% loss is audible without PLC | "Opus built-in PLC synthesizes continuation for missing frames, imperceptible under ~20% loss" |
| Single latency bucket | "The only latency is network transit" | Encode delay (2-3ms) plus jitter buffer (20-80ms) plus decode (2-3ms) dominate on local networks | "Four components: encode delay, network transit, jitter buffer, decode delay, totaling 34-136ms" |
How I Would Communicate This in an Interview
Here is how I would actually say this, aiming for about 90 seconds:
"Discord voice uses an SFU architecture, not peer-to-peer. When you speak, your microphone captures PCM audio at 48kHz. WebRTC's built-in echo cancellation and noise suppression clean the signal, then the Opus codec compresses each 20ms chunk to about 160-240 bytes at 64-96kbps. These frames are packetized as SRTP over UDP and sent to the nearest Discord voice server for the channel's assigned region.
The voice server is an SFU: it receives your encoded stream and forwards the raw encrypted packets to every other participant. It never decodes or re-encodes. Server CPU cost is dominated by network I/O, not compute, which is what makes it scale to millions of concurrent users.
On the receiving side, each listener maintains a jitter buffer that holds incoming packets briefly (20-80ms adaptively) and releases them at a steady rate. This smooths out UDP's variable arrival times. If a packet is lost, Opus's built-in packet loss concealment synthesizes a continuation that is imperceptible for loss rates under about 20%.
The latency budget is: encode 2-3ms, network transit 10-50ms, jitter buffer 20-80ms, decode 2-3ms. Total end-to-end is typically 60-120ms. Discord targets under 100ms, which requires using the closest available voice server region."
Lead with the topology choice
The single most differentiating thing you can say is "SFU, not peer-to-peer, because P2P upload bandwidth exhausts at 4-5 participants." Say that in your first sentence. It immediately signals you understand the scaling constraint driving the whole design.
Interview Cheat Sheet
- SFU in Discord: All audio routes through a server that forwards RTP packets without decoding. Server CPU cost is O(N) memcopy, not O(N) decode/encode.
- Why not P2P: At N participants, each user uploads N-1 streams. Upload bandwidth exhausts at ~5 users on typical broadband connections.
- Why not MCU: Server decodes all streams, mixes, re-encodes N output streams per 20ms frame. CPU cost is O(N) decode + O(N) encode. Economically unviable.
- Opus codec: 20ms frames, 6-510kbps adaptive, built-in PLC for loss concealment, optional FEC for proactive redundancy. IETF-standardized for low-latency voice.
- SRTP over UDP: Encrypted real-time transport with sequence numbers (ordering), timestamps (synchronization), and SSRC identifiers (stream identity per speaker).
- Jitter buffer: Adaptive playout delay 20-80ms that absorbs packet arrival variation. Smaller on good connections, larger on congested paths.
- PLC: Opus decoder synthesizes audio for missing frames. Imperceptible up to approximately 20% packet loss for voice content.
- VAD silence suppression: Silent frames are not transmitted, reducing upload bandwidth by 40-60% for typical conversational speech patterns.
- Region assignment: Voice channel is pinned to one geographic SFU region. All participants connect to that same server. Chosen at channel creation.
- Latency budget: Encode 2-3ms + network 10-50ms + jitter buffer 20-80ms + decode 2-3ms = 34-136ms total. Discord targets sub-100ms for typical users.
Test Your Understanding
Quick Recap
- Discord voice uses an SFU: a central server per region that forwards encoded RTP packets without decoding, making server CPU cost O(N) memcopy rather than O(N) decode/encode cycles.
- Opus codec encodes 20ms PCM frames at 6-510kbps, adapting bitrate based on RTCP feedback from the SFU reporting packet loss and jitter.
- Audio travels as SRTP over UDP with sequence numbers for reordering, timestamps for synchronization, and SSRC identifiers to distinguish individual speakers.
- The adaptive jitter buffer at the receiver absorbs network arrival jitter by holding frames 20-80ms and releasing them at steady 50Hz playout rate.
- Opus PLC synthesizes audio for missing frames, making packet loss up to 20% largely imperceptible for speech; Opus FEC proactively embeds redundant frames.
- Voice channels are pinned to one geographic voice server region and all participants connect to that same SFU regardless of individual location.
- VAD silence suppression drops silent frames, reducing upload bandwidth 40-60% and preventing the SFU from forwarding inactive streams at scale.
- End-to-end latency budget: encode 2-3ms, network transit 10-90ms, jitter buffer 20-80ms, decode 2-3ms, targeting under 100ms total for typical users.
Related Concepts
- How WebSockets Work: Discord's signaling layer (voice server assignment, channel join/leave events) uses a persistent WebSocket connection. Voice audio itself uses SRTP over UDP, not WebSockets.
- How HTTP/3 and QUIC Work: QUIC solves many of the same UDP reliability problems as the RTP/RTCP stack, using different mechanisms. Understanding QUIC deepens intuition for why real-time audio built its own UDP-based protocol stack.
- How Zoom Video Works: Zoom uses a similar SFU architecture for audio. Adding video introduces bitrate adaptation and simulcast (sending multiple resolution streams simultaneously) as additional complexity.