TCP vs UDP
When to choose TCP vs UDP: reliability trade-offs, when UDP wins in gaming, video, and DNS, what QUIC does differently, and how to add reliability at the application layer when UDP is the right choice.
TL;DR
| Dimension | Choose TCP | Choose UDP |
|---|---|---|
| Delivery | Must deliver every byte, in order | Can tolerate lost or reordered packets |
| Latency | Can tolerate handshake + retransmission delays | Need lowest possible latency (real-time audio, gaming) |
| Multiplexing | Single stream per connection is acceptable | Need independent streams without head-of-line blocking |
| Connection overhead | Long-lived connections amortize setup cost | One-shot request/response (DNS) or connectionless multicast |
| Protocol complexity | Want the OS to handle reliability | Willing to build application-layer reliability |
Default answer: TCP. Most application traffic (HTTP APIs, database connections, file transfers, chat) uses TCP because reliable ordered delivery is the baseline expectation. UDP wins specifically when late data is worse than missing data, or when TCP's head-of-line blocking is the bottleneck.
The Framing
In 2018, a game studio shipped a competitive multiplayer shooter using TCP for all player position updates. At 60 frames per second, each position update is stale 16ms after it's sent. When packet loss hit 1% on a player's connection, TCP's retransmission logic kicked in: every packet behind the lost one queued up waiting for the retransmit, adding 50-200ms of jitter. Players saw enemies teleporting across the screen.
They rewrote the networking layer on UDP with application-layer redundancy (send the last 3 positions in every packet) and the teleportation problem disappeared. Late data was simply ignored because the next update was already arriving.
This is the core of the TCP vs. UDP decision. TCP guarantees that every byte arrives, in order, at any cost in latency. UDP guarantees nothing, but gives you the freedom to decide what "reliability" means for your specific use case.
The question is never "which protocol is better?" The question is: when a packet is lost, is it better to wait for a retransmission or move on without it?
How Each Works
TCP: Reliable Ordered Byte Stream
TCP establishes a connection with a three-way handshake (SYN, SYN-ACK, ACK), which costs one round trip before any data flows. After that, it provides a reliable, ordered byte stream with flow control and congestion control.
Key mechanisms that add latency:
- Three-way handshake: 1 RTT before data flows. On a 100ms RTT link, that's 100ms of pure overhead. TLS adds another 1-2 RTTs on top.
- Retransmission: Lost packets are retransmitted after a timeout (typically 200ms initially, with exponential backoff). Every byte behind the lost packet waits.
- Congestion control: TCP starts slow (slow start) and probes for available bandwidth. A new connection on a 1 Gbps link starts sending at maybe 10 KB/s and ramps up over multiple RTTs. This is correct behavior for the network but painful for short-lived transfers.
- Head-of-line blocking: TCP delivers bytes in order. If byte 1000 is lost, bytes 1001-5000 sit in a buffer waiting for the retransmit, even if they're for completely independent application-level requests.
For your interview: TCP's head-of-line blocking is the reason HTTP/2 over TCP doesn't fully deliver on its multiplexing promise, and it's why HTTP/3 moved to QUIC (UDP-based).
UDP: Minimal Datagram Delivery
UDP sends individual datagrams with an 8-byte header (source port, destination port, length, checksum). No handshake, no connection state, no retransmission, no ordering, no flow control.
UDP Header (8 bytes):
+-------------------+-------------------+
| Source Port (16b) | Dest Port (16b) |
+-------------------+-------------------+
| Length (16b) | Checksum (16b) |
+-------------------+-------------------+
| Payload... |
TCP Header (20+ bytes):
+-------------------+-------------------+
| Source Port (16b) | Dest Port (16b) |
+-------------------+-------------------+
| Sequence Number (32b) |
+-------------------+-------------------+
| Acknowledgment Number (32b) |
+-------------------+-------------------+
| Offset/Flags/Window (32b) |
+-------------------+-------------------+
| Checksum/Urgent (32b) |
+-------------------+-------------------+
| Options (0-40 bytes) |
+-------------------+-------------------+
| Payload... |
The simplicity is the feature. Each datagram is independent. The network delivers them in any order (or not at all), and the application decides what to do about it. I've seen teams build surprisingly sophisticated reliability on top of UDP when TCP's one-size-fits-all approach didn't fit their needs.
TCP's Congestion Control (Why Throughput Ramps Slowly)
TCP doesn't just guarantee delivery. It also regulates how fast it sends data to avoid overwhelming the network. This is critical to understand because it explains why TCP feels slow on new connections.
Slow start: A new TCP connection doesn't know the network's capacity. It starts with a small congestion window (typically 10 segments, ~14 KB) and doubles it every round trip. On a 100ms RTT link, it takes ~500ms to ramp up to 1 MB/s throughput, even if the network can handle 100 MB/s. This is why short-lived HTTP requests on new connections feel sluggish.
AIMD (Additive Increase, Multiplicative Decrease): After slow start, TCP increases the congestion window by one segment per RTT (additive increase). When packet loss is detected, it halves the window (multiplicative decrease). This sawtooth pattern means TCP never fully utilizes available bandwidth; it oscillates between probing for more and backing off.
Why this matters for system design: If your service creates many short-lived TCP connections, each one pays the slow start penalty. This is why connection pooling (HTTP keep-alive, gRPC persistent connections, database connection pools) is so important. A single long-lived connection has already ramped its congestion window and sends at full speed. For throughput-sensitive applications like file uploads or streaming, slow start on a new connection can add seconds of latency before reaching full speed.
Modern variants like BBR (developed by Google) replace the loss-based approach with a model-based algorithm that estimates available bandwidth without waiting for packet loss. BBR achieves near-optimal throughput on high-bandwidth, high-latency links where traditional TCP (Cubic, Reno) underperforms. Google deployed BBR across their infrastructure and measured 2-25x throughput improvement on long-haul connections.
For your interview: if someone asks why a file download starts slow and speeds up, this is the answer. Slow start ramps the congestion window. The download reaches full speed only after several RTTs of probing. Mention connection pooling as the standard mitigation for short-lived connections.
Head-to-Head Comparison
| Dimension | TCP | UDP | Verdict |
|---|---|---|---|
| Reliability | Guaranteed delivery, retransmission built in | Best-effort, application handles loss | TCP for correctness, UDP for speed |
| Ordering | Strict byte-order delivery | No ordering guarantees | TCP when order matters globally |
| Connection setup | 1 RTT handshake (+ TLS overhead) | None, first byte is payload | UDP for one-shot or latency-critical |
| Head-of-line blocking | Yes, lost packets stall entire stream | No, each datagram independent | UDP for multiplexed streams |
| Header overhead | 20+ bytes per segment | 8 bytes per datagram | UDP (2.5x smaller header) |
| Flow control | Built-in window-based flow control | None, sender can overwhelm receiver | TCP for cooperative senders |
| Congestion control | Slow start + AIMD, network-friendly | None, application must self-regulate | TCP is a better network citizen |
| Multicast | Not supported (point-to-point only) | Supported (one-to-many) | UDP for broadcast/multicast |
| Connection state | Kernel tracks per-connection state | Stateless, no kernel resources per "connection" | UDP for massive fan-out |
| NAT traversal | Straightforward with established connections | Requires hole-punching techniques | TCP simpler, UDP needs STUN/TURN |
The fundamental tension is reliability vs. latency. TCP pays for reliability with connection overhead, retransmission delays, and head-of-line blocking. UDP pays for speed with zero built-in reliability and the burden of application-layer error handling.
One detail worth noting: TCP maintains significant per-connection state in the kernel. Each connection tracks sequence numbers, acknowledgment numbers, congestion window, retransmission timers, and receive buffers. On a server handling 1M concurrent connections, TCP state consumes gigabytes of kernel memory. A busy web server with 100K concurrent keep-alive connections uses ~6-8 GB just for TCP socket buffers.
UDP is stateless at the transport layer. No per-connection kernel resources, no buffers, no timers. This is why protocols that need massive fan-out (DNS resolvers handling millions of queries per second, game servers with 100K+ concurrent players, telemetry collectors) prefer UDP even when they implement their own reliability on top.
When TCP Wins
TCP is the right default for the vast majority of application traffic. If you're not sure, use TCP.
API calls and web traffic. HTTP/1.1 and HTTP/2 run on TCP. Every REST call, GraphQL query, and gRPC request uses TCP. The handshake cost is amortized over connection reuse (HTTP keep-alive), and reliable delivery is non-negotiable for request/response semantics. My recommendation: unless you're building real-time media or gaming infrastructure, TCP is your protocol.
File transfers. Losing a single byte in a file download corrupts the file. TCP's retransmission guarantees integrity. FTP, SCP, and HTTP downloads all use TCP for this reason.
Database connections. Queries must not be silently dropped or reordered. A lost SQL statement would corrupt application state. Every database protocol (MySQL, PostgreSQL, MongoDB wire protocols) runs on TCP.
Chat and messaging. User-to-user messages need delivery guarantees. If a message is lost in transit, the user never sees it. TCP's reliability is essential here. WebSocket (which runs over TCP) is the standard for real-time chat.
Email (SMTP/IMAP). Email delivery is inherently a reliable-transfer problem. Losing an email in transit is unacceptable.
Interview tip: mention connection reuse
When justifying TCP for web services, mention that the handshake cost is a one-time overhead. HTTP keep-alive and connection pooling mean the 1-RTT setup is amortized across thousands of requests. The per-request overhead of TCP is effectively just the header bytes and congestion window state.
When UDP Wins
UDP wins when late data is worse than missing data, or when TCP's head-of-line blocking defeats the purpose of multiplexing.
Real-time video and audio (WebRTC, Zoom, streaming). A late audio frame is useless. If a TCP packet is retransmitted after 200ms, all subsequent audio frames wait behind it, causing jitter that's far worse than a brief audio gap. I've debugged VoIP quality issues that traced directly to TCP retransmissions adding 200ms spikes. UDP with application-layer jitter buffers and packet loss concealment handles this gracefully.
Online multiplayer games. Position updates at 60Hz mean each frame is 16ms old before the next arrives. Retransmitting a stale position update delivers wrong information late. Games send redundant state in every packet (last 3 positions) so the receiver can interpolate even when packets are lost.
DNS queries. A DNS query and response each fit in a single UDP datagram (typically under 512 bytes). The round-trip is faster than a TCP handshake + request + response. If the query fails, the resolver retries. No connection state needed for a one-shot lookup. DNS uses TCP as a fallback for large responses (DNSSEC, zone transfers over 512 bytes).
Multicast and broadcast. UDP supports one-to-many delivery. Streaming the same live video feed to 100,000 viewers can use IP multicast over UDP. TCP would require 100,000 separate connections from the server.
Service discovery and heartbeats. Lightweight health checks and service announcements use UDP because the overhead of establishing TCP connections to every peer would be prohibitive in large clusters.
The Nuance: QUIC Changes Everything
The TCP-vs-UDP debate shifted fundamentally when QUIC arrived. QUIC is a transport protocol built on UDP that provides TCP-like reliability without TCP's worst problems.
What QUIC solves:
- Head-of-line blocking: QUIC multiplexes independent streams over UDP. Packet loss on Stream 1 only stalls Stream 1. Streams 2 and 3 continue unblocked. This is the key advantage over HTTP/2 over TCP, where one lost packet stalls every multiplexed stream.
- Connection setup latency: QUIC combines the transport handshake and TLS handshake into a single round trip (1-RTT). For repeat connections, 0-RTT is possible, sending application data in the very first packet.
- Connection migration: TCP connections are identified by the 4-tuple (source IP, source port, dest IP, dest port). When a mobile user switches from WiFi to cellular, the IP changes and the TCP connection dies. QUIC uses a connection ID, so the connection survives IP changes without re-handshaking.
What QUIC doesn't solve: QUIC still needs congestion control (it implements its own), still has per-stream head-of-line blocking (just not cross-stream), and adds CPU overhead from mandatory encryption. QUIC is not "fast UDP." It's "TCP done better, implemented in userspace over UDP."
HTTP/3 (which runs on QUIC) is now used by Google, Cloudflare, Meta, and most modern CDNs. Mobile users see the biggest improvement because of 0-RTT resumption and connection migration during network switches.
Here's what the handshake difference looks like in practice. A first-time TCP+TLS connection takes 3 round trips before any application data flows. QUIC combines transport and encryption into 1 RTT, and on repeat connections, data can flow immediately.
On a 100ms RTT mobile connection, the difference is stark: TCP+TLS takes 200ms before any data flows. QUIC takes 100ms on first connect and 0ms on repeat visits. For mobile users switching between WiFi and cellular (which changes their IP address), QUIC's connection ID means the connection survives without re-handshaking. TCP connections die on IP change and must re-establish from scratch.
What QUIC doesn't solve: QUIC still runs congestion control (its own implementation, not TCP's). It still has per-stream head-of-line blocking (packet loss on stream 1 stalls stream 1, just not streams 2 and 3). And it adds CPU overhead because all encryption and reliability mechanisms run in userspace rather than the kernel. QUIC is not "fast UDP." It's TCP done better, implemented in userspace over UDP to bypass middlebox ossification.
HTTP/3 adoption: HTTP/3 (which mandates QUIC) is supported by all major browsers and is served by Google, Cloudflare, Meta, Fastly, and most modern CDNs. Mobile users see the biggest improvement because 0-RTT resumption and connection migration directly address the two biggest pain points of mobile networking: high-latency cellular handshakes and IP changes during network transitions. For server-to-server communication within a datacenter, QUIC's benefits are minimal because RTTs are sub-millisecond and connections are long-lived.
Gotcha: QUIC is not raw UDP
Don't describe QUIC as "just using UDP for speed." QUIC provides reliable, ordered, per-stream delivery with congestion control and encryption. It uses UDP as a substrate to avoid OS kernel restrictions on deploying new transport protocols. The choice of UDP is about deployability (middleboxes don't block UDP port 443), not about avoiding reliability.
Real-World Examples
Zoom / WebRTC: Uses UDP for all real-time audio and video streams, processing upwards of 300 million daily meeting participants. When packet loss occurs, the codec applies Forward Error Correction (FEC) to reconstruct lost data from redundant packets, or conceals the gap by repeating the last good frame. Zoom's media servers send ~50 packets per second per stream for video. At 3% packet loss, that's 1-2 lost packets per second, which FEC handles transparently. Zoom falls back to TCP only when UDP is blocked by corporate firewalls, and their own quality metrics show a 30-40% increase in latency variance and a measurable drop in Mean Opinion Score (MOS) when forced to TCP.
Discord: Voice channels use UDP with the Opus audio codec, handling millions of concurrent voice connections. Discord's engineering team measured a 30-50ms reduction in average audio latency after switching from TCP to UDP for voice, plus elimination of the "robot voice" artifact caused by TCP retransmission cascades during packet loss spikes on congested home networks. Their custom UDP protocol includes a lightweight sequence number for jitter buffer management but intentionally omits retransmission. A dropped audio frame (20ms of speech) is imperceptible; a 200ms TCP retransmission stall is very noticeable.
Cloudflare / HTTP/3: Cloudflare serves over 25% of all web traffic and reports that HTTP/3 (QUIC) now handles a significant and growing share of requests. Their measurements show HTTP/3 reduces time-to-first-byte by 12.4% compared to HTTP/2 over TCP on high-latency mobile connections. The improvement comes from 0-RTT connection resumption (saving a full RTT on repeat visits) and elimination of cross-stream head-of-line blocking. On lossy mobile networks (2-5% packet loss), the improvement is even more dramatic because QUIC avoids the cascade where one lost packet stalls all multiplexed streams.
Google / YouTube: Google pioneered QUIC and has been running it in production since 2013. As of recent measurements, QUIC handles over 30% of YouTube's global video traffic and a significant portion of Google Search and Gmail traffic. Google's published data shows QUIC reduced search latency by 3.6% and YouTube rebuffer rate by 15-18% compared to TCP. The rebuffer improvement comes from QUIC's ability to continue streaming on one stream while another recovers from packet loss, plus 0-RTT resumption that eliminates the cold-start penalty when users resume a video.
How This Shows Up in Interviews
This trade-off appears in system design interviews when discussing real-time features, protocol choices, or networking layers.
What they're testing: Whether you understand why certain protocols choose TCP or UDP, not just which ones do. The follow-up is usually about head-of-line blocking or QUIC.
Depth expected at senior level:
- Explain the three-way handshake and its latency cost
- Describe head-of-line blocking and why it matters for HTTP/2
- Know that QUIC solves HOL blocking by multiplexing independent streams on UDP
- Understand when application-layer reliability over UDP is appropriate
- Explain connection migration for mobile users
| Interviewer asks | Strong answer |
|---|---|
| "Why UDP for your video streaming service?" | "TCP retransmissions add 200ms+ latency spikes. For real-time video, a dropped frame is invisible (codec conceals it), but a 200ms stall causes visible jitter. UDP lets us skip stale frames and keep the stream flowing." |
| "How do you handle packet loss on UDP?" | "Forward Error Correction sends redundant data so the receiver can reconstruct lost packets without retransmission. For gaming, we send the last 3 position states in every packet so the client can interpolate even with 5% loss." |
| "Why is HTTP/3 built on UDP?" | "To eliminate TCP's cross-stream head-of-line blocking. HTTP/2 multiplexes streams on one TCP connection, so one lost packet stalls all streams. QUIC gives each stream independent reliability. It uses UDP as a substrate because deploying a new transport protocol through middleboxes is impractical." |
| "Would you use TCP or UDP for DNS?" | "UDP for standard queries (fits in one datagram, no handshake overhead). TCP for zone transfers and DNSSEC responses that exceed 512 bytes. Most DNS traffic is UDP because the query/response pattern doesn't benefit from connection state." |
Quick Recap
- TCP provides reliable, ordered byte-stream delivery at the cost of handshake latency, retransmission delays, and head-of-line blocking. UDP delivers independent datagrams with no guarantees and near-zero overhead. Choose based on whether late data is preferable to missing data.
- Real-time video, audio, and games use UDP because retransmitting stale frames is worse than dropping them. Application-layer techniques (FEC, interpolation, jitter buffers) handle loss better than TCP's retransmit-and-wait approach.
- TCP is correct for APIs, databases, file transfers, and any workload requiring delivery guarantees. Connection reuse and pooling amortize the handshake cost to near zero for long-lived services.
- HTTP/2 over TCP suffers from cross-stream head-of-line blocking: one lost packet stalls all multiplexed streams. QUIC (HTTP/3) solves this by multiplexing independent streams over UDP with per-stream reliability.
- QUIC is replacing TCP for web-facing traffic because it combines UDP's lack of cross-stream HOL blocking with per-stream reliability, faster handshakes (0-RTT for repeat connections), and connection migration for mobile users.
- In interviews, demonstrate understanding of why each protocol makes a specific trade-off. Don't just say "UDP is faster." Explain head-of-line blocking, quantify the handshake cost, and know when QUIC is the real answer.
Related Trade-offs
- Networking fundamentals for the OSI model and how TCP/UDP fit into the transport layer
- How TCP congestion control works for slow start, AIMD, and BBR internals
- Sync vs. async for the application-layer equivalent of this blocking vs. non-blocking tension
- Polling vs. webhooks vs. SSE for real-time communication patterns built on top of TCP
- Push vs. pull for the data delivery model that sits above the transport protocol choice