TCP congestion control: slow start, AIMD, CUBIC, and BBR
How TCP prevents network collapse using congestion control algorithms. Slow start, congestion avoidance, AIMD, and why modern algorithms CUBIC and BBR have replaced the original Reno algorithm.
The problem
Your microservice sends a burst of 500 HTTP requests to a downstream service over a single TCP connection. The network link between them handles 100 Mbps, but a shared router along the path has a 64KB buffer. Every sender on the network pushes data at maximum rate. The router buffer fills, packets get dropped, senders retransmit, the retransmissions fill the buffer again, and now 80% of all traffic on the link is retransmissions. Useful throughput drops to near zero.
This is congestion collapse. It actually happened on the internet in October 1986. Throughput between Lawrence Berkeley Lab and UC Berkeley dropped by a factor of 100. The network had bandwidth, but every sender was shouting over every other sender.
Van Jacobson designed TCP congestion control in 1988. The core insight: TCP senders must reduce their sending rate when the network signals congestion (via packet loss or delay). Without this, every sender acts selfishly and the network becomes unusable for everyone.
What it is
TCP congestion control is a set of algorithms built into every TCP implementation that dynamically adjust the sending rate based on network feedback. The sender maintains a congestion window (cwnd) that limits how much unacknowledged data can be in flight at any time.
Think of it like cars merging onto a highway. If every car enters at full speed without checking traffic, the highway jams. Congestion control is the metering light at the on-ramp: it controls how fast new data enters the network based on how congested the road already is.
How it works
TCP's sending rate is governed by two windows:
effective window = min(rwnd, cwnd)
- rwnd (receiver window): flow control, how much the receiver's buffer can hold
- cwnd (congestion window): congestion control, how much the network path can handle
The sender can have at most min(rwnd, cwnd) unacknowledged bytes in flight at any moment. The receiver controls rwnd. The congestion control algorithm controls cwnd.
The congestion control state machine cycles through phases:
function on_ack_received(ack):
if cwnd < ssthresh:
// Slow Start: exponential growth
cwnd += MSS
else:
// Congestion Avoidance: linear growth
cwnd += MSS * (MSS / cwnd)
function on_loss_detected(type):
if type == TIMEOUT:
ssthresh = cwnd / 2
cwnd = 1 * MSS // reset to minimum
if type == TRIPLE_DUPLICATE_ACK:
ssthresh = cwnd / 2
cwnd = ssthresh + 3 * MSS // fast recovery
Every TCP connection starts in slow start (exponential growth), transitions to congestion avoidance (linear growth) when cwnd reaches ssthresh, and cuts cwnd when loss is detected. The three phases, slow start, congestion avoidance, and loss recovery, repeat for the lifetime of the connection.
Slow start
When a new TCP connection opens (or after a timeout), cwnd starts small (typically 10 MSS, about 14KB on most systems). On each ACK received, cwnd increases by 1 MSS. Since one RTT produces ACKs for all in-flight segments, this doubles cwnd every round-trip:
- Round 1: cwnd = 10, send 10 segments, 10 ACKs arrive, cwnd = 20
- Round 2: cwnd = 20, send 20, 20 ACKs, cwnd = 40
- Round 3: cwnd = 40, send 40, 40 ACKs, cwnd = 80
The name "slow start" is misleading. It is slower than immediately blasting at full rate, but exponential growth reaches link capacity within a few RTTs. On a 100 Mbps link with 50ms RTT, slow start fills the pipe in roughly 4 round trips starting from 10 MSS.
Slow start ends when:
- cwnd reaches ssthresh (slow start threshold), which triggers a switch to congestion avoidance
- Packet loss is detected via timeout, which resets cwnd to 1 MSS and halves ssthresh
- Three duplicate ACKs arrive, which triggers fast retransmit and fast recovery
Interview tip: slow start cold penalty
Slow start is why HTTP/1.1 was slow for small responses over new connections. A fresh TCP connection starts at ~14KB cwnd. A 200KB page takes several RTTs to fully transfer. Connection reuse (HTTP keep-alive) and HTTP/2 multiplexing exist specifically to amortize this cold-start penalty.
Congestion avoidance (AIMD)
Once cwnd reaches ssthresh, TCP switches from exponential to linear growth. Instead of adding 1 MSS per ACK, it adds a fraction: MSS * (MSS / cwnd) per ACK, which works out to roughly 1 MSS increase per full RTT.
AIMD stands for Additive Increase, Multiplicative Decrease:
- Additive Increase: grow cwnd by 1 MSS per RTT (probe for more bandwidth slowly)
- Multiplicative Decrease: on packet loss, cut cwnd in half (back off sharply)
No loss: cwnd += 1 MSS per RTT (gentle linear increase)
Loss detected: cwnd = cwnd / 2 (sharp multiplicative decrease)
This creates a sawtooth pattern: cwnd slowly rises, hits a loss, drops by half, rises again. The sawtooth is intentional. AIMD is mathematically proven to converge to fair sharing: if two AIMD flows share a bottleneck link, they converge to equal bandwidth over time regardless of starting conditions.
The problem with AIMD is efficiency on high-bandwidth, high-latency links. A 10 Gbps link with 100ms RTT has a bandwidth-delay product of 125MB. After a loss event, AIMD halves cwnd to 62.5MB and needs thousands of RTTs to climb back. I've seen this waste 30-40% of available bandwidth on trans-oceanic links.
Fast retransmit and fast recovery
TCP Reno introduced two optimizations that avoid the full cwnd reset on every loss event.
Fast Retransmit: When the sender receives 3 duplicate ACKs for the same sequence number, it retransmits the missing segment immediately without waiting for the retransmission timeout (RTO). Three duplicate ACKs mean the network is still delivering packets (just one was lost), so the situation is less severe than a timeout.
Fast Recovery: After fast retransmit, instead of resetting cwnd to 1 MSS (as with a timeout), the sender sets ssthresh = cwnd / 2 and cwnd = ssthresh + 3 * MSS. It then continues in congestion avoidance mode. The connection never re-enters slow start.
function on_triple_duplicate_ack(ack):
// Fast Retransmit
retransmit(ack.sequence_number + 1)
// Fast Recovery
ssthresh = cwnd / 2
cwnd = ssthresh + 3 * MSS // +3 for the 3 dup ACKs
// Stay in congestion avoidance (no slow start reset)
// Each additional dup ACK: cwnd += 1 MSS
// When new ACK arrives: cwnd = ssthresh (deflate)
Fast recovery is why modern TCP handles isolated packet losses gracefully. Only timeouts (which indicate severe congestion or path failure) cause the full cwnd reset. I think of duplicate ACKs as "the network is congested but still working" and timeouts as "the network is broken."
Fast recovery in TCP Reno handles only one lost packet per RTT well. If multiple packets are lost in the same window, Reno often falls back to a timeout anyway. TCP NewReno and SACK (Selective Acknowledgement) fix this by tracking exactly which segments are missing.
CUBIC
CUBIC replaced Reno as the Linux default in 2006 and is now the most widely deployed congestion control algorithm on the internet. It modifies the congestion avoidance curve to recover bandwidth faster on high-BDP (bandwidth-delay product) links.
Instead of linear AIMD growth, CUBIC uses a cubic polynomial relative to time since the last congestion event:
W(t) = C * (t - K)^3 + W_max
where:
W_max = cwnd at which the last loss occurred
K = cubic_root(W_max * beta / C)
C = scaling constant (0.4 by default)
beta = multiplicative decrease factor (0.7 in CUBIC vs 0.5 in Reno)
t = time since last congestion event
The cubic shape is the key insight. Near W_max (the last known good cwnd), the curve is nearly flat, probing slowly to avoid triggering another loss at the same point. Far from W_max (either below or above), the curve steepens, growing aggressively to find new capacity.
CUBIC also only reduces cwnd by 30% on loss (beta = 0.7) compared to Reno's 50%. This means faster recovery after congestion events. On a 10 Gbps trans-Pacific link with 150ms RTT, CUBIC reaches full utilization in seconds where Reno would take minutes.
BBR (Bottleneck Bandwidth and Round-trip propagation time)
BBR, developed by Google in 2016, represents a fundamental shift in congestion control philosophy. Loss-based algorithms (Reno, CUBIC) treat packet loss as the congestion signal: when a packet is dropped, back off. BBR instead builds a model of the network path and sends at the rate the path can actually handle.
BBR continuously estimates two quantities:
- BtlBw (Bottleneck Bandwidth): the maximum delivery rate observed over a recent time window
- RTprop (Round-trip propagation time): the minimum RTT observed, representing the path's physical propagation delay without queuing
optimal_sending_rate = BtlBw
optimal_cwnd = BtlBw × RTprop // bandwidth-delay product
BBR cycles through four states. STARTUP is like slow start: it doubles the sending rate each RTT to quickly find BtlBw. DRAIN reduces the sending rate to clear any queue buildup from the startup phase. PROBE_BW is the steady state where BBR cycles its pacing gain between 1.25, 0.75, and 1.0 to periodically probe for more bandwidth. PROBE_RTT fires every 10 seconds: BBR reduces cwnd to 4 packets to drain queues and measure the true propagation delay.
Why BBR matters for system design: On high-latency links (satellite, cross-continental), CUBIC halves cwnd on every loss and takes many RTTs to recover. BBR keeps sending at the bottleneck rate regardless of isolated losses. Google reported 2-25% throughput improvement on YouTube after deploying BBR. In my experience, BBR makes the biggest difference on lossy wireless links where random packet loss is not congestion.
Limitation: BBR can be unfair to CUBIC flows on shared links. BBR continues sending near bottleneck rate while CUBIC backs off on loss, so BBR captures a disproportionate share of bandwidth. BBRv2 addresses some of these fairness concerns.
Production usage
| System | Usage | Notable behavior |
|---|---|---|
| Linux kernel | CUBIC is the default congestion control since 2006. BBR is available via sysctl net.ipv4.tcp_congestion_control=bbr. | Google runs BBR on all its edge servers. Most cloud VMs ship with CUBIC. Switching to BBR on high-latency paths often yields 10-30% throughput gains with no application changes. |
| Google (YouTube, Cloud) | BBR deployed across all Google edge infrastructure since 2016. | Reported 2-25% throughput improvement on YouTube. BBR is critical for Google's global backbone where cross-continent RTTs exceed 100ms. |
| Cloudflare CDN | Uses BBR on edge servers serving content to end users. | Improved throughput for users on lossy mobile and satellite links where random loss is not congestion. Cloudflare reported reduced retransmission rates. |
| AWS / Azure | Default is CUBIC. Some services use BBR internally. | EC2 instances default to CUBIC. AWS Global Accelerator uses custom congestion control optimized for their backbone. |
| QUIC (HTTP/3) | Implements congestion control at the application layer (not kernel). Typically uses CUBIC or BBR. | QUIC can switch congestion control algorithms per-connection without kernel changes. Google's QUIC implementation defaults to BBR. |
Interview tip: congestion control and system design
When discussing latency-sensitive APIs in an interview, mention that new TCP connections pay a slow-start penalty (starting at ~14KB cwnd). This is why connection pooling, HTTP/2 multiplexing, and gRPC long-lived connections matter. Saying "we reuse connections to avoid slow start" shows you understand the transport layer, not just the application layer.
Limitations and when NOT to use it
- Loss-based algorithms (CUBIC) underperform on lossy wireless links. Random packet loss from radio interference is not congestion, but CUBIC treats it as such and halves cwnd unnecessarily. BBR handles this better but is not universally deployed.
- Slow start penalizes short-lived connections. A TCP connection that transfers 50KB and closes never leaves slow start. For request-response patterns with small payloads, the slow start phase dominates total transfer time. This is why HTTP/2 and connection pooling exist.
- BBR fairness problems are real. On a shared bottleneck, BBR flows take bandwidth from CUBIC flows because BBR ignores loss signals that CUBIC respects. Running both algorithms on the same link creates unfair bandwidth distribution.
- TCP congestion control cannot solve application-level backpressure. If your application produces data faster than the consumer reads it, TCP flow control (rwnd) handles receiver-side backpressure, but congestion control only addresses network-path congestion. You still need application-level rate limiting.
- Bufferbloat undermines loss-based algorithms. If routers have excessively large buffers, packets queue for hundreds of milliseconds before being dropped. CUBIC sees no loss signal and keeps increasing cwnd, causing high latency without the loss signal it needs to back off. BBR and AQM (Active Queue Management) address this.
- Head-of-line blocking at the TCP layer persists. Even with HTTP/2 multiplexing, a single lost TCP segment blocks all streams on that connection until retransmitted. This is a fundamental TCP limitation that QUIC/HTTP/3 solves by implementing congestion control per-stream.
Interview cheat sheet
- When asked about TCP performance: State that every new TCP connection starts in slow start with cwnd of ~14KB. It takes several RTTs of exponential growth to reach link capacity. This is the transport-layer reason for connection pooling and HTTP/2.
- When comparing Reno vs CUBIC vs BBR: Reno uses linear AIMD (halve on loss, +1 per RTT). CUBIC uses a cubic function centered on the last-known-good cwnd. BBR models the bottleneck bandwidth and minimum RTT instead of reacting to loss. Each generation solves the previous one's weakness on high-BDP links.
- When discussing fairness: AIMD provably converges to fair bandwidth sharing between competing flows. CUBIC maintains this property. BBR breaks fairness with CUBIC flows because it ignores loss signals, which is an active area of research (BBRv2).
- When asked about distributed system timeouts: TCP retransmission timeout (RTO) is typically 1-3 seconds. After a timeout, cwnd resets to 1 MSS and slow start restarts. This is why a single network hiccup can cause a latency spike followed by slow recovery.
- When asked about CDN performance: CDNs place servers close to users to minimize RTT. Lower RTT means slow start completes faster (fewer round trips to fill the pipe), AIMD recovers faster (cwnd grows by 1 MSS per RTT), and BBR measures a smaller bandwidth-delay product.
- When discussing gRPC or HTTP/2: One key advantage is amortizing TCP slow start across many requests on a single long-lived connection. Multiplexing avoids opening new connections (and new slow starts) for each request.
- When asked "what happens when a packet is lost": Distinguish between 3 duplicate ACKs (fast retransmit, cwnd halved, fast recovery) and timeout (cwnd reset to 1 MSS, full slow start restart). The first is minor, the second is catastrophic for throughput.
- When discussing bufferbloat or latency: Loss-based congestion control fills router buffers before detecting congestion. This causes latency spikes even when throughput is fine. BBR and Active Queue Management (CoDel, PIE) address this by signaling congestion before buffers are full.
Quick recap
- TCP congestion control prevents network collapse by dynamically adjusting the sending rate in response to network feedback, using the congestion window (cwnd) to limit in-flight data.
- Slow start grows cwnd exponentially (doubling per RTT) from an initial value of ~14KB until reaching ssthresh or detecting loss, probing network capacity quickly on new connections.
- AIMD (congestion avoidance) grows cwnd linearly (+1 MSS per RTT) and halves it on loss, creating a sawtooth pattern that provably converges to fair bandwidth sharing between competing flows.
- Fast retransmit detects loss via 3 duplicate ACKs and retransmits immediately, while fast recovery avoids resetting cwnd to 1 MSS, keeping the connection in congestion avoidance instead of restarting slow start.
- CUBIC uses a cubic polynomial centered on the last-known-good cwnd to grow aggressively far from the loss point and cautiously near it, making it far more efficient than Reno on high-bandwidth, high-latency links.
- BBR models the bottleneck bandwidth and minimum RTT to set the sending rate directly, avoiding the loss-based penalties of CUBIC on lossy wireless links and high-latency paths, though it introduces fairness concerns with CUBIC flows.
Related concepts
- HTTP keep-alive - Connection reuse amortizes TCP slow start across multiple requests, eliminating the per-request cold start penalty.
- Connection pooling - Connection pools maintain warm TCP connections with fully-grown cwnd, avoiding slow start entirely for pooled requests.
- Backpressure - TCP flow control (rwnd) is one form of backpressure, but application-level backpressure is needed when congestion control alone cannot prevent overload.
- Zero-copy I/O - Reduces CPU overhead when sending large payloads over TCP, allowing the congestion control algorithm to fully utilize available bandwidth.