TCP congestion control: slow start, AIMD, CUBIC, and BBR
How TCP prevents network collapse using congestion control algorithms. Slow start, congestion avoidance, AIMD, and why modern algorithms CUBIC and BBR have replaced the original Reno algorithm.
The problem
Your microservice sends a burst of 500 HTTP requests to a downstream service over a single TCP connection. The network link between them handles 100 Mbps, but a shared router along the path has a 64KB buffer. Every sender on the network pushes data at maximum rate. The router buffer fills, packets get dropped, senders retransmit, the retransmissions fill the buffer again, and now 80% of all traffic on the link is retransmissions. Useful throughput drops to near zero.
This is congestion collapse. It actually happened on the internet in October 1986. Throughput between Lawrence Berkeley Lab and UC Berkeley dropped by a factor of 100. The network had bandwidth, but every sender was shouting over every other sender.
Van Jacobson designed TCP congestion control in 1988. The core insight: TCP senders must reduce their sending rate when the network signals congestion (via packet loss or delay). Without this, every sender acts selfishly and the network becomes unusable for everyone.
What it is
TCP congestion control is a set of algorithms built into every TCP implementation that dynamically adjust the sending rate based on network feedback. The sender maintains a congestion window (cwnd) that limits how much unacknowledged data can be in flight at any time.
Think of it like cars merging onto a highway. If every car enters at full speed without checking traffic, the highway jams. Congestion control is the metering light at the on-ramp: it controls how fast new data enters the network based on how congested the road already is.
How it works
TCP's sending rate is governed by two windows:
effective window = min(rwnd, cwnd)
- rwnd (receiver window): flow control, how much the receiver's buffer can hold
- cwnd (congestion window): congestion control, how much the network path can handle
The sender can have at most min(rwnd, cwnd) unacknowledged bytes in flight at any moment. The receiver controls rwnd. The congestion control algorithm controls cwnd.
The congestion control state machine cycles through phases:
function on_ack_received(ack):
if cwnd < ssthresh:
// Slow Start: exponential growth
cwnd += MSS
else:
// Congestion Avoidance: linear growth
cwnd += MSS * (MSS / cwnd)
function on_loss_detected(type):
if type == TIMEOUT:
ssthresh = cwnd / 2
cwnd = 1 * MSS // reset to minimum
if type == TRIPLE_DUPLICATE_ACK:
ssthresh = cwnd / 2
cwnd = ssthresh + 3 * MSS // fast recovery
Every TCP connection starts in slow start (exponential growth), transitions to congestion avoidance (linear growth) when cwnd reaches ssthresh, and cuts cwnd when loss is detected. The three phases, slow start, congestion avoidance, and loss recovery, repeat for the lifetime of the connection.
Slow start
When a new TCP connection opens (or after a timeout), cwnd starts small (typically 10 MSS, about 14KB on most systems). On each ACK received, cwnd increases by 1 MSS. Since one RTT produces ACKs for all in-flight segments, this doubles cwnd every round-trip:
- Round 1: cwnd = 10, send 10 segments, 10 ACKs arrive, cwnd = 20
- Round 2: cwnd = 20, send 20, 20 ACKs, cwnd = 40
- Round 3: cwnd = 40, send 40, 40 ACKs, cwnd = 80
The name "slow start" is misleading. It is slower than immediately blasting at full rate, but exponential growth reaches link capacity within a few RTTs. On a 100 Mbps link with 50ms RTT, slow start fills the pipe in roughly 4 round trips starting from 10 MSS.
Slow start ends when:
- cwnd reaches ssthresh (slow start threshold), which triggers a switch to congestion avoidance
- Packet loss is detected via timeout, which resets cwnd to 1 MSS and halves ssthresh
- Three duplicate ACKs arrive, which triggers fast retransmit and fast recovery
Interview tip: slow start cold penalty
Slow start is why HTTP/1.1 was slow for small responses over new connections. A fresh TCP connection starts at ~14KB cwnd. A 200KB page takes several RTTs to fully transfer. Connection reuse (HTTP keep-alive) and HTTP/2 multiplexing exist specifically to amortize this cold-start penalty.
Congestion avoidance (AIMD)
Once cwnd reaches ssthresh, TCP switches from exponential to linear growth. Instead of adding 1 MSS per ACK, it adds a fraction: MSS * (MSS / cwnd) per ACK, which works out to roughly 1 MSS increase per full RTT.
AIMD stands for Additive Increase, Multiplicative Decrease:
- Additive Increase: grow cwnd by 1 MSS per RTT (probe for more bandwidth slowly)
- Multiplicative Decrease: on packet loss, cut cwnd in half (back off sharply)
No loss: cwnd += 1 MSS per RTT (gentle linear increase)
Loss detected: cwnd = cwnd / 2 (sharp multiplicative decrease)
This creates a sawtooth pattern: cwnd slowly rises, hits a loss, drops by half, rises again. The sawtooth is intentional. AIMD is mathematically proven to converge to fair sharing: if two AIMD flows share a bottleneck link, they converge to equal bandwidth over time regardless of starting conditions.
The problem with AIMD is efficiency on high-bandwidth, high-latency links. A 10 Gbps link with 100ms RTT has a bandwidth-delay product of 125MB. After a loss event, AIMD halves cwnd to 62.5MB and needs thousands of RTTs to climb back. I've seen this waste 30-40% of available bandwidth on trans-oceanic links.
Fast retransmit and fast recovery
TCP Reno introduced two optimizations that avoid the full cwnd reset on every loss event.
Fast Retransmit: When the sender receives 3 duplicate ACKs for the same sequence number, it retransmits the missing segment immediately without waiting for the retransmission timeout (RTO). Three duplicate ACKs mean the network is still delivering packets (just one was lost), so the situation is less severe than a timeout.
Fast Recovery: After fast retransmit, instead of resetting cwnd to 1 MSS (as with a timeout), the sender sets ssthresh = cwnd / 2 and cwnd = ssthresh + 3 * MSS. It then continues in congestion avoidance mode. The connection never re-enters slow start.
function on_triple_duplicate_ack(ack):
// Fast Retransmit
retransmit(ack.sequence_number + 1)
// Fast Recovery
ssthresh = cwnd / 2
cwnd = ssthresh + 3 * MSS // +3 for the 3 dup ACKs
// Stay in congestion avoidance (no slow start reset)
// Each additional dup ACK: cwnd += 1 MSS
// When new ACK arrives: cwnd = ssthresh (deflate)
Fast recovery is why modern TCP handles isolated packet losses gracefully. Only timeouts (which indicate severe congestion or path failure) cause the full cwnd reset. I think of duplicate ACKs as "the network is congested but still working" and timeouts as "the network is broken."
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.