DNS internals
Learn how DNS resolves a hostname end to end, what recursive vs. iterative resolution means, and why TTL tuning during deployments is as important as the deployment itself.
The problem
You finish a zero-downtime deployment. The new application servers are running. The load balancer is updated and routing traffic to them. Your old EC2 instances are shut down. Five minutes later, 30% of users are still hitting the old, terminated IP address. Their requests are timing out. You got pages from half your regions.
The deployment worked. The DNS did not. Specifically, your DNS TTL was set to 86,400 (24 hours). Resolvers cached the old IP and will not check for an update until their cached record expires. The users hitting the old IP are using resolvers that cached the old record this morning.
DNS propagation delay is not magic or randomness. It is arithmetic: every resolver holds your record for exactly its TTL, and they all refreshed at different times throughout the day. Understanding DNS end to end turns a mysterious deployment failure into a predictable problem you can prevent by lowering your TTL before a planned IP change.
What DNS is
The Domain Name System (DNS) is a globally distributed, hierarchically delegated database that maps human-readable names (like api.example.com) to machine-usable values (like 93.184.216.34). It is not a single server. It is a tree of authorities, each responsible for a portion of the namespace, coordinated through delegations.
Think of it like a nested directory of phone books. There is a global directory (root) that tells you which regional directory handles .com. The .com directory tells you which business directory handles example.com. The example.com directory tells you the actual phone number for api. No single book contains everything; each book tells you who to ask next.
The DNS namespace is a tree. Delegation flows downward from root to TLD to zone:
spawnSync d2 ENOENT
How DNS resolution works
Resolving api.example.com from a fresh cache (no information cached anywhere) walks a four-step chain. The sequence below shows recursive resolution, where the recursive resolver does all the work on behalf of the client.
Step by step:
- Your application calls
getaddrinfo("api.example.com"). The OS stub resolver checks its local cache. Cache miss. - The stub resolver forwards the query to the configured recursive resolver (your ISP's resolver, or
8.8.8.8, or a private DNS server). - The recursive resolver checks its cache. Cache miss. It must start from the top.
- The recursive resolver asks a root nameserver. Root servers do not know the answer but know which nameservers are authoritative for
.com. Returns a referral. - The recursive resolver asks the
.comTLD nameserver. It knows which nameservers are authoritative forexample.com. Another referral. - The recursive resolver asks
ns1.example.com(the authoritative nameserver). This server has the actual record. Returns the answer with a TTL. - The recursive resolver caches the answer for the TTL duration and returns it to the stub resolver, which caches it and returns it to the application.
The entire chain for a cold cache typically takes 30-150 ms. Subsequent queries hit the recursive resolver's cache and return in under 1 ms.
// Pseudocode: recursive resolver algorithm
function resolve(name, type):
cached = cache.get(name, type)
if cached and not expired: return cached
// Start from the bottom of what we know
best_known_ns = find_closest_cached_nameserver(name)
// e.g. for api.example.com, we might have .com NS cached already
while true:
response = query(best_known_ns, name, type)
if response.is_answer:
cache.store(name, type, response.answer, ttl=response.ttl)
return response.answer
if response.is_referral:
// Follow the referral β ask the next nameserver in the chain
best_known_ns = response.referral_ns
continue
if response.is_nxdomain:
cache.store(name, NXDOMAIN, ttl=response.negative_ttl)
return NXDOMAIN
TTL and caching at every layer
TTL (Time To Live) is the number of seconds a resolver is allowed to cache a DNS record. Once the TTL expires, the resolver must re-query the authoritative nameserver for a fresh copy.
Every layer in the chain caches independently, and the TTL countdown starts from when each resolver fetched the record, not from when you published it.
| Layer | What it caches | Typical TTL | Notes |
|---|---|---|---|
| OS stub resolver | Query results | Typically 0-30 s | Many systems re-query on every process restart |
| Recursive resolver (ISP) | Full answers and referrals | As published in DNS | May enforce a minimum TTL floor (often 60 s) |
| Recursive resolver (public: 8.8.8.8) | Full answers | Honors TTL exactly | Google and Cloudflare honor low TTLs; many ISP resolvers enforce floors |
| Browser | A record results | Varies (10 s - 60 s) | Chrome and Firefox have their own DNS cache |
| Application | Results from getaddrinfo | Application-controlled | Many HTTP clients cache results for the connection lifetime |
This is why DNS propagation is gradual rather than instant: every resolver refreshes independently when its cached copy expires. A TTL of 300 s means all resolvers will have the new record within 5 minutes of the change. A TTL of 86,400 s means some resolvers may serve the old record for up to 24 hours.
Lowering TTL must happen before the planned IP change, not during it. If your TTL is 86,400 when you make the change, resolvers that cached the record two hours ago will hold it for another 22 hours regardless of your new TTL. Lower TTL to 300 s at least 24-48 hours before any planned IP rotation, wait one full old-TTL period, then make the change. This is one of the most common deployment mistakes I see in production post-mortems.
DNS record types
DNS is more than just A records. Each record type serves a specific purpose:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.