DNS internals

The problem

You finish a zero-downtime deployment. The new application servers are running. The load balancer is updated and routing traffic to them. Your old EC2 instances are shut down. Five minutes later, 30% of users are still hitting the old, terminated IP address. Their requests are timing out. You got pages from half your regions.

The deployment worked. The DNS did not. Specifically, your DNS TTL was set to 86,400 (24 hours). Resolvers cached the old IP and will not check for an update until their cached record expires. The users hitting the old IP are using resolvers that cached the old record this morning.

DNS propagation delay is not magic or randomness. It is arithmetic: every resolver holds your record for exactly its TTL, and they all refreshed at different times throughout the day. Understanding DNS end to end turns a mysterious deployment failure into a predictable problem you can prevent by lowering your TTL before a planned IP change.

What DNS is

The Domain Name System (DNS) is a globally distributed, hierarchically delegated database that maps human-readable names (like api.example.com) to machine-usable values (like 93.184.216.34). It is not a single server. It is a tree of authorities, each responsible for a portion of the namespace, coordinated through delegations.

Think of it like a nested directory of phone books. There is a global directory (root) that tells you which regional directory handles .com. The .com directory tells you which business directory handles example.com. The example.com directory tells you the actual phone number for api. No single book contains everything; each book tells you who to ask next.

The DNS namespace is a tree. Delegation flows downward from root to TLD to zone:

D2 render error.

spawnSync d2 ENOENT

How DNS resolution works

Resolving api.example.com from a fresh cache (no information cached anywhere) walks a four-step chain. The sequence below shows recursive resolution, where the recursive resolver does all the work on behalf of the client.

Step by step:

Your application calls getaddrinfo("api.example.com"). The OS stub resolver checks its local cache. Cache miss.
The stub resolver forwards the query to the configured recursive resolver (your ISP's resolver, or 8.8.8.8, or a private DNS server).
The recursive resolver checks its cache. Cache miss. It must start from the top.
The recursive resolver asks a root nameserver. Root servers do not know the answer but know which nameservers are authoritative for .com. Returns a referral.
The recursive resolver asks the .com TLD nameserver. It knows which nameservers are authoritative for example.com. Another referral.
The recursive resolver asks ns1.example.com (the authoritative nameserver). This server has the actual record. Returns the answer with a TTL.
The recursive resolver caches the answer for the TTL duration and returns it to the stub resolver, which caches it and returns it to the application.

The entire chain for a cold cache typically takes 30-150 ms. Subsequent queries hit the recursive resolver's cache and return in under 1 ms.

// Pseudocode: recursive resolver algorithm

function resolve(name, type):
    cached = cache.get(name, type)
    if cached and not expired: return cached

    // Start from the bottom of what we know
    best_known_ns = find_closest_cached_nameserver(name)
    // e.g. for api.example.com, we might have .com NS cached already

    while true:
        response = query(best_known_ns, name, type)

        if response.is_answer:
            cache.store(name, type, response.answer, ttl=response.ttl)
            return response.answer

        if response.is_referral:
            // Follow the referral — ask the next nameserver in the chain
            best_known_ns = response.referral_ns
            continue

        if response.is_nxdomain:
            cache.store(name, NXDOMAIN, ttl=response.negative_ttl)
            return NXDOMAIN

TTL and caching at every layer

TTL (Time To Live) is the number of seconds a resolver is allowed to cache a DNS record. Once the TTL expires, the resolver must re-query the authoritative nameserver for a fresh copy.

Every layer in the chain caches independently, and the TTL countdown starts from when each resolver fetched the record, not from when you published it.

Layer	What it caches	Typical TTL	Notes
OS stub resolver	Query results	Typically 0-30 s	Many systems re-query on every process restart
Recursive resolver (ISP)	Full answers and referrals	As published in DNS	May enforce a minimum TTL floor (often 60 s)
Recursive resolver (public: 8.8.8.8)	Full answers	Honors TTL exactly	Google and Cloudflare honor low TTLs; many ISP resolvers enforce floors
Browser	A record results	Varies (10 s - 60 s)	Chrome and Firefox have their own DNS cache
Application	Results from getaddrinfo	Application-controlled	Many HTTP clients cache results for the connection lifetime

This is why DNS propagation is gradual rather than instant: every resolver refreshes independently when its cached copy expires. A TTL of 300 s means all resolvers will have the new record within 5 minutes of the change. A TTL of 86,400 s means some resolvers may serve the old record for up to 24 hours.

Lowering TTL must happen before the planned IP change, not during it. If your TTL is 86,400 when you make the change, resolvers that cached the record two hours ago will hold it for another 22 hours regardless of your new TTL. Lower TTL to 300 s at least 24-48 hours before any planned IP rotation, wait one full old-TTL period, then make the change. This is one of the most common deployment mistakes I see in production post-mortems.

DNS record types

DNS is more than just A records. Each record type serves a specific purpose:

The problem

What DNS is

The DNS namespace is a tree. Delegation flows downward from root to TLD to zone:

D2 render error.

spawnSync d2 ENOENT

How DNS resolution works

Step by step:

Your application calls getaddrinfo("api.example.com"). The OS stub resolver checks its local cache. Cache miss.
The stub resolver forwards the query to the configured recursive resolver (your ISP's resolver, or 8.8.8.8, or a private DNS server).
The recursive resolver checks its cache. Cache miss. It must start from the top.
The recursive resolver asks a root nameserver. Root servers do not know the answer but know which nameservers are authoritative for .com. Returns a referral.
The recursive resolver asks the .com TLD nameserver. It knows which nameservers are authoritative for example.com. Another referral.
The recursive resolver asks ns1.example.com (the authoritative nameserver). This server has the actual record. Returns the answer with a TTL.
The recursive resolver caches the answer for the TTL duration and returns it to the stub resolver, which caches it and returns it to the application.

The entire chain for a cold cache typically takes 30-150 ms. Subsequent queries hit the recursive resolver's cache and return in under 1 ms.

// Pseudocode: recursive resolver algorithm

function resolve(name, type):
    cached = cache.get(name, type)
    if cached and not expired: return cached

    // Start from the bottom of what we know
    best_known_ns = find_closest_cached_nameserver(name)
    // e.g. for api.example.com, we might have .com NS cached already

    while true:
        response = query(best_known_ns, name, type)

        if response.is_answer:
            cache.store(name, type, response.answer, ttl=response.ttl)
            return response.answer

        if response.is_referral:
            // Follow the referral — ask the next nameserver in the chain
            best_known_ns = response.referral_ns
            continue

        if response.is_nxdomain:
            cache.store(name, NXDOMAIN, ttl=response.negative_ttl)
            return NXDOMAIN

TTL and caching at every layer

TTL (Time To Live) is the number of seconds a resolver is allowed to cache a DNS record. Once the TTL expires, the resolver must re-query the authoritative nameserver for a fresh copy.

Every layer in the chain caches independently, and the TTL countdown starts from when each resolver fetched the record, not from when you published it.

Layer	What it caches	Typical TTL	Notes
OS stub resolver	Query results	Typically 0-30 s	Many systems re-query on every process restart
Recursive resolver (ISP)	Full answers and referrals	As published in DNS	May enforce a minimum TTL floor (often 60 s)
Recursive resolver (public: 8.8.8.8)	Full answers	Honors TTL exactly	Google and Cloudflare honor low TTLs; many ISP resolvers enforce floors
Browser	A record results	Varies (10 s - 60 s)	Chrome and Firefox have their own DNS cache
Application	Results from getaddrinfo	Application-controlled	Many HTTP clients cache results for the connection lifetime

DNS record types

DNS is more than just A records. Each record type serves a specific purpose:

DNS internals

The problem

What DNS is

How DNS resolution works

TTL and caching at every layer

DNS record types

Continue Reading with Premium

Comments

DNS internals

The problem

What DNS is

How DNS resolution works

TTL and caching at every layer

DNS record types

Continue Reading with Premium

Comments