Azure core services for system design interviews
A practitioner guide to Azure services that matter for system design: Cosmos DB, Azure Functions, Event Hubs, Service Bus, AKS, Blob Storage, and their AWS equivalents.
Azure is the second largest cloud provider and the dominant choice in enterprises running Microsoft workloads. In system design interviews, Azure shows up less frequently than AWS, but when it does, interviewers expect you to know the services deeply, not just recite names.
This guide covers every Azure service that matters for system design, organized by category. For each service I explain what problem it solves, how it maps to AWS and GCP equivalents, the production gotchas I have seen firsthand, and the specific talking points that impress interviewers.
How to use this guide
Do not memorize every service. Read the categories that match the system you are designing, understand the tradeoffs, and know when to reach for each tool. Interviewers care about your reasoning, not your ability to list SKUs.
1. Compute
Azure offers five main compute options. The right choice depends on your control requirements, scaling model, and how much operational overhead you want to absorb.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Virtual Machines | EC2 | Compute Engine | Full control, legacy apps |
| Azure Functions | Lambda | Cloud Functions | Event-driven, short-lived |
| Azure Container Apps | ECS on Fargate | Cloud Run | Containerized microservices |
| AKS | EKS | GKE | Complex orchestration needs |
| App Service | Elastic Beanstalk | App Engine | Web apps with minimal ops |
Virtual Machines
VMs are the escape hatch. When nothing else fits, when you need a specific OS kernel, a GPU, or a legacy application that cannot be containerized, you use VMs.
Key talking points:
- VM Scale Sets (VMSS) are the correct answer for auto-scaling VMs. Never propose standalone VMs in an interview.
- Availability Sets spread VMs across fault domains (physical racks) and update domains (rolling restart groups). This is how you survive hardware failures.
- Azure Spot VMs give you up to 90% discount but can be evicted with 30 seconds notice. Use them for batch processing, not serving traffic.
- Ultra Disks provide up to 160,000 IOPS and 2,000 MB/s throughput per disk. Use them for SAP HANA or SQL Server workloads.
The temp disk trap
Every Azure VM has a temporary disk (D: drive on Windows, /dev/sdb on Linux). It is fast, local NVMe storage. It is also ephemeral. I have seen teams store application data on the temp disk and lose everything during a routine maintenance event. The data does not survive VM reallocation.
Production gotcha: Azure VMs have a concept called "Planned Maintenance." Microsoft periodically updates the underlying host infrastructure. If you are not in an Availability Set or using Availability Zones, your single VM will reboot with only 15 minutes of notice. I have seen production outages from teams that ran a single VM without redundancy.
Azure Functions
Azure Functions is the serverless compute service. You write a function, define a trigger, and Azure handles scaling, infrastructure, and execution. The real differentiator from AWS Lambda is Durable Functions.
Key talking points:
- Consumption plan: Pay per execution. Scales to zero. Cold starts of 1-3 seconds for .NET, 3-10 seconds for Java. Maximum execution time of 10 minutes (can be extended to unlimited on Premium).
- Premium plan: Pre-warmed instances eliminate cold starts. VNet integration. Unlimited execution duration. This is what you use in production.
- Durable Functions: This is Azure's killer feature. Orchestrate long-running workflows with code, not YAML. Supports fan-out/fan-in, human interaction patterns, and eternal monitoring loops.
- Input/output bindings let you declaratively connect to 20+ Azure services without writing any SDK code.
Durable Functions are the interview differentiator
When an interviewer asks "how would you orchestrate a multi-step workflow," most candidates say "use a queue and state machine." The better answer on Azure is Durable Functions. You write the orchestration as a single function that calls activity functions. The runtime handles checkpointing, replay, and failure recovery automatically.
Durable Functions patterns worth knowing:
- Fan-out/fan-in: Dispatch N parallel tasks, wait for all to complete, aggregate results. Example: process 1000 images in parallel and compile a report.
- Human interaction: Start a workflow, wait for human approval (with timeout), then continue. Example: expense approval that escalates after 48 hours.
- Monitor pattern: Periodically poll an external system until a condition is met. Example: check if a deployment is healthy every 30 seconds for 10 minutes.
Production gotcha: Azure Functions on the Consumption plan share a pool of workers. During traffic spikes, new instances take 1-10 seconds to cold start. I have seen APIs return 504 timeouts because the function app could not scale fast enough during Black Friday traffic. The fix is the Premium plan with pre-warmed instances (minimum 1 instance always ready).
Azure Container Apps
Container Apps is Azure's answer to "I want to run containers without managing Kubernetes." It is built on top of Kubernetes and Envoy, but you never touch kubectl.
Key talking points:
- Scale to zero (and from zero). You pay nothing when idle. Scale based on HTTP requests, queue depth, CPU, or any KEDA-supported scaler.
- Built-in traffic splitting for blue/green and canary deployments via revisions.
- Dapr integration provides service-to-service invocation, pub/sub, and state management without vendor lock-in.
- No cluster management, no node pools, no kubectl. This is the right default for 80% of containerized workloads.
Container Apps vs AKS decision
Use Container Apps when you need simple container hosting with auto-scaling. Use AKS when you need custom operators, specific Kubernetes features (StatefulSets, DaemonSets), or multi-cluster federation. Container Apps covers 80% of use cases with 20% of the operational complexity.
AKS (Azure Kubernetes Service)
AKS is managed Kubernetes. Azure manages the control plane (free), you manage and pay for the worker nodes.
Key talking points:
- Virtual Nodes: Burst overflow to Azure Container Instances. When your node pool is full, pods run on ACI with no VM provisioning delay. This is Azure's unique scaling story.
- KEDA (Kubernetes Event-Driven Autoscaling): Scale pods based on event sources (queue length, Event Hub lag, Prometheus metrics). This is the correct answer for event-driven workloads on AKS.
- AAD Integration: Kubernetes RBAC backed by Azure Active Directory. Pod Identity (now Workload Identity) eliminates the need for secrets in pods.
- The control plane is free. You pay only for worker nodes. This makes AKS cheaper than EKS ($0.10/hour for the EKS control plane adds up).
Production gotcha: AKS upgrades are the #1 source of production incidents I have seen. Kubernetes minor version upgrades (1.27 to 1.28) can break workloads due to API deprecations. Always test upgrades in a staging cluster first. Use the AKS maintenance window feature to schedule upgrades during low-traffic periods.
Do not run stateful workloads on Kubernetes unless you really need to
I have seen teams run Postgres and Elasticsearch on AKS to "save costs." The operational overhead of managing persistent volumes, backup strategies, and node drains for stateful pods far exceeds the cost of managed services like Azure Database for PostgreSQL or Azure Cognitive Search. Use managed services for stateful workloads. Use AKS for stateless services.
App Service
App Service is the simplest way to host web applications on Azure. Deploy your code or container, get a URL.
Key talking points:
- Supports .NET, Java, Node.js, Python, PHP, and custom containers.
- Built-in autoscaling, custom domains, TLS certificates, and deployment slots.
- Deployment slots enable zero-downtime deployments by swapping pre-warmed staging slots into production.
- App Service Environment (ASE) provides network isolation for compliance-sensitive workloads, but costs $1,000+/month minimum.
Deployment slots are free on Standard tier and above
Use them for every production deployment. Deploy to the staging slot, verify it works, then swap. The swap is atomic and instant. If something goes wrong, swap back. This is the simplest zero-downtime deployment strategy on any cloud.
When to mention in interviews: App Service is the right answer for "deploy a web API with minimal operational overhead." It is not the right answer for event-driven architectures, complex microservices, or workloads that need to scale to zero.
2. Storage
Azure storage is built on a single foundation called Azure Storage Accounts. Understanding storage accounts is essential because Blob, Files, Queue, and Table storage all live inside them.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Blob Storage | S3 | Cloud Storage | Unstructured data, media, backups |
| Azure Files | EFS | Filestore | Shared file systems, lift-and-shift |
| Managed Disks | EBS | Persistent Disk | VM-attached block storage |
| Data Lake Storage Gen2 | S3 + Lake Formation | Cloud Storage | Analytics, big data |
Blob Storage
Blob Storage is Azure's object store. It stores unstructured data: images, videos, documents, backups, logs. It is the foundation for most Azure architectures.
Key talking points:
- Four access tiers with dramatically different costs. Hot tier is $0.018/GB/month. Archive is $0.00099/GB/month (55x cheaper) but takes 1-15 hours to read.
- Lifecycle management policies automatically move blobs between tiers based on last access time or creation date. This is the answer to "how do you manage storage costs."
- Blob Storage supports up to 500 TB per storage account with 20,000 requests per second per account.
- Shared Access Signatures (SAS tokens) provide time-limited, scoped access to specific blobs without sharing account keys.
The storage account IOPS limit is the hidden bottleneck
A single storage account supports 20,000 requests per second. If you have a high-throughput application writing millions of small blobs, you will hit this limit. The fix is to shard across multiple storage accounts or use premium storage accounts (which support 100,000+ IOPS).
Production gotcha: Archive tier rehydration is not instant. I have seen disaster recovery plans that assumed archived backups could be restored immediately. Rehydrating a 1 TB blob from Archive tier can take up to 15 hours. If your RTO is less than 15 hours, keep a copy in Cool tier.
Azure Files
Azure Files provides fully managed SMB and NFS file shares in the cloud. It is the easiest way to replace on-premises file servers.
Key talking points:
- SMB 3.0 and NFS 4.1 protocols. Mount directly from Windows, Linux, and macOS.
- Azure File Sync replicates cloud file shares to on-premises Windows Servers with cloud tiering (frequently accessed files stay local, cold files are tiered to the cloud). This is the hybrid cloud story for file storage.
- Premium tier uses SSD storage and delivers consistent low-latency performance for I/O-intensive workloads. Supports up to 100,000 IOPS.
- Snapshots provide point-in-time recovery at the share level. You can take 200 snapshots per share.
- Identity-based authentication with Entra ID eliminates the need for storage account keys.
Azure Files as AKS persistent volumes
Azure Files supports ReadWriteMany (RWX) access mode in Kubernetes, meaning multiple pods can mount and write to the same file share simultaneously. Azure Managed Disks only support ReadWriteOnce (RWO). Use Azure Files when multiple pods need shared storage.
When to use in interviews: Azure Files is the answer when the interviewer says "we have an existing application that reads from a shared filesystem" or "we need to migrate file shares to the cloud."
Managed Disks
Managed Disks are block storage volumes for Azure VMs. Azure handles replication, encryption, and availability.
Key talking points:
- Four performance tiers: Standard HDD, Standard SSD, Premium SSD v2, Ultra Disk.
- Premium SSD v2 lets you independently configure IOPS and throughput without choosing a disk size tier. This is a game-changer for right-sizing costs.
- Ultra Disks deliver up to 160,000 IOPS and 2,000 MB/s. Use them for demanding database workloads (SAP HANA, SQL Server, Oracle).
- Disk snapshots are incremental, copying only changed blocks. A 1 TB disk with 10 GB of changes produces a 10 GB snapshot.
- Shared Disks allow multiple VMs to attach the same disk simultaneously. Required for Windows Server Failover Clusters and Oracle RAC.
- Server-Side Encryption (SSE) with platform-managed or customer-managed keys is enabled by default on all disks.
Disk bursting saves money on spiky workloads
Standard SSD and smaller Premium SSD disks support credit-based bursting. The disk accumulates IO credits during low-usage periods and spends them during bursts. A P20 disk (512 GB) baselines at 2,300 IOPS but can burst to 3,500 IOPS for up to 30 minutes. This eliminates the need to over-provision for peak loads.
Data Lake Storage Gen2
Data Lake Storage Gen2 is Blob Storage with a hierarchical namespace bolted on. It adds directory-level operations and POSIX-compatible access control, making it suitable for big data analytics.
Key talking points:
- Same durability and availability as Blob Storage (99.999999999% durability, 11 nines).
- Hierarchical namespace enables file-system-like operations (rename directory in O(1) vs O(n) in flat Blob Storage).
- ABAC (Attribute-Based Access Control) with ACLs at the directory and file level.
- Native integration with Azure Databricks, Synapse Analytics, and HDInsight.
Data Lake zones are the interview talking point
Mention the Raw/Curated/Enriched zone pattern when discussing data lake architectures. Raw holds unmodified ingested data. Curated holds cleaned, schema-applied data. Enriched holds aggregated, business-ready datasets. This shows you understand data governance, not just storage.
3. Databases
Databases are the most important category in system design interviews. Azure offers five database services that cover relational, document, key-value, and graph workloads. Cosmos DB is the star of this section.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Azure SQL Database | RDS for SQL Server / Aurora | Cloud SQL | Relational, OLTP |
| Cosmos DB | DynamoDB (partial) | Firestore / Spanner (partial) | Global, multi-model, tunable consistency |
| Azure Cache for Redis | ElastiCache for Redis | Memorystore | Caching, sessions, leaderboards |
| Azure Database for PostgreSQL | RDS for PostgreSQL / Aurora PostgreSQL | Cloud SQL for PostgreSQL / AlloyDB | Open-source relational |
| Table Storage | DynamoDB (basic) | Bigtable (basic) | Simple key-value, low cost |
Azure SQL Database
Azure SQL is SQL Server as a managed service. It supports the full T-SQL surface area, making it the easiest migration path for SQL Server workloads.
Key talking points:
- Hyperscale tier supports databases up to 100 TB with near-instant backups (regardless of database size) and up to 4 read replicas. This is Azure's answer to Aurora.
- Elastic Pools share resources across multiple databases. This is the correct answer for multi-tenant SaaS architectures where each tenant gets their own database.
- Auto-failover groups provide automatic DNS-based failover across regions with RPO < 5 seconds. The application connects to a listener endpoint and does not need to change connection strings during failover.
- Serverless compute tier automatically pauses the database after inactivity and resumes on first connection. Saves costs for dev/test environments.
DTU vs vCore pricing: know the difference
DTU (Database Transaction Units) bundles CPU, memory, and IO into a single unit. Simple but opaque. vCore lets you independently choose compute and storage. In interviews, recommend vCore for production workloads because it gives predictable performance. Use DTU only for simple workloads where you do not need to tune individual resources.
Production gotcha: Azure SQL has connection limits that vary by tier. A Basic tier database supports only 30 concurrent connections. I have seen applications crash because developers tested against a Basic tier database that was accidentally promoted to production. Always use connection pooling and monitor the sessions_count metric.
Cosmos DB
Cosmos DB is Azure's globally distributed, multi-model database. It is the most important Azure service to understand for system design interviews because it introduces concepts that do not exist in AWS or GCP.
The 5 consistency levels (you MUST know these):
This is the single most asked interview question about Cosmos DB. No other database exposes five tunable consistency levels.
- Strong: Every read gets the most recent committed write. Equivalent to reading from a single-region database. Highest latency because writes must replicate to a quorum of replicas before acknowledging.
- Bounded Staleness: Reads lag behind writes by at most K versions or T seconds. You configure both bounds. Good for leaderboards or stock tickers where "slightly stale" is acceptable.
- Session (default): Within a single client session, you always read your own writes. Different sessions may see different data. This is the correct default for 90% of applications.
- Consistent Prefix: You never see out-of-order writes. If writes happen as A, B, C, you may see A or A,B or A,B,C but never A,C or B,A. Good for event streams.
- Eventual: No ordering guarantees. Lowest latency, highest throughput. Good for like counts, view counts, and analytics.
Partition key selection is the #1 Cosmos DB design decision
I have seen teams blow their Cosmos DB budget by 20x because of bad partition keys. A bad partition key creates hot partitions where all requests hit a single physical partition (limited to 10,000 RU/s). A good partition key distributes requests evenly. For e-commerce, use userId. For IoT, use deviceId. For multi-tenant, use tenantId. Never use a timestamp as a partition key.
RU-based pricing model:
- Cosmos DB charges in Request Units (RUs). One RU is the cost of reading a 1 KB document by its ID.
- A simple point read: 1 RU. A write: 5 RUs. A complex query scanning 100 documents: 50-500 RUs.
- Provisioned throughput: You reserve RUs per second (minimum 400 RU/s at ~$23/month). If you exceed your provisioned RUs, requests get throttled (HTTP 429).
- Autoscale: Automatically scales between 10% and 100% of your max RU/s. You pay for the peak.
- Serverless: Pay per RU consumed. Good for dev/test and spiky workloads under 5,000 RU/s.
Production gotcha: Cosmos DB multi-region writes require a conflict resolution policy. The default is Last Writer Wins (LWW) based on a _ts timestamp. This means if two regions write to the same document simultaneously, the "later" write wins and the "earlier" write is silently discarded. I have seen data loss from this. If your application cannot tolerate lost writes, use a custom conflict resolution stored procedure or design for single-region writes with multi-region reads.
Azure Cache for Redis
Azure Cache for Redis is a fully managed Redis service. It provides in-memory caching, session storage, pub/sub messaging, and data structure operations.
Key talking points:
- Sub-millisecond latency for reads and writes. A Premium tier cache with clustering supports up to 1.2 million requests per second.
- Enterprise tier includes Redis modules: RediSearch (full-text search), RedisJSON (native JSON operations), and RedisTimeSeries. This is unique to Azure. AWS ElastiCache does not offer Redis modules.
- Active geo-replication (Enterprise tier) replicates data across Azure regions with conflict-free resolution.
- RBAC with Azure AD authentication eliminates the need for Redis passwords.
Redis is not a database
I have seen teams use Azure Cache for Redis as their primary data store. Redis is volatile by default. Even with persistence enabled (RDB snapshots or AOF), Redis is not designed for durability. Use it as a cache layer in front of a database, not as a replacement.
Azure Database for PostgreSQL
Azure offers three deployment options for PostgreSQL: Single Server (legacy), Flexible Server (recommended), and Cosmos DB for PostgreSQL (formerly Hyperscale/Citus, for distributed PostgreSQL).
Key talking points:
- Flexible Server is the recommended option. It supports PostgreSQL 13-16, burstable compute (scale down to save costs), and same-zone or zone-redundant high availability.
- Cosmos DB for PostgreSQL (Citus extension) distributes tables across multiple nodes. Good for multi-tenant SaaS and real-time analytics on large datasets.
- Intelligent Performance features: automatic tuning recommendations, query performance insights, and wait statistics.
- Read replicas (up to 5) for read-heavy workloads.
Cosmos DB for PostgreSQL is the distributed PostgreSQL answer
When an interviewer asks "how would you scale PostgreSQL horizontally," Cosmos DB for PostgreSQL (the Citus extension) is the Azure answer. It shards tables by a distribution column (e.g., tenant_id) and distributes data across worker nodes. You write standard PostgreSQL queries and Citus handles the distribution.
Table Storage
Table Storage is a simple, cheap key-value store inside Azure Storage Accounts. It supports structured NoSQL data.
Key talking points:
- Extremely cheap: $0.045/GB/month for storage, $0.00036 per 10,000 transactions.
- Schema-less. Each entity (row) can have different properties.
- Partition key + Row key as the composite primary key.
- Limited querying: no secondary indexes, no joins, no complex filters. Point lookups only.
When to use in interviews: Table Storage is the answer when you need a cheap, simple key-value store and can design your access patterns around partition key + row key lookups. For anything more complex, use Cosmos DB.
4. Messaging and Streaming
Messaging is the backbone of distributed systems. Azure offers four messaging services, each designed for different patterns. The most common interview mistake is conflating them.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Service Bus | SQS + SNS | Pub/Sub | Enterprise messaging, ordered delivery |
| Event Hubs | Kinesis | Pub/Sub (streaming mode) | High-throughput streaming, telemetry |
| Event Grid | EventBridge | Eventarc | Reactive event routing |
| Queue Storage | SQS (basic) | Cloud Tasks | Simple, cheap queueing |
Service Bus
Service Bus is Azure's enterprise message broker. It provides queues (point-to-point) and topics (pub/sub) with advanced features like sessions, dead-lettering, and duplicate detection.
Key talking points:
- Sessions enable ordered processing by session ID. Messages with the same session ID are processed in FIFO order by a single consumer. This is how you guarantee order for per-customer or per-device message streams.
- Dead Letter Queue (DLQ) captures messages that fail processing after N attempts. Every queue and subscription has its own DLQ. Monitor this; a growing DLQ means your consumers are unhealthy.
- Duplicate detection within a configurable time window (up to 10 minutes). The broker silently drops duplicate messages based on message ID. Use this for exactly-once processing.
- Transactions: Send a message and complete another message in a single atomic operation. This enables the transactional outbox pattern.
- Premium tier supports messages up to 100 MB and guarantees predictable latency.
Sessions are the Service Bus killer feature
When an interviewer asks "how do you guarantee ordered processing in a distributed system," Service Bus sessions are the answer. Each session ID maps to exactly one consumer at a time. Messages within a session arrive in FIFO order. You get ordering guarantees without sacrificing parallelism across different sessions.
Production gotcha: Service Bus has a lock duration on messages (default 30 seconds, max 5 minutes). If your consumer takes longer than the lock duration to process a message, the lock expires and the message becomes visible again. Another consumer picks it up and you get duplicate processing. I have seen this cause double-charging in payment systems. Always set the lock duration to match your worst-case processing time, and call RenewMessageLock() for long-running operations.
Event Hubs
Event Hubs is Azure's high-throughput streaming platform. It ingests millions of events per second with sub-second latency. It is the Azure equivalent of Apache Kafka, and it even supports the Kafka wire protocol.
Key talking points:
- Throughput Units (TU): Each TU provides 1 MB/s ingress and 2 MB/s egress. Standard tier supports up to 40 TUs (40 MB/s ingress). Premium and Dedicated tiers support much higher throughput.
- Partitions: Data is distributed across partitions using a partition key. More partitions enable more consumer parallelism (1 consumer per partition per consumer group). Choose partition count at creation time; it cannot be reduced later.
- Consumer Groups: Each consumer group gets its own view of the event stream. One group can process real-time analytics while another archives to Data Lake. Up to 20 consumer groups (Standard) or 100 (Premium).
- Capture: Automatically saves all events to Blob Storage or Data Lake in Avro format. No code required. This is the simplest way to build an event archive.
- Kafka compatibility: Event Hubs Premium and Dedicated support the Kafka protocol. Existing Kafka clients connect by changing the connection string. No code changes.
Event Hubs vs Service Bus: know when to use which
Event Hubs is for streaming (high throughput, append-only log, consumer manages position). Service Bus is for messaging (individual message acknowledgment, dead lettering, sessions). If you need to process 100K+ events/second, use Event Hubs. If you need guaranteed delivery of individual business transactions with ordering, use Service Bus.
Production gotcha: Event Hubs retention is 1-90 days (Standard) or up to 90 days (Premium). After the retention period, events are deleted. I have seen teams assume events would be available forever and lose data. Enable Capture to archive events to Data Lake before they expire.
Event Grid
Event Grid is a serverless event routing service. It connects event sources to event handlers using a push-based model.
Key talking points:
- Event Grid is push-based. Events are delivered to handlers within seconds of occurrence. No polling required.
- Built-in integration with 20+ Azure services as event sources. Blob Storage, Resource Groups, Azure subscriptions, IoT Hub, and more.
- Event filtering by event type, subject prefix/suffix, and advanced filters on data fields. Only deliver events your handler cares about.
- At-least-once delivery with retry (exponential backoff, up to 24 hours). Dead-lettering to Blob Storage for events that cannot be delivered.
- Cost: $0.60 per million events. First 100,000 events per month are free.
Event Grid is the glue service
Event Grid is not a replacement for Service Bus or Event Hubs. It is the reactive glue that connects Azure services. "When a blob is uploaded, trigger an Azure Function" is an Event Grid pattern. "Process 1 million events per second" is an Event Hubs pattern. "Guarantee ordered delivery of payment messages" is a Service Bus pattern.
Queue Storage
Queue Storage is the simplest, cheapest messaging option on Azure. It lives inside an Azure Storage Account.
Key talking points:
- Maximum message size: 64 KB. Maximum queue size: 500 TB. Maximum time-to-live: 7 days.
- No ordering guarantees, no duplicate detection, no dead-letter queue. Simple FIFO-ish behavior.
- Costs: $0.004 per 10,000 operations. Dramatically cheaper than Service Bus.
- Use when you need basic task distribution and do not need the advanced features of Service Bus.
When to use in interviews: Queue Storage is the right answer for fire-and-forget background tasks where ordering does not matter and occasional duplicates are acceptable. Image resizing, email sending, log processing.
5. Networking and Content Delivery
Networking is where Azure interviews get tricky. Azure has overlapping services for load balancing and content delivery, and understanding which one to use is a common interview question.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Azure Front Door | CloudFront + Global Accelerator | Cloud CDN + Cloud Armor | Global HTTP load balancing + CDN + WAF |
| Application Gateway | ALB | HTTP(S) Load Balancer | Regional Layer 7 load balancing + WAF |
| Azure DNS | Route 53 | Cloud DNS | DNS hosting |
| Traffic Manager | Route 53 (DNS routing) | Cloud DNS (routing) | DNS-based global traffic distribution |
| VNet | VPC | VPC | Network isolation |
Azure Front Door
Front Door is Azure's global load balancer, CDN, and Web Application Firewall (WAF) in a single service. It operates at the edge across 150+ Microsoft Points of Presence worldwide.
Key talking points:
- Combines CDN, global load balancing, and WAF in one service. On AWS, you need CloudFront + Global Accelerator + WAF separately.
- TLS termination at the edge with managed certificates. Supports HTTP/2 and HTTP/3.
- URL-based routing: send
/api/*to AKS and/static/*to Blob Storage. - Private Link integration: connect to backends over Azure's private network, not the public internet.
- Built-in DDoS protection at Layer 7. WAF rules protect against OWASP Top 10 attacks.
Front Door vs Application Gateway vs Traffic Manager: the confusion
This is the #1 Azure networking interview question. Here is the simple answer:
- Front Door: Global HTTP load balancer + CDN + WAF. Use for global applications.
- Application Gateway: Regional HTTP load balancer + WAF. Use for single-region applications that need path-based routing or WAF.
- Traffic Manager: DNS-based global load balancer. No data path (it only returns DNS responses). Use for non-HTTP protocols or when you need DNS-level failover.
- Azure Load Balancer: Layer 4 (TCP/UDP). Regional only. Use for non-HTTP traffic.
Application Gateway
Application Gateway is Azure's regional Layer 7 load balancer. It sits inside a VNet and provides path-based routing, SSL termination, and WAF.
Key talking points:
- Path-based routing:
/api/*to backend pool A,/images/*to backend pool B. - WAF v2 with OWASP 3.2 Core Rule Set. Can block SQL injection, XSS, and other L7 attacks.
- Autoscaling (v2 SKU): automatically scales based on traffic. No manual capacity planning.
- WebSocket and HTTP/2 support.
- Integration with AKS via the Application Gateway Ingress Controller (AGIC).
- Costs start at ~$175/month for the gateway-hours plus data processed.
Application Gateway Ingress Controller (AGIC) for AKS
Instead of deploying nginx-ingress inside your AKS cluster, use AGIC. It configures the Application Gateway based on Kubernetes Ingress resources. You get Azure-managed WAF, TLS termination, and autoscaling without running an ingress controller pod inside the cluster.
Azure DNS
Azure DNS hosts your DNS zones on Azure's global network. It uses Anycast networking, so DNS queries are answered by the nearest available server.
Key talking points:
- 100% uptime SLA (the highest of any Azure service).
- Supports Private DNS Zones for name resolution within VNets.
- Alias records point directly to Azure resources (Front Door, Traffic Manager, Public IP). No dangling DNS when resources are deleted.
- DNSSEC is supported for domain validation.
Traffic Manager
Traffic Manager is a DNS-based global traffic router. It does not proxy traffic; it returns DNS responses that direct clients to the best endpoint.
Key talking points:
- Routing methods: Priority (failover), Weighted (A/B testing), Performance (latency-based), Geographic (compliance), MultiValue (return multiple IPs), Subnet (route by client IP range).
- Health checks: TCP, HTTP, or HTTPS probes to endpoints. Unhealthy endpoints are removed from DNS responses.
- TTL affects failover time. Lower TTL = faster failover but more DNS queries.
- Use for non-HTTP protocols (TCP, UDP) where Front Door does not apply.
Virtual Network (VNet)
VNets provide network isolation for Azure resources. Every Azure resource that needs network connectivity should be inside a VNet.
Key talking points:
- Private Endpoints expose Azure PaaS services (SQL, Cosmos DB, Storage) on a private IP inside your VNet. Traffic never leaves the Azure backbone. This is the correct answer for "how do you secure access to your database."
- Network Security Groups (NSG) are stateful firewalls at the subnet and NIC level. Deny all inbound by default, allow only necessary traffic.
- VNet Peering connects two VNets with low-latency, high-bandwidth Microsoft backbone connectivity. Works across regions (global peering).
- VPN Gateway connects your VNet to on-premises networks via IPsec/IKE VPN. ExpressRoute provides a dedicated private connection (not over the internet) for higher bandwidth and reliability.
- Service Endpoints route traffic from a VNet to Azure PaaS services over the Azure backbone (older approach). Private Endpoints are the modern replacement.
Always use Private Endpoints for databases
I have seen production databases accessible from the public internet because the team "planned to add Private Endpoints later." Later never comes. Start with Private Endpoints from day one. The Azure portal makes it easy: create the resource, enable Private Endpoint, deny public access. Three clicks.
6. Security and Identity
Azure has the strongest identity story of any cloud provider, rooted in Entra ID (formerly Azure Active Directory). In system design interviews, security questions on Azure almost always involve Managed Identity and zero-trust architecture.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Entra ID (Azure AD) | IAM + Cognito | Cloud Identity + IAP | Identity and access management |
| Key Vault | Secrets Manager + KMS | Secret Manager + Cloud KMS | Secrets, keys, certificates |
| Azure Firewall | Network Firewall | Cloud Firewall | Network-level threat protection |
| DDoS Protection | Shield | Cloud Armor | Volumetric DDoS mitigation |
| Managed Identity | IAM Roles (for EC2) | Service Accounts | Credential-free authentication |
Entra ID (Azure AD)
Entra ID is Azure's cloud-based identity and access management service. It does far more than AWS IAM. It handles user authentication, application authorization, conditional access policies, and B2B/B2C identity federation.
Key talking points:
- Managed Identity is the single most important security feature on Azure. It assigns an identity to Azure resources (VMs, Functions, AKS pods) so they can authenticate to other Azure services without any credentials in code. No passwords, no connection strings, no secret rotation. The identity is managed by Azure.
- Conditional Access: Rules that evaluate every authentication request based on user, device, location, and risk. "If the user is accessing from an unknown location, require MFA and block if the device is not managed." This is zero trust in practice.
- B2B vs B2C: B2B lets external partners access your Azure resources using their own organization's identity. B2C creates a customer-facing identity system with social logins (Google, Facebook), custom branding, and self-service sign-up.
- Entra ID issues OAuth 2.0 tokens that work with every Azure service and thousands of SaaS applications.
Managed Identity eliminates the #1 security risk
The most common production security incident I have seen is leaked credentials: connection strings in config files, API keys in environment variables, secrets in code repositories. Managed Identity eliminates this entire category of risk. Any Azure service that supports Managed Identity should use it. No exceptions.
Key Vault
Key Vault stores secrets, encryption keys, and TLS certificates. It provides centralized secret management with access policies and audit logging.
Key talking points:
- HSM-backed keys (Premium tier) for regulatory compliance. Keys never leave the HSM boundary.
- Soft delete and purge protection prevent accidental deletion. Deleted secrets are recoverable for 7-90 days.
- Secret versioning: every update creates a new version. Applications can pin to a specific version or always get the latest.
- Certificate auto-renewal with integrated Certificate Authorities (DigiCert, GlobalSign).
- Access policies or RBAC control who (and what) can read, write, or manage secrets.
Key Vault has rate limits
Key Vault allows 4,000 GET requests per 10 seconds per vault (for secrets). If your application reads secrets on every request instead of caching them, you will get throttled. Cache secrets in memory with a reasonable TTL (5-15 minutes) and refresh them periodically.
Production gotcha: I have seen teams create a single Key Vault for all environments (dev, staging, production). This is a security anti-pattern. A developer with access to "the Key Vault" can read production secrets. Create separate Key Vaults per environment with different access policies.
Azure Firewall
Azure Firewall is a managed, cloud-based network security service that protects Azure VNet resources. It provides L3-L7 filtering with built-in threat intelligence.
Key talking points:
- Stateful firewall with built-in high availability and auto-scaling. No capacity planning required.
- Application rules (L7): filter outbound traffic by FQDN. "Allow traffic to
*.github.combut block everything else." - Network rules (L3-L4): filter by source IP, destination IP, port, and protocol.
- Threat Intelligence: automatically block traffic to/from known malicious IP addresses and domains (powered by Microsoft's threat intelligence feed).
- Azure Firewall Premium adds TLS inspection (decrypt, inspect, re-encrypt), IDPS (Intrusion Detection and Prevention with 67,000+ signatures), and URL filtering.
- Cost: ~$900/month base + data processing charges. This is expensive. Use NSGs for basic filtering and reserve Azure Firewall for compliance-sensitive workloads.
Azure Firewall is expensive for simple use cases
At ~$900/month base cost, Azure Firewall is overkill for simple inbound/outbound filtering. Use Network Security Groups (free) for basic port and IP filtering. Use Azure Firewall when you need L7 filtering, threat intelligence, TLS inspection, or centralized logging for compliance.
DDoS Protection
Azure DDoS Protection defends against volumetric, protocol, and application-layer DDoS attacks.
Key talking points:
- Basic tier is free and automatically enabled for all Azure resources. It protects against common L3/L4 attacks.
- Standard tier adds adaptive tuning (learns your traffic patterns), attack analytics, and cost protection (credits for resource scaling during an attack). Costs $2,944/month per DDoS plan (covers up to 100 public IPs).
- Integrates with Azure Monitor for real-time attack telemetry and alerts.
- Combine with Front Door WAF for L7 protection and DDoS Protection Standard for L3/L4 protection.
DDoS cost protection is unique to Azure
If a DDoS attack causes your resources to auto-scale (more VMs, higher bandwidth), Microsoft credits you for the cost of the scale-out. This is a unique Azure benefit. AWS Shield Advanced has similar cost protection but at a higher price point ($3,000/month).
7. Observability
Observability is the difference between "the system is slow" and "the database query on line 47 is scanning 2M rows because the index was dropped last Tuesday." Azure Monitor is the umbrella service that ties everything together.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Azure Monitor | CloudWatch | Cloud Monitoring | Metrics, alerts, dashboards |
| Application Insights | X-Ray + CloudWatch Logs | Cloud Trace + Error Reporting | APM, distributed tracing |
| Log Analytics | CloudWatch Logs Insights | Cloud Logging | Log aggregation, KQL queries |
| Microsoft Sentinel | GuardDuty + SecurityHub | Chronicle | SIEM, security analytics |
Azure Monitor
Azure Monitor is the central observability platform. It collects metrics and logs from every Azure resource and provides alerting, dashboards, and automated responses.
Key talking points:
- All Azure resources emit platform metrics automatically. No agent installation required for basic CPU, memory, disk, and network metrics.
- KQL (Kusto Query Language) is the query language for Log Analytics. It is powerful and worth learning. A single KQL query can correlate logs from multiple services, compute percentiles, and render time-series charts.
- Alert rules can trigger on metric thresholds, log query results, or activity log events. Action groups define what happens when an alert fires: email, SMS, webhook, Azure Function, Logic App, or ITSM ticket.
- Autoscale uses Monitor metrics to automatically scale VMs, App Service, and other resources. Scale out when CPU > 70%, scale in when CPU < 30%.
Application Insights
Application Insights is Azure's Application Performance Monitoring (APM) service. It is the Azure equivalent of Datadog APM or New Relic.
Key talking points:
- Auto-instrumentation: Add the Application Insights SDK (or enable it in App Service settings) and it automatically captures HTTP requests, database queries, Redis calls, and external HTTP dependencies. No code changes for basic telemetry.
- Application Map: Visualizes dependencies between services and highlights failing or slow components. This is the first thing I open during an incident.
- Live Metrics: Real-time stream of requests, exceptions, and performance counters. Useful during deployments to spot regressions immediately.
- Smart Detection: ML-based anomaly detection that alerts on unusual failure rates, response time degradation, and dependency issues.
- Distributed tracing: Correlates requests across multiple services using W3C Trace Context. A single operation ID traces a request from the client through API Gateway, microservices, queues, and databases.
Application Insights sampling is the cost control lever
At high traffic volumes, Application Insights can become expensive. Adaptive sampling automatically reduces telemetry volume while preserving statistically accurate metrics. Set an initial target of 5 items per second per server. You get representative data at a fraction of the cost.
Production gotcha: Application Insights defaults to 90-day retention. After 90 days, telemetry is deleted. I have seen teams try to investigate an incident from 4 months ago and find the data is gone. For long-term retention, configure continuous export to a Log Analytics workspace (up to 730 days) or export to a Storage Account.
Log Analytics
Log Analytics is the centralized log aggregation and query service powered by KQL. Every Azure service can send its logs to a Log Analytics workspace.
Key talking points:
- KQL is extremely powerful. A single query like
requests | where duration > 1000 | summarize count() by bin(timestamp, 5m) | render timechartgives you a time-series chart of slow requests. - Workspace-based pricing: pay per GB ingested. First 5 GB/day are free. After that, ~$2.76/GB.
- Data retention: 31 days free, up to 730 days at additional cost.
- Log Analytics workspaces are the foundation for alerts, workbooks, dashboards, and Sentinel (SIEM).
Microsoft Sentinel
Sentinel is Azure's cloud-native SIEM (Security Information and Event Management) and SOAR (Security Orchestration, Automated Response) service.
Key talking points:
- Built on top of Log Analytics. Ingests security logs from Azure, Microsoft 365, on-premises, and third-party sources.
- Analytics rules detect threats using KQL queries, ML models, and Microsoft's threat intelligence.
- Playbooks (built on Logic Apps) automate incident response. "When a brute-force attack is detected, block the IP in the firewall and create a ServiceNow ticket."
- UEBA (User and Entity Behavior Analytics) detects anomalous user behavior by baselining normal patterns.
- Cost: Sentinel charges per GB ingested into its Log Analytics workspace. Enterprise deployments can cost $10K-$100K+/month.
Sentinel is overkill for most system design interviews
Unless the interview specifically asks about security monitoring or threat detection, do not bring up Sentinel. It is an enterprise security product. Mention it briefly when discussing defense-in-depth, but focus your time on Application Insights and Azure Monitor for observability.
8. AI and ML Services
AI services are increasingly relevant in system design interviews, especially at companies building LLM-powered products. Azure has a strong AI portfolio, particularly Azure OpenAI Service, which provides exclusive access to OpenAI models with enterprise features.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Azure OpenAI Service | Bedrock (Claude, Titan) | Vertex AI (Gemini) | GPT-4, embeddings, enterprise AI |
| Azure Machine Learning | SageMaker | Vertex AI | Custom model training and deployment |
| Azure AI Services (Cognitive) | Rekognition, Comprehend, Polly | Vision AI, Natural Language | Pre-built AI APIs |
| Azure AI Search | OpenSearch Service / Kendra | Vertex AI Search | Semantic search, RAG |
Azure OpenAI Service
Azure OpenAI Service gives you access to OpenAI models (GPT-4o, GPT-4, GPT-3.5-Turbo, DALL-E, Whisper, text-embedding-ada-002) through Azure's infrastructure with enterprise security, compliance, and networking.
Key talking points:
- Your data is not used to train models. This is the #1 enterprise concern. Azure OpenAI provides a contractual guarantee that your prompts and completions are not used to train, retrain, or improve OpenAI models.
- Provisioned Throughput Units (PTU): Reserve dedicated compute capacity measured in tokens per minute. Eliminates noisy-neighbor throttling. Essential for production workloads with predictable traffic.
- Content filtering: Built-in content safety that blocks harmful content at the API level. Configurable severity thresholds for hate, violence, self-harm, and sexual content.
- Private Endpoints: Connect to Azure OpenAI over your private VNet. Prompts never traverse the public internet.
- GPT-4o is the current flagship model with multimodal capabilities (text + image input). Text-embedding-3-large is the best embedding model for RAG applications.
Token limits and rate limiting are the production bottleneck
Azure OpenAI has per-deployment rate limits measured in tokens per minute (TPM) and requests per minute (RPM). Standard deployments share capacity and can be throttled during peak usage. I have seen production systems return 429 errors during traffic spikes. Use Provisioned Throughput for predictable workloads and implement exponential backoff with retry for standard deployments.
Production gotcha: Azure OpenAI endpoints are region-specific. If you deploy to East US and that region experiences capacity constraints, you cannot burst to another region without creating a new deployment. My recommendation is to deploy to at least two regions and implement client-side failover. The Azure OpenAI SDK does not handle multi-region failover automatically.
Azure Machine Learning
Azure ML is the platform for building, training, and deploying custom machine learning models. It supports the full ML lifecycle from data preparation to model monitoring.
Key talking points:
- Managed compute: Training clusters that auto-scale based on job queue depth. Spot instances for training at 60-80% discount.
- MLflow integration: Track experiments, register models, and deploy to endpoints using the open-source MLflow protocol. No vendor lock-in.
- Managed Endpoints: Deploy models as REST APIs with auto-scaling, blue/green deployment, and A/B traffic splitting.
- Responsible AI dashboard: Built-in fairness assessment, model interpretability, and error analysis.
- Cost: You pay for compute (training clusters + inference endpoints) and storage. The Azure ML workspace itself is free.
Azure AI Services (Cognitive Services)
Azure AI Services provides pre-built AI capabilities as REST APIs. No ML expertise required. These are the building blocks for AI-powered features in any application.
Key talking points:
- Vision: Image classification, object detection, OCR (including handwritten text), face detection, and custom image classifiers with just 5-10 training images.
- Language: Sentiment analysis, key phrase extraction, named entity recognition, text summarization (extractive and abstractive), and translation across 100+ languages.
- Speech: Real-time speech-to-text (streaming), batch transcription, text-to-speech with 400+ neural voices, and real-time speech translation.
- Decision: Time-series anomaly detection, content moderation (text, image, video), and reinforcement learning-based personalization.
- All APIs support Private Endpoints and customer-managed encryption keys.
- Containers: Run AI Services in your own infrastructure (on-premises or AKS) for data residency requirements. Same API, same model, different hosting.
AI Services containers for regulated industries
Healthcare and financial services often cannot send data to cloud APIs. Azure AI Services containers let you run the same models on-premises or in your own AKS cluster. The data never leaves your network. You still need an Azure subscription for billing, but the inference happens locally.
Azure AI Search
Azure AI Search (formerly Cognitive Search) provides full-text search, vector search, and hybrid search (combining both). It is the backbone of RAG (Retrieval-Augmented Generation) architectures on Azure.
Key talking points:
- Hybrid search combines keyword search (BM25) and vector search (HNSW) using Reciprocal Rank Fusion. This outperforms either method alone for RAG applications.
- Semantic ranker uses a cross-encoder model to re-rank results based on semantic relevance. It dramatically improves result quality for the top-K results sent to the LLM.
- Integrated vectorization: AI Search can automatically generate embeddings during indexing using Azure OpenAI, eliminating the need for a separate embedding pipeline.
- AI enrichment skills: Extract text from PDFs (OCR), detect language, extract entities, generate summaries, and create embeddings during indexing. No custom code needed.
- Supports up to 1,000 indexes per service and billions of documents per index.
AI Search is the RAG answer on Azure
When an interviewer asks "how would you build a chatbot that answers questions about company documents," the Azure answer is: Azure AI Search for retrieval + Azure OpenAI for generation. Upload documents, let AI Search index and embed them, then use hybrid search + semantic ranker to find relevant chunks. Send those chunks as context to GPT-4o. This is the canonical RAG pattern.
9. Data and Analytics
Azure's analytics stack is built around Azure Synapse Analytics, a unified platform that combines data warehousing, big data processing, and data integration. Understanding when to use Synapse vs Databricks is a common interview topic.
| Azure Service | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| Azure Synapse Analytics | Redshift + Glue + EMR | BigQuery + Dataflow | Unified analytics platform |
| Azure Data Factory | Glue | Dataflow / Cloud Composer | ETL/ELT pipelines |
| HDInsight | EMR | Dataproc | Open-source Hadoop/Spark |
| Azure Databricks | EMR + Glue | Dataproc + BigQuery | Collaborative Spark, ML |
| Stream Analytics | Kinesis Analytics | Dataflow (streaming) | Real-time stream processing |
Azure Synapse Analytics
Synapse is Azure's unified analytics service. It combines serverless SQL, dedicated SQL pools (formerly SQL Data Warehouse), Apache Spark pools, and data integration pipelines in a single workspace.
Key talking points:
- Serverless SQL pool lets you query Parquet, CSV, and JSON files in Data Lake Storage using T-SQL. No provisioning, no data loading. Pay only for the data scanned (~$5/TB). This is the cheapest way to explore data in a lake.
- Dedicated SQL pool is a Massively Parallel Processing (MPP) data warehouse. Data is distributed across 60 compute nodes using hash, round-robin, or replicated distribution. Provisioned in Data Warehouse Units (DWUs) from 100 to 30,000.
- Spark pools run Apache Spark notebooks for data engineering, ML, and complex transformations. Auto-scale cores based on workload. Integration with Delta Lake for ACID transactions on the data lake.
- Synapse Link: Zero-ETL replication from Cosmos DB, SQL Server, or Dataverse into Synapse for analytical queries. No impact on the operational database.
Synapse Link for Cosmos DB is the zero-ETL answer
When an interviewer asks "how do you run analytics on your operational database without impacting production," Synapse Link is the answer. It automatically replicates Cosmos DB data to an analytical column store. You query the analytical store with serverless SQL or Spark. The operational store sees zero impact. No ETL pipeline to build or maintain.
Production gotcha: Dedicated SQL pool costs accumulate even when idle. A DW1000c costs ~$12/hour ($8,600/month). I have seen teams leave dedicated pools running 24/7 for dashboards that are viewed only during business hours. Use auto-pause (pauses after N minutes of inactivity) or scheduled scaling to reduce costs.
Azure Data Factory
Data Factory is Azure's managed ETL/ELT service. It orchestrates data movement and transformation at scale using a visual pipeline designer.
Key talking points:
- Integration Runtimes: Azure IR (cloud-based, auto-managed), Self-hosted IR (for on-premises and private network data sources), and Azure-SSIS IR (run SSIS packages in the cloud).
- Copy Activity moves data between 90+ sources and destinations. It handles schema mapping, data type conversion, and fault tolerance automatically.
- Mapping Data Flows provide a visual drag-and-drop interface for data transformations. Under the hood, they compile to Spark jobs.
- Tumbling window triggers process data in fixed time intervals with dependency chaining. "Process today's data only after yesterday's data is complete."
- ADF is the same engine as Synapse Pipelines. If you already have a Synapse workspace, use Synapse Pipelines instead of creating a separate ADF instance.
Self-hosted Integration Runtime is the on-premises bridge
If the interview involves data migration from on-premises, mention the Self-hosted IR. It is a lightweight agent that runs on a Windows VM in the customer's network and securely moves data to Azure without opening inbound firewall ports. All communication is outbound HTTPS.
HDInsight
HDInsight is Azure's managed open-source analytics service. It provides fully managed clusters for Hadoop, Spark, Hive, Kafka, HBase, and Interactive Query.
Key talking points:
- Fully managed clusters with auto-scaling and enterprise security (Entra ID, network isolation, encryption).
- Supports Apache Spark 3.x, Hadoop 3.x, Kafka 2.x, and HBase.
- Good for teams with existing Hadoop/Spark expertise who want a managed service without changing their code.
- Being gradually superseded by Azure Databricks for most Spark workloads and Event Hubs for Kafka workloads.
Azure Databricks
Databricks is a jointly developed service between Microsoft and Databricks. It provides a collaborative Spark platform optimized for data engineering, data science, and ML.
Key talking points:
- Photon engine: A C++ native vectorized query engine that accelerates Spark SQL workloads by up to 12x. This is the performance differentiator over vanilla Spark.
- Unity Catalog: Centralized governance for data and AI assets across workspaces. Fine-grained access control at the table, column, and row level.
- Delta Lake: Open-source ACID transaction layer for data lakes. Supports time travel (query data as of a timestamp), schema evolution, and MERGE operations.
- Databricks SQL: Serverless SQL warehouses for BI queries on Delta Lake tables. Directly competes with Synapse serverless SQL.
- Cost: Databricks charges DBU (Databricks Units) on top of Azure VM costs. Can be 2-3x more expensive than raw Spark on HDInsight, but the productivity gains usually justify the cost.
Stream Analytics
Stream Analytics is a real-time analytics service that processes streaming data from Event Hubs, IoT Hub, and Blob Storage using a SQL-like query language.
Key talking points:
- SQL-like query language with windowing functions (Tumbling, Sliding, Hopping, Session windows). "Count events per device in 5-minute tumbling windows."
- Built-in anomaly detection: detect spikes, dips, and slow trends in real-time streams.
- Reference data joins: enrich streaming events with static data from Blob Storage or SQL Database.
- Exactly-once processing guarantees for outputs to SQL, Cosmos DB, and Data Lake.
- Scales to millions of events per second with a single query.
Stream Analytics vs Databricks Structured Streaming
Use Stream Analytics for simple stream processing with SQL-like queries (aggregations, joins, anomaly detection). Use Databricks Structured Streaming for complex transformations, ML inference on streams, or when you need the full power of Spark. Stream Analytics is simpler and cheaper for straightforward use cases.
10. Infrastructure as Code
Infrastructure as Code (IaC) defines and provisions Azure resources using declarative templates or code. In interviews, mentioning IaC shows operational maturity.
| Azure Tool | AWS Equivalent | GCP Equivalent | Best For |
|---|---|---|---|
| ARM Templates | CloudFormation | Deployment Manager | Native Azure IaC (JSON) |
| Bicep | CloudFormation | N/A | Native Azure IaC (cleaner syntax) |
| Terraform | Terraform (same) | Terraform (same) | Multi-cloud, team familiarity |
ARM Templates
Azure Resource Manager (ARM) templates are JSON files that define Azure resources declaratively. They are the native IaC format for Azure.
Key talking points:
- JSON-based declarative syntax. Verbose but complete. Every Azure resource supports ARM templates.
- Deployment modes: Incremental (add/update resources, leave others) and Complete (delete resources not in template).
- Template specs allow versioned, reusable templates stored in Azure.
- What-if operation shows changes before deployment.
- The JSON syntax is verbose and hard to read. Bicep is the modern replacement.
Bicep
Bicep is a domain-specific language (DSL) that compiles to ARM templates. It provides cleaner syntax, type safety, and IntelliSense support in VS Code.
Key talking points:
- Bicep is a transparent abstraction over ARM templates. Anything ARM can do, Bicep can do with cleaner syntax.
- Modules enable reusable components. "Deploy a standard AKS cluster" becomes a one-line module call.
- Native VS Code extension with IntelliSense, validation, and go-to-definition.
- Supports deployment stacks for managing resources as a group with deny assignments (prevent accidental deletion).
Bicep is the right answer for Azure-only shops
If the interview is Azure-focused, recommend Bicep. It has the best Azure support, the fastest feature parity with new Azure services, and the cleanest syntax. It compiles to ARM, so there is zero abstraction penalty. I would choose Bicep over Terraform for Azure-only infrastructure because Bicep supports new Azure features on day one, while the Terraform Azure provider often lags by weeks or months.
Terraform on Azure
Terraform (by HashiCorp) uses the AzureRM provider to manage Azure resources. It is the most popular multi-cloud IaC tool.
Key talking points:
- HCL (HashiCorp Configuration Language) syntax. Declarative with state management.
- State file tracks the current state of deployed resources. Must be stored securely (Azure Storage Account backend with state locking).
- Plan/Apply workflow:
terraform planshows changes,terraform applyexecutes them. - The AzureRM provider covers most Azure resources but new features appear days to weeks after Azure releases them.
- Best choice for multi-cloud environments or teams already using Terraform for AWS/GCP.
Production gotcha: Terraform state file corruption is a real risk. I have seen teams lose their state file and have to import 200+ resources manually. Always store state in a remote backend (Azure Storage Account with versioning enabled) and enable state locking to prevent concurrent modifications.
Azure Decision Flowchart
When designing a system on Azure, use this mental model to select services:
This is not exhaustive, but it covers the 80% case. In an interview, walking through this decision tree shows structured thinking.
The interview pattern
When asked to design a system on Azure, start with requirements, map each requirement to a service category, then pick the specific service using the decision flowchart above. This shows you are not just memorizing services but reasoning about tradeoffs.
Quick Reference: Azure vs AWS vs GCP
| Category | Azure | AWS | GCP |
|---|---|---|---|
| Serverless compute | Azure Functions | Lambda | Cloud Functions |
| Container orchestration | AKS | EKS | GKE |
| Serverless containers | Container Apps | Fargate | Cloud Run |
| Object storage | Blob Storage | S3 | Cloud Storage |
| Relational DB | Azure SQL | Aurora / RDS | Cloud SQL / AlloyDB |
| NoSQL (document) | Cosmos DB | DynamoDB | Firestore |
| Global DB | Cosmos DB | DynamoDB Global Tables | Spanner |
| In-memory cache | Azure Cache for Redis | ElastiCache | Memorystore |
| Message queue | Service Bus | SQS | Pub/Sub |
| Event streaming | Event Hubs | Kinesis | Pub/Sub |
| Event routing | Event Grid | EventBridge | Eventarc |
| CDN + global LB | Front Door | CloudFront + Global Accelerator | Cloud CDN |
| DNS | Azure DNS | Route 53 | Cloud DNS |
| Identity | Entra ID | IAM + Cognito | Cloud Identity |
| Secrets management | Key Vault | Secrets Manager | Secret Manager |
| Monitoring | Azure Monitor | CloudWatch | Cloud Monitoring |
| APM | Application Insights | X-Ray | Cloud Trace |
| SIEM | Sentinel | GuardDuty | Chronicle |
| LLM API | Azure OpenAI | Bedrock | Vertex AI |
| Search | AI Search | OpenSearch / Kendra | Vertex AI Search |
| Data warehouse | Synapse | Redshift | BigQuery |
| ETL | Data Factory | Glue | Dataflow |
| Spark platform | Databricks | EMR / Databricks | Dataproc / Databricks |
| Stream processing | Stream Analytics | Kinesis Analytics | Dataflow |
| IaC (native) | Bicep | CloudFormation | Deployment Manager |
Azure Services Knowledge Check
Loading questionsβ¦