GCP core services for system design interviews
A practitioner guide to GCP services that matter for system design: Spanner, BigQuery, Cloud Run, Pub/Sub, Cloud Storage, Firestore, and their AWS and Azure equivalents.
Why GCP Comes Up in System Design Interviews
Google Cloud Platform sits on the same infrastructure that powers Google Search, YouTube, and Gmail. That heritage gives GCP a handful of services that have no true equivalent anywhere else: Spanner (globally consistent relational database), BigQuery (serverless petabyte-scale analytics), and Pub/Sub (global messaging with no regional boundaries). If your interviewer is a Google engineer, or the company runs on GCP, you need to know these services cold.
I find that GCP appears in interviews less frequently than AWS but when it does, interviewers expect depth. They want to hear about TrueTime, about Dremel's column-oriented execution, about how Cloud Run bridges the gap between serverless functions and full Kubernetes. This guide covers exactly what you need.
The structure is simple. I walk through every service category, give you the one-liner, the AWS/Azure mapping, the architecture diagram, and the production gotchas that separate senior engineers from everyone else.
How to use this guide
Read the categories relevant to your interview prep first. Every service section includes the AWS/Azure equivalent so you can map concepts you already know. The Bad/Better/Best expandables show you how interviewers evaluate your answers.
1. Compute
GCP's compute story starts with Borg, the internal cluster manager that inspired Kubernetes. That lineage shows up everywhere: GKE is the most mature managed Kubernetes offering, Cloud Run is the smoothest serverless container experience, and even Compute Engine benefits from Google's live migration technology.
Compute Engine
What it solves: Virtual machines on Google's infrastructure, with live migration that moves running VMs between hosts without downtime.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Compute Engine | EC2 | Azure Virtual Machines |
Key talking points:
- Live migration is the differentiator. Google moves your running VM to another host for maintenance without rebooting. AWS and Azure do not offer this for most instance types.
- Custom machine types let you pick exact vCPU and memory ratios instead of choosing from fixed sizes. I use this when a workload needs 16 vCPUs but only 24 GB RAM.
- Preemptible VMs (now called Spot VMs) cost 60-91% less but can be terminated with 30 seconds notice. Good for batch processing, bad for serving traffic.
- Sole-tenant nodes give you an entire physical server. Required for licensing compliance (Oracle, SQL Server) and regulated workloads.
Interview tip: live migration
When comparing GCP to AWS, mention live migration. It is the single most impressive infrastructure feature GCP has. Say: "Compute Engine can move running VMs between physical hosts with zero downtime, which means I get near-zero maintenance windows."
Production gotchas:
I have seen teams get burned by persistent disk IOPS limits. A pd-standard disk gives you 0.75 read IOPS per GB and 1.5 write IOPS per GB. A 100 GB standard disk gives you only 75 read IOPS. If you need serious I/O, use pd-ssd (30 IOPS/GB) or pd-balanced (6 IOPS/GB). The mistake I see most often is teams provisioning small standard disks and wondering why their database is slow.
Network egress costs are the silent budget killer. GCP charges $0.08-0.12/GB for inter-region traffic and $0.08-0.23/GB for internet egress. A service sending 10 TB/month to the internet costs $800-2,300 just in network fees.
Cloud Run
What it solves: Serverless containers. You push a container image, Cloud Run handles scaling from zero to thousands of instances, including all the infrastructure.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Run | AWS Fargate + API Gateway | Azure Container Apps |
This is my favorite GCP compute service. Cloud Run sits in the sweet spot between Lambda (too limited: 15-minute timeout, no container support until recently) and Kubernetes (too complex for most workloads). You get the simplicity of serverless with the flexibility of containers.
Key talking points:
- Concurrency model is what separates Cloud Run from Lambda. A single Cloud Run instance handles up to 1000 concurrent requests (default 80). Lambda runs one request per instance. This means Cloud Run is dramatically more efficient for I/O-bound workloads.
- Min instances eliminate cold starts. Set
min-instances: 2and you always have warm containers ready. Costs more, but for latency-sensitive APIs this is non-negotiable. - Traffic splitting lets you do canary deployments natively. Send 5% of traffic to a new revision, watch metrics, then promote. No service mesh required.
- Startup CPU boost gives your container extra CPU during startup to reduce cold start time. Free and enabled by default.
- Cloud Run jobs handle batch workloads. Run a container to completion without exposing an HTTP endpoint.
The cold start reality
Cloud Run cold starts range from 300ms to 2s depending on your container size and language runtime. JVM-based services are the worst offenders. My recommendation: use min instances for anything in the request path, use GraalVM native images for Java services, and keep your container images under 200 MB.
Production gotchas:
The 60-minute request timeout catches people. If you have a long-running API call, Cloud Run will kill it at 60 minutes (configurable up to 60 min for HTTP, 24 hours for gRPC). For truly long-running work, send the job to Pub/Sub and process it asynchronously.
I have seen teams hit the 8 GB memory limit per instance and not realize Cloud Run now supports up to 32 GB with the second-gen execution environment. If you are running memory-intensive workloads, switch to gen2.
Google Kubernetes Engine (GKE)
What it solves: Managed Kubernetes. GKE was the first managed Kubernetes service (launched 2015) and remains the most tightly integrated with upstream Kubernetes.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| GKE Standard | Amazon EKS | Azure Kubernetes Service (AKS) |
| GKE Autopilot | EKS with Fargate | AKS with Virtual Nodes |
Key talking points:
- Autopilot vs Standard: Autopilot is the right default. Google manages the nodes, you only define pods. You pay per pod resource request, not per node. Standard gives you node-level control but you own the ops burden.
- Release channels: Rapid, Regular, and Stable. Use Regular for production. Rapid gets new features first but has more risk.
- Multi-cluster mesh: GKE supports Anthos Service Mesh for cross-cluster communication. Use it when you need multi-region active-active Kubernetes.
- Workload Identity: Maps Kubernetes service accounts to GCP IAM service accounts. Never use node-level service accounts in production.
Autopilot bin-packing
GKE Autopilot charges you for pod resource requests, not actual usage. If your pod requests 2 CPU and 4 GB RAM but only uses 0.5 CPU and 1 GB, you still pay for 2 CPU and 4 GB. Right-size your resource requests aggressively. Use Vertical Pod Autoscaler (VPA) in recommendation mode to find the right numbers.
Production gotchas:
The biggest GKE failure mode I see is teams running Standard clusters without understanding node auto-provisioning. They create a single node pool with e2-standard-4 instances, deploy a mix of workloads, and wonder why bin-packing is terrible. Use multiple node pools with different machine types, or just switch to Autopilot.
GKE upgrades can break workloads if you use deprecated APIs. Google auto-upgrades control planes (you cannot opt out on Autopilot), so keep your Kubernetes manifests on current API versions. I have seen teams lose a weekend debugging a broken Deployment because they used extensions/v1beta1 and the upgrade removed it.
Cloud Functions
What it solves: Event-driven serverless functions. Write a function, attach a trigger (HTTP, Pub/Sub, Cloud Storage event, Firestore change), and Google handles everything else.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Functions (2nd gen) | AWS Lambda | Azure Functions |
Key talking points:
- 2nd gen is built on Cloud Run. This means you get concurrency (up to 1000 requests per instance), longer timeouts (up to 60 minutes), and larger instances (up to 16 GB RAM, 4 vCPUs). Always use 2nd gen for new functions.
- Event-driven triggers via Eventarc give you a unified eventing model. Any GCP service that produces audit logs can trigger a function.
- Min instances work the same as Cloud Run. Set them for latency-sensitive functions.
Cloud Functions vs Cloud Run
Cloud Functions 2nd gen is literally Cloud Run under the hood. The difference is developer experience: Cloud Functions gives you a simpler deployment model (just push code, no Dockerfile). I prefer Cloud Run when I need full control over the container, and Cloud Functions when I want the fastest path from code to production for simple event handlers.
Production gotchas:
The 1st gen to 2nd gen migration is not automatic and the APIs are different. I have seen teams stuck on 1st gen functions with 256 MB memory limits and single-concurrency, not realizing that 2nd gen solves both problems.
Cold starts on 1st gen are brutal (2-5 seconds for Python/Node). 2nd gen is better (500ms-1.5s) because it reuses Cloud Run's infrastructure. Use min instances for anything user-facing.
App Engine
What it solves: The original Platform-as-a-Service on GCP. Push code, App Engine handles deployment, scaling, and load balancing.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| App Engine Standard | AWS Elastic Beanstalk | Azure App Service |
| App Engine Flexible | AWS Elastic Beanstalk (Docker) | Azure App Service (Custom Container) |
Key talking points:
- Standard environment supports Python, Java, Node.js, Go, PHP, and Ruby with automatic scaling to zero. Startup times are fast (sub-second for Python/Go).
- Flexible environment runs custom Docker containers on managed VMs. It does not scale to zero.
- App Engine was revolutionary in 2008 but Cloud Run has largely replaced it for new projects.
App Engine in interviews
If an interviewer asks about App Engine, acknowledge its historical importance but recommend Cloud Run for new projects. Say: "App Engine was groundbreaking for PaaS, but Cloud Run gives me container portability, better pricing, and the same scaling model. I would only use App Engine for existing applications that are already deployed there."
My honest take: do not start new projects on App Engine. Cloud Run is better in every dimension except one: App Engine's cron service is slightly simpler to configure. But Cloud Scheduler + Cloud Run achieves the same thing.
2. Storage
GCP's storage services are straightforward. Cloud Storage is the object store (like S3), Persistent Disk is the block store (like EBS), and Filestore is the managed NFS (like EFS). The differentiator is Cloud Storage's multi-regional option and its tight integration with BigQuery.
Cloud Storage (GCS)
What it solves: Object storage for any amount of unstructured data. Files, images, backups, data lake, static website hosting.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Storage | Amazon S3 | Azure Blob Storage |
Key talking points:
- Storage classes with automatic lifecycle transitions. Set a policy: move to Nearline after 30 days, Coldline after 90, Archive after 365. This alone can cut storage costs by 70%.
- Multi-regional buckets replicate data across regions automatically. Use
usoreumulti-region for data that needs high availability without you managing replication. - Uniform bucket-level access replaces legacy per-object ACLs. Always use this for new buckets. It simplifies IAM dramatically.
- BigQuery external tables can query Cloud Storage directly without loading data. This is powerful for ad-hoc analysis of raw files.
Signed URLs for secure access
Never make buckets public for file sharing. Use signed URLs that expire after a set time. Generate them server-side and hand them to clients. This gives you fine-grained access control without exposing your bucket to the internet.
Production gotchas:
The 5 TB single object limit rarely matters, but the 5,000 requests/second per prefix limit does. If all your objects start with 2026/04/15/, you will hit throttling at scale. Use random prefixes or hashed keys to distribute load across partitions.
I have seen teams accidentally enable requester-pays on a public bucket and then wonder why their billing spiked. Requester-pays means the downloader pays for egress and operations, but only if they authenticate. Unauthenticated requests to a requester-pays bucket just fail.
Object versioning is off by default. Turn it on for any bucket that holds data you cannot regenerate. It has saved me from accidental deletions more than once.
Persistent Disk
What it solves: Block storage volumes that attach to Compute Engine VMs and GKE nodes. Think of them as virtual hard drives.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Persistent Disk | Amazon EBS | Azure Managed Disks |
Key talking points:
- Four types: pd-standard (HDD, cheap), pd-balanced (SSD, good default), pd-ssd (high IOPS SSD), and pd-extreme (highest IOPS, for databases).
- Regional persistent disks replicate synchronously across two zones. Use this for databases where you need zone-level redundancy.
- Snapshots are incremental and stored in Cloud Storage. Schedule daily snapshots for disaster recovery.
- Disks can be resized online without downtime. You can also change disk type (e.g., from pd-balanced to pd-ssd) without detaching.
IOPS scaling
Persistent Disk IOPS scales with disk size. A 100 GB pd-ssd gives 3,000 IOPS. A 1 TB pd-ssd gives 30,000 IOPS. If you need more IOPS, provision a larger disk even if you do not need the space. This is the same model as AWS EBS gp3, except gp3 lets you provision IOPS independently.
Filestore
What it solves: Managed NFS file shares for workloads that need a shared file system across multiple VMs.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Filestore | Amazon EFS | Azure Files (NFS) |
Key talking points:
- Minimum 1 TB capacity for Basic tier, 2.5 TB for Enterprise tier.
- Enterprise tier provides regional availability (multi-zone replication) with an SLA of 99.99%.
- Common use cases: content management systems, legacy applications that need a shared filesystem, and GKE workloads with ReadWriteMany persistent volumes.
My recommendation: avoid Filestore unless you truly need POSIX-compliant shared file access. Cloud Storage is cheaper and more scalable for most use cases. Filestore is for when you have software that literally requires mount /mnt/data.
3. Databases
This is where GCP shines brightest. Cloud Spanner is the only globally consistent relational database available as a managed service anywhere. Bigtable is the original wide-column store that inspired HBase and Cassandra. Firestore gives you real-time document synchronization. The database lineup is GCP's strongest hand in system design interviews.
Cloud SQL
What it solves: Managed MySQL, PostgreSQL, and SQL Server. The "just give me a relational database" option.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud SQL | Amazon RDS | Azure SQL Database |
Key talking points:
- Cloud SQL Auth Proxy is the recommended way to connect. It handles TLS, IAM authentication, and connection management. Never whitelist IPs directly.
- High availability uses regional persistent disk and automatic failover. Failover takes 60-120 seconds, during which the database is unavailable for writes.
- Read replicas can be cross-region for global read distribution. But they are asynchronous, so expect replication lag of 100ms to seconds.
- Max instance size: 96 vCPUs, 624 GB RAM. If you need more, you have outgrown Cloud SQL and should look at AlloyDB or Spanner.
Connection limits
Cloud SQL has hard connection limits based on instance size. A db-custom-4-16384 instance supports ~4,000 connections. If you have 200 Cloud Run instances each opening 20 connections, you have already hit the limit. Use the Cloud SQL Auth Proxy with connection pooling, or run pgBouncer as a sidecar.
Production gotchas:
The mistake I see most often is teams using the Cloud SQL Admin API to connect instead of the Auth Proxy. The Admin API generates an ephemeral SSL certificate, but it does not pool connections or handle reconnection. The Auth Proxy does both.
Storage auto-increase is on by default, which is good. But it only grows; it never shrinks. I have seen teams with a 500 GB disk after a data migration spike, paying for storage they no longer need. You have to manually export data, recreate the instance, and import.
For your interview: "I would use Cloud SQL for straightforward relational workloads where the data fits in a single region and my write throughput is under 10,000 QPS. Beyond that, I would look at AlloyDB or Spanner."
Cloud Spanner
What it solves: The world's only horizontally scalable, globally consistent relational database. It gives you SQL, ACID transactions, and schemas that span the planet with strong consistency everywhere.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Spanner | Amazon Aurora (closest, but not global) | Azure Cosmos DB (multi-model, eventual consistency default) |
This is GCP's crown jewel. No other cloud provider offers anything truly equivalent. Aurora is regional. Cosmos DB defaults to eventual consistency and is not relational. Spanner gives you globally distributed, strongly consistent, relational, SQL-compatible transactions.
Key talking points:
- TrueTime is the magic behind Spanner's consistency. Google uses GPS receivers and atomic clocks to give every server a global clock with bounded uncertainty (typically under 7ms). This lets Spanner assign globally ordered timestamps to transactions without a single coordinator.
- External consistency means that if transaction T1 commits before T2 starts, T1's timestamp is guaranteed to be less than T2's. This is stronger than serializable isolation.
- Processing units (PU) are the billing model. 1 node = 1000 PU. Each node handles ~10,000 reads/sec or 2,000 writes/sec. Minimum cost: ~$0.90/hr for a single-region instance (1 node).
- Interleaved tables are unique to Spanner. You physically co-locate parent and child rows for fast joins. Use this for one-to-many relationships that are always queried together.
Spanner cost reality
Spanner is expensive. A 3-node multi-region instance costs roughly $6,500/month. I have seen teams underestimate the cost by 10x because they priced a single processing unit and forgot you need at minimum 3 nodes for a multi-region config. Always calculate your Spanner bill before committing.
When to use Spanner vs Cloud SQL:
Use Cloud SQL when your data fits in one region and your write throughput is under 10K QPS. Use Spanner when you need global consistency, horizontal write scaling beyond what a single machine can handle, or when your SLA requires 99.999% availability (Spanner's multi-region SLA).
The honest answer? Most applications do not need Spanner. But when you do need it (global financial systems, multi-region inventory, gaming leaderboards with strong consistency), nothing else comes close.
Production gotchas:
Hot-spotting is Spanner's most common failure mode. If your primary key is a monotonically increasing integer, all writes go to the same split server. Use UUIDs, or bit-reverse sequential IDs. Spanner's documentation explicitly warns against this, but I still see it in production.
Spanner does not support all PostgreSQL features even in PostgreSQL interface mode. Window functions, CTEs, and most DDL work, but some data types and extensions are missing. Test your queries on Spanner's emulator before committing.
Firestore
What it solves: Serverless document database with real-time synchronization. Firestore is the production version of Firebase's real-time database, rebuilt for scale.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Firestore | Amazon DynamoDB (closest) | Azure Cosmos DB (document mode) |
Key talking points:
- Real-time listeners push changes to connected clients within milliseconds. No polling required. This is the killer feature for chat apps, collaborative tools, and dashboards.
- Security rules let mobile and web clients query the database directly (no backend needed for simple apps). Rules are written in a custom DSL that checks authentication and data validation.
- 1 write per second per document is the hard limit. If your chat room document gets 100 messages/second, you need to shard it into subcollections. This catches many teams off guard.
- Native mode vs Datastore mode: Native mode is Firestore with all features. Datastore mode is backward-compatible with the legacy Datastore API. New projects should always use Native mode.
The 1 write/sec/document limit
This is not a soft limit. At exactly 1 write per second per document, Firestore starts returning contention errors. For counters, leaderboards, or any high-write-rate field, use distributed counters (split the count across 10-50 shards) or move the hot path to Memorystore.
Production gotchas:
Firestore queries require indexes, and composite indexes must be created explicitly. If you add a new query pattern in code and forget to create the index, the query fails at runtime. Firestore's error message helpfully includes a direct link to create the missing index, but this has caused production outages for teams that did not test their queries in staging.
Firestore pricing is per-operation, not per-compute. At high read volumes (millions of reads/day), Firestore can become more expensive than Cloud SQL. I have helped teams migrate from Firestore to Cloud SQL after their monthly bill hit $5,000 for what was essentially a relational query pattern wedged into a document model.
Cloud Bigtable
What it solves: Petabyte-scale, low-latency NoSQL database for time-series, IoT, and analytics workloads. Bigtable is the original paper that inspired HBase and Cassandra.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Bigtable | Amazon DynamoDB (for key-value) / Amazon Keyspaces | Azure Cosmos DB (Cassandra API) |
Key talking points:
- Separated storage and compute. Bigtable nodes handle processing, Colossus handles storage. You can scale nodes (compute) independently of data size.
- Single-row transactions only. No multi-row transactions. If you need them, use Spanner.
- Row key design is everything. A bad row key causes hot-spotting. Use a composite key like
userID#reversedTimestampfor time-series data so recent events scatter across nodes while scans for a single user's history remain contiguous. - Minimum 3 nodes in production. Each node handles roughly 10,000 rows/second for reads and writes. At $0.65/hr per node, a 3-node cluster costs ~$1,400/month.
When Bigtable beats DynamoDB
Bigtable excels at sequential-key scans (time-series range queries). DynamoDB's scan operation is expensive and slow. If your workload is "give me all events for device X between midnight and 6 AM," Bigtable's row-range scan is an order of magnitude faster than DynamoDB's query.
This is the database for IoT time-series, financial tick data, or any workload that ingests billions of rows and needs sub-10ms point lookups. If you are designing a system with time-series data at scale in a GCP interview, reach for Bigtable.
Memorystore
What it solves: Managed Redis and Memcached. In-memory caching and session storage.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Memorystore for Redis | Amazon ElastiCache for Redis | Azure Cache for Redis |
| Memorystore for Memcached | Amazon ElastiCache for Memcached | Azure Cache for Memcached |
Key talking points:
- Redis Cluster mode supports up to 300 GB of data across shards. Use it for caching layers that need more than a single node's memory.
- Automatic failover for Redis instances with replicas. Failover takes 10-30 seconds.
- Memorystore connects via Private Service Access, meaning it lives inside your VPC. No public IP, no internet exposure.
- Redis 7.x support includes Redis Functions and ACLs.
Memorystore vs self-managed Redis
Memorystore is worth the premium over self-hosted Redis on Compute Engine. It handles patching, failover, monitoring, and backups. The only reason to self-host is if you need Redis modules (RediSearch, RedisJSON) which Memorystore does not support.
For interviews, I keep it simple: "For caching and session storage, I would use Memorystore for Redis. It gives me sub-millisecond reads, automatic failover, and I do not have to manage the infrastructure."
AlloyDB
What it solves: Google's PostgreSQL-compatible database that separates compute and storage for independent scaling. Think of it as Google's answer to Amazon Aurora.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| AlloyDB | Amazon Aurora PostgreSQL | Azure Cosmos DB for PostgreSQL |
Key talking points:
- 4x faster than standard PostgreSQL for transactional workloads and 100x faster for analytical queries (Google's benchmarks, grain of salt advised).
- Columnar engine processes analytical queries without a separate data warehouse. It automatically creates columnar copies of frequently queried tables.
- AI integration with Vertex AI for vector embeddings directly in the database. Use
SELECT embedding(column)to generate embeddings. - Storage scales automatically up to 128 TB. No pre-provisioning needed.
AlloyDB is relatively new (GA in 2022) and I would mention it in interviews to show you track GCP's evolution. Say: "For demanding PostgreSQL workloads, I would evaluate AlloyDB, which separates compute and storage like Aurora and includes a columnar engine for analytics."
4. Messaging and Streaming
Asynchronous communication is the backbone of any distributed system. GCP's Pub/Sub is the standout here: it is global by default, which is a massive differentiator from AWS SQS (regional) and Azure Service Bus (regional). Dataflow handles stream and batch processing on the same unified model.
Pub/Sub
What it solves: Global, serverless messaging for event-driven architectures. Decouple producers from consumers with at-least-once (or exactly-once) delivery.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Pub/Sub | Amazon SNS + SQS | Azure Service Bus |
Pub/Sub is global by default. When you publish a message, it is replicated across multiple zones and regions before acknowledgement. This is fundamentally different from SQS, which is a single-region service. In a system design interview, this matters when you are designing globally distributed event-driven systems.
Key talking points:
- Global by default. Messages are replicated across regions before the publish call returns. No SQS-style "create queue in us-east-1" regionality. This is the single most important differentiator from AWS.
- Exactly-once delivery is available (opt-in per subscription). This eliminates the need for idempotency in many consumers. It works by deduplicating based on message ID within the acknowledgement deadline window.
- Ordering keys guarantee messages with the same key are delivered in publish order. Use the order ID or user ID as the ordering key. Without ordering keys, message ordering is best-effort.
- Dead letter topics catch messages that fail processing after N attempts. Always configure these. I have seen teams lose thousands of events because failed messages expired from the subscription with no dead letter configured.
- Push vs Pull subscriptions: Push sends messages to an HTTP endpoint (Cloud Run, Cloud Functions). Pull lets your application poll for messages. Use Push for serverless consumers, Pull when you need flow control.
Pub/Sub for cross-service communication
My default pattern for microservices on GCP: every service publishes domain events to a Pub/Sub topic. Downstream services subscribe independently. This gives you loose coupling, replay capability (via seek), and natural fan-out without point-to-point HTTP calls.
Message size and throughput limits
Maximum message size is 10 MB. Maximum throughput per topic is effectively unlimited (Google scales it), but each subscription has a 10,000 messages/second acknowledge rate by default. Request a quota increase if you need more.
Production gotchas:
The acknowledge deadline is critical and misunderstood. If your consumer does not acknowledge a message within the deadline (default 10 seconds, max 600 seconds), Pub/Sub redelivers it. I have seen services process every message twice because the processing time exceeded the acknowledge deadline. Either extend the deadline with modAckDeadline or set a longer initial deadline.
Pub/Sub's "exactly-once" mode requires the client library to detect and discard redelivered messages. If you are using a raw HTTP client instead of the official client library, you do not get exactly-once semantics and must handle deduplication yourself.
Dataflow
What it solves: Unified stream and batch processing on Apache Beam. Write once, run as a streaming pipeline or a batch job.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Dataflow | Amazon Kinesis Data Analytics / AWS Glue | Azure Stream Analytics |
Key talking points:
- Apache Beam SDK means your pipeline code is portable. Write it once and run on Dataflow (GCP), Spark (on-prem), or Flink (AWS). In practice, most teams target Dataflow directly.
- Streaming mode processes data with sub-second latency. Uses windowing (fixed, sliding, session windows) and triggers to control when results are emitted.
- Autoscaling workers adjusts the number of processing VMs based on backlog. A pipeline can scale from 1 worker to 1,000 workers automatically.
- Streaming Engine offloads shuffle and state management from worker VMs to a Google-managed backend. This reduces worker cost and improves reliability.
Dataflow vs Dataproc
Dataflow is serverless: you submit a pipeline and Google manages the infrastructure. Dataproc is managed Spark/Hadoop: you create a cluster and submit Spark jobs. Use Dataflow for new pipelines (especially streaming). Use Dataproc when you have existing Spark code or need the full Spark ecosystem (MLlib, SparkSQL).
Production gotchas:
Dataflow streaming jobs run continuously and cost money 24/7. A 5-worker streaming pipeline costs roughly $1,500/month. I have seen teams forget about a test pipeline running in dev for months, racking up thousands in charges. Always set up billing alerts and tag your Dataflow jobs.
Late data handling is where complexity lives. Dataflow's watermark tells you "all data up to this timestamp has arrived." But late data can still arrive after the watermark passes. Configure allowed lateness and accumulation mode carefully or you will either lose data or count it twice.
Eventarc
What it solves: Event routing from GCP services and third-party sources to Cloud Run, Cloud Functions, and Workflows.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Eventarc | Amazon EventBridge | Azure Event Grid |
Key talking points:
- Unified event router. Eventarc captures events from 130+ GCP sources (Cloud Audit Logs, direct events from Pub/Sub, Cloud Storage, Firestore) and routes them to serverless targets.
- CloudEvents format standardizes event payloads. Every event follows the CloudEvents spec.
- Use Eventarc when you want to trigger Cloud Run from a GCS upload or a Firestore write without manually setting up Pub/Sub topics.
Eventarc simplifies event-driven architecture
Instead of manually creating Pub/Sub topics and subscriptions for every GCP-to-service integration, use Eventarc. One gcloud eventarc triggers create command does what used to take 5 commands and a custom IAM setup.
My honest take: Eventarc is a convenience layer over Pub/Sub and Cloud Audit Logs. For simple integrations ("trigger a function when a file is uploaded"), it is the fastest path. For complex event routing with fan-out, dead letters, and ordering, use Pub/Sub directly.
5. Networking and Content Delivery
GCP's networking is built on Google's private backbone, which spans the globe. This gives GCP a genuine latency advantage for global services: traffic enters Google's network at the nearest edge PoP and travels over Google's fiber (not the public internet) to the destination region.
Cloud Load Balancing
What it solves: Global and regional load balancing across Compute Engine, GKE, Cloud Run, and Cloud Storage backends.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Load Balancing | Elastic Load Balancing (ALB/NLB) | Azure Load Balancer + Front Door |
GCP's load balancer is a single global resource (for HTTP/S). There is no concept of "deploy an ALB in us-east-1." You create one load balancer, add backends across multiple regions, and GCP routes traffic to the nearest healthy backend.
Key talking points:
- Global anycast IP means one IP address for the entire world. Users in Tokyo and New York connect to the same IP, and Google's network routes them to the nearest backend.
- SSL termination at the edge happens at 200+ PoPs worldwide. Your users get fast TLS handshakes regardless of where your backend runs.
- Health checks are continuous and automatic. If a region goes down, traffic failovers in seconds (not minutes).
- URL maps route traffic by path:
/api/*goes to Cloud Run,/static/*goes to Cloud Storage.
One LB, multiple backends
In interviews, say: "With GCP's Global HTTP(S) Load Balancer, I get a single anycast IP that routes users to the nearest healthy backend. SSL terminates at the edge, and Cloud CDN caches static content. I do not need separate load balancers per region."
Cloud CDN
What it solves: Content delivery network that caches responses at Google's edge locations. Integrated with Cloud Load Balancing.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud CDN | Amazon CloudFront | Azure CDN / Front Door |
Key talking points:
- Enable Cloud CDN with one checkbox on your load balancer backend. No separate service to configure.
- Cache modes: USE_ORIGIN_HEADERS (respect Cache-Control), CACHE_ALL_STATIC (auto-cache images, JS, CSS), FORCE_CACHE_ALL (cache everything for the specified TTL).
- Signed URLs and signed cookies for private content delivery.
- Cloud CDN shares Google's edge network, meaning content is served from the same 200+ PoPs as YouTube and Google Search.
I would combine Cloud CDN with Cloud Storage for static assets (images, videos, JS/CSS bundles) and Cloud Run for dynamic API responses with Cache-Control headers.
Cloud DNS
What it solves: Managed authoritative DNS with global anycast.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud DNS | Amazon Route 53 | Azure DNS |
Key talking points:
- 100% SLA (Google guarantees it will never go down). This is the highest uptime SLA of any DNS service.
- Supports DNSSEC for response authentication.
- Routing policies: geolocation, weighted round-robin, and failover. Use geolocation to route users to their nearest region.
- Private DNS zones for internal VPC resolution.
Cloud Armor
What it solves: Web Application Firewall (WAF) and DDoS protection at the edge.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Armor | AWS WAF + Shield | Azure WAF + DDoS Protection |
Key talking points:
- Google-scale DDoS absorption. Cloud Armor sits on the same infrastructure that protects Google.com. It can absorb multi-Tbps attacks.
- Preconfigured WAF rules for the OWASP Top 10 (SQL injection, XSS, RFI/LFI). Enable them with one rule.
- Adaptive Protection uses ML to detect anomalous traffic patterns and automatically suggests blocking rules.
- Rate limiting by IP, by header, or by geographic region. Essential for API protection.
Cloud Armor pricing
Cloud Armor Standard tier is free for basic DDoS protection. The Managed Protection Plus tier (with adaptive protection and advanced WAF rules) costs $3,000/month base plus per-request fees. For most applications, the Standard tier with custom WAF rules is sufficient.
Virtual Private Cloud (VPC)
What it solves: Software-defined networking for isolating and connecting GCP resources.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| VPC | Amazon VPC | Azure Virtual Network |
Key talking points:
- Global VPC. A single VPC spans all regions. You do not create per-region VPCs like in AWS. Subnets are regional, but the VPC is global. This simplifies multi-region networking dramatically.
- Shared VPC lets multiple projects use the same VPC, managed centrally. Use this in organizations with multiple teams.
- VPC Service Controls create a security perimeter around GCP resources, preventing data exfiltration. Even if credentials are compromised, data cannot leave the perimeter.
- Private Google Access lets VMs without external IPs reach Google APIs (BigQuery, Cloud Storage, etc.) over Google's internal network.
GCP's global VPC vs AWS's regional VPC
This is a common interview comparison. GCP's VPC is global: one VPC, subnets per region, automatic routing between them. AWS's VPC is regional: you need VPC peering or Transit Gateway to connect across regions. GCP's model is simpler for multi-region deployments.
For your interview: "I would use a single global VPC with subnets in each region where I have compute resources. GCP handles inter-region routing automatically, so I do not need the equivalent of AWS Transit Gateway."
6. Security and Identity
GCP's security model centers on IAM with fine-grained roles, and the infrastructure assumes zero trust by default. Every API call is authenticated and authorized. Every internal RPC is encrypted. This is not optional; it is the default.
Cloud IAM
What it solves: Identity and Access Management. Controls who (identity) can do what (role) on which resource.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud IAM | AWS IAM | Azure RBAC + Entra ID |
Key talking points:
- Resource hierarchy (Organization β Folders β Projects β Resources) with policy inheritance. A role granted at the Organization level applies to every resource in every project. Use this hierarchy for centralized governance.
- Predefined roles cover most use cases.
roles/storage.objectViewergives read-only access to GCS objects. Do not giveroles/storage.adminwhen you only need read access. - Service accounts are for machine identities. Every GKE pod, Cloud Run service, and Compute Engine VM should have its own service account with the minimum required permissions.
- Workload Identity Federation lets external systems (AWS, GitHub Actions, Azure) authenticate to GCP without service account keys. This replaced the dangerous practice of downloading JSON key files.
Never download service account keys
Service account key files are the #1 source of GCP credential leaks. They are long-lived, not rotated automatically, and end up in Git repositories, CI/CD pipelines, and Slack channels. Use Workload Identity Federation for external systems and the metadata server for GCP-hosted workloads. If you must use keys, rotate them every 90 days and monitor with Cloud Audit Logs.
Production gotchas:
The most common IAM mistake is granting roles/editor or roles/owner at the project level. These roles include thousands of permissions. A compromised service with the Editor role can read every database, delete every resource, and modify every IAM policy in the project. Use predefined roles at the narrowest scope possible.
IAM policy evaluation is not instant. Changes propagate in 60 seconds typically, but I have seen it take up to 5 minutes. Do not build health checks or CI pipelines that depend on IAM changes being immediate.
Identity Platform
What it solves: Customer-facing authentication. Sign-in with email/password, phone, Google, Facebook, SAML, OIDC, and more.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Identity Platform | Amazon Cognito | Azure AD B2C |
Key talking points:
- Built on Firebase Auth but with enterprise features: multi-tenancy, SAML, blocking functions, and SLA.
- Multi-tenancy lets you create isolated authentication namespaces for each customer. Essential for B2B SaaS applications.
- Blocking functions run before or after authentication events. Use them to check a user against a deny list, add custom claims, or trigger provisioning workflows.
- Supports up to 10 billion authentications per month.
Identity Platform vs Firebase Auth
Firebase Auth is free for most features but has lower limits and no SLA. Identity Platform is the paid, enterprise version with multi-tenancy, blocking functions, and a 99.95% SLA. For production applications, use Identity Platform.
Cloud KMS
What it solves: Managed encryption key lifecycle: create, rotate, and use encryption keys without managing HSMs.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud KMS | AWS KMS | Azure Key Vault |
Key talking points:
- Envelope encryption: KMS encrypts a data encryption key (DEK), you use the DEK to encrypt your data. This is the pattern used by every GCP service that offers customer-managed encryption keys (CMEK).
- Automatic key rotation on a user-defined schedule (minimum 1 day, recommended 90 days). Old key versions stay available for decryption.
- HSM-backed keys (Cloud HSM) for FIPS 140-2 Level 3 compliance. Required for PCI DSS and some government workloads.
- External Key Manager (EKM) lets you use keys stored in an external HSM. The encryption key never enters Google infrastructure.
CMEK everywhere
For regulated industries, enable Customer-Managed Encryption Keys on every service: Cloud Storage, BigQuery, Cloud SQL, Spanner, Pub/Sub. This gives you key rotation control and the ability to revoke access to all data by disabling the key.
Secret Manager
What it solves: Securely store and manage API keys, passwords, certificates, and other sensitive strings.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Secret Manager | AWS Secrets Manager | Azure Key Vault (secrets) |
Key talking points:
- Versioned secrets with automatic replication across regions. Each secret can have multiple versions; roll back by pointing to an older version.
- IAM-based access control. Grant
roles/secretmanager.secretAccessoron a specific secret to a specific service account. Fine-grained. - Automatic rotation is not built-in (unlike AWS Secrets Manager). You need to implement rotation logic yourself using Cloud Functions or a Pub/Sub-triggered workflow.
- Secrets are encrypted at rest with Google-managed keys or your own CMEK.
Secret Manager does not rotate secrets
Unlike AWS Secrets Manager, GCP's Secret Manager does not have built-in rotation. You must build the rotation logic yourself. Create a Cloud Scheduler job that triggers a Cloud Function to generate a new secret, update the dependent service, and add a new secret version. This is a common interview catch.
My approach: store every secret (database passwords, API keys, TLS certs) in Secret Manager. Reference them in Cloud Run as mounted volumes or environment variables. Never hardcode secrets in code, config files, or environment variables in your CI/CD pipeline.
7. Observability
You cannot operate what you cannot observe. GCP's observability stack (Cloud Operations Suite, formerly Stackdriver) provides monitoring, logging, and tracing as integrated, serverless services. The key advantage over rolling your own ELK/Prometheus/Jaeger stack: zero infrastructure to manage.
Cloud Monitoring
What it solves: Metrics collection, dashboards, and alerting for GCP resources and custom applications.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Monitoring | Amazon CloudWatch | Azure Monitor |
Key talking points:
- SLO monitoring is built-in. Define your SLI (latency, availability), set an SLO (99.9%), and Cloud Monitoring tracks your error budget. Alert when the budget burns too fast. This is Google's SRE approach baked into the product.
- Managed Prometheus lets you use Prometheus metrics and PromQL queries without running a Prometheus server. GCP scrapes your targets and stores metrics for 24 months.
- Monitoring Query Language (MQL) is more powerful than PromQL for GCP metrics but has a steeper learning curve. Use it for complex aggregations.
- Uptime checks probe your endpoints from multiple global locations every 1-15 minutes.
SLO-based alerting
Instead of alerting on arbitrary thresholds ("CPU > 80%"), define SLOs and alert on burn rate. "My API's latency SLO is p99 < 500ms. Alert me when I am burning error budget at 10x the normal rate." This dramatically reduces alert noise and focuses on what actually matters to users.
Cloud Logging
What it solves: Centralized log management. Automatic collection from all GCP services, plus custom application logs.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Logging | Amazon CloudWatch Logs | Azure Monitor Logs |
Key talking points:
- Automatic collection from every GCP service. Cloud Run stdout, GKE container logs, Compute Engine system logs, load balancer access logs. No agent configuration needed for most services.
- Log-based metrics let you create custom metrics from log entries. Count the number of 500 errors per minute without writing code.
- Log Router controls where logs go: Cloud Logging buckets (default), BigQuery (for analysis), Cloud Storage (for archival), or Pub/Sub (for streaming to external systems).
- Log Analytics powered by BigQuery lets you write SQL queries against your logs. This is far more powerful than CloudWatch Logs Insights.
Logging costs add up
Cloud Logging charges $0.50/GB for ingestion beyond the free 50 GB/month allotment. A busy GKE cluster can generate 100+ GB/month of logs. Use the Log Router to exclude verbose debug logs from ingestion, or route them to Cloud Storage at $0.02/GB/month for cold analysis.
Production gotchas:
The default log retention is 30 days for _Default bucket and 400 days for _Required bucket. If you need longer retention for compliance, create a custom log bucket with the retention period you need.
I have seen teams routing all logs to BigQuery for analysis and then getting surprised by a $2,000/month BigQuery storage bill. Route only the logs you actually query to BigQuery. Send everything else to Cloud Storage.
Cloud Trace
What it solves: Distributed tracing for understanding request latency across services.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Trace | AWS X-Ray | Azure Application Insights |
Key talking points:
- Automatic trace collection for Cloud Run, App Engine, and Cloud Functions. No SDK required for basic tracing.
- Compatible with OpenTelemetry for custom instrumentation.
- Latency analysis shows p50, p95, and p99 latency distributions. Drill into individual traces to find the slow span.
- Integrates with Cloud Logging so you can jump from a trace to the corresponding log entries.
For your interview: "I would use Cloud Trace with OpenTelemetry instrumentation to trace requests across my microservices. Combined with Cloud Logging, I can go from 'this request was slow' to 'this specific database query took 3 seconds' in a few clicks."
Error Reporting
What it solves: Automatically groups and displays errors from Cloud Functions, App Engine, Cloud Run, GKE, and Compute Engine.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Error Reporting | Third-party (Sentry, Datadog) | Azure Application Insights |
Key talking points:
- Automatically parses stack traces and groups identical errors. No configuration needed for supported runtimes.
- Shows error frequency, first/last occurrence, and affected services.
- Integrates with Cloud Logging to show the full log context around each error.
This is one of those "free" GCP features that many teams do not know about. It is not as feature-rich as Sentry, but for basic error tracking on GCP services, it works out of the box.
8. AI/ML Services
GCP has the strongest AI/ML platform of any cloud provider. Google invented the Transformer (the "T" in GPT), runs one of the largest AI research organizations in the world, and builds custom TPUs for model training. Vertex AI is the unified platform that brings all of this together.
Vertex AI
What it solves: End-to-end machine learning platform: train, tune, deploy, and manage ML models at scale.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Vertex AI | Amazon SageMaker | Azure Machine Learning |
Key talking points:
- AutoML trains production-quality models with zero ML expertise. Upload labeled data, pick a model type (image classification, NLP, tabular), and Vertex AI handles architecture search, hyperparameter tuning, and deployment.
- TPU access is GCP's hardware advantage. TPU v5e costs ~$1.20/hr and provides 2x performance per dollar compared to A100 GPUs for many model architectures. No other cloud provider offers TPUs.
- Feature Store manages online (low-latency serving) and offline (batch training) features in one system. This solves the training-serving skew problem.
- Model Monitoring detects feature drift and prediction skew in production. It compares incoming feature distributions against training data and alerts when they diverge.
- Vertex AI Pipelines are based on Kubeflow Pipelines. Define your ML workflow as a DAG: data prep β training β evaluation β deployment. Reproducible and auditable.
TPUs in interviews
If the interview involves training large models, mention TPUs. Say: "I would use TPU v5e pods on Vertex AI for training because they offer the best performance-per-dollar for transformer architectures. Google designed TPUs specifically for matrix operations that dominate ML workloads."
Production gotchas:
Vertex AI endpoints charge for prediction compute even when idle. A single n1-standard-4 endpoint node costs ~$150/month even with zero traffic. Use autoscaling with min replicas set to 0 if your model does not need always-on serving, or use batch predictions when real-time is not required.
I have seen teams train models on Vertex AI with the default machine type and wonder why training takes 12 hours. Always specify the right accelerator. For fine-tuning a transformer: use at least 1 A100 GPU. For training from scratch: use a TPU v5e pod.
Gemini API
What it solves: Access to Google's most capable foundation models (Gemini family) for text generation, multimodal understanding, code generation, and reasoning.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Gemini API (via Vertex AI) | Amazon Bedrock (Claude, Titan) | Azure OpenAI Service |
Key talking points:
- Gemini 2.5 Pro has a 1M+ token context window. This is the largest context window of any production model. Use it for analyzing entire codebases, long documents, or multi-turn conversations with extensive history.
- Multimodal natively. Gemini processes text, images, video, and audio in a single model. No separate vision or audio APIs needed.
- Grounding connects model outputs to Google Search or your own data. This reduces hallucinations by anchoring responses to real facts.
- Two access paths: Gemini API (Google AI Studio, simple API key auth) for prototyping, and Vertex AI Gemini (enterprise features, IAM auth, VPC Service Controls) for production.
Gemini vs OpenAI in GCP interviews
If the interviewer asks about LLM integration, frame it as: "On GCP, I would use the Gemini API via Vertex AI for enterprise features like VPC Service Controls, data residency, and IAM-based access. The 1M token context window is particularly useful for document analysis and RAG pipelines."
Production gotchas:
Rate limits and quotas vary by model and region. Gemini 2.5 Pro has lower throughput than smaller models. For high-throughput applications, use provisioned throughput (buy committed capacity) rather than relying on on-demand quotas.
Latency for long-context requests (100K+ tokens) can reach 10-30 seconds. Design your application with streaming responses to keep the UI responsive while the model generates.
Document AI
What it solves: Extract structured data from documents (PDFs, scanned images, forms, invoices) using pre-trained and custom ML models.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Document AI | Amazon Textract | Azure Document Intelligence |
Key talking points:
- Pre-trained processors for common document types: invoices, receipts, W-2 forms, driver's licenses, bank statements, passports. No training required.
- Custom Document Extractor lets you train on your own document layouts with as few as 10 labeled examples.
- Supports Human-in-the-Loop review for low-confidence extractions. Integrates with a review UI.
- Batch processing for high-volume document ingestion.
For your interview: "For document processing at scale, I would use Document AI with a pre-trained processor for standard documents and a custom extractor for proprietary formats. The Human-in-the-Loop feature handles edge cases where the model is uncertain."
Vision AI and Natural Language AI
What it solves: Pre-trained APIs for image analysis (object detection, OCR, face detection) and text analysis (sentiment, entity extraction, classification).
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Cloud Vision AI | Amazon Rekognition | Azure Computer Vision |
| Cloud Natural Language AI | Amazon Comprehend | Azure Language Service |
Key talking points:
- Vision AI provides label detection, OCR, face detection, explicit content detection, and product search. It processes images via a REST API with no model training.
- Natural Language AI performs sentiment analysis, entity extraction, content classification, and syntax analysis on text.
- Both services are being superseded by Gemini's multimodal capabilities for new applications. Use the dedicated APIs when you need low-latency, high-throughput processing of a specific task.
Pre-trained APIs vs Gemini
For simple tasks (detect objects in an image, extract entities from text), the pre-trained Vision and Language APIs are faster and cheaper than Gemini. For complex reasoning (describe what is happening in a video, summarize a conversation), use Gemini. The pre-trained APIs have sub-200ms latency; Gemini takes 1-5 seconds.
9. Data and Analytics
GCP's analytics story is anchored by BigQuery, which is arguably the best serverless data warehouse on any cloud. Dataproc provides managed Spark for teams with existing Spark workloads, and Looker is the BI layer.
BigQuery
What it solves: Serverless, petabyte-scale SQL data warehouse. No infrastructure to manage, no indexes to tune, no clusters to resize.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| BigQuery | Amazon Redshift Serverless | Azure Synapse Analytics |
BigQuery is built on Dremel, Google's internal query engine that powers Google's own analytics. It separates compute and storage completely: your data sits in Colossus (Google's distributed file system), and queries run on a pool of shared compute capacity (slots).
Key talking points:
- Two pricing models: On-demand ($6.25/TB scanned) is the default and good for ad-hoc queries. Flat-rate slots ($0.04/slot-hour) are better for consistent, heavy workloads. In my experience, teams that run more than 5 TB of queries per month should evaluate slots.
- Streaming inserts add rows to BigQuery in real-time. The gotcha: streamed data lands in a "streaming buffer" that is not immediately available for DML operations (UPDATE, DELETE). It takes up to 90 minutes for streamed rows to become DML-eligible.
- Materialized views pre-compute aggregations and BigQuery automatically refreshes them when base tables change. Use these for dashboard queries that run the same aggregation repeatedly.
- BI Engine caches query results in memory for sub-second response times on Looker dashboards. Allocates 1-250 GB of RAM per project.
- BigQuery ML lets you train ML models (linear regression, boosted trees, deep neural networks, LLMs) using SQL. No Python, no data export, no separate training infrastructure.
The streaming buffer gotcha
Data inserted via the streaming API lands in a buffer that is queryable but not modifiable. You cannot UPDATE or DELETE streamed rows until they flush to columnar storage (up to 90 minutes). If you need to modify recently ingested data, use the BigQuery Storage Write API with "committed" mode instead of the legacy streaming insert API.
Partition and cluster everything
Always partition BigQuery tables by date or timestamp column. Always cluster by your most common filter columns. A query on a 10 TB table partitioned by date and clustered by customer_id might scan only 50 GB instead of the full 10 TB. This cuts your on-demand bill by 99.5%.
Production gotchas:
The biggest BigQuery cost mistake is SELECT * on a multi-TB table. BigQuery scans every column in the table because it uses columnar storage. A SELECT column_a, column_b on a 10 TB table might scan only 200 GB. Always select only the columns you need, and use cached results (--use_cache=true, which is the default) for repeated queries.
Slot contention in flat-rate pricing is a real problem. If multiple teams share a slot reservation and one team runs a massive query, other teams' queries queue. Use slot assignments or reservations per team to prevent noisy neighbor issues.
Dataproc
What it solves: Managed Apache Spark, Hadoop, Presto, and Flink clusters.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Dataproc | Amazon EMR | Azure HDInsight |
Key talking points:
- Clusters spin up in under 90 seconds. Use ephemeral clusters: create for a job, run the job, delete the cluster. Do not leave clusters running 24/7.
- Dataproc Serverless lets you submit Spark jobs without creating a cluster at all. Google manages the compute.
- Autoscaling adds and removes worker nodes based on YARN metrics.
- Store data in Cloud Storage (not HDFS) so it persists after the cluster is deleted. Use the
gs://connector.
Dataproc vs Dataflow
Dataproc is for existing Spark/Hadoop workloads. Dataflow is for new pipeline development. If you are starting fresh, use Dataflow (Apache Beam). If you have existing Spark code, use Dataproc. Do not rewrite working Spark jobs just to use Dataflow.
Looker
What it solves: Business intelligence and data visualization platform. Semantic modeling layer over your data warehouse.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Looker | Amazon QuickSight | Power BI |
Key talking points:
- LookML is Looker's modeling language. It defines dimensions, measures, and relationships in code, creating a reusable semantic layer. Analysts write explores against LookML, not raw SQL.
- Embedded analytics lets you embed Looker dashboards in your application. White-label BI without building your own charting.
- Looker Studio (formerly Data Studio) is the free, lighter alternative for basic dashboards and reports.
Data Fusion
What it solves: Visual ETL/ELT pipeline builder. Drag-and-drop data integration built on CDAP (an open-source data integration framework).
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Data Fusion | AWS Glue Studio | Azure Data Factory |
Key talking points:
- Visual pipeline designer for teams that prefer GUI-based ETL over writing code.
- Pre-built connectors for databases, SaaS applications, and file systems.
- Under the hood, Data Fusion compiles visual pipelines to Dataproc Spark jobs.
My take: use Data Fusion when your data engineering team prefers visual tools. For code-first teams, use Dataflow or directly write Spark jobs on Dataproc. Data Fusion adds a layer of abstraction that can make debugging harder for complex pipelines.
10. Infrastructure as Code
Managing GCP resources manually through the Console does not scale. You need infrastructure as code (IaC) for reproducibility, version control, and team collaboration. GCP supports its own Deployment Manager plus the dominant third-party tools.
Deployment Manager
What it solves: Google's native IaC tool. Define GCP resources in YAML/Jinja2/Python templates.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Deployment Manager | AWS CloudFormation | Azure Resource Manager (ARM) |
Key talking points:
- YAML-based templates with Jinja2 or Python for dynamic configuration.
- Supports all GCP resources through the API discovery document.
- Previews show what changes will be made before applying.
Deployment Manager is effectively deprecated
Google stopped investing in Deployment Manager years ago. The last major feature update was 2020. Google now recommends Terraform for new projects and is investing in Config Connector (Kubernetes-native IaC) as the GCP-native alternative. Do not start new projects on Deployment Manager.
My honest recommendation: skip Deployment Manager entirely. Use Terraform or Pulumi.
Terraform on GCP
What it solves: Multi-cloud IaC from HashiCorp. The industry standard for cloud infrastructure automation.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Terraform (Google provider) | Terraform (AWS provider) | Terraform (AzureRM provider) |
Key talking points:
- Google Cloud provider is actively maintained and covers virtually every GCP service.
- Remote state in GCS with state locking via Cloud Storage's generation-based locking. Store state in a separate project from the resources it manages.
- Terraform modules for common patterns:
terraform-google-modulesprovides battle-tested modules for VPC, GKE, Cloud SQL, and more. - Cloud Build integration for CI/CD: run
terraform planon pull requests,terraform applyon merge to main.
State management best practice
Store Terraform state in a GCS bucket with versioning enabled and a separate GCP project for the state bucket. This prevents accidental state deletion and ensures you can recover from state corruption. Never store state locally or in the same project as your resources.
Pulumi on GCP
What it solves: Infrastructure as code using real programming languages (TypeScript, Python, Go, C#) instead of HCL or YAML.
| GCP Service | AWS Equivalent | Azure Equivalent |
|---|---|---|
| Pulumi (GCP provider) | Pulumi (AWS provider) | Pulumi (Azure provider) |
Key talking points:
- Write infrastructure code in TypeScript, Python, Go, or C# instead of learning HCL.
- Full IDE support: autocomplete, type checking, and refactoring.
- Component resources let you create reusable infrastructure abstractions using the same patterns as application code (classes, functions, packages).
- Pulumi Cloud provides state management, secrets, and policies.
My take: Pulumi is excellent for teams that already have strong TypeScript or Python skills and dislike HCL. Terraform has a much larger community and more examples. For interviews, default to mentioning Terraform since it is the industry standard, but mention Pulumi as an alternative if the interviewer asks about other IaC tools.
GCP Services Quick Reference
Here is the complete mapping for rapid interview reference:
| Category | GCP Service | AWS Equivalent | Azure Equivalent | When to Use |
|---|---|---|---|---|
| Compute | Compute Engine | EC2 | Virtual Machines | Custom VMs, GPU workloads |
| Cloud Run | Fargate | Container Apps | Stateless HTTP services | |
| GKE | EKS | AKS | Container orchestration | |
| Cloud Functions | Lambda | Azure Functions | Event-driven functions | |
| App Engine | Elastic Beanstalk | App Service | Legacy PaaS (avoid for new) | |
| Storage | Cloud Storage | S3 | Blob Storage | Object storage, data lake |
| Persistent Disk | EBS | Managed Disks | Block storage for VMs | |
| Filestore | EFS | Azure Files | Shared NFS | |
| Database | Cloud SQL | RDS | SQL Database | Managed relational DB |
| Cloud Spanner | Aurora Global | Cosmos DB | Global consistent relational | |
| Firestore | DynamoDB | Cosmos DB | Document DB, real-time sync | |
| Bigtable | DynamoDB/Keyspaces | Cosmos DB (Cassandra) | Time-series, IoT | |
| Memorystore | ElastiCache | Cache for Redis | In-memory caching | |
| AlloyDB | Aurora PostgreSQL | Cosmos DB for PG | High-perf PostgreSQL | |
| Messaging | Pub/Sub | SNS+SQS | Service Bus | Global async messaging |
| Dataflow | Kinesis Analytics | Stream Analytics | Stream/batch processing | |
| Eventarc | EventBridge | Event Grid | Event routing | |
| Networking | Cloud LB | ELB (ALB/NLB) | Load Balancer+Front Door | Global load balancing |
| Cloud CDN | CloudFront | Azure CDN | Content caching | |
| Cloud DNS | Route 53 | Azure DNS | DNS with 100% SLA | |
| Cloud Armor | WAF+Shield | WAF+DDoS Protection | DDoS, WAF | |
| VPC | VPC | Virtual Network | Network isolation | |
| Security | Cloud IAM | IAM | RBAC+Entra ID | Access control |
| Identity Platform | Cognito | AD B2C | Customer auth | |
| Cloud KMS | KMS | Key Vault | Encryption keys | |
| Secret Manager | Secrets Manager | Key Vault (secrets) | Secret storage | |
| Observability | Cloud Monitoring | CloudWatch | Azure Monitor | Metrics, alerting |
| Cloud Logging | CloudWatch Logs | Monitor Logs | Log management | |
| Cloud Trace | X-Ray | App Insights | Distributed tracing | |
| AI/ML | Vertex AI | SageMaker | Azure ML | ML platform |
| Gemini API | Bedrock | Azure OpenAI | Foundation models | |
| Document AI | Textract | Document Intelligence | Document processing | |
| Analytics | BigQuery | Redshift | Synapse Analytics | Data warehouse |
| Dataproc | EMR | HDInsight | Managed Spark/Hadoop | |
| Looker | QuickSight | Power BI | Business intelligence | |
| IaC | Terraform | Terraform | Terraform | Infrastructure as code |
GCP's Unique Differentiators (Cheat Sheet)
When an interviewer asks "Why GCP over AWS/Azure?", these are your talking points:
- Spanner is the only globally consistent relational database available as a managed service. Period.
- BigQuery is the most mature serverless data warehouse. Separation of compute and storage, no cluster management, petabyte-scale.
- Global networking on Google's private backbone. Load balancers with global anycast, VPCs that span all regions, traffic that rides Google's fiber.
- Kubernetes heritage. GKE is the most mature and tightly integrated managed Kubernetes offering. GCP built Kubernetes.
- TPUs for ML training. Custom silicon for transformer workloads at 2x performance-per-dollar vs GPUs.
- Pub/Sub is global by default. AWS SQS/SNS are regional. For global event-driven architectures, Pub/Sub is simpler.
- Live migration moves running VMs between hosts without downtime. No other major cloud offers this.
- SRE tools built-in. SLO monitoring, error budgets, and managed Prometheus in Cloud Monitoring come from Google's SRE culture.
The five-second GCP pitch
"GCP gives me globally consistent databases, the best serverless analytics, and the network backbone that runs Google Search. I would choose GCP when my requirements include global consistency (Spanner), petabyte-scale analytics (BigQuery), or when my team has Kubernetes expertise (GKE)."
GCP Services Knowledge Check
Loading questionsβ¦