System Design Interview Questions for Mid-Level Engineers

System design interviews test your ability to think at scale — to architect distributed systems that are reliable, performant, and maintainable under real-world constraints. Unlike coding interviews, there is rarely one correct answer. Instead, interviewers evaluate how you break down ambiguous problems, make trade-offs, and communicate your reasoning. For mid-level engineers, the expectation is that you can move beyond theoretical knowledge and speak to practical decisions: why you would choose one database over another, how you would handle a sudden traffic spike, or what happens when a downstream service goes down. The questions below cover the core topics you are most likely to encounter, grouped by theme to help you study systematically.

Fundamentals: Scalability, Availability, and CAP Theorem

What is the difference between horizontal and vertical scaling, and when would you choose each?

Vertical scaling means adding more resources (CPU, RAM, disk) to a single machine, while horizontal scaling means adding more machines to distribute the load. Vertical scaling is simpler and requires no application changes, but hits hardware limits quickly and creates a single point of failure. Horizontal scaling is preferred for large-scale systems because it offers near-unlimited capacity and better fault tolerance, though it introduces complexity around data consistency and session management.

What does “availability” mean in the context of distributed systems, and how is it typically measured?

Availability is the percentage of time a system is operational and able to serve requests. It is commonly expressed as “nines” — 99.9% availability means roughly 8.7 hours of downtime per year, while 99.99% means about 52 minutes. High availability is achieved through redundancy, health checks, automatic failover, and eliminating single points of failure across every layer of the stack.

Explain the CAP theorem. What does it mean in practice for system design?

CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency (every read returns the most recent write), Availability (every request receives a response), and Partition Tolerance (the system continues operating despite network partitions). Since network partitions are unavoidable in real distributed systems, the practical choice is between CP (consistency over availability, e.g., HBase, ZooKeeper) and AP (availability over consistency, e.g., Cassandra, DynamoDB). Understanding this trade-off helps you select the right datastore for a given use case.

What is eventual consistency, and when is it an acceptable trade-off?

Eventual consistency means that, given no new updates, all replicas of a piece of data will converge to the same value over time — but reads may return stale data in the interim. It is acceptable when user experience tolerates slight staleness, such as a social media “like” count, product recommendation data, or DNS propagation. It is not acceptable for financial transactions, inventory reservation, or any scenario where acting on stale data causes harm.

What is a single point of failure (SPOF), and how do you design systems to avoid it?

A SPOF is any component whose failure brings down the entire system. Common SPOFs include a single database server, a single load balancer, or a single availability zone. You eliminate them through redundancy: running multiple instances of critical services, using active-active or active-passive failover configurations, deploying across multiple availability zones, and using managed services that abstract away hardware failures.

What is the difference between latency and throughput, and how do they relate to system design?

Latency is the time it takes to complete a single request (e.g., 50ms per query), while throughput is the number of requests a system can handle per unit of time (e.g., 10,000 requests per second). They are often in tension: optimizations that improve throughput (e.g., batching) can increase latency for individual requests. System design requires you to identify which matters more for your use case — real-time chat prioritizes low latency, while batch data pipelines prioritize high throughput.

Database Design: SQL vs NoSQL, Sharding, and Replication

When would you choose a relational database over a NoSQL database?

Choose a relational database (PostgreSQL, MySQL) when your data has a well-defined schema with complex relationships, you need ACID transactions, or your queries involve multi-table joins. NoSQL databases (Cassandra, MongoDB, DynamoDB) are better suited for high write throughput, flexible or evolving schemas, hierarchical data, or when you need to scale horizontally across many nodes without the overhead of joins. For most mid-level applications, a relational database is the safer default unless you have a clear reason to deviate.

What is database sharding, and what problems does it introduce?

Sharding is the practice of horizontally partitioning a database by splitting rows across multiple independent database instances (shards), each responsible for a subset of the data. It solves write throughput and storage limitations that arise when a single machine cannot handle the load. However, it introduces significant complexity: cross-shard queries and joins become expensive or impossible, rebalancing shards when data grows unevenly (hotspots) is difficult, and distributed transactions across shards require careful coordination.

What are the common sharding strategies, and what are the trade-offs of each?

The three primary strategies are range-based sharding (e.g., users with IDs 1–1M on shard 1), hash-based sharding (apply a hash function to the shard key to distribute evenly), and directory-based sharding (a lookup table maps each key to its shard). Range-based sharding enables efficient range queries but can create hotspots if data is not uniformly distributed. Hash-based sharding distributes load evenly but makes range queries difficult. Directory-based sharding is flexible but the lookup table itself becomes a bottleneck and SPOF.

Explain database replication. What is the difference between synchronous and asynchronous replication?

Replication involves copying data from a primary (leader) database to one or more replicas (followers) to improve read throughput and fault tolerance. In synchronous replication, a write is only acknowledged to the client after it has been committed to the primary and at least one replica — this guarantees no data loss on failover but increases write latency. In asynchronous replication, the primary acknowledges the write immediately and replicates in the background — this is faster but risks losing recently committed writes if the primary fails before replication completes.

What is the N+1 query problem, and how do you solve it?

The N+1 problem occurs when an application executes one query to fetch a list of N records, then issues N additional queries to fetch related data for each record individually — often caused by lazy loading in ORMs. For example, fetching 100 posts and then querying the author for each post results in 101 queries. The solution is to use eager loading (SQL JOINs or ORM include/eager_load directives) to fetch all required data in a single query, or to use a DataLoader pattern to batch and deduplicate requests.

What is a read replica, and when would you use one?

A read replica is a copy of your primary database that is kept in sync (usually asynchronously) and only serves read queries. You introduce read replicas when your application is read-heavy and the primary database becomes a bottleneck — for instance, an analytics dashboard running expensive reporting queries should not compete with transactional writes on the primary. Because replicas may lag behind the primary, they are unsuitable for reads that must reflect the most recent write, such as reading back a record immediately after creating it.

Caching Strategies: Redis, CDN, and Cache Invalidation

What are the most common caching strategies, and when should you use each?

Cache-aside (lazy loading) loads data into the cache only on a cache miss — the application checks the cache, and if absent, queries the database and populates the cache. It is simple and only caches what is actually requested, but the first request for any item is always slow. Write-through caches every write to both the cache and the database simultaneously, ensuring the cache is always warm but adding write latency. Write-behind (write-back) caches the write immediately and persists to the database asynchronously, offering the lowest write latency at the risk of data loss. Read-through is similar to cache-aside but the cache itself handles fetching from the database on a miss.

What is cache invalidation, and why is it considered one of the hardest problems in computer science?

Cache invalidation is the process of removing or updating cached data when the underlying source of truth changes, so that stale data is not served. It is difficult because you must decide when a cache entry is “too old,” handle concurrent writes that may race with cache reads, and propagate invalidations across multiple cache nodes in a distributed system. Common strategies include TTL (time-to-live) expiry, event-driven invalidation triggered by writes, and versioned cache keys that effectively invalidate by changing the key rather than deleting entries.

What is a CDN, and which types of content benefit most from CDN caching?

A Content Delivery Network (CDN) is a geographically distributed network of edge servers that cache and serve content from locations physically close to end users, reducing latency and offloading traffic from origin servers. Static assets — images, videos, CSS, JavaScript bundles, and downloadable files — benefit most because they are cacheable for long periods. Dynamic or user-specific content (e.g., personalized API responses) benefits less, though modern CDNs support edge computing to handle some dynamic logic at the edge.

How does Redis differ from Memcached, and when would you choose one over the other?

Both are in-memory key-value stores used for caching, but Redis supports a richer set of data structures (strings, lists, sets, sorted sets, hashes, streams), optional persistence to disk, Pub/Sub messaging, and Lua scripting. Memcached is simpler, multi-threaded out of the box, and can be slightly faster for simple get/set operations with very large datasets. Choose Redis when you need anything beyond simple string caching — rate limiting with sorted sets, session storage, leaderboards, or a message broker. Choose Memcached only if you have a pure caching use case and want to minimize operational complexity.

What is a cache stampede (thundering herd), and how do you prevent it?

A cache stampede occurs when a popular cache entry expires and many concurrent requests simultaneously find a cache miss, all rushing to query the database at once — overwhelming it with redundant work. Prevention strategies include probabilistic early expiry (proactively refresh the cache slightly before expiry to avoid the cliff edge), mutex/locking (only one request rebuilds the cache while others wait or serve slightly stale data), and background refresh (a separate process continuously refreshes hot cache entries before they expire).

Load Balancing and Rate Limiting

What are the common load balancing algorithms, and what are the trade-offs?

Round-robin distributes requests sequentially across servers — simple and effective when all servers are homogeneous. Least connections routes to the server with the fewest active connections, better suited when request durations vary widely. IP hash routes a given client’s IP to the same server consistently, useful for stateful sessions but can cause uneven distribution. Weighted variants of the above allow you to send proportionally more traffic to more powerful servers. For most stateless microservices, round-robin or least connections is sufficient.

What is the difference between Layer 4 and Layer 7 load balancing?

Layer 4 load balancers operate at the transport layer (TCP/UDP) and route traffic based on IP address and port without inspecting the content of packets — they are fast and low-overhead. Layer 7 load balancers operate at the application layer (HTTP/HTTPS) and can make routing decisions based on URL paths, headers, cookies, or request body content — enabling features like SSL termination, sticky sessions, A/B routing, and path-based routing to different backend services. Most modern API gateways and cloud load balancers (AWS ALB, GCP HTTPS LB) operate at Layer 7.

How does rate limiting work, and what are the common algorithms used to implement it?

Rate limiting restricts how many requests a client can make in a given time window to protect services from abuse and ensure fair usage. Common algorithms include: fixed window counter (reset a counter every N seconds — simple but allows bursts at window boundaries), sliding window log (track exact timestamps of recent requests — accurate but memory-intensive), sliding window counter (approximate the sliding window using two fixed-window counts — a good balance), token bucket (a bucket fills with tokens at a fixed rate; each request consumes a token — allows controlled bursting), and leaky bucket (requests drain from a queue at a fixed rate — smooths out bursts). Token bucket is widely used because it naturally handles burst traffic while enforcing an average rate.

Where should rate limiting logic be enforced in a distributed system?

Rate limiting can be enforced at the API gateway (centralized, easiest to manage), at each individual service (more granular but harder to coordinate), or at the load balancer. In distributed environments, per-service in-process rate limiting (e.g., an in-memory token bucket) does not account for traffic across multiple instances — a shared data store like Redis is needed to maintain a global counter. A common pattern is a lightweight Redis-backed rate limiter at the API gateway that applies per-user or per-IP limits before requests reach backend services.

Message Queues and Event-Driven Architecture

What is the purpose of a message queue in a distributed system?

A message queue decouples producers (services that generate work) from consumers (services that process work), enabling asynchronous communication. This improves resilience — if a consumer is slow or temporarily down, messages accumulate in the queue rather than being lost or causing backpressure on the producer. It also enables load leveling: a burst of 10,000 events from the producer can be processed at a steady rate by the consumer without overwhelming it. Examples include sending email notifications, processing image uploads, or triggering downstream workflows after an order is placed.

What is the difference between a message queue and a message stream (e.g., RabbitMQ vs. Kafka)?

Traditional message queues (RabbitMQ, SQS) are designed for task distribution: each message is consumed by exactly one consumer and deleted after acknowledgment. Message streams (Kafka, Kinesis) are append-only logs where messages are retained for a configurable period and multiple independent consumer groups can read the same stream at their own pace. Kafka is better suited for event sourcing, audit logs, and fan-out to multiple consumers; RabbitMQ is better for job queues where each task should be processed once and discarded.

What is the difference between at-most-once, at-least-once, and exactly-once delivery semantics?

At-most-once delivery means messages may be lost but will never be processed twice — acceptable for use cases where occasional loss is tolerable, like metrics or telemetry. At-least-once delivery means messages will always be delivered but may be processed more than once if a consumer crashes after processing but before acknowledging — your consumer must be idempotent to handle this safely. Exactly-once delivery guarantees no loss and no duplication, but is the most expensive to implement and often requires distributed transactions or idempotency keys; Kafka supports it within the broker but end-to-end exactly-once across heterogeneous systems remains very difficult.

What is the fan-out pattern in event-driven architecture?

Fan-out means a single event published to a topic is delivered to multiple independent subscribers, each processing it in their own way. For example, when a user completes a purchase, an “order placed” event might fan out to a fulfillment service, a notification service, an analytics pipeline, and a loyalty points service simultaneously. This decouples services so that adding a new consumer does not require modifying the publisher. AWS SNS combined with SQS (SNS fan-out to multiple SQS queues) is a common pattern for implementing this reliably.

API Design: REST, GraphQL, and Pagination

What are the key principles of RESTful API design?

REST (Representational State Transfer) organizes APIs around resources identified by URLs, uses standard HTTP verbs (GET, POST, PUT, PATCH, DELETE) to express intent, and is stateless — each request contains all the information needed to process it. Good REST APIs use meaningful resource naming (nouns, not verbs: /users/123/orders not /getOrdersForUser), return appropriate HTTP status codes, and version their endpoints to allow evolution without breaking clients. Consistency in naming, error response format, and authentication mechanism across all endpoints is critical for developer experience.

What are the advantages and disadvantages of GraphQL compared to REST?

GraphQL allows clients to request exactly the fields they need in a single query, eliminating over-fetching (REST returns more data than needed) and under-fetching (REST requires multiple round trips to assemble a view). It is particularly powerful for complex, interconnected data models and for mobile clients where bandwidth matters. However, GraphQL introduces complexity: caching is harder (no natural URL-based cache keys), N+1 query problems at the resolver level are common (requiring DataLoader), and the flexibility that empowers clients also makes it harder to reason about server-side performance. It is best suited for public APIs consumed by varied clients, not internal service-to-service communication.

What are the common pagination strategies, and when would you use each?

Offset-based pagination (?page=3&limit=20) is simple to implement but degrades in performance on large datasets (the database must scan and skip N rows) and produces inconsistent results if items are inserted or deleted mid-page. Cursor-based pagination uses an opaque pointer to the last seen item (e.g., ?after=cursor_xyz), enabling efficient indexed queries and stable pages even as data changes — it is the standard for feeds and timelines but does not support jumping to arbitrary pages. Keyset pagination is similar to cursor-based but uses actual column values (e.g., WHERE created_at < '2024-01-15' LIMIT 20), offering excellent database performance for append-heavy datasets.

How do you design an API to handle versioning without breaking existing clients?

Common strategies include URL path versioning (/v1/users, /v2/users), request header versioning (API-Version: 2), and content negotiation via the Accept header. URL versioning is the most explicit and easy to test in a browser, making it the most widely adopted. The key principle is to never make breaking changes (removing fields, changing types, altering required parameters) without incrementing the version and maintaining the old version for a deprecation window. Additive changes (new optional fields, new endpoints) are generally safe to make without a version bump.

System Design Case Study: URL Shortener

How would you design a URL shortening service like bit.ly at scale?

At a high level, the service accepts a long URL and returns a short code (e.g., bit.ly/abc123), then redirects users from the short URL to the original. Key design decisions: generate short codes using a base62 encoding of a unique auto-incrementing ID (or a hash of the URL), store mappings in a fast key-value store like DynamoDB or Redis for O(1) lookups, and use a 301 (permanent) redirect to offload repeat lookups to the browser cache — or a 302 (temporary) redirect if you need to track every click. For write throughput, a single globally incrementing ID requires a coordination service; a common alternative is to pre-allocate ranges of IDs to application servers to eliminate the bottleneck.

How would you handle custom short codes and collision detection in a URL shortener?

Custom short codes (user-defined aliases) require a uniqueness check before insertion — a simple database unique constraint or a Redis SET operation (which is atomic) handles this. For auto-generated codes, using a hash (e.g., MD5) of the URL and taking the first 6 characters will occasionally collide; the standard approach is to append a counter suffix on collision and retry, or to use a dedicated ID generation service (like Twitter’s Snowflake) that guarantees uniqueness without hashing. Bloom filters can quickly reject definitely-not-stored codes before hitting the database, reducing unnecessary lookups.

System Design Case Study: Chat Application

What protocols and architectural patterns are commonly used to build a real-time chat system?

WebSockets are the standard protocol for real-time bidirectional chat because they maintain a persistent TCP connection, eliminating the overhead of HTTP handshakes for every message. Server-Sent Events (SSE) are a simpler alternative for unidirectional push (server to client) but do not support client-to-server messaging. Long polling is a fallback for environments where WebSockets are blocked. Architecturally, a connection service manages WebSocket sessions, a messaging service handles message persistence, and a Pub/Sub system (Redis Pub/Sub, Kafka) fans out messages to all connection servers that have subscribers in a given conversation.

How would you design message storage and retrieval for a chat application?

Messages should be stored in a database optimized for time-series append writes and range reads by conversation. Cassandra is a popular choice because it supports an efficient data model where the partition key is the conversation ID and the clustering key is a timestamp or message ID, allowing fast retrieval of the most recent N messages. For message history (infinite scroll), cursor-based pagination on the message ID or timestamp is standard. Hot conversations with many recent messages can be cached in Redis to serve the initial load without hitting the database.

How do you handle online presence and “last seen” indicators at scale?

Online presence is typically tracked by recording a heartbeat timestamp whenever a client sends any message or a dedicated ping, storing it in Redis with a short TTL. If the TTL expires without a new heartbeat, the user is considered offline. For “last seen,” persist the most recent heartbeat to a database on disconnect or periodically. At scale, broadcasting presence changes to all of a user’s contacts is expensive — a common optimization is to only push presence updates to contacts who are currently online and have the relevant conversation open, rather than all contacts globally.

System Design Case Study: News Feed

What are the two main approaches to generating a social media news feed, and when would you use each?

The push (fan-out on write) approach pre-computes each user’s feed at write time: when Alice posts, her post is immediately written into the feed cache of all her followers. Reads are extremely fast (the feed is pre-assembled), but writes are expensive for users with millions of followers. The pull (fan-out on read) approach assembles the feed at read time by fetching and merging recent posts from all accounts a user follows — simpler and no write amplification, but slower reads for users following many accounts. Most production systems at scale (Twitter, Facebook) use a hybrid: fan-out on write for normal users, fan-out on read for celebrity accounts (high-follower users), with the results merged at read time.

How would you rank and personalize a news feed beyond simple reverse chronological order?

Personalization requires a ranking model that scores each candidate post based on signals like recency, user interaction history (likes, comments, shares), relationship strength between the user and the author, content type preference, and post engagement velocity. In a simplified architecture, a candidate generation stage retrieves the top N recent posts from the pre-computed feed, and a lightweight ranking service re-scores and re-orders them before returning the result. True ML-based ranking requires an offline model training pipeline, a feature store for fast feature retrieval, and an A/B testing framework to evaluate ranking changes safely.

System Design Case Study: File Storage System

How would you design a distributed file storage system like Dropbox or Google Drive?

At a high level, the system consists of a metadata service (stores file names, folder structure, permissions, version history — typically in a relational database), a block storage service (stores actual file data, typically in object storage like S3), and a sync client that detects local changes and uploads them. Large files are chunked into fixed-size blocks; only changed blocks are uploaded on edit, enabling delta sync. Blocks are content-addressed (identified by their hash) so identical blocks across different files are deduplicated automatically. A separate notification service (using WebSockets or long polling) tells other clients when a file has changed so they can pull the latest version.

How do you handle concurrent edits and conflict resolution in a file sync system?

Most file sync systems (Dropbox, Google Drive) use last-write-wins for binary files: if two clients edit the same file offline and both sync, one version becomes the canonical version and the other is saved as a conflict copy for the user to resolve manually. Google Docs handles text documents differently — it uses Operational Transformation (OT) or Conflict-free Replicated Data Types (CRDTs) to merge concurrent edits at the character level, enabling real-time collaborative editing without conflicts. For a system design interview, acknowledging the trade-off between simplicity (last-write-wins) and user experience (CRDT-based merging) demonstrates solid understanding.

Monitoring and Observability

What are the three pillars of observability, and what does each one tell you?

The three pillars are metrics, logs, and traces. Metrics are numerical measurements aggregated over time (e.g., request rate, error rate, latency percentiles, CPU usage) — they tell you that something is wrong. Logs are timestamped records of discrete events (e.g., an error message with a stack trace) — they tell you what happened. Distributed traces track a request as it flows through multiple services, recording timing and causality at each hop — they tell you where time is being spent and which service caused a failure. Effective observability requires all three working together; metrics surface the symptom, traces narrow the scope, and logs provide the specific evidence.

What are the key metrics you should monitor for a web API service?

The Google SRE “Four Golden Signals” provide a practical starting point: Latency (how long requests take, ideally tracked as percentiles — p50, p95, p99 — not just averages), Traffic (request volume, showing normal load and demand patterns), Errors (rate of failed requests, broken down by error type and endpoint), and Saturation (how full key resources are — CPU, memory, database connection pool, queue depth). Beyond these, monitor dependency health (downstream service latency and error rates), and set up alerting that pages on symptoms (elevated error rate, SLA breach) rather than causes (high CPU) to reduce alert fatigue.

What is a service-level objective (SLO), and how does it differ from an SLA?

A Service Level Objective (SLO) is an internal target for a specific reliability metric, such as “99.9% of API requests should complete in under 200ms measured over a rolling 30-day window.” It is the engineering team’s commitment to itself about the quality of service it aims to deliver. A Service Level Agreement (SLA) is an external contractual commitment to customers, usually with financial penalties for breach. SLOs should be set more conservatively than SLAs — if your SLA promises 99.9% availability, your internal SLO might target 99.95% so you have an error budget buffer before the contractual threshold is breached. SLIs (Service Level Indicators) are the actual measured metrics used to evaluate whether SLOs are being met.

What is distributed tracing, and how does it help debug microservice architectures?

Distributed tracing assigns a unique trace ID to each incoming request and propagates it through every service call in the request’s path via HTTP headers or message metadata. Each service records a “span” — the time it spent processing that trace — and reports it to a centralized tracing backend (Jaeger, Zipkin, AWS X-Ray). When debugging a slow or failing request, you can pull up its trace and see a waterfall diagram of every service call, their durations, and any errors, making it immediately clear whether the bottleneck is in your own service, a database call, or a downstream dependency. Without tracing, diagnosing latency in a system with 10+ services is essentially guesswork.

Tips for Approaching System Design Questions

Clarify requirements before drawing boxes.

Spend the first 3–5 minutes asking targeted questions: What is the expected read-to-write ratio? How many daily active users? What are the latency requirements? What consistency guarantees are required? Interviewers reward candidates who expose hidden constraints early rather than designing for a problem that was not actually asked.

Start with a simple, correct design before optimizing.

Propose the simplest architecture that satisfies the requirements, then identify its bottlenecks out loud. This demonstrates that you understand trade-offs rather than reflexively reaching for Kafka and sharded Cassandra on every problem. A single PostgreSQL instance with read replicas is the right answer for many systems at moderate scale.

Make trade-offs explicit and justify them.

Do not just say “I would use Redis here.” Say “I would add a Redis cache here to reduce read latency on the most frequently accessed user profiles, accepting the added operational complexity and the risk of serving slightly stale data for up to 60 seconds.” Interviewers are evaluating your judgment and communication as much as your technical knowledge.

Drive toward numbers.

Back-of-the-envelope estimation grounds your design in reality. If the system needs to handle 100,000 writes per second, a single relational database cannot keep up — and saying so with a rough calculation is far more compelling than a vague claim. Estimating storage requirements, bandwidth, and QPS before choosing components shows engineering maturity.

Practice explaining out loud, not just thinking on paper.

System design interviews are collaborative conversations. Narrate your thinking as you draw — “I’m starting with the client and working inward,” “I’m putting a cache here because reads will outnumber writes 100 to 1,” “This is a potential bottleneck I’ll want to revisit.” Interviewers who cannot follow your reasoning cannot advocate for you, regardless of how strong your design is.