DevOps Engineer Interview Questions: CI/CD, Docker, and Kubernetes

Whether you’re preparing for your first DevOps role or leveling up to a senior position, technical interviews in this field are demanding and wide-ranging. Interviewers expect you to demonstrate hands-on experience with CI/CD pipelines, containerization, orchestration, infrastructure automation, and observability. This guide covers 45 carefully curated interview questions — with practical, real-world answers — spanning the core technologies and concepts every DevOps engineer needs to master.

CI/CD Pipelines

Q1. What is a CI/CD pipeline, and what are its key stages?

A CI/CD pipeline automates the process of integrating code changes, running tests, and deploying software. Key stages typically include source checkout, build, unit/integration testing, static analysis, artifact packaging, staging deployment, acceptance testing, and production deployment. The goal is to move code from commit to production reliably and repeatedly. A well-designed pipeline acts as the quality gate that catches defects before they reach end users.

Q2. What is the difference between Continuous Integration, Continuous Delivery, and Continuous Deployment?

Continuous Integration (CI) focuses on automatically building and testing every code commit to detect integration issues early. Continuous Delivery (CD) extends CI by ensuring the artifact is always in a deployable state, but a human approval step gates production releases. Continuous Deployment goes one step further — every commit that passes all automated tests is deployed to production without manual intervention. The choice between Delivery and Deployment depends on a team’s risk tolerance and compliance requirements.

Q3. How do you decide between Jenkins, GitHub Actions, and GitLab CI for a new project?

Jenkins is highly customizable and has a vast plugin ecosystem, making it a strong choice for complex, on-premise, or legacy environments, but it requires significant infrastructure overhead to maintain. GitHub Actions is the natural choice for projects already hosted on GitHub — it offers native integration, a rich marketplace of actions, and zero infrastructure to manage for cloud workloads. GitLab CI is ideal when your team uses GitLab for source control, as pipelines are defined in the same repository and deeply integrated with merge requests, security scanning, and the GitLab container registry. Evaluate based on your existing toolchain, team expertise, and whether self-hosted runners are a requirement.

Q4. What is a blue-green deployment and when would you use it?

A blue-green deployment maintains two identical production environments — blue (current live) and green (new version) — and switches traffic from blue to green when the new version is validated. This approach enables zero-downtime deployments and instant rollback by simply redirecting the load balancer back to the blue environment. It is best suited for stateless applications where database schema changes can be handled independently. The main trade-off is the cost of maintaining two full production environments simultaneously.

Q5. Explain canary releases and how they reduce deployment risk.

A canary release gradually shifts a small percentage of production traffic (e.g., 5%) to the new version while the majority of users remain on the stable version. Teams monitor error rates, latency, and business metrics on the canary slice before progressively increasing its traffic share. If anomalies appear, traffic is immediately shifted back to the old version, limiting the blast radius. Tools like Argo Rollouts, Flagger, and AWS CodeDeploy natively support weighted traffic routing for canary strategies.

Q6. What rollback strategies do you use in a CI/CD pipeline?

Common rollback strategies include redeploying the previous versioned artifact from your artifact registry, using Kubernetes rollout undo to revert a Deployment to the last known-good ReplicaSet, and leveraging feature flags to toggle off a broken feature without a full deployment. For blue-green deployments, rollback is as simple as re-pointing the load balancer to the old environment. The key principle is that rollback must be fast, automated, and tested — a rollback procedure that has never been exercised should not be trusted in a real incident.

Q7. How do you manage build artifacts in a CI/CD pipeline?

Build artifacts — compiled binaries, Docker images, Helm charts, npm packages — should be stored in a dedicated artifact registry such as JFrog Artifactory, Nexus, GitHub Packages, or a cloud-native registry like AWS ECR or Google Artifact Registry. Every artifact must be immutably versioned (semantic version or Git SHA) so that any pipeline stage can reproduce the exact build. Retention policies and vulnerability scanning (e.g., Trivy, Snyk) should be applied at the registry level. Artifacts should never be rebuilt between pipeline stages; the same artifact promoted through environments is the guarantee of consistency.

Q8. What is a Jenkinsfile and how does pipeline-as-code work?

A Jenkinsfile is a text file, stored in source control alongside application code, that defines a Jenkins pipeline using either Declarative or Scripted DSL syntax. Pipeline-as-code means the build process is version-controlled, code-reviewed, and can be reproduced exactly from any branch or tag. The Declarative syntax is preferred for readability and built-in error handling, while Scripted syntax offers more programmatic flexibility. Having the pipeline definition in the repo eliminates the “it worked on my machine” problem and makes pipeline changes auditable.

Q9. How do you handle secrets in a CI/CD pipeline?

Secrets such as API keys, database credentials, and certificates should never be hardcoded in pipeline configuration files or source code. CI/CD platforms offer native secrets stores (GitHub Actions Secrets, GitLab CI Variables) that inject values as environment variables at runtime. For more advanced use cases, integrating with HashiCorp Vault or AWS Secrets Manager allows dynamic, short-lived credentials to be fetched during the pipeline run. Regularly rotate secrets and audit access logs to maintain a least-privilege posture.

Docker and Containerization

Q10. What is the fundamental difference between a container and a virtual machine?

Virtual machines emulate full hardware and run a complete guest OS on top of a hypervisor, resulting in significant overhead in terms of memory, disk, and boot time. Containers share the host kernel and isolate processes using Linux namespaces and cgroups, making them lightweight, fast to start, and more resource-efficient. A VM provides stronger isolation since each has its own OS, which matters for multi-tenant or compliance-sensitive environments. Containers excel in microservices architectures where density, startup speed, and portability across environments are priorities.

Q11. What are Dockerfile best practices you follow in production?

Use minimal base images (e.g., Alpine or distroless) to reduce the attack surface and image size. Order Dockerfile instructions from least to most frequently changing — copy dependency manifests and install packages before copying application code — to maximize layer cache reuse. Run processes as a non-root user by adding a USER instruction, and avoid storing secrets in image layers using build args or runtime environment variable injection. Combine RUN commands with && and clean up package manager caches in the same layer to keep image layers small.

Q12. What is a multi-stage Docker build and why is it valuable?

A multi-stage build uses multiple FROM instructions in a single Dockerfile, where each stage can produce intermediate artifacts that are selectively copied into the final image. This allows you to use a full build environment (e.g., a JDK or Node.js toolchain) in early stages and produce a minimal runtime image in the final stage, without shipping build tools, source code, or test dependencies to production. The result is dramatically smaller images, reduced attack surface, and faster image pulls. For a Go application, for example, the final stage can be a scratch image containing only the compiled binary.

Q13. How does Docker networking work, and what are the main network drivers?

Docker provides several network drivers: bridge (default for standalone containers on a single host), host (container shares the host’s network namespace), overlay (multi-host networking used by Docker Swarm and compatible with Kubernetes), and macvlan (assigns a MAC address to the container, making it appear as a physical device on the network). Containers on the same user-defined bridge network can communicate using container names as DNS hostnames, which is fundamental to Docker Compose service discovery. In production Kubernetes environments, container networking is handled by a CNI plugin (e.g., Calico, Cilium) rather than Docker’s native drivers.

Q14. Explain Docker volumes and when you use them versus bind mounts.

Docker volumes are managed by the Docker daemon, stored in a Docker-controlled directory on the host (typically /var/lib/docker/volumes), and are portable, easy to back up, and shareable between containers. Bind mounts map a specific host path directly into the container, giving the container direct access to the host filesystem. Volumes are preferred in production for persistent data (databases, file uploads) because they are decoupled from host directory structure and can be managed with Docker CLI commands. Bind mounts are most useful in development to mount source code into a container for live reloading without rebuilding the image.

Q15. What is Docker Compose and what problem does it solve?

Docker Compose is a tool for defining and running multi-container applications using a YAML file (docker-compose.yml) that specifies services, networks, volumes, and environment variables. It solves the problem of orchestrating multiple dependent containers — for example, a web app, a Redis cache, and a PostgreSQL database — with a single docker compose up command. Compose handles dependency ordering, network creation, and volume attachment automatically. While not suitable for production-scale orchestration (that is Kubernetes’ domain), it is invaluable for local development and integration testing environments.

Q16. How do image layers work in Docker, and why does layer ordering matter?

Each Dockerfile instruction (FROM, RUN, COPY, ADD) creates a new read-only image layer stacked on top of the previous one. When Docker builds an image, it checks its layer cache — if the instruction and its context have not changed since the last build, the cached layer is reused, skipping that step. If a layer changes, all subsequent layers are invalidated and rebuilt. This is why frequently changing instructions (like COPY . .) should come last — placing stable, slow-changing steps (like package installation) first maximizes cache hits and dramatically reduces build times.

Q17. What are key Docker security concerns in a production environment?

Running containers as root is one of the most common and dangerous misconfigurations — always specify a non-root USER in your Dockerfile. Limit container capabilities using --cap-drop ALL and only add back the specific capabilities required. Scan images for known CVEs using tools like Trivy, Snyk, or AWS Inspector before pushing to production. Use read-only root filesystems (--read-only), avoid privileged mode, enforce image signing with Docker Content Trust or Sigstore, and keep base images updated with a regular automated patching pipeline.

Q18. What is the difference between CMD and ENTRYPOINT in a Dockerfile?

ENTRYPOINT defines the executable that always runs when the container starts and is not easily overridden — it sets the container’s primary process. CMD provides default arguments to the ENTRYPOINT or, if no ENTRYPOINT is set, acts as the default command itself; CMD can be overridden at docker run time by appending a command. The common pattern is to use ENTRYPOINT for the fixed binary (e.g., ["python"]) and CMD for default arguments (e.g., ["app.py"]), allowing users to pass different arguments without knowing the full path. Using the exec form (["executable", "arg"]) rather than shell form is preferred because it avoids spawning an extra shell process and ensures signals are passed directly to the process.

Kubernetes

Q19. What is the difference between a Pod, a Deployment, and a ReplicaSet in Kubernetes?

A Pod is the smallest deployable unit in Kubernetes, encapsulating one or more containers that share network and storage. A ReplicaSet ensures a specified number of identical Pod replicas are running at any given time, providing basic self-healing by replacing failed Pods. A Deployment is a higher-level abstraction that manages ReplicaSets, enabling declarative updates, rollback history, and rolling update strategies. In practice, you almost always create Deployments rather than ReplicaSets directly, because Deployments provide the update and rollback lifecycle management that ReplicaSets alone do not.

Q20. How do Kubernetes Services work, and what are the different service types?

A Kubernetes Service provides a stable virtual IP (ClusterIP) and DNS name that load-balances traffic to a dynamic set of Pods selected by label selectors, decoupling consumers from Pod lifecycle churn. ClusterIP is only reachable within the cluster. NodePort exposes the service on a static port on every node’s IP, making it accessible externally but rarely used in production. LoadBalancer provisions a cloud provider’s external load balancer. ExternalName maps the service to an external DNS name. In production, Ingress resources (managed by an Ingress Controller like Nginx or Traefik) are preferred over LoadBalancer services for HTTP/HTTPS routing.

Q21. What is an Ingress resource and how does an Ingress Controller work?

An Ingress resource is a Kubernetes API object that defines HTTP and HTTPS routing rules — mapping hostnames and URL paths to backend Services. An Ingress Controller (e.g., Nginx Ingress, Traefik, AWS ALB Ingress Controller) is a running component in the cluster that watches the Kubernetes API for Ingress resources and configures the underlying load balancer or reverse proxy accordingly. Without an Ingress Controller, Ingress resources have no effect. Ingress enables TLS termination, name-based virtual hosting, and path-based routing from a single external IP, making it far more cost-effective than provisioning a LoadBalancer service per application.

Q22. What are ConfigMaps and Secrets, and how do they differ?

ConfigMaps store non-sensitive configuration data as key-value pairs and can be consumed by Pods as environment variables, command-line arguments, or mounted files. Secrets store sensitive data (passwords, tokens, certificates) similarly, but the values are base64-encoded (not encrypted by default) and Kubernetes applies additional access controls to them. In production, you should enable etcd encryption at rest for Secrets and integrate with an external secrets manager (e.g., HashiCorp Vault with the Vault Agent Injector, or External Secrets Operator) to avoid storing sensitive values directly in the cluster. Never commit raw Secret manifests to source control.

Q23. How does Horizontal Pod Autoscaling (HPA) work in Kubernetes?

The HPA controller periodically queries metrics (default interval: 15 seconds) from the Metrics Server (or custom/external metrics adapters) and adjusts the replica count of a Deployment, ReplicaSet, or StatefulSet to keep a target metric (e.g., 70% CPU utilization) within the desired range. The HPA calculates the desired replica count using the formula: desiredReplicas = ceil(currentReplicas * (currentMetric / desiredMetric)). For effective HPA, Pods must have accurate resource requests defined so the Metrics Server can compute meaningful utilization percentages. HPA can also scale on custom metrics like HTTP request rate via KEDA (Kubernetes Event-Driven Autoscaling) for more sophisticated workloads.

Q24. What is RBAC in Kubernetes and how do you implement least-privilege access?

Role-Based Access Control (RBAC) in Kubernetes controls who (user, group, or service account) can perform which actions (verbs: get, list, create, delete) on which resources (pods, secrets, deployments) within which scope (namespace via Role/RoleBinding, or cluster-wide via ClusterRole/ClusterRoleBinding). Least-privilege means granting only the permissions required for a specific workload — for example, a CI/CD service account might only need create and update on Deployments in a single namespace, not cluster-admin. Audit access regularly with tools like kubectl-who-can or Polaris. Avoid using the default service account in namespaces, as it often has broader permissions than intended.

Q25. What is Helm and what problems does it solve?

Helm is the package manager for Kubernetes, allowing you to define, install, upgrade, and version Kubernetes applications as “charts” — collections of templated YAML manifests with a values file for customization. It solves the problem of managing large numbers of related Kubernetes resources as a single unit, supporting parameterization across environments (dev, staging, prod) with different values files. Helm’s release history allows atomic upgrades and rollbacks with helm rollback. In production, Helm is commonly used alongside GitOps tools like Argo CD or Flux, which reconcile Helm releases declared in Git repositories.

Q26. How do rolling updates work in Kubernetes, and how do you control the rollout pace?

A Kubernetes rolling update replaces old Pods with new ones incrementally, governed by two Deployment strategy parameters: maxUnavailable (how many Pods can be unavailable during the update) and maxSurge (how many extra Pods can be created above the desired count). By setting maxUnavailable: 0 and maxSurge: 1, you ensure zero downtime — a new Pod must be Ready before an old one is terminated. You can monitor progress with kubectl rollout status and pause/resume with kubectl rollout pause/resume. Readiness probes are critical here — Kubernetes will not route traffic to a new Pod until its readiness probe passes, preventing premature traffic shifts to unhealthy instances.

Q27. What are liveness and readiness probes, and why are they both necessary?

A readiness probe tells Kubernetes when a Pod is ready to receive traffic — the Pod is removed from the Service’s endpoint list when it fails, preventing requests from being routed to an initializing or overloaded container. A liveness probe detects when a container has entered a broken, unrecoverable state (e.g., a deadlock) and triggers a container restart. Both are necessary because a container can be alive (liveness passes) but not ready (e.g., still warming up a cache), or it can be ready but later become stuck (liveness must detect and restart it). A third probe, the startup probe, is used for slow-starting applications to prevent liveness probes from killing the container before it has had time to initialize.

Q28. What is the purpose of Kubernetes namespaces and when should you use multiple namespaces?

Namespaces provide a logical partitioning mechanism within a single Kubernetes cluster, enabling resource isolation, access control (RBAC scoped to a namespace), and resource quota enforcement between teams or environments. Common patterns include separate namespaces for each environment (dev, staging, prod), per-team namespaces, or per-application namespaces for large organizations. However, namespaces do not provide strong security isolation — for true multi-tenancy with hard isolation, separate clusters or virtual clusters (vcluster) are more appropriate. NetworkPolicies can enforce namespace-level network segmentation to prevent unauthorized cross-namespace communication.

Infrastructure as Code

Q29. What is Terraform and how does it differ from Ansible?

Terraform is a declarative Infrastructure as Code tool that provisions and manages cloud and on-premise infrastructure (VMs, networks, databases, DNS records) by defining the desired end state in HCL configuration files. Ansible is a configuration management and orchestration tool that executes imperative playbooks to configure existing systems — installing packages, managing services, deploying applications. The key distinction is that Terraform excels at provisioning infrastructure resources (the “what exists”), while Ansible is better suited to configuring what’s running on those resources (the “what is installed and how it’s configured”). They are complementary: Terraform to provision, Ansible (or cloud-init) to configure.

Q30. What is Terraform state, and what are best practices for managing it?

Terraform state is a JSON file (terraform.tfstate) that maps your configuration to the real-world resources it manages, tracking resource IDs and metadata so Terraform can determine what needs to be created, updated, or destroyed on subsequent applies. Storing state locally is dangerous in a team environment — always use a remote backend (S3 with DynamoDB locking, Terraform Cloud, Azure Blob Storage) to enable collaboration and prevent concurrent state corruption. Enable state file encryption at rest and restrict access via IAM policies. Use workspaces or separate state files per environment to isolate blast radius. Never manually edit state files — use terraform state mv and terraform import for state manipulation.

Q31. Explain Terraform’s plan and apply workflow and why it matters.

terraform plan generates an execution plan showing exactly what Terraform will create, modify, or destroy to reach the desired state declared in your configuration — it makes no changes to real infrastructure. terraform apply executes the approved plan against the actual infrastructure. This two-step workflow is critical for safety: it prevents surprise changes and allows code review of infrastructure modifications before they happen. In CI/CD pipelines, it is common practice to run terraform plan on pull requests (posting the diff as a comment) and only run terraform apply on merge to the main branch, often requiring manual approval for production environments.

Q32. What are Ansible playbooks and how do idempotency work in Ansible?

An Ansible playbook is a YAML file that defines a series of tasks to be executed against a group of hosts defined in an inventory file. Idempotency means running the same playbook multiple times produces the same end state without unintended side effects — for example, the apt module installs a package only if it is not already present, rather than reinstalling it every time. Most built-in Ansible modules are idempotent by design, but custom shell or command tasks require the creates or when conditionals to achieve idempotency. Idempotency is essential for reliable automation — playbooks should be safe to re-run at any time for configuration drift correction.

Monitoring, Observability, and Logging

Q33. What is Prometheus and how does it collect metrics?

Prometheus is an open-source time-series metrics system that uses a pull model — it scrapes HTTP endpoints (typically /metrics) on configured targets at a defined interval and stores metric samples locally in its time-series database. Applications expose metrics in Prometheus exposition format via client libraries (Go, Java, Python, etc.), and infrastructure components like Kubernetes nodes are scraped via exporters (Node Exporter, kube-state-metrics). The pull model provides a natural health check: if Prometheus can’t scrape a target, the target is likely down. Alertmanager handles routing, deduplication, and notification of alerts defined by Prometheus alerting rules.

Q34. How does Grafana complement Prometheus, and what are dashboards used for?

Grafana is a visualization and analytics platform that queries data sources — including Prometheus, Loki, Elasticsearch, and others — and renders time-series graphs, heatmaps, tables, and alert panels into configurable dashboards. While Prometheus provides the query language (PromQL) and data storage, Grafana provides the human-readable operational visibility layer that teams use for real-time monitoring, capacity planning, and incident investigation. Dashboards should be version-controlled as JSON (stored in Git) and provisioned automatically via Grafana’s provisioning API rather than configured manually, ensuring consistency across environments. Grafana can also manage alerts and route them via its unified alerting system.

Q35. What is the ELK stack and what role does each component play?

The ELK stack consists of Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed search and analytics engine that indexes and stores log data, providing fast full-text search and aggregations. Logstash is a data processing pipeline that ingests logs from various sources, parses and transforms them (e.g., parsing Apache log formats, enriching with GeoIP), and ships them to Elasticsearch. Kibana is the visualization layer — it queries Elasticsearch and provides dashboards, log search interfaces, and alerting. In modern setups, Filebeat (a lightweight log shipper) often replaces Logstash for collection, and the stack is sometimes referred to as the Elastic Stack or EFK (Elasticsearch, Fluentd, Kibana) when Fluentd is used instead.

Q36. What is distributed tracing and why is it important in a microservices environment?

Distributed tracing tracks a single request as it propagates across multiple microservices, capturing timing data for each service hop as “spans” within a “trace.” This is critical in microservices architectures where a single user action might trigger calls to 10+ services — logs alone cannot tell you which service in the chain caused a 2-second latency spike. Tools like Jaeger, Zipkin, and OpenTelemetry provide distributed tracing by propagating trace context headers (e.g., traceparent) between services. OpenTelemetry has become the de facto vendor-neutral standard for instrumentation, allowing teams to switch backend tracing systems without re-instrumenting their code.

General DevOps Principles and SRE

Q37. What is Site Reliability Engineering (SRE) and how does it differ from traditional DevOps?

SRE is Google’s approach to applying software engineering principles to operations problems — treating infrastructure management, reliability, and on-call work as software challenges to be solved with code. While DevOps is a cultural philosophy that emphasizes collaboration between development and operations teams, SRE provides specific prescriptive practices: error budgets, SLOs, blameless postmortems, and toil reduction targets. A key SRE principle is that reliability work (on-call, incident response, automation) should consume no more than 50% of an SRE team’s time; the rest is devoted to engineering work that eliminates that operational burden. SRE and DevOps are complementary — SRE can be thought of as a specific implementation of DevOps principles.

Q38. What are SLOs, SLAs, and SLIs, and how are they related?

A Service Level Indicator (SLI) is a quantitative measurement of service behavior — for example, the percentage of HTTP requests that return a 2xx response within 500ms. A Service Level Objective (SLO) is the target range for an SLI — for example, “99.9% of requests must succeed within 500ms over a 30-day window.” A Service Level Agreement (SLA) is a formal contract with customers that specifies consequences (refunds, penalties) if SLOs are not met. The error budget is derived from the SLO: a 99.9% availability SLO grants 43.8 minutes of acceptable downtime per month. When the error budget is consumed, teams focus on reliability over feature development until it recovers.

Q39. How do you approach incident management and what is a blameless postmortem?

Incident management involves detecting an incident (via alerts or user reports), declaring severity, assembling an incident response team, mitigating impact, resolving the root cause, and conducting a postmortem review. A blameless postmortem is a post-incident analysis document that focuses on systemic and process failures rather than individual mistakes — the premise is that people make mistakes within systems, and the goal is to improve those systems so the same class of failure cannot recur. A good postmortem includes a timeline, root cause analysis, contributing factors, impact quantification, and concrete action items with owners and due dates. Postmortems should be shared openly within the organization to build institutional knowledge.

Q40. What is GitOps and how does it differ from traditional CI/CD?

GitOps is an operational model where Git is the single source of truth for both application and infrastructure desired state. Instead of CI pipelines pushing changes to environments imperatively, a GitOps agent (e.g., Argo CD, Flux) running in the cluster continuously reconciles the live cluster state with the state declared in a Git repository and automatically corrects any drift. The difference from traditional CI/CD is the directionality: traditional pipelines push changes out, while GitOps agents pull desired state in. This means the cluster always reflects what is in Git, every change is auditable via Git history, and rollback is as simple as reverting a Git commit. GitOps also improves security by removing the need for CI pipelines to have direct kubectl access to production clusters.

Q41. What is the concept of “toil” in SRE and why is reducing it important?

Toil is manual, repetitive, automatable operational work that scales linearly with service growth and provides no lasting engineering value — examples include manually restarting crashed services, running scripts to add capacity, or processing recurring ticket requests. SRE teams target keeping toil below 50% of their working time; excessive toil crowds out engineering work and leads to burnout and turnover. Reducing toil requires identifying recurring manual tasks and building automation to eliminate them. Paradoxically, some toil is acceptable and even useful for building system understanding, but it must be tracked and continuously driven down over time.

Q42. What is chaos engineering and how is it practiced?

Chaos engineering is the practice of intentionally injecting failures into a production or production-like system to discover weaknesses before they cause unplanned outages. Pioneered by Netflix with the Chaos Monkey tool, it involves forming a hypothesis (“the system will degrade gracefully if one availability zone fails”), running a controlled experiment (terminating instances in one AZ), observing real system behavior, and fixing gaps between the hypothesis and reality. Modern tools include Gremlin, Chaos Mesh (for Kubernetes), and AWS Fault Injection Simulator. The key is to run experiments with a defined “blast radius” — limiting the scope of the failure injection — and always have a kill switch to stop the experiment immediately if real customer impact exceeds acceptable thresholds.

Q43. What is the twelve-factor app methodology and how does it influence DevOps practices?

The Twelve-Factor App is a methodology for building software-as-a-service applications that are portable, scalable, and maintainable in cloud environments. Key factors directly relevant to DevOps include: storing config in environment variables (factor III), treating backing services as attached resources (factor IV), exporting logs as event streams to stdout rather than managing log files (factor XI), and strict separation of build, release, and run stages (factor V). These principles make applications inherently container-friendly and CI/CD-compatible. Teams adopting twelve-factor methodology find that their applications integrate more naturally with Kubernetes, Docker, and modern observability tooling.

Q44. How do you implement network policies in Kubernetes to enforce zero-trust networking?

Kubernetes NetworkPolicy resources define ingress and egress rules for Pods based on pod selectors, namespace selectors, and IP blocks. By default, all Pod-to-Pod communication is allowed within a cluster; a zero-trust posture starts with a default-deny policy in each namespace that blocks all ingress and egress, then adds specific allow rules only for required communication paths. For example, a database Pod should only accept ingress from Pods in the same namespace with a specific label (e.g., role: api). NetworkPolicies require a CNI plugin that enforces them — Calico, Cilium, and Weave Net support NetworkPolicy enforcement, while the default Flannel does not. Cilium’s eBPF-based implementation also supports Layer 7 (HTTP/gRPC) policy enforcement for more granular control.

Q45. What is your approach to Kubernetes resource requests and limits, and why do they matter?

Resource requests tell the Kubernetes scheduler how much CPU and memory a Pod requires, influencing which node it is placed on. Resource limits cap the maximum CPU and memory the container can consume — a container exceeding its memory limit is OOMKilled, and one exceeding its CPU limit is throttled. Setting requests without limits risks a “noisy neighbor” problem where one Pod starves others on the same node. Setting limits too low causes unnecessary throttling and OOMKills. Best practice is to profile application resource usage under realistic load, set requests to the typical (P50) usage, set memory limits equal to requests (since memory is not compressible), and set CPU limits conservatively or use a Vertical Pod Autoscaler to manage them dynamically. Always define both requests and limits in production workloads — pods with no requests cannot be effectively scheduled.

Interview Preparation Tips

Mastering DevOps interviews requires more than memorizing definitions. Here are the most effective preparation strategies used by engineers who land senior roles:

Build and break things hands-on. Spin up a local Kubernetes cluster with kind or minikube, build CI/CD pipelines in GitHub Actions or GitLab CI for a personal project, and write Terraform to provision real cloud infrastructure. Interviewers can immediately tell the difference between candidates who have read documentation and those who have debugged a CrashLoopBackOff at 2am.

Practice explaining the “why” not just the “what.” For every concept — blue-green deployments, HPA, Terraform state — be ready to articulate the problem it solves, when you would and would not use it, and what trade-offs it introduces. Senior DevOps roles require architectural judgment, not just tool knowledge.

Prepare war stories. Have two or three concrete incidents you have worked on where you can describe the symptom, your debugging process, the root cause, and the remediation steps. The STAR (Situation, Task, Action, Result) format works well here. Incident experience is one of the most differentiating signals in DevOps interviews.

Stay current with the ecosystem. The DevOps toolchain evolves rapidly. Follow CNCF project updates, read the Kubernetes release notes for each minor version, and keep an eye on emerging tools like Cilium, OpenTelemetry, Argo Workflows, and Crossplane. Demonstrating awareness of current ecosystem trends signals that you are actively engaged with the community, not just maintaining legacy pipelines.

Know your numbers. Be comfortable discussing SLO targets, error budget calculations, deployment frequency metrics (from the DORA report), and capacity planning math. DevOps is increasingly a data-driven discipline, and interviewers at mature organizations will expect quantitative thinking alongside operational experience.

Leave a Reply

Your email address will not be published. Required fields are marked *