Assuming all network issues are 'CNI problems' without checking kube-proxy, iptables, or firewalls.

A Kubernetes Engineer owns the cluster itself. Unlike DevOps Engineers who treat Kubernetes as one tool among many, or Platform Engineers who build developer platforms on top of K8s, a Kubernetes Engineer specialises in cluster operations, container orchestration, and the Kubernetes ecosystem. This means deep expertise in CNI plugins, service meshes, etcd management, RBAC, Helm, operators, and production troubleshooting. See how this differs from DevOps Engineer, Platform Engineer, Cloud Engineer, and Terraform Engineer roles.

These interview questions focus on real-world cluster scenarios: how you'd debug a pending pod, design a multi-tenant network policy, scale etcd under load, or implement a custom storage provisioner. The questions separate engineers who can recite YAML syntax from those who understand the control plane, scheduler, kubelet lifecycle, and how to operate Kubernetes reliably at scale.

Behavioural Questions

Tell us about a time a Kubernetes cluster went down in production. What was the root cause, and how did you fix it?
Describe a situation where you had to debug a complex multi-pod networking issue. What tools and methods did you use?
Walk us through a time you had to upgrade a Kubernetes cluster with zero downtime. What was your strategy, and what went wrong?

Tell us about a time you had to teach a developer team about RBAC and network policies. How did you explain it?
Describe a situation where you disagreed with a security or infrastructure decision. How did you handle it?
Walk us through a time you had to balance cluster stability with allowing teams to innovate. What trade-offs did you make?

Tell us about a time you had to learn a new Kubernetes feature or tool quickly. How did you approach it?
Describe a situation where you mentored a junior engineer on Kubernetes best practices. What did you teach them?
Walk us through a time you solved a problem using a non-obvious Kubernetes feature or pattern. Why did you choose that approach?

Core Kubernetes Concepts & Architecture

Explain the Kubernetes control plane. What are the key components, and how do they work together to create a pod?
Sample Answer Guidance: The control plane consists of the API server (etcd gateway), scheduler (assigns pods to nodes), controller manager (reconciles state), and cloud controller manager. When you create a Deployment, the API server stores it in etcd. The Deployment controller watches and creates ReplicaSets; the ReplicaSet controller creates Pods. The scheduler watches Pending pods and assigns them to nodes. The kubelet on each node watches its assigned pods and starts containers.
What is the role of etcd in Kubernetes? Why is it critical, and how would you back it up?
Sample Answer Guidance: etcd is the single source of truth for all cluster state: namespaces, deployments, secrets, configmaps. It uses Raft consensus for fault tolerance. Without etcd, the cluster loses all state. Back it up by snapshotting etcd regularly (e.g., `etcdctl snapshot save`) and storing snapshots off-cluster. Also monitor etcd latency and disk usage; slow etcd causes cascading delays in the entire control plane.
How does the scheduler decide which node to place a pod on? What are predicates and priorities?
Sample Answer Guidance: The scheduler uses a two-phase process. Predicates filter out nodes that cannot run the pod (e.g., insufficient CPU/memory, taints, node selectors). Priorities rank remaining nodes by scoring (e.g., least-requested resources, pod affinity). The scheduler binds the pod to the highest-scoring node. Custom scoring rules are called priorities (or scoring plugins in newer Kubernetes versions). Understanding this matters for debugging pending pods.
What is a watch in Kubernetes? How does it differ from polling, and why is it important?
Sample Answer Guidance: A watch is a long-lived HTTP connection where the API server streams state changes to clients (e.g., controllers, kubelet, kubectl). Instead of polling (repeatedly asking 'is the pod running?'), watch is event-driven: the client receives notifications when the pod changes. This reduces API server load and makes reconciliation near-instantaneous. Controllers rely on watches to react to changes efficiently.
Explain the difference between a Deployment, StatefulSet, and DaemonSet. When would you use each?
Sample Answer Guidance: A Deployment manages stateless replicas with rolling updates; the ReplicaSet controller ensures the desired number of pods. StatefulSet manages stateful services: each pod gets a stable identity (e.g., postgres-0, postgres-1), stable storage (PersistentVolumes), and ordered updates. DaemonSet ensures one pod per node (e.g., logging agents, monitoring). Use Deployment for most apps, StatefulSet for databases or services requiring persistence, DaemonSet for cluster-wide utilities.
What is a controller manager, and how does it differ from the scheduler? Name two controllers you know.
Sample Answer Guidance: The controller manager runs a suite of controllers, each watching a specific resource type and reconciling its state. The Deployment controller watches Deployments and creates/updates ReplicaSets. The ReplicaSet controller watches ReplicaSets and creates/updates Pods. The scheduler is separate: it assigns Pending pods to nodes. Controllers enforce desired state; the scheduler assigns pods to nodes. Other controllers include StatefulSet, DaemonSet, and Job controllers.

What interviewers look for: Strong candidates explain the control plane (API server, etcd, scheduler, controller manager) as an interconnected system, understand watch semantics and reconciliation loops, and explain how kubelet drives pod lifecycle. They mention specific components (e.g., 'the scheduler uses predicates and priorities'). Weak candidates describe Kubernetes as 'a system that runs containers' or focus only on kubectl commands. They cannot explain how a Deployment becomes a running pod, or confuse the scheduler with the controller manager.

Networking, Storage & Security

Explain the Kubernetes networking model. How does a packet travel from one pod to another?
Sample Answer Guidance: The Kubernetes networking model requires three things: pod-to-pod communication (handled by CNI plugins like Calico or Flannel), pod-to-service routing (handled by kube-proxy and iptables/IPVS), and external-to-pod ingress. A packet from pod-A to pod-B: pod-A sends to pod-B's IP, the CNI plugin (e.g., Calico) routes it across the network, the kernel delivers it to pod-B. The CNI plugin creates virtual network interfaces and manages IP address allocation per pod.
What is a Service? Describe ClusterIP, NodePort, and LoadBalancer services and when to use each.
Sample Answer Guidance: A Service is a stable DNS name and IP for a set of pods. ClusterIP (default) creates an internal IP accessible only within the cluster; used for internal communication. NodePort exposes the Service on a port on every node (e.g., 30000); used for bare-metal or testing. LoadBalancer provisions an external IP (cloud-specific); used for production ingress. Services use label selectors to match pods and kube-proxy uses iptables/IPVS to route traffic.
How do you implement multi-tenancy and network isolation in Kubernetes? What tools would you use?
Sample Answer Guidance: Use Namespaces for logical isolation and NetworkPolicy for traffic control. NetworkPolicy is a firewall: it defines which pods can communicate with which (by selector and ports). Combine with admission controllers to prevent privileged containers, enforce namespace quotas, and restrict image registries. For stronger isolation, use service mesh (Istio, Linkerd) for mTLS and fine-grained traffic policies. Always default to deny-all and allow explicitly.
What is a PersistentVolume (PV) and PersistentVolumeClaim (PVC)? How do you provision storage dynamically?
Sample Answer Guidance: A PV is a cluster-level storage resource; a PVC is a pod's request for storage. A pod uses PVCs; a StorageClass defines how PVs are provisioned (e.g., AWS EBS, GCP Persistent Disks). When a pod creates a PVC, the provisioner watches and dynamically creates a PV and underlying storage. Always use StorageClass with dynamic provisioning rather than static PVs for scalability. Consider reclaim policy (Delete vs. Retain) and snapshot backups.
Explain RBAC in Kubernetes. What are Roles, RoleBindings, and how do you design permissions for multi-tenant clusters?
Sample Answer Guidance: RBAC controls who can do what. A Role is a set of permissions (e.g., 'get, list, watch pods'). A RoleBinding grants a Role to a user, group, or ServiceAccount. RoleBinding and Role are namespace-scoped; ClusterRole and ClusterRoleBinding are cluster-scoped. For multi-tenancy: create a namespace per tenant, grant each tenant's ServiceAccount a Role with permissions to only their namespace. Audit RBAC regularly with `kubectl auth can-i`.
What is a network policy, and how would you design one to allow only specific pod-to-pod traffic?
Sample Answer Guidance: A NetworkPolicy is a Kubernetes firewall. It uses label selectors to allow/deny traffic between pods. Example: create a policy that allows traffic from 'frontend' pods to 'backend' pods on port 8080, and denies all else. Start with a deny-all default: `kubectl apply -f deny-all.yaml`, then selectively allow trusted paths. NetworkPolicy works only if a CNI plugin supports it (e.g., Calico, Cilium). Test with `kubectl get networkpolicies` and monitor denied traffic.
What is a ServiceAccount, and why is it important for pod security?
Sample Answer Guidance: A ServiceAccount is a Kubernetes identity for pods. Every pod gets a ServiceAccount (default if not specified); its credentials are mounted as a secret. Pods use this identity to authenticate to the API server and other services. Create least-privilege ServiceAccounts for each deployment and bind specific Roles to them. Disable automounting (serviceAccountName: none) if a pod doesn't need API access. This is essential for adhering to the principle of least privilege.

What interviewers look for: Strong candidates explain CNI as a plugin system for pod networking, understand service types (ClusterIP, NodePort, LoadBalancer) and DNS, design network policies by principle of least privilege, and discuss persistent storage (PVC, StorageClass, provisioners). They mention real tools: Calico, Flannel, Cilium. Weak candidates conflate services with ingresses, cannot explain how traffic reaches a pod, or think network policies are optional. They may not understand PersistentVolume vs. PersistentVolumeClaim.

Operations, Troubleshooting & Scaling

A pod is stuck in Pending. Walk through your debugging steps.
Sample Answer Guidance: First, `kubectl describe pod <pod>` shows Events: usually 'Unschedulable' with reason (e.g., insufficient CPU). Check `kubectl top nodes` for resource availability. If resources are available, check taints (`kubectl describe nodes`): the pod's tolerations may not match. Check node selectors and affinity rules. Verify the PVC is Bound if the pod uses storage. Check API server logs (`kubectl logs -n kube-system <api-server-pod>`) for errors. If still stuck, increase logging and check scheduler logs.
A container keeps restarting (CrashLoopBackOff). How would you diagnose the issue?
Sample Answer Guidance: `kubectl logs <pod>` shows stderr from the previous run; check for application errors. `kubectl logs <pod> --previous` shows logs from the last exit. `kubectl describe pod` shows the restart count and last termination reason (e.g., exit code 1). Run `kubectl exec -it <pod> sh` to inspect the runtime state if the container starts. Check resource limits: OOMKilled means out of memory. Review the image tag: is it correct? Check readiness/liveness probe definitions: they may be killing the container.
Explain kubelet and its role in the pod lifecycle. What does it log?
Sample Answer Guidance: The kubelet runs on every node and manages pod lifecycle: it watches the API server for assigned pods, pulls images, starts containers via the container runtime, monitors health, and reports status back to the API server. If a pod is in Pending and a node has capacity, the kubelet will pull the image and start it. Kubelet logs (on the node, usually `/var/log/kubelet.log` or via `journalctl`) show image pull errors, container runtime errors, and eviction decisions. Slow kubelet means pods take time to start.
What are node taints and tolerations? Give an example of when you'd use them.
Sample Answer Guidance: A taint marks a node as 'unsuitable' for certain pods (e.g., `node-role.kubernetes.io/master=:NoSchedule`). A toleration allows a pod to tolerate a taint. Example: a node with a GPU taint (gpu=true:NoSchedule) will reject normal pods but accept pods with a matching toleration. Use taints to: dedicate nodes to specific workloads (GPUs, databases), prevent master nodes from running user pods, or handle node maintenance (drain and cordon). Apply taints via `kubectl taint nodes`.
How would you upgrade a live Kubernetes cluster with multiple nodes and persistent workloads? What is a PDB?
Sample Answer Guidance: Use rolling node upgrades with PodDisruptionBudgets (PDB). A PDB ensures a minimum number of pod replicas are available during voluntary disruptions (e.g., node maintenance). Example: PDB with minAvailable: 1 ensures at least one replica stays running. Upgrade: cordon a node (stop scheduling new pods), drain it (evict existing pods to other nodes respecting PDBs), upgrade the kubelet and OS, and uncordon. For master nodes, backup etcd first. Monitor cluster health throughout. Test in a staging cluster first.
What is node pressure (disk pressure, memory pressure) and how does kubelet respond?
Sample Answer Guidance: Node pressure means the node is running low on resources (CPU, memory, disk). When kubelet detects pressure, it evicts pods based on QoS class: BestEffort pods are evicted first, then Burstable, then Guaranteed. Evicted pods trigger the scheduler to reschedule them on other nodes. Disk pressure is common: kubelet may evict pods to free space on the root filesystem. Monitor node conditions with `kubectl get nodes -o wide` (NotReady, DiskPressure, etc.). Set kubelet's eviction thresholds to prevent eviction thrashing.
How would you scale a Kubernetes cluster to handle 10x more workloads? What are the bottlenecks?
Sample Answer Guidance: Scaling requires scaling every layer: worker nodes (add more nodes via cluster autoscaler), control plane (replicate API server, scheduler, controller-manager, add etcd replicas), and etcd (monitor disk I/O and latency; multi-region etcd requires careful synchronisation). Monitor API server response times, scheduler latency, and kubelet CPU. Use resource quotas and limits to prevent noisy neighbours. Consider multiple clusters or node pools for isolation. Test at scale in a staging environment; 1000+ nodes require different tuning than 10 nodes.

What interviewers look for: Strong candidates debug methodically using kubectl logs, describe, events, and top; understand kubelet, controller-manager, and API server logs. They explain causes of common issues (pending pods, crashloops, OOMKilled, node pressure) and know how to scale etcd, API server, and worker nodes. They discuss upgrade strategies, backup/restore, and monitoring. Weak candidates jump to 'restart the pod' without investigating. They don't know kubectl debugging tools, cannot read logs, or don't understand node pressure, taints, or resource quotas.

Common Mistakes in Kubernetes Engineer Interviews

Confusing services with ingresses, or thinking a Service is just a 'load balancer'.

Services provide stable DNS and IP for pod groups; Ingress is a gateway for HTTP(S) traffic. A Service uses kube-proxy and iptables; Ingress requires an Ingress controller. Misunderstanding this shows shallow networking knowledge. How to fix: Learn the three-layer model: pods (ephemeral IPs), services (stable DNS and VIP routing traffic via iptables/IPVS), and ingress (HTTP load balancer). Explain the difference with a worked example.

Saying 'just restart the pod' without investigating root cause, or not checking logs and events.

Kubernetes is about reconciliation, not manual restarts. A pod in CrashLoopBackOff will keep crashing unless the root cause (bug, bad config, missing secret) is fixed. Restarting wastes time and looks unprofessional. How to fix: Always start with kubectl describe, logs, and events. Methodically rule out configuration, resources, taints, and affinity issues. Show your debugging process.

Not understanding that 'ready' and 'running' are different pod states, or not explaining readiness/liveness probes.

A pod can be Running but not Ready (e.g., startup probe failed). Readiness probes gate traffic; liveness probes restart unhealthy containers. Confusing these shows you've never debugged a crashing app in Kubernetes. How to fix: Explain the pod lifecycle: Pending → Running → Ready/NotReady. Discuss startup, readiness, and liveness probes, and when to use each. Know that readiness failures don't restart the pod.

Assuming all network issues are 'CNI problems' without checking kube-proxy, iptables, or firewalls.

Network debugging in Kubernetes is layered: CNI (pod-to-pod routing), kube-proxy (service load balancing), iptables/IPVS (netfilter rules), and OS firewalls. Jumping to 'replace the CNI' is amateur. How to fix: Learn the layers. Use tcpdump, netstat, and iptables to trace packets. Test with busybox pods. Understand that kube-proxy is often the culprit, not CNI.

How We Evaluate Kubernetes Engineer Answers

Explains the control plane as a system: API server, etcd, scheduler, controller manager, kubelet and how they interact

Understands pod lifecycle and reconciliation: why Kubernetes uses controllers and watch semantics, not polling

Can debug methodically using kubectl tools: describe, logs, events, exec; knows where to find kubelet and control plane logs

Understands networking: service routing, CNI plugins, network policies, DNS, and can explain packet flow

Explains RBAC and service accounts; applies principle of least privilege

Can design multi-tenant clusters with namespace isolation and network policies

Understands persistent storage: PV, PVC, StorageClass, dynamic provisioning, and reclaim policies

Knows node concepts: taints, tolerations, cordoning, draining, QoS classes, kubelet eviction

Can plan and execute cluster upgrades safely with PodDisruptionBudgets

Identifies scaling bottlenecks: etcd latency, API server throughput, kubelet CPU, network bandwidth

Discusses monitoring and observability: metrics for control plane, nodes, and workloads

Shows incident response experience: root cause analysis, graceful degradation, on-call mindset

Kubernetes Engineer FAQ

What is the difference between a Kubernetes Engineer and a DevOps Engineer?

A DevOps Engineer treats Kubernetes as one tool among many (cloud, CI/CD, monitoring, databases). A Kubernetes Engineer specialises in Kubernetes itself: cluster operations, networking, storage, RBAC, Helm, operators. A DevOps Engineer asks 'how do I deploy my app?'; a Kubernetes Engineer asks 'how does the cluster work?'. Kubernetes Engineers own the cluster; DevOps owns the deployment pipeline.

What is the difference between a Kubernetes Engineer and a Platform Engineer?

A Kubernetes Engineer owns the Kubernetes cluster layer: node management, networking, storage, security, upgrades. A Platform Engineer builds developer platforms on top of Kubernetes: self-service APIs, CI/CD integration, observability, templating. Platform Engineers may use Helm, operators, or Kustomize to manage deployments. Kubernetes Engineers ensure the cluster is fast, secure, and reliable.

What skills should a Kubernetes Engineer have?

Deep Kubernetes knowledge (control plane, scheduler, kubelet, networking, storage, RBAC), Linux systems (cgroups, namespaces, networking), container runtimes (Docker, containerd), scripting (Bash, Python), monitoring (Prometheus, logs), and incident response experience. Knowledge of specific tools (Helm, Operators, service meshes, CNI plugins) is a plus. Most importantly: debugging mindset and production scars.

Is certification (CKA or CKAD) worth it for a Kubernetes Engineer role?

CKA (Certified Kubernetes Administrator) is the most relevant: it tests cluster operations, troubleshooting, and hands-on kubectl skills. CKAD is more application-focused. Certifications show you can use kubectl and know cluster basics, but they don't replace real production experience. Many hiring managers value portfolio (open-source contributions, talks, incident post-mortems) over certs. Consider certification if you're new to Kubernetes.

What are the most common Kubernetes failures you should know about?

Pending pods (scheduler, resources, taints, affinity), CrashLoopBackOff (bad code, missing secrets), OOMKilled (memory limits too low), node NotReady (kubelet crash, network issue), API server overload (slow etcd), and storage issues (PVC stuck in Pending). Master these debugging scenarios and you'll handle 80% of production incidents. Also know how to recover from etcd corruption and manage node maintenance.

How do I transition from DevOps to Kubernetes Engineer?

Start by understanding the control plane deeply: read Kelsey Hightower's Kubernetes the Hard Way, understand etcd and RAFT consensus, trace a Deployment to a running pod. Practice debugging with kubectl and reading logs. Run a cluster in a homelab or cloud (free tier) and break things intentionally. Read the Kubernetes source code for critical components (scheduler, kubelet). Join Kubernetes Slack communities and help others debug.

What is a Kubernetes Operator, and when would you write one?

An Operator is a controller that extends Kubernetes to manage stateful applications. It uses Custom Resource Definitions (CRDs) to define resources (e.g., Database, Cache) and reconciles them to the desired state. Write an Operator when kubectl and Helm can't express your application's lifecycle (e.g., database backups, rolling updates, failover). Operators are complex; prefer Helm first unless you need application-aware orchestration.

How do you handle secrets securely in Kubernetes?

Secrets in etcd are base64-encoded, not encrypted by default (dangerous!). Enable encryption at rest in the API server (EncryptionConfiguration). Use external secret systems: Vault, AWS Secrets Manager, or sealed secrets. Use RBAC to restrict who can read secrets. Use short-lived credentials (serviceAccount tokens, assume roles). Audit who accesses secrets. Never commit secrets to git; use tools like git-crypt or sealed secrets for safe versioning.

Kubernetes Engineer Interview Questions & Answers

Kubernetes Engineer Interview Process

Screening (20 mins)

Technical Deep Dive (45 mins)

Troubleshooting Scenario (15 mins)

Operational Readiness (10 mins)

Behavioural Questions

Core Kubernetes Concepts & Architecture

Networking, Storage & Security

Operations, Troubleshooting & Scaling

Practise Kubernetes Questions in a Live Interview Simulation

Common Mistakes in Kubernetes Engineer Interviews

Confusing services with ingresses, or thinking a Service is just a 'load balancer'.

Saying 'just restart the pod' without investigating root cause, or not checking logs and events.

Not understanding that 'ready' and 'running' are different pod states, or not explaining readiness/liveness probes.

Assuming all network issues are 'CNI problems' without checking kube-proxy, iptables, or firewalls.

How We Evaluate Kubernetes Engineer Answers

Want to Practise These Questions?

Kubernetes Engineer FAQ

Ready to Practise Kubernetes Engineer Interview Questions?

Kubernetes Engineer Interview Questions & Answers

Kubernetes Engineer Interview Process

Screening (20 mins)

Technical Deep Dive (45 mins)

Troubleshooting Scenario (15 mins)

Operational Readiness (10 mins)

Behavioural Questions

Core Kubernetes Concepts & Architecture

Networking, Storage & Security

Operations, Troubleshooting & Scaling

Practise Kubernetes Questions in a Live Interview Simulation

Common Mistakes in Kubernetes Engineer Interviews

Confusing services with ingresses, or thinking a Service is just a 'load balancer'.

Saying 'just restart the pod' without investigating root cause, or not checking logs and events.

Not understanding that 'ready' and 'running' are different pod states, or not explaining readiness/liveness probes.

Assuming all network issues are 'CNI problems' without checking kube-proxy, iptables, or firewalls.

How We Evaluate Kubernetes Engineer Answers

Want to Practise These Questions?

Kubernetes Engineer FAQ

Related Interview Question Sets

Ready to Practise Kubernetes Engineer Interview Questions?