Skipping testing and relying on manual terraform apply before production

A Terraform Engineer owns the Infrastructure as Code (IaC) platform and standards. Unlike Cloud Engineers who use Terraform as one tool, Terraform Engineers specialise in HCL, modules, state management, policy as code, and CI/CD pipelines for infrastructure. This role differentiates from DevOps Engineers (who focus on broader deployment and CI/CD) and Platform Engineers (who build self-service platforms using IaC foundations). These 40+ questions test whether candidates can architect, maintain, and secure Terraform at scale.

This guide covers behavioural questions (team collaboration, incident management), core technical areas (HCL and modules, multi-environment deployment, policy and testing), and real-world scenarios. Use these to assess candidates' understanding of state locking, remote backends, workspaces, providers, Terragrunt, Sentinel policy as code, and infrastructure testing.

Behavioural Questions: Team, Communication & Incident Management

Team Collaboration & Knowledge Sharing

Describe a time you had to review a Terraform pull request from a junior engineer. How did you provide feedback without blocking their learning?
Tell me about a situation where your team didn't follow IaC standards and resources were created manually. How did you address it?
Walk me through how you documented a complex Terraform module so that other engineers could use and maintain it without asking you for help.

Incident Management & Troubleshooting

Describe a time your Terraform apply failed mid-deployment and left infrastructure in an inconsistent state. How did you diagnose and recover?
Tell me about an incident where a state file became corrupted or lost. What happened and how did you prevent it from happening again?
Walk me through a situation where you had to roll back a Terraform change in production. How did you do it safely?

Architectural Decision-Making & Trade-offs

Tell me about a time you chose between using one large Terraform module vs. multiple smaller, composed modules. What trade-offs did you consider?
Describe a situation where you had to decide between remote state and local state, or between S3 and Terraform Cloud. How did you evaluate options?
Walk me through a conversation where you convinced your team to invest time in a centralised Terraform module library. What was the business case?

HCL, Modules & State Management

What's the difference between 'terraform plan', 'terraform apply', and 'terraform refresh'? Why would you run refresh before plan in a team environment?
Sample Answer Guidance: Plan shows intended changes without modifying anything. Apply executes changes. Refresh updates local state to match real infrastructure without changing it. In teams, refresh syncs state before plan to avoid deploying stale changes and risking conflicts.
Explain Terraform state. Why is it critical, what problems arise from local state, and how does remote state solve those problems?
Sample Answer Guidance: State is the record of real infrastructure Terraform has created. Local state is unsafe for teams (no locking, version control risks). Remote state (S3, Terraform Cloud, Consul) enables locking, audit trails, and concurrency control. Backends like S3 with DynamoDB locking prevent concurrent applies that corrupt state.
How would you structure a reusable module in Terraform? What goes in variables.tf, main.tf, outputs.tf, and why?
Sample Answer Guidance: variables.tf declares module inputs with descriptions and defaults. main.tf contains resource definitions. outputs.tf exposes values for downstream modules. This separation makes modules composable and testable. Each file has a clear responsibility, aiding readability and maintenance.
What is state locking, and why is it essential? How do you implement it with S3 and DynamoDB?
Sample Answer Guidance: State locking prevents concurrent applies that corrupt state. With S3, DynamoDB stores lock metadata. When apply runs, Terraform acquires a lock via DynamoDB; if another apply starts, it waits or fails. This ensures only one engineer modifies infrastructure at a time.
Describe the terraform import command. When would you use it, and what are its limitations?
Sample Answer Guidance: terraform import brings existing infrastructure under Terraform management without recreating it. Use it when resources exist but aren't in code yet. Limitations: it only updates state, not the Terraform configuration—you still write the resource block manually. It can't import complex relationships like security group rules easily.
How do you handle sensitive data (API keys, passwords, database credentials) in Terraform? What's the best practice?
Sample Answer Guidance: Use 'sensitive' flag on variables and outputs to prevent values from appearing in logs or terminal output. Store secrets in HashiCorp Vault, AWS Secrets Manager, or Terraform Cloud's variable store. Never commit secrets to version control; use .gitignore and encrypted backend state.
What's the difference between a 'count' and 'for_each' in Terraform? When would you choose one over the other?
Sample Answer Guidance: count uses numeric indices; for_each uses map keys (more stable). Use for_each when order doesn't matter and you want stable resource addressing—if you remove an element mid-list, only that element is destroyed. count is simpler for fixed repetition. for_each is safer for dynamic lists.

What interviewers look for: Strong answers show deep understanding of state as the source of truth, mention team safety (locking, remote backends), and can explain why local state is dangerous. Weak answers treat state as a nice-to-have or don't understand concurrency risks. Excellent candidates discuss state migrations, recovery from corruption, and secrets management.

Providers, Workspaces & Multi-Environment Deployment

What is a Terraform provider? Name three providers you've used in depth and describe a scenario where you'd chain them together.
Sample Answer Guidance: A provider is a plugin that manages resources on a platform (AWS, Azure, GCP, Kubernetes). I've used AWS (compute, networking), Kubernetes (cluster resources), and Helm (Kubernetes packages). Chain them: provision EKS with AWS provider, then deploy ingress controllers via Kubernetes provider and Helm charts via Helm provider.
Explain Terraform workspaces. How would you use them to manage dev, staging, and prod? What are their limitations?
Sample Answer Guidance: Workspaces are isolated state files within one backend. Use terraform workspace select prod to switch environments. Limitation: they're fragile for team use—easy to accidentally work in wrong workspace. Better approach: separate Terraform code directories per environment or use modules with different tfvar files for safety.
Design a multi-environment Terraform structure (dev, staging, prod). How would you avoid code duplication and ensure consistency?
Sample Answer Guidance: Use a shared module library for common resources (VPC, RDS, security groups). Create env-specific directories (environments/dev, environments/prod) with main.tf calling modules and locals/tfvar files overriding variables. This centralises logic while allowing per-environment customisation without repeating resource blocks.
How do you manage Terraform versions across a team? Why is version consistency important?
Sample Answer Guidance: Use required_version in Terraform block to enforce minimum version. Commit .terraform.lock.hcl to version control to lock provider versions. Teams use CI/CD to run a single Terraform version. Different versions can have different plan/apply behaviour, risking state corruption or unexpected resource changes.
Explain the terraform backend configuration. What are the trade-offs between S3, Terraform Cloud, and Consul?
Sample Answer Guidance: S3 is cheap but requires manual DynamoDB setup for locking and lacks team features. Terraform Cloud adds locking, runs, cost estimation, policy as code (Sentinel), but costs extra and vendor-locks. Consul is self-hosted, offers state, locking, and HA. Choose based on team size, compliance, and budget.
How would you implement a 'promote' workflow where infrastructure changes move from dev → staging → prod safely?
Sample Answer Guidance: Use separate state files per environment and a promotion pipeline: approve merge to dev branch (auto-apply), merge to staging branch (manual approval + plan review, auto-apply), merge to prod (security review + manual apply). Implement via CI/CD with branch protection rules and require approval for prod changes.
What's a provider plugin cache, and how would you use it in CI/CD to speed up Terraform init?
Sample Answer Guidance: A plugin cache stores downloaded provider binaries locally (TF_PLUGIN_CACHE_DIR). In CI/CD, cache this directory across builds to avoid re-downloading providers, cutting terraform init time from seconds to milliseconds. Share cache across team CI agents to reduce bandwidth and speed up all runs.

What interviewers look for: Strong answers distinguish workspaces (fragile for teams) from code/variable-based multi-environment strategies (robust). Candidates should discuss provider versioning, backend trade-offs, and locking mechanisms. Weak answers propose workspaces as the primary multi-environment solution—a red flag. Excellent candidates explain promote workflows and CI/CD safety gates.

CI/CD for Infrastructure, Testing & Policy as Code

Design a CI/CD pipeline for Terraform. What steps would you include between pushing code and apply?
Sample Answer Guidance: Pipeline: terraform init → fmt/lint (tflint) → validate → plan (saved to artifact) → human approval → apply. Earlier steps catch syntax errors and policy violations before plan. Plan review ensures safety. Never auto-apply to prod. Use branch protection (require PR reviews) before merge to main.
What is Terratest? How would you write a test for a Terraform module that provisions an RDS database?
Sample Answer Guidance: Terratest is a Go testing framework for infrastructure. Test an RDS module by: deploying it via terraform.InitAndApply(), querying AWS to verify DB exists with correct settings (encrypted, multi-AZ), then terraform.Destroy() to clean up. This verifies the module produces expected infrastructure without manual testing.
Explain policy as code in Terraform. Compare Sentinel and OPA. When would you enforce that all resources must have cost tags?
Sample Answer Guidance: Policy as code (Sentinel, OPA) blocks non-compliant Terraform plans. Sentinel is Terraform Cloud native; OPA is open-source and agnostic. Enforce cost tags via policy rule: every resource must have 'cost_centre' tag. This prevents untagged resources in prod, ensuring cost allocation and governance.
How would you implement cost estimation in a Terraform CI/CD pipeline? Why is it important?
Sample Answer Guidance: Terraform Cloud includes cost estimation: it shows monthly cost delta for each plan. Alternatively, use Infracost (open-source) in CI to comment cost changes on PRs. This prevents surprise expenses—engineers see 'RDS instance added: +$300/month' before approving, reducing billing incidents.
Describe your approach to Terraform linting and formatting. What tools would you use and why?
Sample Answer Guidance: Use terraform fmt to standardise code format (automatic in CI). Use tflint for linting (detect unused variables, deep checks). Use checkov for security scans (detect overly permissive security groups, unencrypted resources). Run all three in CI before plan to catch issues early and enforce consistency.
How would you safely enable destroy operations in CI/CD without risking accidental resource deletion?
Sample Answer Guidance: Never auto-approve destroy. Require manual step: terraform destroy only after explicit approval from senior engineer or team lead. Log all destroy operations. Use -target to destroy specific resources if needed, not blanket destroy. Protect critical resources via lifecycle { prevent_destroy = true } to block accidental deletion.

What interviewers look for: Strong answers include full CI/CD workflows with multiple safety gates (lint, validate, plan review, approval). They understand testing (Terratest), cost awareness (Infracost, Terraform Cloud), and policy enforcement (Sentinel, OPA). Weak answers skip testing or treat CI/CD as just running terraform apply. Excellent candidates discuss cost estimation, security scanning (checkov), and disaster recovery.

Common Mistakes

Treating Terraform state as disposable or ignorable

Local state without locking enables concurrent applies that corrupt infrastructure, causing outages. Teams lose sync between code and reality. How to fix: Always use remote state with locking (S3+DynamoDB, Terraform Cloud, Consul). Treat state as read-only in most cases. Document state recovery procedures and test them.

Using workspaces as the primary multi-environment strategy

Easy to accidentally apply to wrong workspace, affecting prod. Hard to code review and version control per-environment changes. Not suitable for teams. How to fix: Use separate Terraform code directories or variable files per environment. Workspaces are useful only for temporary, isolated testing. Enforce environment separation via CI/CD branch protection.

Writing monolithic modules that do too much

Modules become rigid, hard to reuse, and difficult to test. Small changes force large rebuilds. Difficult to apply different settings per environment. How to fix: Follow single responsibility: one module = one logical component (e.g., VPC module, RDS module, security group module). Compose modules for complex resources. Add variables for common customisations.

Skipping testing and relying on manual terraform apply before production

Infrastructure breaks in prod due to undetected issues. Policy violations (unencrypted resources, missing tags) slip through. Cost estimation surprises appear post-deployment. How to fix: Implement Terratest for module verification, tflint for linting, checkov for security, and Infracost for cost estimation. Run all checks in CI before plan. Require peer review of plans.

Evaluation Criteria

Deep understanding of Terraform state: how it works, why remote state with locking is critical for teams, and how to recover from corruption

Module design and composition: ability to write reusable, testable modules and explain trade-offs between monolithic and modular approaches

Multi-environment strategy: how to safely deploy across dev, staging, prod without duplicating code, using separate state files or workspaces appropriately

CI/CD pipeline design: ability to architect a safe Terraform pipeline with linting, validation, planning, approval gates, and policy enforcement

Provider knowledge: experience with at least 2–3 major providers (AWS, Azure, GCP, Kubernetes, Helm) and ability to chain them in real scenarios

Policy as code (Sentinel, OPA): understanding of how to enforce governance (cost tags, security rules) without blocking legitimate deploys

Cost awareness: knowledge of cost estimation tools (Terraform Cloud, Infracost) and how to prevent surprise expenses

Incident response: ability to handle state corruption, rollbacks, and concurrent apply failures with clear recovery procedures

Security best practices: how to manage secrets, avoid hard-coded credentials, use encrypted backends, and enforce secure defaults

Testing and validation: experience with Terratest, tflint, checkov, and ability to explain how to test infrastructure code effectively

Terraform Engineer FAQ

What's the best way to manage Terraform secrets in a CI/CD pipeline?

Never commit secrets to version control. Use environment variables injected via CI/CD secrets manager (GitHub Secrets, GitLab CI/CD Variables). Store sensitive values in HashiCorp Vault, AWS Secrets Manager, or Terraform Cloud variable store. Mark variables as 'sensitive' in Terraform to prevent logging. Encrypt backend state at rest.

How do you handle a situation where a Terraform state file becomes corrupted?

First, pull a recent backup of the state file (if available from S3 versioning or Terraform Cloud snapshots). If no backup exists, use terraform import to re-import critical resources into a fresh state. Validate each resource was imported correctly before discarding the corrupted state. Consider implementing automated backups and regular state validation checks.

What's the difference between Terraform Cloud and Terraform Enterprise?

Terraform Cloud is Terraform's SaaS platform, offering remote state, locking, policy as code (Sentinel), cost estimation, and team management. Terraform Enterprise is the self-hosted version with additional features like audit logging, SAML SSO, and on-premises deployment. Choose Cloud for quick setup; Enterprise for compliance-heavy environments requiring full control.

How would you implement role-based access control (RBAC) for Terraform in a multi-team environment?

Use Terraform Cloud/Enterprise teams and workspace-level permissions: assign teams to workspaces with 'admin', 'write', or 'read-only' roles. Separate state files per team or project. Enforce SSH keys or OIDC tokens (never hardcoded credentials). Use IAM policies to limit cloud provider access. Audit all Terraform operations via CloudTrail or similar.

What's Terragrunt, and when would you use it instead of plain Terraform?

Terragrunt is a thin wrapper around Terraform that reduces code duplication in multi-environment setups. It automates terraform init, adds dependency management between modules, and manages remote state configurations. Use it when managing many similar environments with minor customisations. Without Terragrunt, you'd repeat the same backend and variable configuration across directories.

How do you prevent accidental terraform destroy of critical resources in production?

Use lifecycle { prevent_destroy = true } on critical resources (databases, load balancers). Require manual approval gates in CI/CD before any destroy operation. Implement separate IAM roles for prod (deny destroy without approval). Log all destroy operations. Consider using aws_s3_bucket_object_lock on state file buckets to prevent accidental state deletion.

What's the relationship between Terraform and Helm, and when would you use both together?

Terraform (AWS provider) provisions EKS clusters; Helm provider deploys Kubernetes packages (Prometheus, Nginx Ingress) on that cluster. Use both when building complete Kubernetes platforms: Terraform manages infrastructure (nodes, networking); Helm manages application deployments. Separate concerns: Terraform engineers manage cluster, platform engineers manage app deployments.

How would you set up a disaster recovery process for Terraform-managed infrastructure?

Enable S3 versioning and cross-region replication for state files. Maintain automated backups of state (e.g., daily snapshots). Document and regularly test terraform import procedures to rebuild state if lost. Keep infrastructure code in Git with full history. Run terraform plan regularly (daily) to detect state drift. Use data sources to detect manual changes and correct them.

Terraform Engineer Interview Questions & Answers

Terraform Engineer Interview Process Overview

Phone Screening

Technical Interview 1

Technical Interview 2

Take-Home Challenge

Behavioural Questions: Team, Communication & Incident Management

Team Collaboration & Knowledge Sharing

Incident Management & Troubleshooting

Architectural Decision-Making & Trade-offs

HCL, Modules & State Management

Providers, Workspaces & Multi-Environment Deployment

CI/CD for Infrastructure, Testing & Policy as Code

Practise Terraform Questions in a Live Interview Simulation

Common Mistakes

Treating Terraform state as disposable or ignorable

Using workspaces as the primary multi-environment strategy

Writing monolithic modules that do too much

Skipping testing and relying on manual terraform apply before production

Evaluation Criteria

Want to Practise These Questions?

Terraform Engineer FAQ

Ready to Practise Terraform Engineer Interview Questions?

Terraform Engineer Interview Questions & Answers

Terraform Engineer Interview Process Overview

Phone Screening

Technical Interview 1

Technical Interview 2

Take-Home Challenge

Behavioural Questions: Team, Communication & Incident Management

Team Collaboration & Knowledge Sharing

Incident Management & Troubleshooting

Architectural Decision-Making & Trade-offs

HCL, Modules & State Management

Providers, Workspaces & Multi-Environment Deployment

CI/CD for Infrastructure, Testing & Policy as Code

Practise Terraform Questions in a Live Interview Simulation

Common Mistakes

Treating Terraform state as disposable or ignorable

Using workspaces as the primary multi-environment strategy

Writing monolithic modules that do too much

Skipping testing and relying on manual terraform apply before production

Evaluation Criteria

Want to Practise These Questions?

Terraform Engineer FAQ

Related Interview Guides

Ready to Practise Terraform Engineer Interview Questions?