Ignoring cost in platform design

Observability Engineers build the monitoring and observability platforms that enable SREs, DevOps engineers, and developers to understand system behaviour at scale. Unlike Site Reliability Engineers who own reliability outcomes, Observability Engineers own the tooling—metrics, logs, traces, dashboards, alerting, and SLO frameworks. This role differs fundamentally from DevOps Engineers (who focus on CI/CD and deployment) and Platform Engineers (who build broader developer platforms). If you're interviewing for an observability role, you'll need to demonstrate expertise in instrumentation standards, metrics design, log aggregation, distributed tracing, and alert fatigue reduction. See our guides on Site Reliability Engineer, DevOps Engineer, and Platform Engineer roles for comparison.

Our interview questions focus on technical depth in observability practises, platform architecture decisions, and real-world scenarios you'll encounter building observability systems for hundreds or thousands of microservices. These questions are used by leading tech companies and will help you prepare for technical rounds with senior engineers and architects.

Behavioural Questions

Driving Observability Adoption

Tell us about a time when you advocated for stronger observability standards across multiple teams. What resistance did you encounter and how did you overcome it?
Describe a situation where poor observability caused a significant incident. What changes did you implement to prevent similar issues?
Share an example of when you had to convince engineers to adopt a new monitoring tool or instrumentation framework. How did you make the case and gain buy-in?

Incident Management and Postmortems

Tell us about a production incident where your observability platform helped identify the root cause quickly. What observability signals were most critical?
Describe your experience with incident postmortems. What role did observability gaps play in lessons learned, and what did you change?
Walk us through a situation where observability revealed an unexpected system behaviour or architectural flaw. How did you work with the platform team to resolve it?

Cross-functional Collaboration

Observability often sits at the intersection of infrastructure, platform, and application teams. Tell us about a time you had to coordinate across teams to solve an observability problem.
Describe a situation where a developer complained about alert fatigue or dashboard overload. How did you partner with them to improve the situation?
Share an example of balancing observability cost (storage, ingestion) against coverage needs. How did you work with stakeholders on this trade-off?

Metrics, Logs & Traces

These questions test your understanding of observability signals, cardinality management, and instrumentation patterns.

Explain the differences between metrics, logs, and traces. When would you use each, and why does an observability platform need all three?
Sample Answer Guidance: Metrics are time-series aggregates (CPU usage, request latency) optimized for dashboards and alerting; logs are unstructured or semi-structured events from applications; traces follow a request across services showing latency breakdown. You need metrics for real-time alerting and trends, logs for debugging specific events, and traces for understanding request flow and dependencies. Most incidents require all three.
What is cardinality explosion and how do you prevent it in a metrics platform?
Sample Answer Guidance: Cardinality explosion occurs when you add high-cardinality labels (e.g., user IDs, request paths) to metrics, creating too many unique time series and overwhelming storage and query engines. Prevent it by applying Prometheus relabelling rules to drop or rename labels, sampling data, using dimension whitelists, and reviewing ingestion patterns. Tools like Thanos help manage cardinality at query time.
You're instrumenting a distributed microservices system with OpenTelemetry. Walk us through your instrumentation strategy—what gets metrics, logs, and traces?
Sample Answer Guidance: Use automatic instrumentation for language runtimes and frameworks first; emit golden signals (latency, error rate, throughput) as metrics via Prometheus. Emit structured logs with trace context (trace ID, span ID) to Loki. Sample traces (e.g., 1% of requests) via OpenTelemetry sampler, ingesting to Jaeger or Tempo. For critical paths, increase sampling. Use logs for exceptions and business events, traces for request flow, metrics for alerting.
How would you design a structured logging strategy for a platform with 50+ services and millions of events per second?
Sample Answer Guidance: Adopt structured logging (JSON) from the start with mandatory fields: trace ID, span ID, severity, service, version. Use Loki or similar log aggregator with labels (service, environment, pod) for fast filtering. Implement log sampling in hot paths—not every request needs full logs. Set retention policies by log level (errors 30 days, info 7 days). Use Promtail or Fluent Bit for collection. Correlate logs to traces and metrics via trace context.
Explain the concept of SLIs and SLOs. How do you instrument a system to measure SLIs like availability and latency?
Sample Answer Guidance: SLI (Service Level Indicator) is a measurable aspect of service quality; SLO (Service Level Objective) is the target. For availability, count successful requests over total requests. For latency, measure 99th percentile response time using histogram metrics (e.g., Prometheus histogram buckets at 0.1s, 0.5s, 1s). Define SLOs like '99.9% of requests complete within 500ms'. Use these metrics to determine error budgets—e.g., if your SLO is 99.9%, you have a 0.1% error budget per month.
You're implementing distributed tracing across a system with thousands of spans per second. What sampling strategy would you use and why?
Sample Answer Guidance: Use probabilistic sampling (e.g., 1% or 10%) to reduce span volume. Implement head-based sampling at instrumentation time (before sending spans) rather than tail-based to save bandwidth. For important transactions (errors, long-latency requests), use tail-based sampling in your trace collector (Jaeger, Tempo) to capture only errors and anomalies at 100%, others at 1%. This captures issues without overwhelming storage.

What interviewers look for: Look for candidates who understand cardinality explosions, can design metrics hierarchies, explain the differences between metrics, logs, and traces, and know when to use which tool. Candidates who conflate metrics with logs, don't understand cardinality problems, or suggest over-instrumentation everywhere.

Alerting, SLOs & Incident Response

These questions explore alert design, SLO-driven alerting, error budgets, and on-call practises.

What is alert fatigue and how do you reduce it in a high-volume observability platform?
Sample Answer Guidance: Alert fatigue occurs when teams receive too many alerts, causing them to ignore even critical ones. Reduce it by alerting on SLOs and error budgets rather than raw metrics (e.g., alert when error budget is depleting faster than expected), tuning thresholds to avoid flapping, using alert aggregation and grouping in tools like PagerDuty, and implementing runbooks so on-call engineers can respond efficiently. Regularly review alert-to-incident ratios.
Describe Google's multi-window, multi-burn-rate alerting approach for SLO-driven alerting.
Sample Answer Guidance: This uses two alert windows: a short window (5-30 min) with high burn rate threshold, and a long window (1-6 hours) with lower threshold. The idea: if you burn your monthly error budget in a few hours, alert immediately. Google might alert if 2% error rate in 5 mins (bad) OR 1% error rate over 1 hour (unsustainable burn). This captures real issues quickly while avoiding false positives, and naturally handles bursty traffic patterns.
You're setting up alerting for a microservice with a 99.9% availability SLO. Walk us through how you'd define error budgets and alerts.
Sample Answer Guidance: A 99.9% SLO allows 0.1% errors per month (43 minutes downtime). Divide this into daily budgets (3.6 minutes/day). Create alerts using multi-burn-rate: alert if you burn 10% of daily budget in 5 minutes, or 50% in 1 hour. Use Prometheus rule groups with for: 5m to avoid flapping. Set escalation policies—warn ops team at 30% burn, page on-call at 60%. This prevents both under- and over-alerting.
How would you design incident detection and response automation using observability data?
Sample Answer Guidance: Implement alert-based detection: SLO breaches trigger pages via PagerDuty. Use correlation: if multiple alerts fire together, group them into incidents. Automate remediation: restart services, scale replicas, roll back recent deployments. Wire observability to incident tracking: auto-create tickets with metric dashboards, alert history, and recent deployments as context. Use Webhook integrations to Slack/PagerDuty. Add playbooks: if CPU >90%, suggest scale-up.
You're designing the alerting rules for a data platform handling real-time analytics. What types of alerts would you create?
Sample Answer Guidance: Create SLO alerts (p99 latency, error rate). Add data quality alerts: detect missing or delayed data using input rate monitoring and watermark tracking (e.g., if latest data is >5 mins old). Alert on infrastructure (CPU, memory, disk). Add application-specific alerts: if output data diverges from expected range, alert. Use anomaly detection (e.g., Prometheus anomaly detector) to catch unusual patterns. Avoid alerting on every CPU spike—only alert if it correlates with SLO breach.
How do you balance alert severity and notification channels (email vs. SMS vs. PagerDuty page)?
Sample Answer Guidance: Use severity levels: critical (pages on-call immediately via PagerDuty), warning (team Slack), info (logs only). Define severity by SLO impact: critical if it breaches SLO or causes customer impact; warning if trending towards SLO breach. Map channels in PagerDuty: p1 (immediate page), p2 (page after 15 mins), p3 (next business day). Regularly audit who's being paged and reduce noise.
Explain the concept of observability error budgets. How would you use them to govern deployment and on-call decisions?
Sample Answer Guidance: Error budget is the amount of downtime/errors allowed while still meeting SLO. If SLO is 99.9%, monthly error budget is ~43 minutes. If you've used 30 minutes in week one, you have ~13 minutes left—govern deployments: require full rollback plan, run chaos tests. Use error budget for on-call decisions: if budget is healthy, on-call team can focus on tech debt; if depleted, pause deployments and focus on reliability. Tools like Nobl9 automate this.

What interviewers look for: Look for candidates who understand alert fatigue, can design meaningful SLO-driven alerts, know multi-window/multi-burn-rate alerting (Google's approach), and think about on-call burden. Candidates who suggest alerting on everything, don't understand alert fatigue, or can't explain why SLO-driven alerting is better than threshold-based.

Observability Platform Architecture & Tooling

These questions examine platform design, scale, cost optimisation, and real-world tooling decisions.

Design a metrics platform for a company with 1000+ microservices, running 100k containers, generating ~1 million metrics per second. What technology stack would you use?
Sample Answer Guidance: Use Prometheus for metrics collection (via service discovery), with Thanos or Cortex for multi-replica/multi-region aggregation and long-term storage. Use S3 or GCS as object storage. Add Grafana for dashboards. Implement Prometheus recording rules to pre-aggregate high-cardinality data. Use Thanos compaction to optimize storage. For cardinality management, apply relabelling rules to drop unnecessary labels. Estimate: 1M metrics/sec at 1:10 compression ratio = ~100GB/day of storage (~3TB/month).
Walk us through your design for a distributed tracing platform handling 10,000 traces per second with 10M+ spans per second.
Sample Answer Guidance: Use an OpenTelemetry Collector for span collection/sampling. Backend: Jaeger with Elasticsearch or Tempo with object storage (cheaper). Implement head-based sampling (1-10%) at collector. Add tail sampling for errors (100%) and slow requests (>500ms). For storage: estimate ~1TB/day at 10M spans/sec with sampling. Use retention tiers: hot storage (7 days, searchable), cold storage (30 days, archive). Consider cost: Tempo with S3 is ~$500/month vs. Elasticsearch at $5k+.
Design a logs aggregation platform for 50+ teams, each generating millions of log lines daily. Address retention, querying, and cost.
Sample Answer Guidance: Use Loki with object storage (S3). Implement log sampling per service (e.g., 10% of info logs, 100% of errors). Enforce structured logging (JSON). Set retention by log level: errors 30 days, warnings 7 days, info 1 day. Use Promtail for collection with relabelling to add context. Estimate cost: at 1TB/day with 90% sampling and 10-day average retention, object storage ~$1k/month. Add querying via Grafana Loki datasource.
How would you handle observability for a globally distributed system across 5 regions with latency and regulatory constraints?
Sample Answer Guidance: Use regional Prometheus servers for metrics, replicate via Thanos to central object store (or regional stores per compliance). For traces, use regional Tempo deployments, query via federated endpoint. For logs, regional Loki instances with cross-region federation for global search. Handle latency: accept eventual consistency for some queries, prioritise local queries. Address regulations: encrypt data in transit, store logs in-country if required. Use Grafana with multi-datasource dashboards.
You're optimising costs for observability. You have unlimited observability but budget constraints. Walk us through your cost reduction strategy.
Sample Answer Guidance: Apply sampling: metrics cardinality reduction via relabelling (drop pod_name, keep service), logs sampling by level (10% info, 100% errors), traces sampling (1% normal, 100% errors). Implement retention tiers: hot (7 days, queryable), warm (30 days, aggregate only), cold (archive to Glacier). Use metrics summarisation: pre-aggregate high-cardinality data into lower-cardinality metrics. Audit unused dashboards and alerts. Estimate: cardinality reduction saves 50%, sampling saves 70%, retention tiers save 40%.
Describe the architectural differences between push-based (Prometheus) and pull-based (Telegraf) metrics collection. When would you use each?
Sample Answer Guidance: Pull-based (Prometheus): server scrapes metrics from targets via HTTP, decouples monitoring from services, easier debugging (query exporter directly). Works well for infrastructure. Push-based (Telegraf, custom agents): useful for serverless, short-lived processes, high-scale distributed systems. Push-based avoids network latency polling. Hybrid approach: use pull for services with stable IPs, push for ephemeral workloads (Lambda, batch jobs). Prometheus ecosystem is mature; push requires message queue (Kafka) to handle throughput.
How do you instrument serverless functions (Lambda, Cloud Functions) for observability? What are the challenges?
Sample Answer Guidance: Challenges: short execution windows, no persistent agents, cold starts skew latency, metrics pushed (no scraping). Solutions: use AWS CloudWatch Insights or GCP Cloud Logging. Instrument via OpenTelemetry SDK, push metrics/traces to collector endpoint (must be fast—<100ms). Use X-Ray for tracing, CloudWatch for metrics. Pre-warm functions to avoid cold-start skew in monitoring. Log structured JSON to CloudWatch, parse via Logs Insights. For multi-region: replicate logs to central store via Kinesis/PubSub.

What interviewers look for: Look for candidates who've designed systems at scale, understand trade-offs (cost vs. observability depth), know modern observability stacks (Prometheus + Grafana + Loki + Tempo), and can discuss retention policies and data tiering. Candidates who haven't thought about cost scaling, don't understand cardinality management at scale, or suggest keeping everything forever.

Common Mistakes in Observability Engineer Interviews

Conflating observability with monitoring

Monitoring tells you *that* something is wrong; observability lets you ask *why*. Monitoring is reactive (alerts on known thresholds), observability is proactive (explore unknown unknowns via logs, traces, metrics). Good observability enables you to ask arbitrary questions about system state without pre-defining dashboards. If you say 'observability = dashboards', you'll lose points. Emphasise the difference and explain the three pillars: metrics, logs, traces.

Suggesting unlimited instrumentation

High cardinality labels and excessive sampling defeat observability platforms at scale. Candidates who say 'instrument everything' without discussing cardinality management, sampling strategies, or cost implications miss the core challenge. Real-world platforms must balance visibility against storage cost and query performance. Mention cardinality budgets, sampling percentages, relabelling rules, and retention policies. Show you understand scaling constraints.

Not understanding error budgets and SLO-driven alerting

Many candidates default to threshold-based alerting ('alert if CPU > 80%') without understanding SLOs or error budgets. Modern observability is SLO-driven: define what 'good' means (99.9% availability), measure the error budget, and alert when you're burning it too fast. Mention multi-burn-rate alerting (Google's approach), why it's better than static thresholds, and how it reduces alert fatigue. This differentiates senior from junior candidates.

Ignoring cost in platform design

Observability is expensive at scale. Candidates who design metrics platforms without discussing storage, compression, retention tiers, or sampling lose credibility. Mention realistic numbers: 1M metrics/sec = ~100GB/day, 10M spans/sec = ~1TB/day. Discuss cost-reduction strategies: cardinality limits, sampling, retention policies, tiering (hot/cold storage). Show you've thought about ROI: how much observability is worth the cost?

Evaluation Criteria for Observability Engineers

Demonstrates deep understanding of metrics design and cardinality management

Explains SLO-driven alerting and error budgets with concrete examples

Understands trade-offs between observability depth and cost at scale

Can design end-to-end observability platforms (metrics, logs, traces) for production systems

Familiar with modern tools: Prometheus, Grafana, Loki, Jaeger, Tempo, OpenTelemetry

Explains structured logging, trace sampling, and instrumentation best practices

Shows experience reducing alert fatigue and improving on-call experience

Discusses multi-region and compliance challenges in observability

Demonstrates incident response and post-incident learning mindset

Advocates for observability standards across teams and platforms

Frequently Asked Questions About Observability Engineer Interviews

What's the difference between an Observability Engineer and a Site Reliability Engineer?

Observability Engineers build the *tools and platforms*—metrics systems, dashboards, alerting frameworks, trace ingestion pipelines. SREs *use* these tools to manage reliability outcomes. SREs own availability targets and incidents; Observability Engineers own the monitoring/observability infrastructure. Many companies combine these roles, but they're distinct: Observability is about the platform, SRE is about the outcome.

Should I study specific tools before my interview?

Yes. Understand Prometheus (metrics), Grafana (visualisation), Loki (logs), Jaeger or Tempo (traces), and OpenTelemetry (instrumentation standard). You don't need to be expert in all, but interviewers expect familiarity. More important: understand *why* these tools exist and their architectural trade-offs. Know Thanos (long-term metrics), Cortex (distributed Prometheus), and PagerDuty (alerting). Reading documentation and building a small observability stack locally is valuable.

How do I explain cardinality explosions without losing the interviewer?

Start simple: 'Cardinality is the number of unique time series. If I add a label with 10k unique values (e.g., user IDs), cardinality explodes 10k times, overwhelming storage and query performance.' Give a concrete example: 'In Prometheus, if you have 100 services with 10 metrics each, that's 1k series. Add a pod_id label with 1k pods, you get 1M series.' Then discuss prevention: relabelling rules, label whitelists, avoiding PII-like labels.

What should I know about alert fatigue?

Alert fatigue is the top operational challenge in observability. If engineers are paged constantly and 90% of alerts are false positives, they stop responding. Mention SLO-driven alerting to filter noise. Discuss alert aggregation—group related alerts. Ask: 'How many pages does your on-call team get per week? What's the false positive rate?' Show you'd improve this. Candidates who don't mention alert fatigue seem inexperienced.

How do I discuss cost in observability design?

Be specific: '1M metrics/second at typical compression rates = ~100GB/day = ~3TB/month. Cloud storage costs ~$0.02/GB, so ~$60/month for metrics storage.' Discuss trade-offs: 'We could increase sampling to reduce costs, but that reduces observability depth.' Mention cost-reduction strategies: cardinality limits, retention tiers (keep errors 30 days, info 7 days), data sampling, compression. Show ROI thinking: 'Is £500/month for full observability worth catching production incidents 1 hour faster?'

What's OpenTelemetry and why is it important?

OpenTelemetry is a standard for instrumenting applications to emit metrics, logs, and traces without vendor lock-in. It's important because it decouples instrumentation from backends: you instrument once with OpenTelemetry, then send data to any backend (Prometheus, Jaeger, Datadog, etc.). Reduces vendor switching costs and promotes standardisation. Mention OpenTelemetry Collector for data pipeline and semantic conventions for consistent naming.

How would you design alerting for a critical service with high variability (e.g., a trading platform)?

For high-variability systems, static thresholds fail. Use anomaly detection (e.g., Prometheus Anomaly Detector) to baseline normal behaviour, alert on deviation. Use percentile-based alerting: alert if p99 latency exceeds historical p99 + 2σ. Implement SLO-driven alerting with wide error budgets if necessary (e.g., 95% SLO). Use PagerDuty escalations: warn on-call after 5 mins if anomaly detected, page after 10 mins if still anomalous. Test alerts in staging with real traffic patterns.

How do you handle observability in polyglot systems with multiple languages and frameworks?

Use language-agnostic standards: OpenTelemetry SDKs exist for all major languages. Define semantic conventions (e.g., 'http.method', 'db.query') so metrics are consistent across languages. Use auto-instrumentation agents (e.g., OpenTelemetry Java agent) to avoid per-service coding. For logs, enforce structured JSON with mandatory fields (trace_id, service_name) regardless of language. For metrics, expose Prometheus format—most libraries support it. Test consistency: run the same traffic through different language services, verify metrics align.

Observability Engineer Interview Questions & Answers

What to Expect in an Observability Engineer Interview

Screening Call

Technical Round 1

Technical Round 2

System Design

Behavioural/Leadership

Behavioural Questions

Driving Observability Adoption

Incident Management and Postmortems

Cross-functional Collaboration

Metrics, Logs & Traces

Alerting, SLOs & Incident Response

Observability Platform Architecture & Tooling

Practise Observability Questions in a Live Interview Simulation

Common Mistakes in Observability Engineer Interviews

Conflating observability with monitoring

Suggesting unlimited instrumentation

Not understanding error budgets and SLO-driven alerting

Ignoring cost in platform design

Evaluation Criteria for Observability Engineers

Want to Practise These Questions?

Frequently Asked Questions About Observability Engineer Interviews

Ready to Practise Observability Engineer Interview Questions?

Observability Engineer Interview Questions & Answers

What to Expect in an Observability Engineer Interview

Screening Call

Technical Round 1

Technical Round 2

System Design

Behavioural/Leadership

Behavioural Questions

Driving Observability Adoption

Incident Management and Postmortems

Cross-functional Collaboration

Metrics, Logs & Traces

Alerting, SLOs & Incident Response

Observability Platform Architecture & Tooling

Practise Observability Questions in a Live Interview Simulation

Common Mistakes in Observability Engineer Interviews

Conflating observability with monitoring

Suggesting unlimited instrumentation

Not understanding error budgets and SLO-driven alerting

Ignoring cost in platform design

Evaluation Criteria for Observability Engineers

Want to Practise These Questions?

Frequently Asked Questions About Observability Engineer Interviews

Related Interview Guides

Ready to Practise Observability Engineer Interview Questions?