Start Practicing

Computer Vision Engineer Interview Questions & Answers (2026 Guide)

Computer vision interviews test far more than CNN theory. Expect questions on detection architectures, segmentation trade-offs, evaluation metrics like mAP and IoU, and the production realities of deploying vision models — latency budgets, edge constraints, and domain shift. This guide covers the full scope with answer frameworks and sample responses for the questions that actually determine hiring decisions.

Start Free Practice Interview →
Detection & segmentation architecture questions
Metrics, evaluation, and error analysis
Edge deployment & production optimization
Domain-specific: AV, medical, manufacturing

AI-powered mock interviews tailored to computer vision engineer roles

Last updated: February 2026

Computer vision engineering sits at the intersection of deep learning and real-world perception. Unlike a general deep learning engineer who might work across any neural network application, a CV engineer specializes in making machines see — and the interview questions reflect that specialization.

Expect to explain why your detector uses FPN instead of a single-scale feature map, how you'd reduce false positives in a manufacturing inspection line without hurting recall, and what happens to your mAP when you change the IoU threshold from 0.5 to 0.75. Interviewers also care about the full pipeline: data collection and labeling strategy, augmentation, training, post-processing (NMS and its variants), and deployment under latency and memory constraints.

This guide is organized by interview topic area: detection and segmentation architectures first, then metrics and evaluation, training and data strategy, production deployment, and domain-specific variations that change what interviewers prioritize.

What Computer Vision Engineers Do in 2026

The computer vision engineer role has broadened significantly. Early CV engineers were primarily researchers implementing papers. Today, the role is deeply production-oriented, with responsibilities spanning the entire pipeline from data to deployment.

Detection and segmentation system design — selecting and adapting architectures for specific visual tasks. This means understanding not just which model to use but why: anchor-based vs anchor-free, one-stage vs two-stage, and how the choice affects speed, accuracy, and deployment complexity.

Data pipeline and labeling strategy — designing annotation workflows, managing labeling quality, handling class imbalance, and building augmentation pipelines. In production CV, data quality often matters more than model architecture.

Evaluation and error analysis — going beyond a single mAP number to understand where and why a model fails. This includes per-class analysis, failure mode categorization, and connecting model errors to business impact.

Production optimization — deploying vision models under real-world constraints: latency requirements, memory limits, and throughput demands. This involves quantization, pruning, distillation, and serving infrastructure design.

Domain adaptation and robustness — handling the gap between training data and production conditions. Different cameras, lighting changes, weather, motion blur, and distribution shift over time. This is often the hardest part of production CV.

CV Engineer vs Deep Learning Engineer vs ML Engineer

Computer vision engineer overlaps heavily with deep learning engineer and ML engineer, and many companies use the titles loosely. The comparison below clarifies where the interview focus differs — this helps you prioritize what to prepare.

DimensionComputer Vision EngineerDeep Learning EngineerML Engineer
Core focusVisual perception — detection, segmentation, tracking, and vision-specific pipelinesNeural network architecture design, training optimization, and model internals across any domainEnd-to-end ML pipelines — data processing, model selection (classical + deep), feature engineering, serving
Typical interview questionsCompare Faster R-CNN vs YOLO vs FCOS, explain NMS, calculate mAP at different IoU thresholdsDerive backprop, explain why LayerNorm beats BatchNorm in transformers, design a training pipelineDesign a feature store, compare XGBoost vs neural net for tabular data, build an inference pipeline
Math expectationsModerate — projective geometry, convolution math, IoU computation, precision-recall curvesHigh — linear algebra, calculus, probability, information theory on whiteboardModerate — statistics, probability, some linear algebra
Data focusImage/video annotation quality, augmentation strategy, domain shift between cameras and environmentsDataset construction for training, preprocessing, data loading efficiencyFeature engineering, data drift, pipeline reliability, data quality at scale
Production concernsInference latency for video, NMS overhead, camera calibration, edge deployment (TensorRT, ONNX)Training efficiency, GPU utilization, quantization for general deploymentPipeline reliability, A/B testing, feature freshness, serving infrastructure
Domain knowledgeHigh — camera optics, sensor characteristics, lighting, specific verticals (AV, medical, manufacturing)Low to moderate — domain-agnostic architecture expertiseLow to moderate — domain varies by company

Detection & Segmentation Questions

Detection and segmentation are the bread and butter of CV interviews. Interviewers expect you to know the major architecture families, understand why each design decision was made, and reason about trade-offs in the context of a specific deployment scenario.

Compare one-stage and two-stage object detectors. When would you choose each?
Why They Ask It

This is the most fundamental architectural question in object detection. Your answer reveals whether you understand the design trade-offs or just know model names.

What They Evaluate
  • Understanding of the speed-accuracy trade-off
  • Knowledge of specific architectures
  • Ability to match architecture to deployment constraints
Answer Framework

Two-stage (Faster R-CNN family) generate region proposals first, then classify and refine — higher accuracy, especially for small objects and crowded scenes, but slower. One-stage (YOLO, SSD, RetinaNet, FCOS) predict directly from feature maps — faster, but historically less accurate on hard cases. Key innovations: focal loss solved class imbalance for one-stage. Anchor-free approaches eliminated hand-designed anchors. Choose two-stage when accuracy on hard cases matters more than speed. Choose one-stage when latency matters.

Sample Answer

Two-stage detectors like Faster R-CNN use a region proposal network to generate candidate boxes, then a second stage classifies each proposal and refines the bounding box. This two-pass approach gives the model two chances to get it right, which helps with small objects and cluttered scenes. The cost is speed — the second stage runs per-proposal. One-stage detectors predict class and location directly from feature maps. The historical weakness was accuracy: the massive imbalance between background and foreground locations overwhelmed training. RetinaNet's focal loss fixed this by downweighting easy negatives. Today there's also the anchor-based vs anchor-free split. Anchor-free detectors (FCOS, CenterNet) predict center points and distances to box edges, eliminating hyperparameter tuning for anchor ratios and scales. In practice, I'd choose a one-stage anchor-free detector as the default for most production systems because of speed and simpler configuration. I'd switch to two-stage for datasets with dense small objects, heavy occlusion, or where accuracy on hard cases is the primary metric.

Explain Feature Pyramid Networks (FPN). Why are they important for detection?
Why They Ask It

Multi-scale feature handling is critical for detecting objects at different sizes. FPN is the standard approach and understanding it reveals architectural depth.

What They Evaluate
  • Understanding of multi-scale features
  • Knowledge of information flow through a feature pyramid
  • Awareness of why naive approaches fail
Answer Framework

The problem: objects appear at vastly different scales. Early CNN layers have high resolution but weak semantics; deep layers have strong semantics but low resolution. FPN adds a top-down pathway with lateral connections that merge high-resolution low-level features with low-resolution high-level features at each scale. The result is feature maps at multiple resolutions, all with strong semantic information. FPN is standard in nearly all modern detectors and also used in instance segmentation (Mask R-CNN).

What is Non-Maximum Suppression (NMS)? What are its failure modes, and what alternatives exist?
Why They Ask It

NMS is a critical post-processing step that every detection pipeline uses. Its failure modes cause real production issues.

What They Evaluate
  • Understanding of the NMS algorithm
  • Awareness of when it fails
  • Knowledge of practical alternatives
Answer Framework

NMS removes duplicates: sort by confidence, keep the top box, suppress all boxes with IoU above threshold, repeat. Failure modes: (1) crowded scenes — suppresses valid overlapping detections; (2) hard IoU threshold cutoff; (3) sequential nature makes it a latency bottleneck on GPU. Alternatives: Soft-NMS (decays confidence instead of hard suppression), Weighted Box Fusion (merges via weighted averages), and DETR (removes NMS entirely with set prediction and Hungarian matching).

Sample Answer

NMS works by sorting detections by confidence, taking the highest-scoring box, removing all other boxes that overlap it above an IoU threshold, then repeating. It has three failure modes that matter in production. First, crowded scenes: if two people stand close together and their boxes exceed the IoU threshold, NMS suppresses one. Soft-NMS addresses this by decaying confidence of overlapping boxes instead of removing them. Second, the IoU threshold is a hard binary decision — no threshold is right for all scenes. Third, NMS is sequential and can't be parallelized efficiently on GPU, making it a latency bottleneck in real-time systems. Weighted Box Fusion merges overlapping boxes into a single weighted-average box for better localization. DETR sidesteps NMS entirely by framing detection as set prediction with Hungarian matching — no duplicate removal needed.

Compare semantic segmentation, instance segmentation, and panoptic segmentation. What architectures are used for each?
Why They Ask It

Segmentation has multiple formulations with different architectures. Confusing them is a red flag.

What They Evaluate
  • Clear understanding of what each task requires
  • Knowledge of the architecture landscape
  • Ability to match approach to requirements
Answer Framework

Semantic: classify every pixel into a category (no instance distinction). Architectures: FCN, U-Net, DeepLab. Instance: detect individual objects and produce a mask for each. Architectures: Mask R-CNN, YOLACT, SOLOv2. Panoptic: combines both — segments 'stuff' semantically and 'things' by instance. Architectures: Panoptic FPN, MaskFormer, Mask2Former (unified transformer-based approach handling all three).

How does YOLO's architecture differ across versions, and what are the key improvements?
Why They Ask It

YOLO is the most commonly deployed detection family. Understanding its evolution shows whether you follow the field.

What They Evaluate
  • Knowledge of YOLO's design evolution
  • Understanding of which improvements mattered most
  • Ability to reason about architectural decisions
Answer Framework

YOLOv1: single-pass grid-based, fast but weak on small objects. v2/v3: anchor boxes, multi-scale prediction, batch norm — improved small object detection. v4/v5: bag-of-freebies (mosaic augmentation, CutMix) and bag-of-specials (CSP backbone, PANet neck). Modern variants (v8, RT-DETR): anchor-free heads, decoupled classification/regression, transformer necks. Key insight: most improvements came from training recipe changes rather than fundamental architecture changes.

Explain how multi-object tracking works. Compare SORT, DeepSORT, and ByteTrack.
Why They Ask It

Tracking is essential for video-based CV. Many CV roles involve video processing.

What They Evaluate
  • Understanding of tracking-by-detection paradigm
  • Knowledge of specific algorithms and trade-offs
  • Awareness of production tracking challenges
Answer Framework

Tracking-by-detection: run detector per frame, associate detections across frames. SORT: Kalman filter for motion prediction, Hungarian algorithm for IoU-based assignment. Fast but fails with occlusion. DeepSORT: adds Re-ID appearance embedding for visual similarity when IoU matching fails. More robust but slower. ByteTrack: uses low-confidence detections too — matches high-confidence first, then remaining tracks with low-confidence detections, recovering partially occluded objects. Production challenges: ID switches, Re-ID cost at high object counts, ego-motion compensation.

You need to build a system that detects small defects on manufactured parts. What architecture and training considerations would you prioritize?
Why They Ask It

Tests whether you can apply architectural knowledge to a real scenario with specific constraints.

What They Evaluate
  • Ability to reason about a concrete problem
  • Knowledge of small object detection techniques
  • Practical judgment about data and deployment
Answer Framework

Small object detection considerations: (1) high-resolution input — don't downsample aggressively, or use tiling/sliding window; (2) FPN with strong low-level feature paths; (3) anchor sizes tuned to defect size distribution; (4) augmentation that preserves small object visibility; (5) strict IoU thresholds since localization matters; (6) extreme class imbalance — focal loss or hard example mining; (7) consider two-stage approach with high-resolution second stage.

Metrics & Evaluation Questions

Metrics questions separate candidates who genuinely evaluate their models from those who just report a single number. Interviewers want to see that you understand what metrics actually measure, when they mislead, and how to connect model performance to business outcomes.

Explain mAP. What's the difference between AP50, AP75, and COCO mAP?
Why They Ask It

mAP is the standard detection metric, but many candidates can't explain what it actually computes. This is a litmus test for CV competency.

What They Evaluate
  • Precise understanding of the AP computation
  • Knowledge of how IoU thresholds affect the metric
  • Ability to interpret results practically
Answer Framework

AP for a single class: sort detections by confidence, compute precision and recall at each threshold, compute area under the PR curve. mAP averages across all classes. AP50 uses IoU ≥ 0.5 (lenient), AP75 uses IoU ≥ 0.75 (strict localization). COCO mAP averages across IoU thresholds from 0.5 to 0.95. Also mention AP_small, AP_medium, AP_large — these reveal size-specific weaknesses.

Sample Answer

mAP starts with per-class Average Precision. For each class, rank all detections by confidence, then walk down the list computing precision and recall. A detection is a true positive if it has IoU above the threshold with an unmatched ground truth box. The PR curve is summarized as area under the curve — that's the AP for one class. mAP is the mean across all classes. The IoU threshold matters enormously. AP50 only requires 50% overlap — most modern detectors score well here. AP75 requires 75% overlap, testing localization precision. A model might score 60 AP50 but only 35 AP75, meaning it finds objects but doesn't box them tightly. COCO mAP averages across ten IoU thresholds from 0.5 to 0.95. What I find most useful in practice is the size-based breakdown: AP_small, AP_medium, AP_large. This almost always reveals weakness on small objects. If I'm presenting results, I always show the size breakdown rather than just the headline mAP.

How do you evaluate a segmentation model? What metrics do you use, and what do they miss?
Why They Ask It

Segmentation metrics are less standardized than detection metrics. Choosing the wrong one can mask serious failures.

What They Evaluate
  • Knowledge of segmentation-specific metrics
  • Understanding of their limitations
  • Practical evaluation judgment
Answer Framework

IoU per class, then mIoU — standard for semantic segmentation. Dice coefficient — common in medical imaging. Pixel accuracy — misleading with class imbalance. Boundary F1 — evaluates edge quality. Panoptic Quality (PQ) for panoptic segmentation. What they miss: pixel metrics don't capture topology, and they weight all pixels equally — a mistake on a critical boundary may matter more than one in a region center.

Your detection model has high overall mAP but performs poorly on a specific class. How do you diagnose and fix it?
Why They Ask It

Aggregate metrics hide per-class failures. This tests real error analysis methodology.

What They Evaluate
  • Error analysis methodology
  • Ability to connect metrics to root causes
  • Practical problem-solving
Answer Framework

Diagnosis: (1) per-class AP, (2) confusion matrix — confused with a specific class? (3) false negatives — small, occluded, unusual poses? (4) training data — underrepresented or noisy annotations? (5) visualize predictions. Fixes depend on cause: class imbalance → oversampling/focal loss; annotation quality → re-label; hard examples → targeted augmentation or hard example mining; confusion with similar class → more discriminative features.

What is model calibration, and why does it matter for production CV systems?
Why They Ask It

Confidence calibration is critical for production decisions but often overlooked.

What They Evaluate
  • Understanding of what calibration means
  • Knowledge of how to measure and fix it
  • Awareness of production impact
Answer Framework

Calibration means predicted confidence matches actual accuracy — 90% confident should be correct ~90% of the time. Most neural networks are overconfident. Production impact: confidence thresholds drive decisions (flag for human review, auto-approve). Measuring: Expected Calibration Error (ECE), reliability diagrams. Fixing: temperature scaling (simple, effective), Platt scaling, histogram binning. Temperature scaling is the standard first approach.

When is precision more important than recall, and vice versa? Give examples from different CV domains.
Why They Ask It

Understanding business context of metrics shows you can connect performance to real-world impact.

What They Evaluate
  • Ability to reason about metric trade-offs in context
  • Domain awareness
  • Practical judgment
Answer Framework

Precision matters when false positives are costly: manufacturing defect detection (stopping production line), content moderation (blocking legitimate content), AV non-critical alerts (eroding trust). Recall matters when false negatives are costly: medical screening (missing a tumor), security surveillance (missing a threat), AV pedestrian detection (safety failure). Many systems need different operating points for different contexts.

Training & Data Questions

In production computer vision, data strategy often determines model quality more than architecture choice. These questions test whether you understand the full data pipeline — from collection and annotation through augmentation and training.

What data augmentation techniques do you use for object detection, and which ones can hurt performance?
Why They Ask It

Augmentation is critical for CV generalization, but naive application can introduce problems.

What They Evaluate
  • Knowledge of vision-specific augmentations
  • Understanding of which are safe for detection vs classification
  • Practical augmentation strategy experience
Answer Framework

Safe and effective: horizontal flip, random crop (with box adjustment), color jitter, mosaic augmentation, CutMix/MixUp. Potentially harmful: aggressive rotation (if objects have canonical orientation), extreme aspect ratio changes, augmentations that cut objects without adjusting labels, heavy blur (destroys small object features). Key principle: any spatial augmentation must also transform bounding box annotations. Any augmentation that removes an object must remove its annotation.

How do you handle class imbalance in an object detection dataset?
Why They Ask It

Class imbalance is the norm in CV, not the exception.

What They Evaluate
  • Knowledge of techniques at multiple levels
  • Practical judgment about which to apply
  • Understanding of foreground-background imbalance
Answer Framework

Data level: oversample rare classes, copy-paste augmentation, targeted data collection. Loss level: focal loss, class-weighted CE, OHEM. Architecture level: balanced feature sampling across FPN levels. Evaluation: report per-class AP, not just mAP. Practical note: foreground-background imbalance is often a bigger problem than inter-class imbalance, and focal loss addresses both.

How do you detect and handle domain shift between your training data and production environment?
Why They Ask It

Domain shift is the #1 cause of CV model failures in production.

What They Evaluate
  • Awareness of vision-specific domain shift sources
  • Knowledge of detection and mitigation techniques
  • Practical deployment experience
Answer Framework

Common sources: different cameras, lighting conditions, environment changes, concept drift. Detection: monitor confidence distributions, track per-class performance on periodic audits, compare feature distributions. Mitigation: domain randomization during training, style transfer for synthetic-to-real gaps, test-time augmentation, periodic retraining with production data, and maintaining diverse training sets covering deployment conditions.

What are the most common forms of data leakage in computer vision, and how do you prevent them?
Why They Ask It

Data leakage in CV is subtler than in tabular ML.

What They Evaluate
  • Awareness of vision-specific leakage sources
  • Systematic prevention methodology
  • Practical experience with video and image datasets
Answer Framework

Common leakage: (1) near-duplicate video frames split across train/val — fix: split by video/sequence; (2) same object in multiple images in both splits — fix: split by object ID; (3) metadata leakage (EXIF, filenames correlating with labels) — fix: strip metadata; (4) augmentation before splitting — fix: always split first; (5) temporal leakage in sequential tasks. Prevention: define split strategy before looking at data, validate with independently collected test set, run leakage audit with trivially simple models.

Describe your approach to building a labeling pipeline for a new CV project.
Why They Ask It

Labeling strategy is a core CV engineering responsibility.

What They Evaluate
  • Understanding of the full annotation lifecycle
  • Quality assurance methods
  • Cost-efficiency thinking
Answer Framework

Start with clear guidelines (visual examples of edge cases). Pilot round to test guidelines and measure inter-annotator agreement. QA: multi-annotator overlap on subset, consensus resolution, automated checks. Active learning: label small set, train initial model, use uncertainty to select next labeling batch. Key metric: inter-annotator agreement — if two annotators can't agree, neither can your model.

Production & Deployment Questions

Production CV is where many candidates fall short. Training a model that works on a benchmark is one thing — deploying it to run at 30fps on an edge device while handling real-world variation is another.

How would you optimize a detection model to run in real-time on an edge device?
Why They Ask It

Edge deployment is one of the most common CV production requirements.

What They Evaluate
  • Knowledge of model compression techniques
  • Understanding of edge-specific constraints
  • Ability to reason about the latency budget
Answer Framework

Profile where latency lives (backbone, neck, head, NMS). Optimization layers: (1) lightweight backbone (MobileNet, EfficientNet) or purpose-built edge detector (YOLOv8-nano); (2) INT8 quantization via TensorRT — 2-3x speedup with minimal loss; (3) export format (TensorRT for NVIDIA Jetson, TFLite for mobile); (4) input resolution reduction — computation is quadratic in resolution; (5) NMS optimization — limit max detections, batched NMS. Measure end-to-end including preprocessing and postprocessing.

Sample Answer

First I'd profile the full pipeline, not just the model forward pass. On edge devices, preprocessing — resizing and normalizing on CPU — can take longer than model inference on GPU. For the model: architecture selection is key. A purpose-built small model usually outperforms a compressed large model at the same latency. Next, INT8 quantization via TensorRT typically gives 2-3x speedup with less than 1 mAP drop. I'd calibrate using representative data from the production environment. Input resolution is the biggest single lever — going from 640 to 416 reduces computation by roughly 2.4x. For NMS: cap maximum detections, increase confidence threshold to filter early, and use TensorRT's built-in NMS plugin. Finally, if processing a video stream, batch multiple frames for better GPU utilization at the cost of slightly higher per-frame latency.

What are the main latency bottlenecks in a video processing CV pipeline, and how do you address each?
Why They Ask It

Video processing has different constraints than single-image inference. This tests systems thinking.

What They Evaluate
  • Understanding of the full video processing pipeline
  • Knowledge of where bottlenecks occur
  • Practical optimization experience
Answer Framework

Bottlenecks by stage: (1) video decoding — use hardware-accelerated NVDEC; (2) preprocessing — resize/normalize on GPU not CPU; (3) model inference — batch frames, use TensorRT; (4) postprocessing — NMS, tracking, business logic; (5) I/O — writing results, downstream services. Minimize CPU-GPU memory transfers by keeping the full pipeline on GPU. For tracking, association cost scales with number of tracked objects.

How do you handle camera calibration and its impact on model performance?
Why They Ask It

Camera calibration is a production reality invisible in benchmark datasets.

What They Evaluate
  • Understanding of camera geometry
  • Awareness of how calibration affects model inputs
  • Practical deployment knowledge
Answer Framework

Calibration covers intrinsic parameters (focal length, principal point, distortion) and extrinsic (position, orientation). Impact: lens distortion curves straight lines affecting detection; different focal lengths change apparent object size; 3D reasoning requires accurate calibration. Approach: undistort images before the model (OpenCV), include calibration metadata in pipeline, recalibrate when cameras change, train with augmentations simulating calibration variation.

Compare TensorRT, ONNX Runtime, and TFLite for deploying vision models.
Why They Ask It

Choosing the right inference runtime directly affects latency and deployment flexibility.

What They Evaluate
  • Knowledge of the deployment toolchain
  • Understanding of hardware-runtime pairings
  • Practical deployment experience
Answer Framework

TensorRT: NVIDIA's optimizer — best raw performance on NVIDIA GPUs (data center and Jetson). Vendor-locked but gives biggest speedup via kernel fusion and INT8 calibration. ONNX Runtime: cross-platform, good default for portability or multi-hardware. Performance is good but usually doesn't match TensorRT on NVIDIA. TFLite: optimized for mobile (Android/iOS) and microcontrollers, best for on-device inference. Typical pipeline: develop in PyTorch → export to ONNX → compile to TensorRT for NVIDIA targets or keep ONNX Runtime for cloud, convert to TFLite for mobile.

How CV Interviews Differ by Domain

Computer vision interviews shift significantly depending on the industry vertical. While the fundamentals are shared, the specific questions, constraints, and domain knowledge vary enough that you should tailor your preparation.

Autonomous Vehicles

AV interviews focus on multi-sensor perception: camera, LiDAR, and radar fusion. Expect questions on 3D object detection (PointPillars, CenterPoint), bird's-eye view representations, and multi-object tracking with motion prediction. Real-time constraints are strict — perception must run at sensor frame rate with bounded worst-case latency. Safety is paramount: failure mode analysis, sensor redundancy, edge cases like unusual road users or adverse weather. ODD (Operational Design Domain) and functional safety concepts may come up.

Medical Imaging

Medical CV interviews emphasize sensitivity and specificity, calibration, and regulatory awareness. Metrics shift from mAP to Dice coefficient, sensitivity/specificity, and AUC-ROC. Dataset bias is a major concern — models trained on one hospital's data may fail at another. Expect questions on small datasets (transfer learning, self-supervised pretraining), 3D volumetric processing (CT, MRI), and clinical validation. Regulatory frameworks (FDA clearance for SaMD) are relevant context.

Manufacturing & Quality Inspection

These interviews focus on anomaly detection (detecting defects never seen before), extreme class imbalance (99.9%+ good parts), and false positive cost (stopping a production line for a false alarm is expensive). Expect questions on one-class classification, few-shot learning for new defect types, and handling camera or lighting changes in deployment. Throughput matters — inspection systems may process hundreds of parts per minute.

Retail & E-Commerce

Expect questions on fine-grained visual recognition (distinguishing similar products), image search and retrieval (embedding space design, similarity metrics), and OCR/document understanding. Scale matters — catalogs may have millions of products, and the system needs to handle user-uploaded images of varying quality.

Coding Questions You'll Actually Get

CV coding interviews are different from standard software engineering interviews. Instead of LeetCode-style problems, expect implementation questions that test whether you can translate CV concepts into working code. These typically involve NumPy or PyTorch.

Implement IoU (Intersection over Union) between two bounding boxes.
Why They Ask It

IoU is the most fundamental computation in object detection. If you can't implement it from scratch, interviewers question whether you understand the metrics you report.

What They Evaluate
  • Ability to translate a geometric concept into code
  • Attention to edge cases (non-overlapping boxes, zero-area)
  • Comfort with NumPy or PyTorch tensor operations
Answer Framework

Boxes as (x1, y1, x2, y2). Intersection: max of x1s, max of y1s, min of x2s, min of y2s. Clamp to zero if no overlap. Intersection area = max(0, x_right - x_left) × max(0, y_bottom - y_top). Union = area_box1 + area_box2 - intersection. IoU = intersection / union. Handle zero union. For batch computation, vectorize with NumPy broadcasting.

Implement Non-Maximum Suppression from scratch.
Why They Ask It

NMS is in every detection pipeline. Implementing it tests whether you understand the algorithm you describe conceptually.

What They Evaluate
  • Algorithmic implementation skills
  • Understanding of greedy suppression logic
  • Ability to handle sorted-iteration pattern
Answer Framework

Input: boxes (N×4), scores (N), IoU threshold. Sort indices by score descending. While indices remain: take top-scoring index, add to keep list, compute IoU between that box and all remaining, remove indices where IoU exceeds threshold. Return kept indices. Vectorize the inner IoU computation. For Soft-NMS, decay scores instead of removing.

Write a custom data augmentation pipeline for object detection that correctly transforms both images and bounding boxes.
Why They Ask It

Tests whether you understand the critical constraint: spatial transforms must be applied to both the image and annotations.

What They Evaluate
  • Understanding of coordinate transforms
  • Awareness of edge cases (boxes out of bounds, clipped, too small)
  • Practical coding skills
Answer Framework

Implement: horizontal flip (flip image, transform x-coordinates: new_x1 = width - old_x2), random crop (adjust box coordinates relative to crop origin, remove boxes that fall outside), resize (scale coordinates proportionally). Validate resulting boxes are still valid (positive area, within bounds). Libraries like Albumentations handle this, but understand the mechanics.

Given predicted and ground truth segmentation masks, compute mean IoU across classes.
Why They Ask It

mIoU computation tests per-class evaluation understanding and efficient tensor operations.

What They Evaluate
  • Understanding of per-class metric computation
  • Ability to work with multi-class masks
  • Comfort with vectorized operations
Answer Framework

Input: predicted mask (H×W), ground truth mask (H×W), number of classes. Per class: intersection (pixels where both pred and gt equal that class), union (pixels where either equals that class), IoU = intersection / union (exclude zero-union classes). Vectorize: build confusion matrix (N×N), derive IoU per class from diagonal and row/column sums.

Behavioral Questions

CV roles require collaboration with data labeling teams, product managers, and hardware/infrastructure engineers. Behavioral questions test whether you can navigate these cross-functional relationships and make pragmatic decisions.

Tell me about a time a model performed well in testing but failed in production. What happened and how did you fix it?
Why They Ask It

The train-production gap is the defining challenge of applied CV.

What They Evaluate
  • Real deployment experience
  • Systematic debugging methodology
  • Understanding of domain shift and its causes
Answer Framework

STAR format. Describe the gap (what metric dropped). Root cause analysis — domain shift, data quality, or pipeline bug? Walk through the fix and what you changed in your process to prevent recurrence. Strongest answers show you changed your evaluation methodology, not just the model.

How do you decide when your training data is good enough to start training?
Why They Ask It

Data collection can be an infinite time sink. This tests pragmatic judgment.

What They Evaluate
  • Data quality intuition
  • Awareness of diminishing returns
  • Ability to balance data investment with modeling time
Answer Framework

Minimum viable dataset: enough examples per class to learn basic visual features. Check inter-annotator agreement — fix guidelines before collecting more. Train a baseline early — its error analysis tells you what's missing. Iterate: collect more data targeted at failure modes rather than uniformly expanding. Goal is a data flywheel where each training round informs the next collection round.

Describe a situation where you disagreed with a product manager about model requirements. How did you resolve it?
Why They Ask It

CV engineers frequently face unrealistic accuracy or latency requirements.

What They Evaluate
  • Ability to translate technical constraints into business terms
  • Negotiation skills
  • Pragmatic problem-solving
Answer Framework

Choose an example about a real constraint (e.g., PM wanted 99% accuracy but labeling noise ceiling was 95%, or PM wanted real-time but target device lacked GPU). Show how you quantified the trade-off, presented data, and found a compromise — tiered system, different operating point, or phased approach.

Practice These Questions with AI Feedback

Reading frameworks is a start — but CV interviews reward the ability to reason through architecture trade-offs and deployment constraints under pressure. Our AI simulator generates role-specific questions, times your responses, and scores both technical depth and communication clarity.

Start Free Practice Interview →

Tailored to computer vision engineer roles. No credit card required.

Frequently Asked Questions

What's the most important thing to study for a computer vision interview?

If you only have limited prep time, focus on object detection architectures and evaluation metrics. Be able to explain the difference between one-stage and two-stage detectors, walk through how mAP is computed, and describe your approach to error analysis when a model underperforms. These topics come up in nearly every CV interview regardless of the specific domain. After that, prioritize whatever matches the company's domain — if they do autonomous driving, study 3D detection and sensor fusion; if they do medical imaging, study segmentation metrics and calibration.

Do I need to know classical computer vision (OpenCV, SIFT, etc.)?

It depends on the role. Most modern CV engineer positions are deep-learning-first, so you'll spend most interview time on neural architectures, training, and deployment. However, classical CV concepts still appear in production: camera calibration uses traditional geometric methods, image preprocessing is still relevant, and some edge deployment scenarios use classical features because they're faster. Companies working with 3D vision, robotics, or augmented reality are more likely to test classical CV. If the job description mentions OpenCV, stereo vision, or SLAM, prepare for it. Otherwise, focus on deep learning approaches.

How important is paper reading for CV interviews?

For senior roles, very important. You should be able to discuss recent papers — not recite every detail, but explain the key idea, why it matters, and its limitations. For mid-level roles, know the foundational papers in detection (R-CNN family, YOLO, DETR), segmentation (U-Net, DeepLab, Mask R-CNN), and domain-specific papers relevant to the company. The most common mistake is knowing what a model does but not why it was designed that way.

Should I prepare coding questions for a CV interview?

Yes, but they're usually different from standard software engineering coding interviews. Expect implementation questions like computing IoU between bounding boxes, implementing NMS, or writing a data augmentation pipeline. Some companies also include standard algorithm questions, especially at larger tech companies. At CV-focused companies and startups, coding tends to be more domain-specific. Be comfortable with PyTorch and NumPy — you may need to implement a custom loss function, write a training loop, or manipulate tensors.

What's the career path for a computer vision engineer?

CV engineers typically progress from implementing existing architectures to designing systems end-to-end and leading technical strategy. The senior path branches into technical lead (owning the vision system for a product), research engineering (bridging research and production), or management (leading a CV team). Domain expertise becomes increasingly valuable — a CV engineer with deep autonomous driving or medical imaging experience is more specialized and harder to replace than a generalist. Some CV engineers transition to broader ML or AI engineering roles, and the skills transfer well.

How do computer vision interviews differ from general deep learning interviews?

CV interviews are domain-specialized. A deep learning engineer interview might ask you to derive backpropagation or explain transformers in the abstract. A CV interview asks you to apply that knowledge to visual problems: why FPN matters for multi-scale detection, how NMS affects mAP, what happens when the camera changes. CV interviews also test domain knowledge that general DL interviews skip — camera calibration, annotation pipeline design, and domain-specific metrics like COCO mAP or Dice coefficient. Preparation should be roughly 60% vision-specific and 40% general deep learning fundamentals.

Ready to Prepare for Your Computer Vision Engineer Interview?

Upload your resume and the job description. Our AI generates targeted questions based on the specific role — covering detection architectures, segmentation, evaluation metrics, edge deployment, and domain-specific scenarios. Practice with timed responses, camera on, and detailed scoring on both technical accuracy and explanation clarity.

Start Free Practice Interview →

Personalized computer vision engineer interview prep. No credit card required.