Do I need to know classical computer vision like OpenCV and SIFT?

Most modern CV roles are deep-learning-first. Classical CV still appears in production for camera calibration, preprocessing, and edge deployment. Companies working with 3D vision, robotics, or AR are more likely to test classical CV.

What is the career path for a computer vision engineer?

CV engineers progress from implementing architectures to designing systems end-to-end. Senior paths branch into technical lead, research engineering, or management. Domain expertise becomes increasingly valuable — specialized CV engineers are harder to replace than generalists.

Computer Vision Engineer Interview Questions & Answers (2026 Guide)

Q: What is the most important thing to study for a computer vision interview?

Focus on object detection architectures and evaluation metrics. Be able to explain one-stage vs two-stage detectors, walk through mAP computation, and describe error analysis when a model underperforms. Then prioritize the company's domain — autonomous driving, medical imaging, or manufacturing.

Q: How important is paper reading for CV interviews?

For senior roles, very important. You should discuss recent papers explaining the key idea, why it matters, and limitations. For mid-level roles, know foundational papers in detection and segmentation. The common mistake is knowing what a model does but not why it was designed that way.

Q: Should I prepare coding questions for a CV interview?

Yes, but expect domain-specific implementation questions like computing IoU, implementing NMS, or writing augmentation pipelines rather than standard LeetCode problems. Be comfortable with PyTorch and NumPy.

Q: How do computer vision interviews differ from general deep learning interviews?

CV interviews are domain-specialized, testing visual problem application: FPN for multi-scale detection, NMS effects on mAP, camera changes. They also test domain knowledge like camera calibration, annotation pipelines, and metrics like COCO mAP or Dice coefficient.

CV Engineer vs Deep Learning Engineer vs ML Engineer

Computer vision engineer overlaps heavily with deep learning engineer and ML engineer, and many companies use the titles loosely. The comparison below clarifies where the interview focus differs — this helps you prioritize what to prepare.

Dimension	Computer Vision Engineer	Deep Learning Engineer	ML Engineer
Core focus	Visual perception — detection, segmentation, tracking, and vision-specific pipelines	Neural network architecture design, training optimization, and model internals across any domain	End-to-end ML pipelines — data processing, model selection (classical + deep), feature engineering, serving
Typical interview questions	Compare Faster R-CNN vs YOLO vs FCOS, explain NMS, calculate mAP at different IoU thresholds	Derive backprop, explain why LayerNorm beats BatchNorm in transformers, design a training pipeline	Design a feature store, compare XGBoost vs neural net for tabular data, build an inference pipeline
Math expectations	Moderate — projective geometry, convolution math, IoU computation, precision-recall curves	High — linear algebra, calculus, probability, information theory on whiteboard	Moderate — statistics, probability, some linear algebra
Data focus	Image/video annotation quality, augmentation strategy, domain shift between cameras and environments	Dataset construction for training, preprocessing, data loading efficiency	Feature engineering, data drift, pipeline reliability, data quality at scale
Production concerns	Inference latency for video, NMS overhead, camera calibration, edge deployment (TensorRT, ONNX)	Training efficiency, GPU utilization, quantization for general deployment	Pipeline reliability, A/B testing, feature freshness, serving infrastructure
Domain knowledge	High — camera optics, sensor characteristics, lighting, specific verticals (AV, medical, manufacturing)	Low to moderate — domain-agnostic architecture expertise	Low to moderate — domain varies by company

Detection & Segmentation Questions

Detection and segmentation are the bread and butter of CV interviews. Interviewers expect you to know the major architecture families, understand why each design decision was made, and reason about trade-offs in the context of a specific deployment scenario.

Compare one-stage and two-stage object detectors. When would you choose each?

Why They Ask It

This is the most fundamental architectural question in object detection. Your answer reveals whether you understand the design trade-offs or just know model names.

What They Evaluate

Understanding of the speed-accuracy trade-off
Knowledge of specific architectures
Ability to match architecture to deployment constraints

Answer Framework

Two-stage (Faster R-CNN family) generate region proposals first, then classify and refine — higher accuracy, especially for small objects and crowded scenes, but slower. One-stage (YOLO, SSD, RetinaNet, FCOS) predict directly from feature maps — faster, but historically less accurate on hard cases. Key innovations: focal loss solved class imbalance for one-stage. Anchor-free approaches eliminated hand-designed anchors. Choose two-stage when accuracy on hard cases matters more than speed. Choose one-stage when latency matters.

Sample Answer

Two-stage detectors like Faster R-CNN use a region proposal network to generate candidate boxes, then a second stage classifies each proposal and refines the bounding box. This two-pass approach gives the model two chances to get it right, which helps with small objects and cluttered scenes. The cost is speed — the second stage runs per-proposal. One-stage detectors predict class and location directly from feature maps. The historical weakness was accuracy: the massive imbalance between background and foreground locations overwhelmed training. RetinaNet's focal loss fixed this by downweighting easy negatives. Today there's also the anchor-based vs anchor-free split. Anchor-free detectors (FCOS, CenterNet) predict center points and distances to box edges, eliminating hyperparameter tuning for anchor ratios and scales. In practice, I'd choose a one-stage anchor-free detector as the default for most production systems because of speed and simpler configuration. I'd switch to two-stage for datasets with dense small objects, heavy occlusion, or where accuracy on hard cases is the primary metric.

Explain Feature Pyramid Networks (FPN). Why are they important for detection?

Why They Ask It

Multi-scale feature handling is critical for detecting objects at different sizes. FPN is the standard approach and understanding it reveals architectural depth.

What They Evaluate

Understanding of multi-scale features
Knowledge of information flow through a feature pyramid
Awareness of why naive approaches fail

Answer Framework

The problem: objects appear at vastly different scales. Early CNN layers have high resolution but weak semantics; deep layers have strong semantics but low resolution. FPN adds a top-down pathway with lateral connections that merge high-resolution low-level features with low-resolution high-level features at each scale. The result is feature maps at multiple resolutions, all with strong semantic information. FPN is standard in nearly all modern detectors and also used in instance segmentation (Mask R-CNN).

What is Non-Maximum Suppression (NMS)? What are its failure modes, and what alternatives exist?

Why They Ask It

NMS is a critical post-processing step that every detection pipeline uses. Its failure modes cause real production issues.

What They Evaluate

Understanding of the NMS algorithm
Awareness of when it fails
Knowledge of practical alternatives

Answer Framework

NMS removes duplicates: sort by confidence, keep the top box, suppress all boxes with IoU above threshold, repeat. Failure modes: (1) crowded scenes — suppresses valid overlapping detections; (2) hard IoU threshold cutoff; (3) sequential nature makes it a latency bottleneck on GPU. Alternatives: Soft-NMS (decays confidence instead of hard suppression), Weighted Box Fusion (merges via weighted averages), and DETR (removes NMS entirely with set prediction and Hungarian matching).

Sample Answer

NMS works by sorting detections by confidence, taking the highest-scoring box, removing all other boxes that overlap it above an IoU threshold, then repeating. It has three failure modes that matter in production. First, crowded scenes: if two people stand close together and their boxes exceed the IoU threshold, NMS suppresses one. Soft-NMS addresses this by decaying confidence of overlapping boxes instead of removing them. Second, the IoU threshold is a hard binary decision — no threshold is right for all scenes. Third, NMS is sequential and can't be parallelized efficiently on GPU, making it a latency bottleneck in real-time systems. Weighted Box Fusion merges overlapping boxes into a single weighted-average box for better localization. DETR sidesteps NMS entirely by framing detection as set prediction with Hungarian matching — no duplicate removal needed.

Compare semantic segmentation, instance segmentation, and panoptic segmentation. What architectures are used for each?

Why They Ask It

Segmentation has multiple formulations with different architectures. Confusing them is a red flag.

What They Evaluate

Clear understanding of what each task requires
Knowledge of the architecture landscape
Ability to match approach to requirements

Answer Framework

Semantic: classify every pixel into a category (no instance distinction). Architectures: FCN, U-Net, DeepLab. Instance: detect individual objects and produce a mask for each. Architectures: Mask R-CNN, YOLACT, SOLOv2. Panoptic: combines both — segments 'stuff' semantically and 'things' by instance. Architectures: Panoptic FPN, MaskFormer, Mask2Former (unified transformer-based approach handling all three).

How does YOLO's architecture differ across versions, and what are the key improvements?

Why They Ask It

YOLO is the most commonly deployed detection family. Understanding its evolution shows whether you follow the field.

What They Evaluate

Knowledge of YOLO's design evolution
Understanding of which improvements mattered most
Ability to reason about architectural decisions

Answer Framework

YOLOv1: single-pass grid-based, fast but weak on small objects. v2/v3: anchor boxes, multi-scale prediction, batch norm — improved small object detection. v4/v5: bag-of-freebies (mosaic augmentation, CutMix) and bag-of-specials (CSP backbone, PANet neck). Modern variants (v8, RT-DETR): anchor-free heads, decoupled classification/regression, transformer necks. Key insight: most improvements came from training recipe changes rather than fundamental architecture changes.

Explain how multi-object tracking works. Compare SORT, DeepSORT, and ByteTrack.

Why They Ask It

Tracking is essential for video-based CV. Many CV roles involve video processing.

What They Evaluate

Understanding of tracking-by-detection paradigm
Knowledge of specific algorithms and trade-offs
Awareness of production tracking challenges

Answer Framework

Tracking-by-detection: run detector per frame, associate detections across frames. SORT: Kalman filter for motion prediction, Hungarian algorithm for IoU-based assignment. Fast but fails with occlusion. DeepSORT: adds Re-ID appearance embedding for visual similarity when IoU matching fails. More robust but slower. ByteTrack: uses low-confidence detections too — matches high-confidence first, then remaining tracks with low-confidence detections, recovering partially occluded objects. Production challenges: ID switches, Re-ID cost at high object counts, ego-motion compensation.

You need to build a system that detects small defects on manufactured parts. What architecture and training considerations would you prioritize?

Why They Ask It

Tests whether you can apply architectural knowledge to a real scenario with specific constraints.

What They Evaluate

Ability to reason about a concrete problem
Knowledge of small object detection techniques
Practical judgment about data and deployment

Answer Framework

Small object detection considerations: (1) high-resolution input — don't downsample aggressively, or use tiling/sliding window; (2) FPN with strong low-level feature paths; (3) anchor sizes tuned to defect size distribution; (4) augmentation that preserves small object visibility; (5) strict IoU thresholds since localization matters; (6) extreme class imbalance — focal loss or hard example mining; (7) consider two-stage approach with high-resolution second stage.

Metrics & Evaluation Questions

Metrics questions separate candidates who genuinely evaluate their models from those who just report a single number. Interviewers want to see that you understand what metrics actually measure, when they mislead, and how to connect model performance to business outcomes.

Explain mAP. What's the difference between AP50, AP75, and COCO mAP?

Why They Ask It

mAP is the standard detection metric, but many candidates can't explain what it actually computes. This is a litmus test for CV competency.

What They Evaluate

Precise understanding of the AP computation
Knowledge of how IoU thresholds affect the metric
Ability to interpret results practically

Answer Framework

AP for a single class: sort detections by confidence, compute precision and recall at each threshold, compute area under the PR curve. mAP averages across all classes. AP50 uses IoU ≥ 0.5 (lenient), AP75 uses IoU ≥ 0.75 (strict localization). COCO mAP averages across IoU thresholds from 0.5 to 0.95. Also mention AP_small, AP_medium, AP_large — these reveal size-specific weaknesses.

Sample Answer

mAP starts with per-class Average Precision. For each class, rank all detections by confidence, then walk down the list computing precision and recall. A detection is a true positive if it has IoU above the threshold with an unmatched ground truth box. The PR curve is summarized as area under the curve — that's the AP for one class. mAP is the mean across all classes. The IoU threshold matters enormously. AP50 only requires 50% overlap — most modern detectors score well here. AP75 requires 75% overlap, testing localization precision. A model might score 60 AP50 but only 35 AP75, meaning it finds objects but doesn't box them tightly. COCO mAP averages across ten IoU thresholds from 0.5 to 0.95. What I find most useful in practice is the size-based breakdown: AP_small, AP_medium, AP_large. This almost always reveals weakness on small objects. If I'm presenting results, I always show the size breakdown rather than just the headline mAP.

How do you evaluate a segmentation model? What metrics do you use, and what do they miss?

Why They Ask It

Segmentation metrics are less standardized than detection metrics. Choosing the wrong one can mask serious failures.

What They Evaluate

Knowledge of segmentation-specific metrics
Understanding of their limitations
Practical evaluation judgment

Answer Framework

IoU per class, then mIoU — standard for semantic segmentation. Dice coefficient — common in medical imaging. Pixel accuracy — misleading with class imbalance. Boundary F1 — evaluates edge quality. Panoptic Quality (PQ) for panoptic segmentation. What they miss: pixel metrics don't capture topology, and they weight all pixels equally — a mistake on a critical boundary may matter more than one in a region center.

Your detection model has high overall mAP but performs poorly on a specific class. How do you diagnose and fix it?

Why They Ask It

Aggregate metrics hide per-class failures. This tests real error analysis methodology.

What They Evaluate

Error analysis methodology
Ability to connect metrics to root causes
Practical problem-solving

Answer Framework

Diagnosis: (1) per-class AP, (2) confusion matrix — confused with a specific class? (3) false negatives — small, occluded, unusual poses? (4) training data — underrepresented or noisy annotations? (5) visualize predictions. Fixes depend on cause: class imbalance → oversampling/focal loss; annotation quality → re-label; hard examples → targeted augmentation or hard example mining; confusion with similar class → more discriminative features.

What is model calibration, and why does it matter for production CV systems?

Why They Ask It

Confidence calibration is critical for production decisions but often overlooked.

What They Evaluate

Understanding of what calibration means
Knowledge of how to measure and fix it
Awareness of production impact

Answer Framework

Calibration means predicted confidence matches actual accuracy — 90% confident should be correct ~90% of the time. Most neural networks are overconfident. Production impact: confidence thresholds drive decisions (flag for human review, auto-approve). Measuring: Expected Calibration Error (ECE), reliability diagrams. Fixing: temperature scaling (simple, effective), Platt scaling, histogram binning. Temperature scaling is the standard first approach.

When is precision more important than recall, and vice versa? Give examples from different CV domains.

Why They Ask It

Understanding business context of metrics shows you can connect performance to real-world impact.

What They Evaluate

Ability to reason about metric trade-offs in context
Domain awareness
Practical judgment

Answer Framework

Precision matters when false positives are costly: manufacturing defect detection (stopping production line), content moderation (blocking legitimate content), AV non-critical alerts (eroding trust). Recall matters when false negatives are costly: medical screening (missing a tumor), security surveillance (missing a threat), AV pedestrian detection (safety failure). Many systems need different operating points for different contexts.

Training & Data Questions

In production computer vision, data strategy often determines model quality more than architecture choice. These questions test whether you understand the full data pipeline — from collection and annotation through augmentation and training.

What data augmentation techniques do you use for object detection, and which ones can hurt performance?

Why They Ask It

Augmentation is critical for CV generalization, but naive application can introduce problems.

What They Evaluate

Knowledge of vision-specific augmentations
Understanding of which are safe for detection vs classification
Practical augmentation strategy experience

Answer Framework

Safe and effective: horizontal flip, random crop (with box adjustment), color jitter, mosaic augmentation, CutMix/MixUp. Potentially harmful: aggressive rotation (if objects have canonical orientation), extreme aspect ratio changes, augmentations that cut objects without adjusting labels, heavy blur (destroys small object features). Key principle: any spatial augmentation must also transform bounding box annotations. Any augmentation that removes an object must remove its annotation.

How do you handle class imbalance in an object detection dataset?

Why They Ask It

Class imbalance is the norm in CV, not the exception.

What They Evaluate

Knowledge of techniques at multiple levels
Practical judgment about which to apply
Understanding of foreground-background imbalance

Answer Framework

Data level: oversample rare classes, copy-paste augmentation, targeted data collection. Loss level: focal loss, class-weighted CE, OHEM. Architecture level: balanced feature sampling across FPN levels. Evaluation: report per-class AP, not just mAP. Practical note: foreground-background imbalance is often a bigger problem than inter-class imbalance, and focal loss addresses both.

How do you detect and handle domain shift between your training data and production environment?

Why They Ask It

Domain shift is the #1 cause of CV model failures in production.

What They Evaluate

Awareness of vision-specific domain shift sources
Knowledge of detection and mitigation techniques
Practical deployment experience

Answer Framework

Common sources: different cameras, lighting conditions, environment changes, concept drift. Detection: monitor confidence distributions, track per-class performance on periodic audits, compare feature distributions. Mitigation: domain randomization during training, style transfer for synthetic-to-real gaps, test-time augmentation, periodic retraining with production data, and maintaining diverse training sets covering deployment conditions.

What are the most common forms of data leakage in computer vision, and how do you prevent them?

Why They Ask It

Data leakage in CV is subtler than in tabular ML.

What They Evaluate

Awareness of vision-specific leakage sources
Systematic prevention methodology
Practical experience with video and image datasets

Answer Framework

Common leakage: (1) near-duplicate video frames split across train/val — fix: split by video/sequence; (2) same object in multiple images in both splits — fix: split by object ID; (3) metadata leakage (EXIF, filenames correlating with labels) — fix: strip metadata; (4) augmentation before splitting — fix: always split first; (5) temporal leakage in sequential tasks. Prevention: define split strategy before looking at data, validate with independently collected test set, run leakage audit with trivially simple models.

Describe your approach to building a labeling pipeline for a new CV project.

Why They Ask It

Labeling strategy is a core CV engineering responsibility.

What They Evaluate

Understanding of the full annotation lifecycle
Quality assurance methods
Cost-efficiency thinking

Answer Framework

Start with clear guidelines (visual examples of edge cases). Pilot round to test guidelines and measure inter-annotator agreement. QA: multi-annotator overlap on subset, consensus resolution, automated checks. Active learning: label small set, train initial model, use uncertainty to select next labeling batch. Key metric: inter-annotator agreement — if two annotators can't agree, neither can your model.

Production & Deployment Questions

Production CV is where many candidates fall short. Training a model that works on a benchmark is one thing — deploying it to run at 30fps on an edge device while handling real-world variation is another.

How would you optimize a detection model to run in real-time on an edge device?

Why They Ask It

Edge deployment is one of the most common CV production requirements.

What They Evaluate

Knowledge of model compression techniques
Understanding of edge-specific constraints
Ability to reason about the latency budget

Answer Framework

Profile where latency lives (backbone, neck, head, NMS). Optimization layers: (1) lightweight backbone (MobileNet, EfficientNet) or purpose-built edge detector (YOLOv8-nano); (2) INT8 quantization via TensorRT — 2-3x speedup with minimal loss; (3) export format (TensorRT for NVIDIA Jetson, TFLite for mobile); (4) input resolution reduction — computation is quadratic in resolution; (5) NMS optimization — limit max detections, batched NMS. Measure end-to-end including preprocessing and postprocessing.

Sample Answer

First I'd profile the full pipeline, not just the model forward pass. On edge devices, preprocessing — resizing and normalizing on CPU — can take longer than model inference on GPU. For the model: architecture selection is key. A purpose-built small model usually outperforms a compressed large model at the same latency. Next, INT8 quantization via TensorRT typically gives 2-3x speedup with less than 1 mAP drop. I'd calibrate using representative data from the production environment. Input resolution is the biggest single lever — going from 640 to 416 reduces computation by roughly 2.4x. For NMS: cap maximum detections, increase confidence threshold to filter early, and use TensorRT's built-in NMS plugin. Finally, if processing a video stream, batch multiple frames for better GPU utilization at the cost of slightly higher per-frame latency.

What are the main latency bottlenecks in a video processing CV pipeline, and how do you address each?

Why They Ask It

Video processing has different constraints than single-image inference. This tests systems thinking.

What They Evaluate

Understanding of the full video processing pipeline
Knowledge of where bottlenecks occur
Practical optimization experience

Answer Framework

Bottlenecks by stage: (1) video decoding — use hardware-accelerated NVDEC; (2) preprocessing — resize/normalize on GPU not CPU; (3) model inference — batch frames, use TensorRT; (4) postprocessing — NMS, tracking, business logic; (5) I/O — writing results, downstream services. Minimize CPU-GPU memory transfers by keeping the full pipeline on GPU. For tracking, association cost scales with number of tracked objects.

How do you handle camera calibration and its impact on model performance?

Why They Ask It

Camera calibration is a production reality invisible in benchmark datasets.

What They Evaluate

Understanding of camera geometry
Awareness of how calibration affects model inputs
Practical deployment knowledge

Answer Framework

Calibration covers intrinsic parameters (focal length, principal point, distortion) and extrinsic (position, orientation). Impact: lens distortion curves straight lines affecting detection; different focal lengths change apparent object size; 3D reasoning requires accurate calibration. Approach: undistort images before the model (OpenCV), include calibration metadata in pipeline, recalibrate when cameras change, train with augmentations simulating calibration variation.

Compare TensorRT, ONNX Runtime, and TFLite for deploying vision models.

Why They Ask It

Choosing the right inference runtime directly affects latency and deployment flexibility.

What They Evaluate

Knowledge of the deployment toolchain
Understanding of hardware-runtime pairings
Practical deployment experience

Answer Framework

TensorRT: NVIDIA's optimizer — best raw performance on NVIDIA GPUs (data center and Jetson). Vendor-locked but gives biggest speedup via kernel fusion and INT8 calibration. ONNX Runtime: cross-platform, good default for portability or multi-hardware. Performance is good but usually doesn't match TensorRT on NVIDIA. TFLite: optimized for mobile (Android/iOS) and microcontrollers, best for on-device inference. Typical pipeline: develop in PyTorch → export to ONNX → compile to TensorRT for NVIDIA targets or keep ONNX Runtime for cloud, convert to TFLite for mobile.

How CV Interviews Differ by Domain

Computer vision interviews shift significantly depending on the industry vertical. While the fundamentals are shared, the specific questions, constraints, and domain knowledge vary enough that you should tailor your preparation.

Autonomous Vehicles

AV interviews focus on multi-sensor perception: camera, LiDAR, and radar fusion. Expect questions on 3D object detection (PointPillars, CenterPoint), bird's-eye view representations, and multi-object tracking with motion prediction. Real-time constraints are strict — perception must run at sensor frame rate with bounded worst-case latency. Safety is paramount: failure mode analysis, sensor redundancy, edge cases like unusual road users or adverse weather. ODD (Operational Design Domain) and functional safety concepts may come up.

Medical Imaging

Medical CV interviews emphasize sensitivity and specificity, calibration, and regulatory awareness. Metrics shift from mAP to Dice coefficient, sensitivity/specificity, and AUC-ROC. Dataset bias is a major concern — models trained on one hospital's data may fail at another. Expect questions on small datasets (transfer learning, self-supervised pretraining), 3D volumetric processing (CT, MRI), and clinical validation. Regulatory frameworks (FDA clearance for SaMD) are relevant context.

Manufacturing & Quality Inspection

These interviews focus on anomaly detection (detecting defects never seen before), extreme class imbalance (99.9%+ good parts), and false positive cost (stopping a production line for a false alarm is expensive). Expect questions on one-class classification, few-shot learning for new defect types, and handling camera or lighting changes in deployment. Throughput matters — inspection systems may process hundreds of parts per minute.

Retail & E-Commerce

Expect questions on fine-grained visual recognition (distinguishing similar products), image search and retrieval (embedding space design, similarity metrics), and OCR/document understanding. Scale matters — catalogs may have millions of products, and the system needs to handle user-uploaded images of varying quality.

Coding Questions You'll Actually Get

CV coding interviews are different from standard software engineering interviews. Instead of LeetCode-style problems, expect implementation questions that test whether you can translate CV concepts into working code. These typically involve NumPy or PyTorch.

Implement IoU (Intersection over Union) between two bounding boxes.

Why They Ask It

IoU is the most fundamental computation in object detection. If you can't implement it from scratch, interviewers question whether you understand the metrics you report.

What They Evaluate

Ability to translate a geometric concept into code
Attention to edge cases (non-overlapping boxes, zero-area)
Comfort with NumPy or PyTorch tensor operations

Answer Framework

Boxes as (x1, y1, x2, y2). Intersection: max of x1s, max of y1s, min of x2s, min of y2s. Clamp to zero if no overlap. Intersection area = max(0, x_right - x_left) × max(0, y_bottom - y_top). Union = area_box1 + area_box2 - intersection. IoU = intersection / union. Handle zero union. For batch computation, vectorize with NumPy broadcasting.

Implement Non-Maximum Suppression from scratch.

Why They Ask It

NMS is in every detection pipeline. Implementing it tests whether you understand the algorithm you describe conceptually.

What They Evaluate

Algorithmic implementation skills
Understanding of greedy suppression logic
Ability to handle sorted-iteration pattern

Answer Framework

Input: boxes (N×4), scores (N), IoU threshold. Sort indices by score descending. While indices remain: take top-scoring index, add to keep list, compute IoU between that box and all remaining, remove indices where IoU exceeds threshold. Return kept indices. Vectorize the inner IoU computation. For Soft-NMS, decay scores instead of removing.

Write a custom data augmentation pipeline for object detection that correctly transforms both images and bounding boxes.

Why They Ask It

Tests whether you understand the critical constraint: spatial transforms must be applied to both the image and annotations.

What They Evaluate

Understanding of coordinate transforms
Awareness of edge cases (boxes out of bounds, clipped, too small)
Practical coding skills

Answer Framework

Implement: horizontal flip (flip image, transform x-coordinates: new_x1 = width - old_x2), random crop (adjust box coordinates relative to crop origin, remove boxes that fall outside), resize (scale coordinates proportionally). Validate resulting boxes are still valid (positive area, within bounds). Libraries like Albumentations handle this, but understand the mechanics.

Given predicted and ground truth segmentation masks, compute mean IoU across classes.

Why They Ask It

mIoU computation tests per-class evaluation understanding and efficient tensor operations.

What They Evaluate

Understanding of per-class metric computation
Ability to work with multi-class masks
Comfort with vectorized operations

Answer Framework

Input: predicted mask (H×W), ground truth mask (H×W), number of classes. Per class: intersection (pixels where both pred and gt equal that class), union (pixels where either equals that class), IoU = intersection / union (exclude zero-union classes). Vectorize: build confusion matrix (N×N), derive IoU per class from diagonal and row/column sums.

Behavioral Questions

CV roles require collaboration with data labeling teams, product managers, and hardware/infrastructure engineers. Behavioral questions test whether you can navigate these cross-functional relationships and make pragmatic decisions.

Tell me about a time a model performed well in testing but failed in production. What happened and how did you fix it?

Why They Ask It

The train-production gap is the defining challenge of applied CV.

What They Evaluate

Real deployment experience
Systematic debugging methodology
Understanding of domain shift and its causes

Answer Framework

STAR format. Describe the gap (what metric dropped). Root cause analysis — domain shift, data quality, or pipeline bug? Walk through the fix and what you changed in your process to prevent recurrence. Strongest answers show you changed your evaluation methodology, not just the model.

How do you decide when your training data is good enough to start training?

Why They Ask It

Data collection can be an infinite time sink. This tests pragmatic judgment.

What They Evaluate

Data quality intuition
Awareness of diminishing returns
Ability to balance data investment with modeling time

Answer Framework

Minimum viable dataset: enough examples per class to learn basic visual features. Check inter-annotator agreement — fix guidelines before collecting more. Train a baseline early — its error analysis tells you what's missing. Iterate: collect more data targeted at failure modes rather than uniformly expanding. Goal is a data flywheel where each training round informs the next collection round.

Describe a situation where you disagreed with a product manager about model requirements. How did you resolve it?

Why They Ask It

CV engineers frequently face unrealistic accuracy or latency requirements.

What They Evaluate

Ability to translate technical constraints into business terms
Negotiation skills
Pragmatic problem-solving

Answer Framework

Choose an example about a real constraint (e.g., PM wanted 99% accuracy but labeling noise ceiling was 95%, or PM wanted real-time but target device lacked GPU). Show how you quantified the trade-off, presented data, and found a compromise — tiered system, different operating point, or phased approach.

Frequently Asked Questions

Want to Practise These Questions?

Use our AI interviewer to rehearse realistic scenarios and get instant feedback on your answers.

Start Practising →

Takes less than 15 minutes.

What's the most important thing to study for a computer vision interview?

If you only have limited prep time, focus on object detection architectures and evaluation metrics. Be able to explain the difference between one-stage and two-stage detectors, walk through how mAP is computed, and describe your approach to error analysis when a model underperforms. These topics come up in nearly every CV interview regardless of the specific domain. After that, prioritize whatever matches the company's domain — if they do autonomous driving, study 3D detection and sensor fusion; if they do medical imaging, study segmentation metrics and calibration.

Do I need to know classical computer vision (OpenCV, SIFT, etc.)?

It depends on the role. Most modern CV engineer positions are deep-learning-first, so you'll spend most interview time on neural architectures, training, and deployment. However, classical CV concepts still appear in production: camera calibration uses traditional geometric methods, image preprocessing is still relevant, and some edge deployment scenarios use classical features because they're faster. Companies working with 3D vision, robotics, or augmented reality are more likely to test classical CV. If the job description mentions OpenCV, stereo vision, or SLAM, prepare for it. Otherwise, focus on deep learning approaches.

How important is paper reading for CV interviews?

For senior roles, very important. You should be able to discuss recent papers — not recite every detail, but explain the key idea, why it matters, and its limitations. For mid-level roles, know the foundational papers in detection (R-CNN family, YOLO, DETR), segmentation (U-Net, DeepLab, Mask R-CNN), and domain-specific papers relevant to the company. The most common mistake is knowing what a model does but not why it was designed that way.

Should I prepare coding questions for a CV interview?

Yes, but they're usually different from standard software engineering coding interviews. Expect implementation questions like computing IoU between bounding boxes, implementing NMS, or writing a data augmentation pipeline. Some companies also include standard algorithm questions, especially at larger tech companies. At CV-focused companies and startups, coding tends to be more domain-specific. Be comfortable with PyTorch and NumPy — you may need to implement a custom loss function, write a training loop, or manipulate tensors.

What's the career path for a computer vision engineer?

CV engineers typically progress from implementing existing architectures to designing systems end-to-end and leading technical strategy. The senior path branches into technical lead (owning the vision system for a product), research engineering (bridging research and production), or management (leading a CV team). Domain expertise becomes increasingly valuable — a CV engineer with deep autonomous driving or medical imaging experience is more specialized and harder to replace than a generalist. Some CV engineers transition to broader ML or AI engineering roles, and the skills transfer well.

How do computer vision interviews differ from general deep learning interviews?

CV interviews are domain-specialized. A deep learning engineer interview might ask you to derive backpropagation or explain transformers in the abstract. A CV interview asks you to apply that knowledge to visual problems: why FPN matters for multi-scale detection, how NMS affects mAP, what happens when the camera changes. CV interviews also test domain knowledge that general DL interviews skip — camera calibration, annotation pipeline design, and domain-specific metrics like COCO mAP or Dice coefficient. Preparation should be roughly 60% vision-specific and 40% general deep learning fundamentals.

Computer Vision Engineer Interview Questions & Answers (2026 Guide)

What Computer Vision Engineers Do in 2026

CV Engineer vs Deep Learning Engineer vs ML Engineer

Detection & Segmentation Questions

Metrics & Evaluation Questions

Training & Data Questions

Production & Deployment Questions

How CV Interviews Differ by Domain

Autonomous Vehicles

Medical Imaging

Manufacturing & Quality Inspection

Retail & E-Commerce

Coding Questions You'll Actually Get

Behavioral Questions

Practice These Questions with AI Feedback

Frequently Asked Questions

Want to Practise These Questions?

Ready to Prepare for Your Computer Vision Engineer Interview?

Computer Vision Engineer Interview Questions & Answers (2026 Guide)

What Computer Vision Engineers Do in 2026

CV Engineer vs Deep Learning Engineer vs ML Engineer

Detection & Segmentation Questions

Metrics & Evaluation Questions

Training & Data Questions

Production & Deployment Questions

How CV Interviews Differ by Domain

Autonomous Vehicles

Medical Imaging

Manufacturing & Quality Inspection

Retail & E-Commerce

Coding Questions You'll Actually Get

Behavioral Questions

Practice These Questions with AI Feedback

Frequently Asked Questions

Want to Practise These Questions?

Explore Related Interview Questions

Ready to Prepare for Your Computer Vision Engineer Interview?