AI Engineering

January 15, 2025

Computer Vision in Production: Beyond the Demo

The Benchmark Trap

Computer vision demos are seductive. You find a pretrained model, run it on some test images, achieve 90%+ accuracy, and feel like you're almost done.

Then you deploy to production, and everything breaks.

The images your users upload are blurry, poorly lit, taken from unexpected angles, and contain objects your training set never saw. The model that was 95% accurate on your test set is 60% accurate in production. Your users are frustrated. Your confidence in the model is gone.

This is the benchmark trap. Here's how to avoid it.

Data Collection and Labeling Strategy

The most important work in a CV project happens before you write any model code.

Collect data from your actual deployment environment: If your model will run on mobile phone cameras in poor lighting, don't train it on studio-quality images. The distribution of your training data must match the distribution of your production data.

Define your labeling ontology carefully: What exactly is a "defect"? What counts as "person detected"? Ambiguous labeling instructions produce inconsistent labels, which produce models that learn to be inconsistently accurate.

Build a labeling pipeline, not a one-time labeling effort: Your data will grow. New edge cases will emerge. Design a repeatable process from the start. Tools: Label Studio (open source), Scale AI (managed), Labelbox (enterprise).

Set aside a hard test set immediately: Before any labeling, set aside 10-20% of your data as a test set that no one touches until evaluation. Do this before labeling, before training, and before you get attached to any particular model performance.

Distribution Shift: The Silent Killer

Distribution shift is when the data your model sees in production is systematically different from the data it was trained on. It's the most common cause of production CV failures.

Common sources of distribution shift:

Lighting: Models trained on daytime images fail at night, or vice versa
Camera: Different camera sensors, focal lengths, and resolutions produce different image statistics
Geography and demographics: A model trained primarily on images from one geographic region may underperform in others
Seasonal and temporal drift: A model trained in summer may degrade in winter; a model trained in 2023 may degrade by 2025

Mitigation strategies:

Diverse training data: Deliberately collect data across the variation axes that matter for your use case
Data augmentation: Simulate variation in training (brightness, contrast, rotation, blur) to make the model more robust
Continuous monitoring: Track model performance metrics in production, not just at evaluation time. When performance degrades, investigate the inputs it's degrading on.
Model retraining pipeline: Build the infrastructure to retrain models on new data. You will need to use it.

Inference Latency at Scale

The accuracy a model achieves on a GPU server may not be achievable in your production environment within your latency constraints.

Questions to answer before building:

What is the maximum acceptable latency for a prediction? (User-facing: typically <500ms. Background processing: seconds or minutes may be fine.)
What hardware is available at inference time?
How many concurrent predictions need to be served?

Latency optimization techniques:

Model quantization: Reducing weight precision from float32 to int8 typically reduces inference time by 2-4x with minimal accuracy loss
Model pruning: Removing redundant weights reduces model size and inference time
Batching: Process multiple images together rather than one at a time for higher throughput
TensorRT: NVIDIA's optimization library can dramatically reduce inference time on GPU
ONNX: Convert models to ONNX format for hardware-agnostic optimization

Edge Deployment vs Cloud

Should your model run on the edge (device) or in the cloud?

Edge wins when:

Latency must be under 100ms (network round-trip eliminates cloud)
Privacy requirements prevent sending images to cloud servers
Connectivity is unreliable (factory floors, vehicles, remote locations)
Volume is high enough that per-inference cloud costs become prohibitive

Cloud wins when:

The model is too large or complex for edge hardware
Edge hardware is too constrained to update models frequently
The use case allows for higher latency
You need to use the latest model without hardware updates

For edge deployment: ONNX Runtime, TensorFlow Lite, and PyTorch Mobile are the standard frameworks. Apple Core ML for iOS. MediaPipe for mobile and embedded.

Confidence Thresholding

Never ship a CV model that returns predictions without confidence scores. And never accept all predictions equally.

Set a confidence threshold below which the model returns "uncertain" rather than a classification. What to do with uncertain predictions depends on your use case:

Route to human review
Request a better image (different angle, lighting)
Fall back to a different system
Simply not act on the prediction

The right threshold is product-specific. A medical imaging system should have a high threshold (catch few cases rather than risk false positives). A recommendation system might accept lower confidence.

Calibration matters: a model that says "90% confident" should be right 90% of the time, not 70% of the time. Calibration is routinely ignored and routinely causes production problems.

Human-in-the-Loop Fallback

For high-stakes CV applications (medical, legal, financial), build explicit human review into the system design from the start.

The pattern: the model makes a prediction with a confidence score. High-confidence predictions are automated. Low-confidence predictions are queued for human review. Human decisions on low-confidence cases become training data for the next model version.

This creates a virtuous cycle: human review improves the model, which reduces the fraction of cases needing human review, which reduces cost over time.

Model Drift Monitoring

CV models that were accurate at deployment can become inaccurate over time as the world changes. Monitor for drift:

Prediction distribution shift: If the model is suddenly predicting class A much more or less than before, investigate
Performance on labeled incoming data: Set up a process to label a random sample of incoming data and measure model accuracy continuously
User feedback signals: If users can flag incorrect predictions, use those signals

Set alerts on your monitoring metrics. You want to know about drift before your users tell you about it.

The teams that build production CV systems that work long-term are the ones who treat model deployment as the beginning of the work, not the end.

Meet the author

Tequity Team