Computer Vision in Production: Beyond the Demo
The Benchmark Trap
Computer vision demos are seductive. You find a pretrained model, run it on some test images, achieve 90%+ accuracy, and feel like you're almost done.
Then you deploy to production, and everything breaks.
The images your users upload are blurry, poorly lit, taken from unexpected angles, and contain objects your training set never saw. The model that was 95% accurate on your test set is 60% accurate in production. Your users are frustrated. Your confidence in the model is gone.
This is the benchmark trap. Here's how to avoid it.
Data Collection and Labeling Strategy
The most important work in a CV project happens before you write any model code.
Collect data from your actual deployment environment: If your model will run on mobile phone cameras in poor lighting, don't train it on studio-quality images. The distribution of your training data must match the distribution of your production data.
Define your labeling ontology carefully: What exactly is a "defect"? What counts as "person detected"? Ambiguous labeling instructions produce inconsistent labels, which produce models that learn to be inconsistently accurate.
Build a labeling pipeline, not a one-time labeling effort: Your data will grow. New edge cases will emerge. Design a repeatable process from the start. Tools: Label Studio (open source), Scale AI (managed), Labelbox (enterprise).
Set aside a hard test set immediately: Before any labeling, set aside 10-20% of your data as a test set that no one touches until evaluation. Do this before labeling, before training, and before you get attached to any particular model performance.
Distribution Shift: The Silent Killer
Distribution shift is when the data your model sees in production is systematically different from the data it was trained on. It's the most common cause of production CV failures.
Common sources of distribution shift:
- Lighting: Models trained on daytime images fail at night, or vice versa
- Camera: Different camera sensors, focal lengths, and resolutions produce different image statistics
- Geography and demographics: A model trained primarily on images from one geographic region may underperform in others
- Seasonal and temporal drift: A model trained in summer may degrade in winter; a model trained in 2023 may degrade by 2025
Mitigation strategies:
- Diverse training data: Deliberately collect data across the variation axes that matter for your use case
- Data augmentation: Simulate variation in training (brightness, contrast, rotation, blur) to make the model more robust
- Continuous monitoring: Track model performance metrics in production, not just at evaluation time. When performance degrades, investigate the inputs it's degrading on.
- Model retraining pipeline: Build the infrastructure to retrain models on new data. You will need to use it.
Inference Latency at Scale
The accuracy a model achieves on a GPU server may not be achievable in your production environment within your latency constraints.
Questions to answer before building:
- What is the maximum acceptable latency for a prediction? (User-facing: typically <500ms. Background processing: seconds or minutes may be fine.)
- What hardware is available at inference time?
- How many concurrent predictions need to be served?
Latency optimization techniques:
- Model quantization: Reducing weight precision from float32 to int8 typically reduces inference time by 2-4x with minimal accuracy loss
- Model pruning: Removing redundant weights reduces model size and inference time
- Batching: Process multiple images together rather than one at a time for higher throughput
- TensorRT: NVIDIA's optimization library can dramatically reduce inference time on GPU
- ONNX: Convert models to ONNX format for hardware-agnostic optimization
Edge Deployment vs Cloud
Should your model run on the edge (device) or in the cloud?
Edge wins when:
- Latency must be under 100ms (network round-trip eliminates cloud)
- Privacy requirements prevent sending images to cloud servers
- Connectivity is unreliable (factory floors, vehicles, remote locations)
- Volume is high enough that per-inference cloud costs become prohibitive
Cloud wins when:
- The model is too large or complex for edge hardware
- Edge hardware is too constrained to update models frequently
- The use case allows for higher latency
- You need to use the latest model without hardware updates
For edge deployment: ONNX Runtime, TensorFlow Lite, and PyTorch Mobile are the standard frameworks. Apple Core ML for iOS. MediaPipe for mobile and embedded.
Confidence Thresholding
Never ship a CV model that returns predictions without confidence scores. And never accept all predictions equally.
Set a confidence threshold below which the model returns "uncertain" rather than a classification. What to do with uncertain predictions depends on your use case:
- Route to human review
- Request a better image (different angle, lighting)
- Fall back to a different system
- Simply not act on the prediction
The right threshold is product-specific. A medical imaging system should have a high threshold (catch few cases rather than risk false positives). A recommendation system might accept lower confidence.
Calibration matters: a model that says "90% confident" should be right 90% of the time, not 70% of the time. Calibration is routinely ignored and routinely causes production problems.
Human-in-the-Loop Fallback
For high-stakes CV applications (medical, legal, financial), build explicit human review into the system design from the start.
The pattern: the model makes a prediction with a confidence score. High-confidence predictions are automated. Low-confidence predictions are queued for human review. Human decisions on low-confidence cases become training data for the next model version.
This creates a virtuous cycle: human review improves the model, which reduces the fraction of cases needing human review, which reduces cost over time.
Model Drift Monitoring
CV models that were accurate at deployment can become inaccurate over time as the world changes. Monitor for drift:
- Prediction distribution shift: If the model is suddenly predicting class A much more or less than before, investigate
- Performance on labeled incoming data: Set up a process to label a random sample of incoming data and measure model accuracy continuously
- User feedback signals: If users can flag incorrect predictions, use those signals
Set alerts on your monitoring metrics. You want to know about drift before your users tell you about it.
The teams that build production CV systems that work long-term are the ones who treat model deployment as the beginning of the work, not the end.









