Automated quality inspection in manufacturing using Pub/Sub and ML classification

kevin_coder · January 27, 2025, 8:36am

We implemented an automated quality inspection system for our manufacturing line using Pub/Sub streaming ingestion, Dataflow processing, and ML-based defect classification. Previously, manual inspection was our bottleneck - inspectors could only check 10% of units, leading to defects reaching customers and quality control issues.

Our solution: Vision cameras capture images of each manufactured unit, publish to Pub/Sub as base64-encoded messages with metadata. Dataflow pipeline processes the stream, calls a Vertex AI AutoML Vision model trained on defect images, classifies each unit as pass/fail/specific-defect-type, and writes results to BigQuery. A real-time quality dashboard displays metrics and triggers alerts for high defect rates.

Implementation snippet:

image_data = base64.b64decode(message['image'])
prediction = vertex_client.predict(endpoint, image_data)
defect_class = prediction['displayName']
confidence = prediction['confidence']

Results: 100% inspection coverage (up from 10%), defect detection rate improved from 75% to 96%, customer complaints reduced by 80%. The system processes 500 units/hour with sub-10-second classification latency. Total implementation took 8 weeks including model training. Happy to discuss technical details of Pub/Sub streaming ingestion, ML-based defect classification, or the real-time quality dashboard.

emma_solver · January 30, 2025, 11:57pm

We trained on 15,000 labeled images covering 8 defect categories plus ‘pass’ class. For new defect types, the model outputs low confidence scores which triggers manual review. We capture these edge cases, label them, and retrain monthly. Started with 80% accuracy, now at 96% after 6 months of continuous improvement. The key is building the feedback loop into your workflow from day one.

emilyops · March 2, 2025, 3:30am

How do you handle false positives? If the ML model incorrectly flags good units as defective, that could cause significant waste. Do you have a secondary verification step or confidence threshold tuning?

michaeltech · March 12, 2025, 7:45pm

I’ll provide comprehensive implementation details covering Pub/Sub streaming ingestion, ML-based defect classification, and the real-time quality dashboard.

Pub/Sub Streaming Ingestion Architecture:

Image Capture and Publishing:
- Industrial cameras: Basler ace 2 cameras (1920x1080, 60fps capability)
- Capture trigger: PLC signal when unit reaches inspection station
- Image processing: Compress to JPEG (quality 85), resize to 1024x768 for ML
- Message format: JSON with base64-encoded image + metadata (unit_id, timestamp, line_id, batch_id)
- Publishing: Python client on edge gateway publishes to Pub/Sub topic `quality-inspection-images
- Throughput: 500 images/hour (8.3 images/min), peaks at 600 images/hour
Pub/Sub Topic Configuration:
- Topic: quality-inspection-images (receives all inspection images)
- Message retention: 7 days (allows replay for troubleshooting)
- Message ordering: Enabled by line_id (maintains sequence per production line)
- Schema: Defined JSON schema for validation (enforces required fields)
Message Structure:

# Pseudocode - Published message structure:
{
  "unit_id": "UNIT-2024-08-001234",
  "timestamp": "2025-08-22T10:15:32.123Z",
  "line_id": "LINE-A",
  "batch_id": "BATCH-2024-W34-001",
  "image_data": "<base64-encoded JPEG>",
  "image_size": 327680,  # bytes
  "camera_id": "CAM-A-01",
  "metadata": {
    "product_type": "WIDGET-X",
    "shift": "morning",
    "operator_id": "OPR-123"
  }
}
# Average message size: 350KB (within 10MB Pub/Sub limit)

Edge Gateway Implementation:
- Hardware: Raspberry Pi 4 with industrial I/O hat
- Software: Python 3.9 with google-cloud-pubsub library
- Buffering: Local queue for network outages (stores up to 1000 images)
- Retry logic: Exponential backoff for publish failures
- Monitoring: Heartbeat messages every 60 seconds to detect gateway issues
Network Considerations:
- Bandwidth: 1 Mbps sustained, 5 Mbps peak (well within factory network capacity)
- Latency: Average 50ms from edge to Pub/Sub (factory to Cloud)
- Reliability: 99.9% publish success rate with retry logic
- Fallback: Local storage buffer for extended outages (rare)

ML-Based Defect Classification:

Model Development Process:
- Data Collection (4 weeks):
  - Captured 20,000 images over 2 months of production
  - Labeled by experienced quality inspectors
  - Classes: Pass (60%), Scratch (15%), Dent (10%), Discoloration (8%), Crack (5%), Other (2%)
  - Balanced dataset using augmentation (rotation, brightness, contrast)
- Model Training (2 weeks):
  - Used Vertex AI AutoML Vision (classification)
  - Training budget: 24 node-hours
  - Validation split: 80% train, 10% validation, 10% test
  - Optimization: Maximize F1 score (balance precision/recall)
- Model Performance:
  - Overall accuracy: 96.2%
  - Precision by class: Pass (98%), Scratch (94%), Dent (92%), Discoloration (90%), Crack (95%)
  - Recall by class: Pass (97%), Scratch (93%), Dent (91%), Discoloration (89%), Crack (94%)
  - Confusion matrix: Primary confusion between Scratch and Discoloration (visually similar)
Vertex AI Endpoint Configuration:
- Model: AutoML Vision classification model
- Machine type: n1-standard-4 (4 vCPU, 15GB RAM)
- Replicas: 2-6 with autoscaling (based on request volume)
- Prediction latency: P50=180ms, P95=320ms per image
- Throughput: 600 predictions/hour per replica (10 predictions/min)
- Cost: ~$0.15 per 1000 predictions
Dataflow Processing Pipeline:

# Pseudocode - Complete Dataflow pipeline:
1. Read from Pub/Sub topic 'quality-inspection-images'
2. Parse JSON message and validate schema
3. Decode base64 image data to bytes
4. Call Vertex AI prediction endpoint with image bytes
5. Parse prediction response: defect_class, confidence_score, all_predictions
6. Apply business logic based on confidence thresholds:
   - confidence > 95%: Auto-classify (straight-through)
   - confidence 80-95%: Flag for secondary inspection
   - confidence < 80%: Route to manual review queue
7. Enrich result with metadata (unit_id, timestamp, line_id, batch_id)
8. Write classification result to BigQuery table
9. Publish alert to 'quality-alerts' topic if defect detected
10. Update real-time metrics in Cloud Monitoring
# Pipeline processes 500 units/hour with <10s end-to-end latency

Confidence-Based Routing Strategy:
- High Confidence (>95%): 85% of predictions
  - Auto-classify and continue production
  - No manual intervention required
  - Defects: Automatic rejection and routing to rework
- Medium Confidence (80-95%): 12% of predictions
  - Flag unit for secondary manual inspection
  - Inspector reviews image and ML prediction
  - Final decision recorded for model retraining
- Low Confidence (<80%): 3% of predictions
  - Always manual review (likely edge case or new defect type)
  - Capture for future training data
  - Investigate root cause (lighting, angle, new defect pattern)
Defect Classification Output:
- Primary class: Most likely defect type or ‘Pass’
- Confidence score: 0-100% probability
- All predictions: Top 3 classes with probabilities
- Bounding boxes: Defect locations (if model trained for object detection)
- Processing time: Timestamp of classification
- Model version: For tracking performance over time

Real-Time Quality Dashboard:

Dashboard Architecture:
- Frontend: React web app hosted on Cloud Run
- Backend: Node.js API connecting to BigQuery and Pub/Sub
- Database: BigQuery for historical data, Firestore for real-time state
- Updates: WebSocket connection for real-time metrics (sub-second updates)
- Authentication: Cloud Identity-Aware Proxy (IAP)
Key Metrics Displayed:
- Production Metrics:
  - Units inspected: Current hour, shift, day
  - Inspection rate: Units per minute (target: 8-10)
  - Pass rate: Percentage of units passing inspection
  - Defect rate: Percentage by defect type
- Quality Metrics:
  - Defects by category: Real-time breakdown (Scratch, Dent, etc.)
  - Defect trends: Hourly, daily, weekly charts
  - First-pass yield: Percentage of units passing without rework
  - Cost of quality: Estimated cost of defects and rework
- ML Model Metrics:
  - Average confidence score: Indicator of model certainty
  - Confidence distribution: Histogram showing prediction confidence
  - Manual review rate: Percentage requiring human inspection
  - Model accuracy: Validated against inspector feedback
Alert System:
- Defect Rate Alerts:
  - Trigger: Defect rate exceeds 5% in any 15-minute window
  - Action: Notify production supervisor via Slack and email
  - Escalation: Page quality manager if rate exceeds 10%
- Model Confidence Alerts:
  - Trigger: Average confidence drops below 85% for 30 minutes
  - Action: Alert ML team to investigate (possible process change or model drift)
- System Health Alerts:
  - Trigger: Pipeline lag exceeds 60 seconds or error rate > 1%
  - Action: Notify DevOps team for immediate investigation
Dashboard Panels:
- Live Inspection Feed: Shows last 20 inspected units with thumbnails and results
- Defect Gallery: Images of recent defects grouped by type
- Production Line Status: Real-time status of each line (running, stopped, issue)
- Batch Quality Summary: Quality metrics per production batch
- Inspector Feedback: Queue of medium/low confidence predictions for review
Integration with Manufacturing Systems:
- ERP integration: Push quality data to SAP for inventory and cost tracking
- PLC integration: Signal to reject defective units automatically
- Traceability: Link inspection results to unit serial numbers for warranty claims
- Reporting: Automated daily/weekly quality reports for management

Implementation Results and Impact:

Before Automation (Manual Inspection):
- Inspection coverage: 10% of units (sampling)
- Defect detection rate: 75% (some defects missed)
- Inspector throughput: 50 units/hour per inspector
- Labor cost: 4 inspectors x $50K/year = $200K/year
- Customer complaints: 15-20 per month
- Rework cost: ~$100K/year
After Automation (ML-Based Inspection):
- Inspection coverage: 100% of units
- Defect detection rate: 96% (significant improvement)
- System throughput: 500 units/hour (10x improvement)
- Labor cost: 1 inspector for manual review queue = $50K/year
- Customer complaints: 3-4 per month (80% reduction)
- Rework cost: ~$30K/year (70% reduction)
- System cost: ~$2K/month ($24K/year for Cloud infrastructure)
- Net savings: $200K - $50K - $24K + $70K (rework) = $196K/year
- ROI: 8x first year, ongoing savings
Quality Improvements:
- First-pass yield: 92% → 96% (fewer defects reaching customers)
- Cycle time: Reduced by 15% (faster inspection, less rework)
- Scrap rate: 3% → 1.5% (earlier defect detection)
- Customer satisfaction: NPS improved from 42 to 68

Lessons Learned and Best Practices:

Start with high-quality training data: Invest time in labeling accuracy
Implement confidence thresholds: Don’t trust low-confidence predictions
Build feedback loop early: Capture manual reviews for continuous improvement
Monitor model drift: Manufacturing processes change, model must adapt
Plan for edge cases: New defect types will emerge, have manual review process
Optimize image size: Balance quality vs bandwidth/storage costs
Test lighting conditions: Consistent lighting critical for vision models
Involve quality team: They know defects best, essential for training data
Deploy gradually: Pilot on one line, validate, then scale to all lines
Document everything: Model versions, training data, threshold tuning decisions

Future Enhancements:

Object detection model: Identify specific defect locations for robotic rework
Predictive maintenance: Correlate defect patterns with equipment health
Root cause analysis: ML to identify upstream process issues causing defects
Multi-camera setup: 360° inspection for complete coverage
Edge deployment: Run inference on gateway for sub-second latency

This automated quality inspection system demonstrates the power of combining Pub/Sub streaming ingestion, ML-based defect classification, and real-time dashboards to transform manufacturing quality control from manual sampling to automated 100% inspection with significant cost savings and quality improvements.

Topic		Replies	Views
IoT gateway ML inference automates quality inspection and reduces defects in packaging line SAP IoT use-case , automation , computer-vision , defect-detection , gateway-mgmt , analytics-ml , sapiot-23 , edge-ml , quality-inspection	5	0	February 12, 2025
Edge gateway ML inference streamlines real-time quality inspection in manufacturing line Oracle IoT Cloud use-case , edge-computing , manufacturing , ml-inference , edge-gateway , gateway-mgmt , analytics-ml , quality-inspection , oiot-22	6	0	October 6, 2025
Automated quality checks in manufacturing using rules-engine and real-time analytics Microsoft Azure IoT use-case , manufacturing , quality-control , rules-engine , power-bi , real-time-processing , azure-stream-analytics , analytics-ml , aziot-24	6	0	August 28, 2025
Navigating the gap between AI vision pilots and production-grade defect detection AI Adoption in QMS discussion , edge-computing , change-management , ai-adoption , piloting , model-drift , qms-ai , computer-vision , data-labeling	5	1	January 5, 2026
Automated defect capture using quality management API on shop floor terminals for yield improvement Siemens Opcenter Execution use-case , api-development , automation , quality-mgmt , real-time-data , shopfloor-terminal , defect-capture , yield-improvement , soc-4-0	4	0	June 3, 2025
Automated real-time defect detection reduced scrap rate by 35% Siemens Opcenter Execution use-case , performance-opt , quality-mgmt , computer-vision , defect-detection , ai-integration , soc-4-1 , scrap-reduction , genealogy-linking	6	0	November 21, 2025
Predictive maintenance app built with edge analytics reduced unplanned downtime by 73% in manufacturing IBM Watson IoT use-case , manufacturing , downtime-reduction , app-enableme , edge-analytics , analytics-ml , wiot-24 , node-red , predictive-main	4	0	April 26, 2025
Implemented predictive maintenance using event correlation and anomaly detection Oracle IoT Cloud use-case , analytics , machine-learning , predictive-maintenance , anomaly-detection , event-processing , gateway-mgmt , oiot-22 , iot-production-monitoring	6	0	February 16, 2025
Edge vs cloud processing for IoT quality data: latency, reliability trade-offs Honeywell MES discussion , edge-computing , cloud-integration , reliability , quality-mgmt , latency-optimization , iot-integration , hybrid-architecture , hm-2023-2	5	0	August 30, 2025

Automated quality inspection in manufacturing using Pub/Sub and ML classification

Related topics