Happy to share the complete implementation details and results:
VLM Fine-Tuning Approach:
We started with GPT-4 Vision for proof-of-concept because it required no training and achieved 78% accuracy out-of-the-box. However, the API costs were prohibitive at scale (800-1,200 invoices daily × $0.03 per invoice = $8,000-10,000/month just for extraction).
We pivoted to fine-tuning an open-source Donut (Document Understanding Transformer) model. The fine-tuning process:
- Collected 2,000 invoice samples from our top 50 suppliers
- Manually annotated all fields (vendor, invoice number, dates, line items, amounts)
- Split data: 1,600 training, 200 validation, 200 test
- Fine-tuned Donut model for 20 epochs on our annotated dataset
- Achieved 96% field-level accuracy on test set
The fine-tuning took 3 days on 4×A100 GPUs (cloud compute cost: $800). The production model runs on a single GPU server (inference time: 2-3 seconds per invoice). This reduced our per-invoice processing cost from $0.03 to $0.002 - a 15x cost reduction.
JSON Schema Adherence:
Our universal JSON schema follows this structure:
- invoiceNumber (required)
- vendorId (required)
- vendorName (required)
- invoiceDate (required, ISO 8601)
- dueDate (optional, ISO 8601)
- purchaseOrderNumber (optional)
- currency (required, ISO 4217 code)
- subtotal (required)
- taxAmount (optional)
- totalAmount (required)
- lineItems (array, optional):
- itemNumber
- description
- quantity
- unitPrice
- lineTotal
The VLM outputs raw JSON, which passes through a validation pipeline:
- JSON structure validation (all required fields present)
- Data type validation (numbers are numeric, dates are valid)
- Business rule validation (line totals sum to invoice total within $0.50 tolerance)
- Cross-reference validation (vendor ID exists in Blue Yonder)
Invoices that pass all validations (92% of total) proceed to automatic API posting. Failed validations (8%) go to human review with extracted data pre-filled.
Invoice Field Extraction Process:
The VLM processes each invoice page as an image:
- Convert PDF to images (300 DPI)
- Feed images to fine-tuned Donut model
- Model outputs JSON with extracted fields and confidence scores
- For multi-page invoices, merge extracted data from all pages
- Apply confidence thresholds (fields with <85% confidence flagged for review)
Key challenge: Handling invoice variations. Our suppliers use 50+ different invoice templates. Fine-tuning on diverse examples taught the model to generalize across formats. For completely new formats (new suppliers), accuracy drops to 85-90% initially, then improves as we add examples to the training set.
API Integration with Blue Yonder:
Our reconciliation service handles the Blue Yonder integration:
- Vendor lookup: Match extracted vendor name/ID to Blue Yonder supplier records using fuzzy matching
- PO validation: If PO number present, verify it exists and is in ‘open’ status via Blue Yonder’s purchase order API
- Line item matching: Match invoice line items to PO line items by part number
- Discrepancy detection: Flag quantity, price, or item mismatches for three-way matching
- Invoice creation: For matching invoices, POST to Blue Yonder’s invoice API endpoint
The API integration uses Blue Yonder’s RESTful supply-planning APIs with OAuth 2.0 authentication. We batch invoice creation requests (up to 50 invoices per API call) to minimize API overhead.
Edge cases we handle:
- Partial deliveries: One PO, multiple invoices - we track received quantities and validate against remaining open PO balance
- Multiple POs per invoice: Supplier combines multiple orders on one invoice - we split the invoice data and create separate records in Blue Yonder
- No PO match: For non-PO invoices (maintenance, services), we create invoice records with manual approval workflow
Infrastructure and Performance:
Production architecture:
- Invoice ingestion service (Python/FastAPI)
- VLM inference server (1× GPU instance, NVIDIA T4)
- Validation and reconciliation service (Python)
- Blue Yonder API integration layer
- Human review queue (React web app)
- PostgreSQL database for audit trail
Processing performance:
- Average invoice processing time: 28 seconds end-to-end
- VLM inference: 2-3 seconds
- Validation and reconciliation: 5-8 seconds
- Blue Yonder API call: 15-20 seconds
- Throughput: 120 invoices/hour (single GPU)
We process invoices in batches throughout the day as they arrive via email or supplier portals.
ROI and Business Impact:
Cost savings:
- Eliminated 4.5 FTE positions (15 people × 30% time) = $270,000/year in labor costs
- Reduced data entry errors from 2.3% to 0.4% = $85,000/year in error correction costs
- Faster invoice processing improved early payment discounts capture = $45,000/year
- Total annual savings: $400,000
Investment:
- Initial development (3 engineers × 4 months): $180,000
- Infrastructure (GPU server, cloud services): $24,000/year
- Ongoing maintenance (0.5 FTE): $60,000/year
- Total first-year cost: $264,000
Payback period: 8 months
Year 2+ ROI: 375% annually
Quality improvements:
- Invoice processing time: 3-4 minutes → 28 seconds (86% reduction)
- Data entry accuracy: 97.7% → 99.6%
- Invoices processed within 24 hours: 65% → 98%
- Procurement team satisfaction: Eliminated tedious data entry, allowing focus on supplier relationship management and exception handling
Lessons Learned:
- Fine-tuning is essential for production accuracy - generic VLMs aren’t sufficient for domain-specific documents
- Start with high-volume, standard formats to maximize ROI, then expand to edge cases
- Human-in-the-loop for edge cases (8%) maintains quality while automating the majority (92%)
- Feedback loop for continuous improvement - quarterly retraining with corrected examples improves accuracy
- API integration complexity often exceeds ML complexity - plan accordingly
This implementation has been transformative for our procurement operations, freeing our team to focus on strategic supplier relationships rather than manual data entry.