Genealogy tracking: batch data lineage approach versus real-time event streaming

We’re designing a genealogy tracking system for our pharmaceutical manufacturing line using FactoryTalk MES 10.0. The regulatory team needs complete traceability from raw materials through finished goods, with the ability to perform forward/backward trace within 4 hours of a quality incident.

I’m evaluating two architectural approaches and would love to hear experiences from others who’ve implemented genealogy tracking at scale:

Approach A: Batch ETL Run scheduled ETL jobs every 15 minutes to extract genealogy events from shop floor systems, transform into lineage relationships, and load into central traceability database. Simpler to implement and maintain.

Approach B: Event Streaming Capture every material movement, assembly operation, and component consumption as real-time events via Kafka or similar message queue. Build lineage graph incrementally as events arrive. More complex but potentially lower latency.

Our volumes: 200 production orders/day, 50,000 component transactions/day, 12 production lines. What design decisions did you make and why?

For pharmaceutical compliance, I’d lean toward event streaming despite the complexity. The 4-hour trace requirement seems generous now, but when you have a contamination event, you want answers in minutes not hours. We implemented Kafka-based streaming and can now trace any batch in under 5 minutes. The real-time visibility also helps with in-process quality decisions - if a component fails inspection, we can immediately identify all WIP that consumed it.

The choice really depends on your message queue infrastructure maturity. If you already have Kafka deployed for other systems, event streaming is the way to go. But if you’d be standing up messaging infrastructure just for genealogy, that’s significant operational overhead. Consider also that batch ETL gives you natural checkpoint recovery - if a job fails, you just re-run it. With streaming, you need to handle exactly-once delivery semantics, consumer group rebalancing, offset management, etc. That said, streaming enables real-time dashboards and proactive quality monitoring that batch can’t match.

We went with batch ETL for similar volumes and it works fine for our compliance needs. The 15-minute latency is acceptable because our quality holds are typically 2-4 hours anyway before materials get released. The operational simplicity is worth it - our small IT team can manage the ETL jobs without specialized streaming expertise. We did optimize the ETL queries to run in under 3 minutes so we have a reliable 15-minute cycle.

Don’t forget about compliance reporting requirements. For FDA audits, we need to produce genealogy reports going back 3 years. Batch ETL naturally creates a stable, queryable historical record. With streaming, you need to think about long-term storage and how you’ll query the event log for historical traces. We use batch ETL specifically because it creates clean dimensional models that our quality analysts can query directly without needing streaming platform expertise.

After implementing both approaches across different sites, here’s my analysis of the key decision factors:

Batch ETL Scheduling Best fit when:

  • Traceability latency tolerance is >30 minutes (your 4-hour requirement fits comfortably)
  • IT team has strong SQL/database skills but limited streaming platform experience
  • Volumes are predictable and ETL jobs can complete within schedule window
  • You need simple disaster recovery (just re-run failed jobs)
  • Historical reporting is primary use case

Implementation considerations:

  • Schedule ETL at 15-minute intervals during production hours, hourly overnight
  • Design incremental loads using change data capture (CDC) to minimize processing time
  • Build ETL resilience: retry logic, error handling, data quality checks
  • Create materialized views for common trace queries to meet 4-hour SLA

Event Streaming Architecture Best fit when:

  • You need real-time visibility into production lineage (sub-minute latency)
  • Proactive quality monitoring is important (detect issues before batches complete)
  • Site already has message queue infrastructure and expertise
  • You want to enable real-time dashboards and alerts
  • Volume spikes are unpredictable and would break batch ETL windows

Implementation considerations:

  • Use Kafka or RabbitMQ with persistent message storage
  • Design event schema carefully upfront (changes are hard once producers/consumers deployed)
  • Implement consumer idempotency to handle duplicate events
  • Set up monitoring for consumer lag and message queue depth
  • Plan for long-term event storage (S3 + Parquet for cost-effective historical queries)

Message Queue Configuration For your 50,000 transactions/day:

  • Kafka topic partitions: 6-8 (allows parallel consumers)
  • Retention: 7 days minimum (supports replay for error recovery)
  • Replication factor: 3 (ensures no data loss)
  • Consumer group: 3-4 instances (handles peak load with redundancy)

Compliance Reporting Requirements Regardless of approach:

  • Maintain immutable audit trail of all lineage data changes
  • Implement data retention policies (typically 3-7 years for pharma)
  • Design report templates for common regulatory queries (forward trace, backward trace, lot genealogy)
  • Validate trace accuracy with known-good test scenarios quarterly
  • Document your lineage calculation methodology for audit purposes

Traceability Latency Tolerance Your 4-hour requirement is actually quite relaxed:

  • Batch ETL at 15-min intervals gives worst-case 15-min latency (well within requirement)
  • This suggests batch approach is sufficient unless you have other real-time needs
  • Consider if quality holds, in-process decisions, or production optimization would benefit from real-time lineage

My Recommendation for Your Scenario Start with batch ETL because:

  1. Your latency requirement (4 hours) is easily met with 15-min ETL cycles
  2. Operational simplicity reduces risk for initial deployment
  3. Your volumes (50K transactions/day) are manageable in batch windows
  4. You avoid the complexity and cost of standing up new streaming infrastructure
  5. Compliance reporting is naturally supported with dimensional warehouse model

Design the ETL with these future-proofing elements:

  • Use CDC patterns so you could later stream those same change events
  • Structure your lineage data model to support both batch loads and event appends
  • Build abstraction layer so consuming applications don’t depend on ETL implementation

This gives you a working system quickly while keeping the door open to add event streaming later if business needs evolve (e.g., real-time quality monitoring, predictive analytics).

Good points on the operational complexity trade-offs. Our site doesn’t currently have Kafka, so we’d be building that capability from scratch. What about hybrid approaches? Could we do batch ETL for the core lineage data but stream critical quality events? Or does that create consistency problems between the two data paths?

Hybrid can work but you need careful design to avoid data inconsistencies. We run batch ETL for historical lineage (older than 24 hours) and event streaming for recent activity. The key is having a clear boundary and ensuring your queries know which data source to use based on timestamp. For example, real-time trace queries hit the streaming data store for the last 24 hours, then fall back to the batch-loaded warehouse for older history. This gives you fast recent traces without the cost of streaming everything forever.