Genealogy tracking module fails to maintain complete audit trail in cloud deployment

We’re running AVEVA MES 2023.1 in Azure cloud and experiencing critical issues with genealogy tracking audit trails. Our pharmaceutical production requires complete traceability, but we’re seeing gaps in the genealogy records during distributed transactions across microservices.

The event sourcing pattern we implemented doesn’t seem to handle distributed transactions properly - when material movements span multiple services, some genealogy events aren’t being captured. We’ve also noticed the graph database queries for genealogy validation are timing out under heavy load.

Current error we’re seeing:


GenealogyEventStore: Transaction rollback detected
Event sequence gap: Events 1247-1251 missing
Audit trail signing failed: Incomplete chain

This is blocking our FDA compliance validation. Has anyone successfully implemented robust genealogy tracking with event sourcing in a cloud-native architecture? Particularly interested in how to ensure atomic audit trail signing across distributed services.

Thanks for the responses. We’re using adjacency lists in Neo4j for the graph database. The Saga pattern sounds promising - could you elaborate on how you handle rollback scenarios? If a genealogy event fails midway through a saga, how do you ensure the audit trail remains consistent? Our current approach tries to maintain strong consistency, but maybe that’s the wrong model for cloud deployment.

The audit trail signing failure is your biggest risk for compliance. In our GMP environment, we implemented a two-phase commit for genealogy events: first write to the event store, then sign the audit trail only after confirming all related events are persisted. The graph database timeout issue might be related to how you’re modeling the genealogy relationships - are you using adjacency lists or nested sets?

Building on the previous suggestions, let me share our complete solution that passed FDA audit:

Event Sourcing with Guaranteed Ordering: We use Azure Event Hubs with partition keys derived from material batch IDs. This ensures all events for a genealogy chain hit the same partition, maintaining strict ordering.

Distributed Transaction Handling: Instead of two-phase commit, we implemented the Saga pattern with compensating transactions. Each genealogy event is written as an immutable append-only record. If a downstream service fails, we write a compensating event rather than rolling back:


GenealogyEvent.Type = MATERIAL_CONSUMED
GenealogyEvent.Type = MATERIAL_CONSUMED_COMPENSATED

Graph Database Optimization: For Neo4j timeouts, we switched from real-time graph queries to a materialized view pattern. The event stream updates a pre-computed genealogy graph asynchronously. Critical queries hit this optimized view, while the full event log remains the source of truth.

Audit Trail Signing: This was our biggest challenge. We implemented a hierarchical signing approach:

  1. Individual events are signed immediately upon persistence to Event Hubs
  2. Every 5 minutes, a batch signature covers all events in that window
  3. Daily, a root signature covers all batch signatures

This creates a Merkle tree structure that’s verifiable but doesn’t require waiting for distributed transactions to complete.

Genealogy Validation: We run continuous validation jobs that replay the event stream and verify:

  • No sequence gaps (using event hub sequence numbers)
  • All material movements balance (input = output + waste)
  • Signature chains are unbroken
  • Graph database state matches event log

For your specific error, the “Transaction rollback detected” suggests you’re trying to use database transactions across services. Replace this with:

  • Write events to Event Hubs (durable, ordered)
  • Use change feed processors to update downstream systems
  • Implement idempotency so replays don’t create duplicates

Key Implementation Details:

  • Event Hubs retention: 7 days minimum for replay capability
  • Cosmos DB for materialized genealogy views (strong consistency)
  • Azure Functions for change feed processing (automatic retry/scaling)
  • Key Vault for signing keys with HSM backing (compliance requirement)

This architecture handles our 50,000+ genealogy events per day across 12 production lines with zero audit trail gaps. The eventual consistency model works because the event log is the single source of truth - all other views are derived and can be rebuilt. For FDA compliance, we provide the complete event log with cryptographic proof of integrity, which satisfies the audit requirements even though downstream systems update asynchronously.

The graph database performance improved 10x after moving to materialized views - complex genealogy queries now return in under 100ms versus 10+ seconds before.

For FDA compliance, you can’t afford eventual consistency in genealogy records. However, you can use compensating transactions within a saga to maintain audit integrity. Each genealogy event should be immutable once written - if you need to ‘undo’ something, write a compensating event rather than deleting the original. This preserves the complete audit trail. Also, consider using Kafka for event streaming instead of direct service-to-service calls - it gives you better durability guarantees and natural event ordering per partition.

The event sequence gaps you’re seeing (1247-1251 missing) suggest a problem with your event store’s consistency model. Are you using Azure Cosmos DB or a different backing store? For genealogy tracking, I’d recommend: 1) Use Cosmos DB with strong consistency for the event store, 2) Implement idempotency keys to handle retries, 3) Use change feed for downstream consumers rather than synchronous calls. This prevents the distributed transaction problem while maintaining auditability.