Building on the previous suggestions, let me share our complete solution that passed FDA audit:
Event Sourcing with Guaranteed Ordering: We use Azure Event Hubs with partition keys derived from material batch IDs. This ensures all events for a genealogy chain hit the same partition, maintaining strict ordering.
Distributed Transaction Handling: Instead of two-phase commit, we implemented the Saga pattern with compensating transactions. Each genealogy event is written as an immutable append-only record. If a downstream service fails, we write a compensating event rather than rolling back:
GenealogyEvent.Type = MATERIAL_CONSUMED
GenealogyEvent.Type = MATERIAL_CONSUMED_COMPENSATED
Graph Database Optimization: For Neo4j timeouts, we switched from real-time graph queries to a materialized view pattern. The event stream updates a pre-computed genealogy graph asynchronously. Critical queries hit this optimized view, while the full event log remains the source of truth.
Audit Trail Signing: This was our biggest challenge. We implemented a hierarchical signing approach:
- Individual events are signed immediately upon persistence to Event Hubs
- Every 5 minutes, a batch signature covers all events in that window
- Daily, a root signature covers all batch signatures
This creates a Merkle tree structure that’s verifiable but doesn’t require waiting for distributed transactions to complete.
Genealogy Validation: We run continuous validation jobs that replay the event stream and verify:
- No sequence gaps (using event hub sequence numbers)
- All material movements balance (input = output + waste)
- Signature chains are unbroken
- Graph database state matches event log
For your specific error, the “Transaction rollback detected” suggests you’re trying to use database transactions across services. Replace this with:
- Write events to Event Hubs (durable, ordered)
- Use change feed processors to update downstream systems
- Implement idempotency so replays don’t create duplicates
Key Implementation Details:
- Event Hubs retention: 7 days minimum for replay capability
- Cosmos DB for materialized genealogy views (strong consistency)
- Azure Functions for change feed processing (automatic retry/scaling)
- Key Vault for signing keys with HSM backing (compliance requirement)
This architecture handles our 50,000+ genealogy events per day across 12 production lines with zero audit trail gaps. The eventual consistency model works because the event log is the single source of truth - all other views are derived and can be rebuilt. For FDA compliance, we provide the complete event log with cryptographic proof of integrity, which satisfies the audit requirements even though downstream systems update asynchronously.
The graph database performance improved 10x after moving to materialized views - complex genealogy queries now return in under 100ms versus 10+ seconds before.