We’re piloting a handful of AI use cases in our ALM environment—mostly around automated change impact assessments and requirement classification—and ran straight into a regulatory puzzle. Our legal team is pushing back hard on data retention. GDPR says we need to delete personal data as soon as the processing purpose ends, but the EU AI Act requires us to keep technical documentation and training data information for up to 10 years for high-risk systems. Some of our training datasets include names and email addresses from change requests and approval workflows.
We tried just deleting everything after training, but then our auditors said we’d have no way to defend the model’s behavior if a regulator or customer ever asked. We tried keeping everything for 10 years, and our DPO said that’s a clear GDPR violation and we’d be exposed if anyone exercised their right to erasure.
Has anyone worked through this? How are you structuring data governance and audit trails so you can satisfy both the deletion requirement and the retention requirement without ending up in a compliance hole?
We hit the exact same wall last year. The solution that worked for us is architectural separation. You keep raw personal data only as long as absolutely necessary—typically just during training. Once that’s done, you apply irreversible anonymization techniques before you delete the originals. The anonymized datasets can be retained for the full 10-year period because they’re no longer personal data under GDPR. Your audit trail and technical documentation should reference the anonymized data and aggregated statistics, not the raw personal records. We use tokenization and differential privacy to ensure the anonymization is truly irreversible. Our external auditors accepted this approach after we walked them through the architecture and showed them we could still reconstruct model behavior without exposing individual identities.
Just to add—your documentation and metadata layer is critical here. We maintain separate stores: one for personal data with strict deletion schedules, one for anonymized archives, and one for technical records and audit logs. The audit logs capture who approved what, what data quality checks passed, and what model version was used, but they don’t contain personally identifiable information. When auditors come in, they get full traceability without us handing over names and email addresses.
From an audit perspective, the key is being able to show that your anonymization was done correctly and that you documented the lineage. We ask for evidence that the anonymized dataset was derived from the original using a validated method, that the original was deleted on schedule, and that the anonymized version can still support model validation and bias testing. If you can demonstrate that, most auditors will accept it.