We’re trying to scale process mining across our org after a successful pilot in order-to-cash, but we’re hitting a wall on event log quality. Our initial analysis looks great until we dig deeper and find missing case IDs, timestamp inconsistencies across different systems, and duplicate records that inflate activity counts. The worst part is zero timestamps—we’ve got events recorded as 1970 or 2100, which makes case durations look like decades.
We’re pulling data from ERP, a couple of CRM instances, and some legacy systems that don’t talk to each other well. Each system uses its own ID scheme, so tracking a single process instance end-to-end is proving really difficult. We’re spending more time cleaning data than actually analyzing processes, and I’m worried we’re going to lose executive support if we can’t show value faster.
How are others tackling this? Are you building dedicated data prep pipelines, or is there a governance approach that’s worked? Also curious if anyone has found ways to automate the detection of these quality issues before they distort the analysis.
Governance saved us here. We set up a data quality framework with documented standards for timestamps (everything gets converted to UTC), activity naming conventions, and case ID formats before data even hits the process mining tool. We also run automated profiling scripts that compare event log characteristics against known operational metrics—if case counts or durations are way off, we know something’s wrong before analysis starts. It’s boring work but absolutely necessary.
The cross-system case ID problem is brutal. We built a mapping table that links order IDs, requisition IDs, and invoice IDs so we can stitch together the full process. It’s manual work upfront but pays off when you can actually see end-to-end flow. The key was getting IT and the business units to agree on which identifier is the “golden” case ID for each process type. Without that alignment, you’re just guessing.
We had almost the exact same issues pulling from SAP. The zero timestamp problem was killing us—turns out migration scripts from an upgrade years ago left placeholder dates all over the place. We ended up writing validation rules that flag any timestamp outside a reasonable range (say, 2015 to present) and then decide case-by-case whether to remove just those events or the whole case. If it’s a small percentage, we drop the cases. If it’s widespread, we remove only the bad events and keep the rest of the process instance intact.