We’ve been trying to get process mining off the ground for our order-to-cash workflows, but we keep hitting walls with the event logs themselves. Our data team pulled activity tables from the ERP, but when the process mining tool loaded them, we’re seeing case durations showing up as decades instead of days. Turns out we’ve got placeholder timestamps like 1970-01-01 or 2100-12-31 scattered throughout, probably from old migrations or incomplete records.
Beyond the timestamp mess, we’re also finding that some orders have multiple IDs depending on which system touched them—CRM uses one identifier, procurement uses another, and finance has its own. So what should be a single process instance ends up looking like three separate cases. We’ve also got duplicate records where the same activity appears twice with slightly different timestamps, which is throwing off our bottleneck analysis.
I know we need to clean this up before we can get reliable insights, but I’m not sure where to prioritize. Should we start with the timestamp problems, tackle the case ID mapping first, or focus on deduplication? And how do other teams handle this kind of data prep—are you doing it manually, or is there tooling that helps automate the cleanup? Would love to hear what’s worked (or hasn’t) for folks who’ve been through this.
Make sure you document everything you’re doing. We built a data dictionary that defines every field in our event logs—what it means, where it comes from, what transformations we apply, and what quality rules we enforce. When someone questions the analysis six months later, you need to be able to explain exactly how the data was prepared. Also, if you’re in a regulated industry, you’ll need that documentation for compliance. Data governance isn’t glamorous, but it’s what keeps these initiatives from falling apart when the original team moves on.
We had to deal with zero timestamps too. What helped us was profiling the event log first—running statistical checks to see how many cases were affected, which activities had the problem, and whether there was a pattern (e.g., always the same activity or always from a specific system). Once we knew the scope, we could decide on remediation. In our case, about 5% of cases had zero timestamps, so we just excluded those cases entirely rather than trying to impute values or keep partial data. If your percentage is higher, you might need a more sophisticated approach.
Timestamps first, in my experience. If your temporal ordering is wrong, everything downstream gets unreliable—activities appear in the wrong sequence, and your process model ends up incorrect. We had a similar issue where migration artifacts left us with 1900-01-01 dates. We wrote a validation script to flag any timestamp outside a reasonable range (say, past five years to next year) and then decided case-by-case whether to remove those events or the entire case. If only a few cases are affected, just drop them. If it’s widespread, you might need to remove only the bad events and keep the rest of the case intact.
One thing that’s helped us scale is treating event log preparation as a proper data pipeline rather than a one-time cleanup. We extract raw data into a staging area, run validation and transformation logic, and only promote clean data to the process mining layer. That way, when source systems change or new data quality issues appear, we can catch them early and fix them in the pipeline rather than polluting the analysis. We also version the pipeline so we can reproduce historical analyses if needed. It’s more upfront investment, but it pays off when you’re running this continuously.
We automate a lot of this with an ETL pipeline that runs nightly. For duplicates, we hash the combination of case ID, activity name, and timestamp (rounded to the nearest minute) and drop any exact matches. For timestamps, we convert everything to UTC during extraction and flag any values outside a configurable valid range. For case IDs, we maintain a reference table in our data warehouse that maps cross-system identifiers. It’s not perfect, but it catches most problems before they hit the process mining tool. Initial setup took a few weeks, but now it’s mostly hands-off.