Here’s the comprehensive solution covering all three critical areas:
CSV Data Type Standardization:
The process mining import wizard requires strict adherence to specific data types for each column. Your CSV must meet these exact specifications:
- Timestamp: ISO 8601 format (yyyy-MM-dd HH:mm:ss) or Excel serial date format
- Case ID: Consistent data type throughout (either all numeric or all alphanumeric string)
- Activity: String, maximum 255 characters
- Resource: String, can be null
- Additional attributes: Must maintain consistent types per column
For immediate fixing of your existing 500K-row CSV, use this Python script:
import pandas as pd
df = pd.read_csv('events.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')
df['CaseID'] = df['CaseID'].astype(str)
df.to_csv('events_cleaned.csv', index=False)
This handles mixed timestamp formats by attempting automatic conversion and standardizes all case IDs to strings. The ‘errors=coerce’ parameter converts invalid timestamps to NaT (Not a Time), which you can then filter out or fix manually.
For more complex scenarios with multiple date formats in the same column, use:
from dateutil.parser import parse
df['Timestamp'] = df['Timestamp'].apply(lambda x: parse(str(x), dayfirst=False))
Import Wizard Validation:
The process mining import wizard performs several validation checks that you need to prepare for:
- Column mapping validation: Ensure your CSV headers exactly match expected names (Case ID, Activity, Timestamp, Resource)
- Mandatory fields check: Case ID, Activity, and Timestamp are required - no nulls allowed
- Data type consistency: Every row must conform to the declared type
- Duplicate detection: The wizard flags but doesn’t reject duplicates
To pass validation on the first attempt:
- Remove or fill any null values in mandatory columns
- Ensure timestamps are within a reasonable range (not year 1900 or 2099)
- Verify case IDs don’t contain special characters that might be interpreted as delimiters
- Check for hidden characters or encoding issues (use UTF-8 encoding)
Before importing, test with a 1000-row sample of your data. The wizard will show validation errors immediately, allowing you to fix issues before processing the full 500K rows.
Pre-Processing Scripts:
For ongoing process mining operations, implement an automated data preparation pipeline:
Option 1 - Power Platform Dataflow (recommended for recurring imports):
- Create a new dataflow in Power Platform
- Add your CSV source (can connect to SharePoint, OneDrive, or Azure Blob)
- Apply these transformation steps:
- Remove rows with null Case IDs or Timestamps
- Use ‘Change Type’ to enforce consistent data types
- Add custom column for timestamp standardization: `DateTime.From(Text.From([Timestamp]), “en-US”)
- Add custom column for case ID formatting:
Text.PadStart(Text.From([CaseID]), 10, "0") (if you want fixed-length IDs)
- Enable dataflow refresh on a schedule
- Configure process mining to read from the dataflow output instead of raw CSV
Option 2 - Python Pre-Processing Script (for one-time or external processing):
Create a comprehensive cleaning script that handles all edge cases:
import pandas as pd
import numpy as np
from dateutil import parser
# Load with explicit encoding
df = pd.read_csv('raw_events.csv', encoding='utf-8-sig')
# Standardize timestamps
def clean_timestamp(ts):
try:
return parser.parse(str(ts))
except:
return np.nan
df['Timestamp'] = df['Timestamp'].apply(clean_timestamp)
# Remove invalid timestamps
df = df.dropna(subset=['Timestamp'])
# Standardize case IDs
df['CaseID'] = df['CaseID'].astype(str).str.strip()
# Clean activity names
df['Activity'] = df['Activity'].str.strip().str[:255]
# Export in process mining format
df.to_csv('events_processed.csv', index=False, date_format='%Y-%m-%d %H:%M:%S')
This script handles:
- Multiple timestamp formats through intelligent parsing
- Null value removal
- String trimming and length limits
- Consistent case ID formatting
- Proper CSV encoding for import
For your 500K-row file, this Python approach processes in 30-60 seconds. Once you have the cleaned CSV, the process mining import wizard should complete without validation errors.
For future imports, I recommend the dataflow approach because it creates a reusable, scheduled pipeline that automatically cleans incoming data. You can also add data quality monitoring to track how many rows fail validation over time, helping you identify upstream data quality issues in your ERP system.