Process mining import fails for large CSV files with data type issues

I’m trying to import large CSV event logs (500K+ rows) into Power Platform process mining, but the import wizard keeps failing with data type mismatch errors. The CSV files contain process event data extracted from our legacy ERP system with timestamp, case ID, activity, and resource columns.

The import wizard validation fails on certain rows where timestamp formats are inconsistent or numeric case IDs are mixed with alphanumeric ones:


Row 45823: Invalid timestamp format
Expected: yyyy-MM-dd HH:mm:ss
Received: dd/MM/yyyy HH:mm:ss

This blocks our process discovery initiative. We need proper CSV data type standardization and potentially pre-processing scripts to clean the data before import. The import wizard validation doesn’t provide enough flexibility for our real-world data inconsistencies.

For a quick fix on existing files, use Excel Power Query. Load the CSV, add a custom column with this formula for timestamps: DateTime.From(Text.From([Timestamp]), “en-US”). For case IDs, use Text.From([CaseID]) to force everything to string. Then export back to CSV. Takes about 10 minutes for 500K rows.

The dataflow approach sounds good for ongoing imports. But for this initial load, I need to get the data in quickly. What’s the fastest way to fix the timestamp and case ID issues in the existing CSV?

Here’s the comprehensive solution covering all three critical areas:

CSV Data Type Standardization: The process mining import wizard requires strict adherence to specific data types for each column. Your CSV must meet these exact specifications:

  • Timestamp: ISO 8601 format (yyyy-MM-dd HH:mm:ss) or Excel serial date format
  • Case ID: Consistent data type throughout (either all numeric or all alphanumeric string)
  • Activity: String, maximum 255 characters
  • Resource: String, can be null
  • Additional attributes: Must maintain consistent types per column

For immediate fixing of your existing 500K-row CSV, use this Python script:

import pandas as pd
df = pd.read_csv('events.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')
df['CaseID'] = df['CaseID'].astype(str)
df.to_csv('events_cleaned.csv', index=False)

This handles mixed timestamp formats by attempting automatic conversion and standardizes all case IDs to strings. The ‘errors=coerce’ parameter converts invalid timestamps to NaT (Not a Time), which you can then filter out or fix manually.

For more complex scenarios with multiple date formats in the same column, use:

from dateutil.parser import parse
df['Timestamp'] = df['Timestamp'].apply(lambda x: parse(str(x), dayfirst=False))

Import Wizard Validation: The process mining import wizard performs several validation checks that you need to prepare for:

  1. Column mapping validation: Ensure your CSV headers exactly match expected names (Case ID, Activity, Timestamp, Resource)
  2. Mandatory fields check: Case ID, Activity, and Timestamp are required - no nulls allowed
  3. Data type consistency: Every row must conform to the declared type
  4. Duplicate detection: The wizard flags but doesn’t reject duplicates

To pass validation on the first attempt:

  • Remove or fill any null values in mandatory columns
  • Ensure timestamps are within a reasonable range (not year 1900 or 2099)
  • Verify case IDs don’t contain special characters that might be interpreted as delimiters
  • Check for hidden characters or encoding issues (use UTF-8 encoding)

Before importing, test with a 1000-row sample of your data. The wizard will show validation errors immediately, allowing you to fix issues before processing the full 500K rows.

Pre-Processing Scripts: For ongoing process mining operations, implement an automated data preparation pipeline:

Option 1 - Power Platform Dataflow (recommended for recurring imports):

  1. Create a new dataflow in Power Platform
  2. Add your CSV source (can connect to SharePoint, OneDrive, or Azure Blob)
  3. Apply these transformation steps:
    • Remove rows with null Case IDs or Timestamps
    • Use ‘Change Type’ to enforce consistent data types
    • Add custom column for timestamp standardization: `DateTime.From(Text.From([Timestamp]), “en-US”)
    • Add custom column for case ID formatting: Text.PadStart(Text.From([CaseID]), 10, "0") (if you want fixed-length IDs)
  4. Enable dataflow refresh on a schedule
  5. Configure process mining to read from the dataflow output instead of raw CSV

Option 2 - Python Pre-Processing Script (for one-time or external processing):

Create a comprehensive cleaning script that handles all edge cases:

import pandas as pd
import numpy as np
from dateutil import parser

# Load with explicit encoding
df = pd.read_csv('raw_events.csv', encoding='utf-8-sig')

# Standardize timestamps
def clean_timestamp(ts):
    try:
        return parser.parse(str(ts))
    except:
        return np.nan

df['Timestamp'] = df['Timestamp'].apply(clean_timestamp)

# Remove invalid timestamps
df = df.dropna(subset=['Timestamp'])

# Standardize case IDs
df['CaseID'] = df['CaseID'].astype(str).str.strip()

# Clean activity names
df['Activity'] = df['Activity'].str.strip().str[:255]

# Export in process mining format
df.to_csv('events_processed.csv', index=False, date_format='%Y-%m-%d %H:%M:%S')

This script handles:

  • Multiple timestamp formats through intelligent parsing
  • Null value removal
  • String trimming and length limits
  • Consistent case ID formatting
  • Proper CSV encoding for import

For your 500K-row file, this Python approach processes in 30-60 seconds. Once you have the cleaned CSV, the process mining import wizard should complete without validation errors.

For future imports, I recommend the dataflow approach because it creates a reusable, scheduled pipeline that automatically cleans incoming data. You can also add data quality monitoring to track how many rows fail validation over time, helping you identify upstream data quality issues in your ERP system.

While external processing works, you can also use Power Platform dataflows to create a cleaning pipeline. Build a dataflow that ingests the raw CSV, applies transformations for timestamp standardization and case ID formatting, then outputs a cleaned version that feeds directly into process mining. This keeps everything within the Power Platform ecosystem and makes it repeatable. The dataflow can handle the 500K rows efficiently if you enable incremental refresh. Add data validation steps to flag problematic rows without failing the entire import - route them to an error table for manual review.

That makes sense. Do you have a recommended approach for the pre-processing? Should I use Power Query within Power Platform or handle it externally before upload?

Large CSV files with inconsistent data are a common problem. You’ll need to pre-process the CSV before importing. Use Python or Power Query to standardize timestamp formats and ensure case IDs are consistently formatted. The process mining import wizard expects strict data types and won’t handle mixed formats.