We’re running Athena queries against our ERP application logs stored in S3, partitioned by date. After our nightly Glue ETL job completes, Athena consistently returns ‘HIVE_PARTITION_SCHEMA_MISMATCH’ errors when querying recent data.
Our Glue crawler runs daily at 2 AM to update the Data Catalog, but queries against yesterday’s partition fail until we manually run MSCK REPAIR TABLE. The ETL workflow writes new partitions in format year=2025/month=03/day=14/ but the catalog doesn’t reflect them.
Sample failing query:
SELECT user_id, action, timestamp
FROM erp_logs
WHERE year=2025 AND month=3 AND day=13;
Error: Partition not found in Data Catalog. This blocks our morning analytics reports. How can we ensure the Glue Data Catalog updates automatically after ETL jobs complete, and integrate partition repair into our workflow without manual intervention?
The HIVE_PARTITION_SCHEMA_MISMATCH error suggests your partition columns might have type inconsistencies. When Glue ETL writes new partitions, are the data types matching the existing catalog schema? I’d recommend checking if your month and day columns are being written as integers versus strings. Run SHOW PARTITIONS erp_logs to see what’s actually registered versus what exists in S3.
For automatic partition updates, consider using AWS Lambda triggered by S3 PUT events. When your ETL writes new partition data, Lambda can execute ALTER TABLE ADD PARTITION statements directly. This eliminates the crawler delay entirely and gives you real-time partition availability. You’d need to parse the S3 key to extract partition values and construct the DDL, but it’s more reliable than scheduled crawlers for time-sensitive analytics.
Lambda works but adds complexity. A simpler approach: configure your Glue ETL job to use enableUpdateCatalog and partitionKeys parameters. This makes Glue automatically update the Data Catalog as it writes partitions. In your job script, set these DynamicFrame write options and Glue handles catalog updates synchronously. Much cleaner than post-processing with crawlers or Lambda functions, and it’s native Glue functionality designed exactly for this scenario.