Glue crawler fails to catalog Parquet files after S3 bucket migration for analytics data lake

jacob_arch · March 24, 2025, 9:40am

After migrating analytics data to a new S3 bucket with KMS encryption, our Glue crawler fails to catalog the Parquet files. The crawler runs but doesn’t create any tables in the Data Catalog. Logs show: “Insufficient permissions to access S3 path s3://new-analytics-bucket/data/”. The Glue crawler IAM role has AWSGlueServiceRole attached, and the S3 bucket policy allows the Glue service principal. I’m confused because the old bucket setup worked fine.

Crawler configuration:


Data store: S3
Include path: s3://new-analytics-bucket/data/
Exclude patterns: _temporary/**, .spark/**

The new bucket has KMS encryption enabled with a customer-managed key, which the old bucket didn’t have. Could this be related to KMS permissions?

ruth_tech · April 4, 2025, 2:45pm

The SecureTransport condition shouldn’t block Glue - it uses HTTPS by default. Your issue is likely that the bucket policy doesn’t explicitly allow the Glue role. Even with the service principal allowed, you need to add the specific IAM role ARN to the bucket policy’s Principal section. Also verify the KMS key policy allows the Glue role to use the key for decryption.

ruth_func · March 26, 2025, 4:47am

Yes, KMS is definitely your issue. The Glue crawler IAM role needs explicit kms:Decrypt permission for your customer-managed key. The AWSGlueServiceRole managed policy doesn’t include KMS permissions. Add an inline policy to the crawler role granting kms:Decrypt and kms:DescribeKey for the specific KMS key ARN used by your bucket.

ruth_func · March 31, 2025, 2:02am

Also check your S3 bucket policy. If it has condition statements requiring specific encryption headers or VPC endpoints, the Glue service might not satisfy those conditions. I’ve seen bucket policies that deny access unless requests include x-amz-server-side-encryption headers, which breaks Glue crawlers. Review the bucket policy for any Deny statements with conditions.

amandaexpert · April 15, 2025, 9:36pm

Don’t forget about the Glue Data Catalog encryption settings. If your Data Catalog is encrypted, the crawler needs permissions for that KMS key too, not just the S3 bucket’s key. Check if you have catalog encryption enabled in Glue settings and ensure the role has access to both KMS keys if they’re different.

ruth_func · April 29, 2025, 9:15pm

You need to fix all three permission layers systematically:

Glue Crawler IAM Role: The AWSGlueServiceRole managed policy isn’t sufficient for encrypted buckets. Create a custom policy attached to your crawler role:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation"],
  "Resource": [
    "arn:aws:s3:::new-analytics-bucket",
    "arn:aws:s3:::new-analytics-bucket/*"
  ]
},
{
  "Effect": "Allow",
  "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:GenerateDataKey"],
  "Resource": "arn:aws:kms:region:account:key/YOUR-KEY-ID"
}

The kms:GenerateDataKey permission is often overlooked but necessary for Glue to write metadata.

S3 Bucket Policy: Your bucket policy must explicitly allow the Glue crawler role, not just the service principal. Add this statement:

{
  "Sid": "AllowGlueCrawler",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::ACCOUNT:role/AWSGlueServiceRole-CrawlerName"
  },
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::new-analytics-bucket",
    "arn:aws:s3:::new-analytics-bucket/*"
  ]
}

If you have Deny statements in the bucket policy, ensure they don’t conflict. A common issue is having a Deny for non-SSL requests that accidentally blocks the Glue service.

KMS Key Policy: This is the most commonly missed piece. Your KMS key policy must grant the Glue role permission to use the key. Add this statement to the key policy:

{
  "Sid": "Allow Glue to use the key",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::ACCOUNT:role/AWSGlueServiceRole-CrawlerName"
  },
  "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:GenerateDataKey"],
  "Resource": "*"
}

Without this, even if the IAM role has kms:Decrypt permissions, the key policy will deny access.

Verification Steps:

Test S3 access: Use AWS CLI with the crawler role credentials to list bucket contents: `aws s3 ls s3://new-analytics-bucket/data/ --profile crawler-role
Test KMS access: Try to decrypt a sample file using the role: `aws s3 cp s3://new-analytics-bucket/data/sample.parquet - --profile crawler-role
Check CloudTrail: Look for AccessDenied events from the Glue service to see which exact permission is failing
Enable Glue crawler CloudWatch logs: In the crawler settings, enable logging to see detailed error messages

Additional Considerations:

If your Parquet files were written by Spark or other tools, ensure they’re using the correct KMS key for encryption
Verify the crawler’s exclude patterns aren’t too broad - _temporary/** should be fine, but double-check
For large datasets, increase the crawler’s DPU allocation to avoid timeouts (default is 2 DPUs, try 5-10 for better performance)
If you have nested partitions (year/month/day structure), ensure the crawler is configured to detect partition keys automatically

After making these changes, test the crawler on a small subset first (use a more specific include path like s3://new-analytics-bucket/data/year=2025/month=01/) before running it on the entire dataset.

edwardpro · April 1, 2025, 7:08pm

I added kms:Decrypt to the crawler role for the KMS key ARN, but the crawler still fails with the same error. The bucket policy has a condition requiring aws:SecureTransport=true. Could that be blocking Glue? Do I need to modify the bucket policy to explicitly allow the Glue service principal?

Topic		Replies	Views
Athena query execution fails with access denied error due to missing Glue permissions Amazon Web Services (AWS) question , analytics , aws-2019 , json , access-denied , compliance-reporting , athena , iam-policy , glue	6	0	June 1, 2025
Athena query fails to read Parquet files from S3 with schema mismatch error Amazon Web Services (AWS) question , analytics , sql , data-lake , aws-2019 , s3 , athena , parquet , glue-data-catalog	6	0	January 13, 2025
IAM policy blocks Athena query access to S3 bucket: AccessDenied error when running analytics workloads Amazon Web Services (AWS) question , analytics , security , aws-2019 , json , s3 , access-denied , athena , iam-policy	6	0	September 23, 2025
Athena query fails on ERP logs due to missing partition metadata after Glue ETL job Amazon Web Services (AWS) question , analytics , etl , sql , data-catalog , aws-2021 , athena , glue , partition-repair	3	0	February 16, 2025
IAM policy blocks Athena query access to S3 bucket: AccessDenied errors on analytics jobs Amazon Web Services (AWS) question , analytics , security , aws-2019 , json , s3 , access-denied , bucket-policy , athena	6	1	June 14, 2025
Athena query fails on S3 CSV data due to missing column mapping and inconsistent schema Amazon Web Services (AWS) question , analytics , devops-auto , csv , aws-2019 , schema-mismatch , s3 , athena , glue	5	0	December 6, 2025
Access denied error when Greengrass v2 component loads ML model from S3 AWS IoT question , security-policy , python , access-denied , iam-role , gg-v2 , greengrass-component , analytics-ml , s3-permissions	3	0	May 16, 2025
IAM policy blocks cross-account access to KMS-encrypted S3 bucket for data export jobs Amazon Web Services (AWS) question , security , devops-auto , iam , aws-2019 , json , s3 , access-denied , kms	5	1	September 12, 2025
S3 bucket policy blocks data archival jobs when compliance tags are missing Amazon Web Services (AWS) question , storage , automation , tagging , aws-2020 , json , s3 , archival-failure , bucket-policy	5	0	May 7, 2025

Glue crawler fails to catalog Parquet files after S3 bucket migration for analytics data lake

Related topics