Glue crawler fails to catalog Parquet files after S3 bucket migration for analytics data lake

After migrating analytics data to a new S3 bucket with KMS encryption, our Glue crawler fails to catalog the Parquet files. The crawler runs but doesn’t create any tables in the Data Catalog. Logs show: “Insufficient permissions to access S3 path s3://new-analytics-bucket/data/”. The Glue crawler IAM role has AWSGlueServiceRole attached, and the S3 bucket policy allows the Glue service principal. I’m confused because the old bucket setup worked fine.

Crawler configuration:


Data store: S3
Include path: s3://new-analytics-bucket/data/
Exclude patterns: _temporary/**, .spark/**

The new bucket has KMS encryption enabled with a customer-managed key, which the old bucket didn’t have. Could this be related to KMS permissions?

The SecureTransport condition shouldn’t block Glue - it uses HTTPS by default. Your issue is likely that the bucket policy doesn’t explicitly allow the Glue role. Even with the service principal allowed, you need to add the specific IAM role ARN to the bucket policy’s Principal section. Also verify the KMS key policy allows the Glue role to use the key for decryption.

Yes, KMS is definitely your issue. The Glue crawler IAM role needs explicit kms:Decrypt permission for your customer-managed key. The AWSGlueServiceRole managed policy doesn’t include KMS permissions. Add an inline policy to the crawler role granting kms:Decrypt and kms:DescribeKey for the specific KMS key ARN used by your bucket.

Also check your S3 bucket policy. If it has condition statements requiring specific encryption headers or VPC endpoints, the Glue service might not satisfy those conditions. I’ve seen bucket policies that deny access unless requests include x-amz-server-side-encryption headers, which breaks Glue crawlers. Review the bucket policy for any Deny statements with conditions.

Don’t forget about the Glue Data Catalog encryption settings. If your Data Catalog is encrypted, the crawler needs permissions for that KMS key too, not just the S3 bucket’s key. Check if you have catalog encryption enabled in Glue settings and ensure the role has access to both KMS keys if they’re different.

You need to fix all three permission layers systematically:

Glue Crawler IAM Role: The AWSGlueServiceRole managed policy isn’t sufficient for encrypted buckets. Create a custom policy attached to your crawler role:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation"],
  "Resource": [
    "arn:aws:s3:::new-analytics-bucket",
    "arn:aws:s3:::new-analytics-bucket/*"
  ]
},
{
  "Effect": "Allow",
  "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:GenerateDataKey"],
  "Resource": "arn:aws:kms:region:account:key/YOUR-KEY-ID"
}

The kms:GenerateDataKey permission is often overlooked but necessary for Glue to write metadata.

S3 Bucket Policy: Your bucket policy must explicitly allow the Glue crawler role, not just the service principal. Add this statement:

{
  "Sid": "AllowGlueCrawler",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::ACCOUNT:role/AWSGlueServiceRole-CrawlerName"
  },
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::new-analytics-bucket",
    "arn:aws:s3:::new-analytics-bucket/*"
  ]
}

If you have Deny statements in the bucket policy, ensure they don’t conflict. A common issue is having a Deny for non-SSL requests that accidentally blocks the Glue service.

KMS Key Policy: This is the most commonly missed piece. Your KMS key policy must grant the Glue role permission to use the key. Add this statement to the key policy:

{
  "Sid": "Allow Glue to use the key",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::ACCOUNT:role/AWSGlueServiceRole-CrawlerName"
  },
  "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:GenerateDataKey"],
  "Resource": "*"
}

Without this, even if the IAM role has kms:Decrypt permissions, the key policy will deny access.

Verification Steps:

  1. Test S3 access: Use AWS CLI with the crawler role credentials to list bucket contents: `aws s3 ls s3://new-analytics-bucket/data/ --profile crawler-role
  2. Test KMS access: Try to decrypt a sample file using the role: `aws s3 cp s3://new-analytics-bucket/data/sample.parquet - --profile crawler-role
  3. Check CloudTrail: Look for AccessDenied events from the Glue service to see which exact permission is failing
  4. Enable Glue crawler CloudWatch logs: In the crawler settings, enable logging to see detailed error messages

Additional Considerations:

  • If your Parquet files were written by Spark or other tools, ensure they’re using the correct KMS key for encryption
  • Verify the crawler’s exclude patterns aren’t too broad - _temporary/** should be fine, but double-check
  • For large datasets, increase the crawler’s DPU allocation to avoid timeouts (default is 2 DPUs, try 5-10 for better performance)
  • If you have nested partitions (year/month/day structure), ensure the crawler is configured to detect partition keys automatically

After making these changes, test the crawler on a small subset first (use a more specific include path like s3://new-analytics-bucket/data/year=2025/month=01/) before running it on the entire dataset.

I added kms:Decrypt to the crawler role for the KMS key ARN, but the crawler still fails with the same error. The bucket policy has a condition requiring aws:SecureTransport=true. Could that be blocking Glue? Do I need to modify the bucket policy to explicitly allow the Glue service principal?