Lambda function times out when performing batch write operations to DynamoDB with large payloads

Our Lambda function processes S3 event notifications and writes records to DynamoDB using batch_write_item. It works fine for small files but times out (3 minute limit) when processing larger datasets. The function receives a list of items from S3, transforms them, and attempts to write 500-800 records in batches of 25.

Current implementation:

with table.batch_writer() as batch:
    for item in items:
        batch.put_item(Item=transform_item(item))

We’ve tried increasing memory from 512MB to 3GB which improved performance slightly but still hitting timeouts. The payload size for each item is around 2-3KB. Should we be chunking the data differently or is there a better approach for handling large batch operations in Lambda?

The issue is likely unprocessed items accumulating. DynamoDB batch_write_item has throughput limits and can return unprocessed items if you exceed provisioned capacity. The boto3 batch_writer handles retries but that takes time. Check your DynamoDB table’s WCU (write capacity units) - you might be throttling. Also, are you processing everything in one Lambda invocation?

Glad you identified the transformation bottleneck. For the complete solution, here’s a proven pattern that addresses all three key issues: Lambda timeout constraints, DynamoDB batch write efficiency, and payload size management.

Architecture Pattern:

  1. Producer Lambda (S3 trigger) - Reads file, creates batches of 10-25 items, sends to SQS
  2. SQS Queue - Buffers work items with visibility timeout > Lambda timeout
  3. Consumer Lambda (SQS trigger) - Processes batches and writes to DynamoDB

Producer Lambda optimization:

import boto3, json
sqs = boto3.client('sqs')
QUEUE_URL = 'your-queue-url'
BATCH_SIZE = 25

for i in range(0, len(items), BATCH_SIZE):
    batch = items[i:i+BATCH_SIZE]
    sqs.send_message(QueueUrl=QUEUE_URL, MessageBody=json.dumps(batch))

Consumer Lambda configuration:

  • Set batch size to 10 (SQS trigger receives up to 10 messages)
  • Reserved concurrency: 50-100 (controls parallel executions)
  • Timeout: 60 seconds (processes smaller chunks quickly)

DynamoDB write optimization:

def write_batch(items):
    with table.batch_writer(overwrite_by_pkeys=['id']) as batch:
        for item in items:
            batch.put_item(Item=item)

Key improvements this provides:

  1. Lambda Timeout Resolution: Each consumer Lambda processes only 10-25 items (250 items max with batch size 10), completing in 10-30 seconds instead of approaching the 3-minute limit.

  2. Payload Size Management: SQS messages stay under 256KB limit. If individual items are 2-3KB, batches of 25 keep messages around 75KB. For larger items, reduce batch size to 10-15.

  3. DynamoDB Batch Write Efficiency: Consumer Lambda handles retries automatically. If throttling occurs, SQS message visibility timeout ensures the batch is retried. Set DynamoDB table to on-demand or provision WCU = (average item size KB × items per second) / 1KB.

Additional Optimizations:

  • Enable SQS dead-letter queue for failed messages after 3 retries
  • Use Lambda reserved concurrency to prevent overwhelming DynamoDB
  • Add CloudWatch alarms on SQS ApproximateAgeOfOldestMessage (alert if queue backs up)
  • Consider DynamoDB PartiQL batch statements if you need conditional writes

This pattern scales to millions of items while keeping individual Lambda executions fast and reliable. The SQS buffer absorbs traffic spikes and DynamoDB auto-scaling adapts to the sustained write rate.

On-demand mode has a default limit of 4000 WCU per table which can be exceeded during spikes. Even with auto-scaling, there’s a brief adaptation period. For your use case, I’d recommend using SQS as a buffer. Have the first Lambda read S3 and send messages to SQS in batches, then use a second Lambda with SQS trigger to process smaller chunks. This spreads the load and prevents timeouts since each Lambda invocation handles fewer items. You could also use Step Functions to orchestrate parallel Lambda executions if you need faster processing. The key is breaking the monolithic batch operation into manageable units that complete well under the timeout limit.

Yes, single invocation processes the entire file. The table is using on-demand billing so WCU shouldn’t be the bottleneck, but I see throttling metrics in CloudWatch. Would splitting into multiple Lambda invocations help? How would you recommend structuring that?