Your load job is stuck in pending state due to a combination of factors related to how BigQuery handles large-scale parallel imports. Let me address each aspect systematically.
BigQuery Load Job Status: The PENDING state indicates BigQuery is performing pre-load validation and resource allocation. For a job loading 2000 CSV files totaling 500GB, this validation phase can legitimately take 1-3 hours. BigQuery needs to:
- Sample all files to validate schema consistency
- Estimate resource requirements for the load operation
- Schedule appropriate worker slots
- Perform format validation on CSV structure
With autodetect enabled on 2000 files, BigQuery must read portions of each file to infer the schema. This is your primary bottleneck.
Project Quota Limits: While you mentioned sufficient quota, verify these specific limits that affect load jobs:
bq show --project_id=your-project
Check these quotas specifically:
- Maximum bytes per load job (default 15TB, but can be lower)
- Concurrent load jobs per table (4 by default)
- Daily load jobs per table (1000 per day)
- On-demand query bytes per day (affects overall project resource allocation)
If your project has consumed significant quota earlier in the day, new jobs queue in pending state even if specific load job limits aren’t reached. The scheduler prioritizes based on overall project resource consumption.
Parallel File Loading: Loading 2000 files in a single job isn’t inherently problematic, but the way you’re doing it is inefficient. The wildcard pattern gs://bucket/data/*.csv with autodetect forces BigQuery to sequentially examine every file during validation. Optimize your load job:
# Create explicit schema file
bq mkdef --source_format=CSV \
'gs://bucket/data/*.csv' > /tmp/table_def.json
# Edit table_def.json to add explicit schema
# Then load with schema
bq load --source_format=CSV \
--schema=field1:STRING,field2:INTEGER,field3:TIMESTAMP \
--max_bad_records=100 \
project:dataset.table \
gs://bucket/data/*.csv
Providing an explicit schema eliminates the validation bottleneck. BigQuery can begin loading immediately rather than spending hours inferring and validating schema across thousands of files.
For very large imports, use this parallel loading strategy instead:
- Split files into batches of 200-300 files each
- Submit multiple load jobs targeting the same table (BigQuery handles concurrency)
- Use explicit schema to skip validation phase
- Monitor jobs with `bq ls -j --max_results=50
This approach typically reduces total load time by 60-70% compared to single-job loads with autodetect. The key insight is that BigQuery’s parallel loading works best when you give it explicit instructions rather than forcing it to discover and validate everything automatically.
If your job has been pending for over 4 hours, cancel it and resubmit with an explicit schema. That alone will likely resolve your issue.