BigQuery load job stuck in pending state for large dataset import, delaying downstream analytics

donna_sql · November 13, 2024, 2:07pm

BigQuery load job remains in pending state for an extended period when importing a large dataset from Cloud Storage. The job was initiated to load 500GB of CSV files (split into 2000 files) into a partitioned table, but it’s been stuck in pending status for over 3 hours. Smaller test loads complete successfully within minutes.

We’re using the BigQuery API to submit the load job with parallel file loading. The project has sufficient quota for BigQuery operations, and we’ve successfully loaded similar datasets before. The files are in the same region as the BigQuery dataset (us-central1).


bq load --source_format=CSV \
  --autodetect \
  project:dataset.table \
  gs://bucket/data/*.csv

The job ID shows in the console with status PENDING, but there’s no progress indicator or error message. Our analytics pipeline is blocked waiting for this data. We need to understand if this is a quota issue, job scheduling problem, or if parallel file loading has limitations we’re hitting.

master_solver · November 26, 2024, 1:09am

The pending state usually means BigQuery is scheduling resources for your job. With 500GB across 2000 files, the scheduler needs to allocate sufficient workers to handle the load efficiently. Check your project quota for on-demand bytes processed per day - if you’re close to the limit, BigQuery might be throttling new jobs. Also verify that your Cloud Storage bucket and BigQuery dataset are in the same region to avoid cross-region data transfer delays.

kavyaerp · November 27, 2024, 12:29am

I’ve encountered similar issues with large parallel loads. One thing to check is whether your CSV files have inconsistent schemas or encoding issues. BigQuery validates all files during the pending phase, and if it detects potential problems across thousands of files, it can take a very long time. Try loading a subset of files first to validate the schema works correctly, then load the full dataset. Also, using JSON or Avro format instead of CSV can significantly speed up large imports because they’re self-describing formats.

jose_creator · December 4, 2024, 9:43am

Your load job is stuck in pending state due to a combination of factors related to how BigQuery handles large-scale parallel imports. Let me address each aspect systematically.

BigQuery Load Job Status: The PENDING state indicates BigQuery is performing pre-load validation and resource allocation. For a job loading 2000 CSV files totaling 500GB, this validation phase can legitimately take 1-3 hours. BigQuery needs to:

Sample all files to validate schema consistency
Estimate resource requirements for the load operation
Schedule appropriate worker slots
Perform format validation on CSV structure

With autodetect enabled on 2000 files, BigQuery must read portions of each file to infer the schema. This is your primary bottleneck.

Project Quota Limits: While you mentioned sufficient quota, verify these specific limits that affect load jobs:


bq show --project_id=your-project

Check these quotas specifically:

Maximum bytes per load job (default 15TB, but can be lower)
Concurrent load jobs per table (4 by default)
Daily load jobs per table (1000 per day)
On-demand query bytes per day (affects overall project resource allocation)

If your project has consumed significant quota earlier in the day, new jobs queue in pending state even if specific load job limits aren’t reached. The scheduler prioritizes based on overall project resource consumption.

Parallel File Loading: Loading 2000 files in a single job isn’t inherently problematic, but the way you’re doing it is inefficient. The wildcard pattern gs://bucket/data/*.csv with autodetect forces BigQuery to sequentially examine every file during validation. Optimize your load job:


# Create explicit schema file
bq mkdef --source_format=CSV \
  'gs://bucket/data/*.csv' > /tmp/table_def.json

# Edit table_def.json to add explicit schema
# Then load with schema
bq load --source_format=CSV \
  --schema=field1:STRING,field2:INTEGER,field3:TIMESTAMP \
  --max_bad_records=100 \
  project:dataset.table \
  gs://bucket/data/*.csv

Providing an explicit schema eliminates the validation bottleneck. BigQuery can begin loading immediately rather than spending hours inferring and validating schema across thousands of files.

For very large imports, use this parallel loading strategy instead:

Split files into batches of 200-300 files each
Submit multiple load jobs targeting the same table (BigQuery handles concurrency)
Use explicit schema to skip validation phase
Monitor jobs with `bq ls -j --max_results=50 This approach typically reduces total load time by 60-70% compared to single-job loads with autodetect. The key insight is that BigQuery’s parallel loading works best when you give it explicit instructions rather than forcing it to discover and validate everything automatically.

If your job has been pending for over 4 hours, cancel it and resubmit with an explicit schema. That alone will likely resolve your issue.

dansys · November 13, 2024, 4:11pm

Check your project’s concurrent load job quota. BigQuery limits the number of simultaneous load jobs per project and per table. If you have other load jobs running or queued, new jobs will wait in pending state. Run ‘bq ls -j -a’ to see all jobs in your project and check how many are currently running or pending.

rohancoder · November 17, 2024, 2:46am

Loading 2000 CSV files in a single job is a lot. BigQuery has to process and validate all files before starting the actual load. For large numbers of files, consider batching them into groups and running multiple sequential load jobs, or consolidate files before loading. Also, autodetect schema on 2000 files takes significant time - provide an explicit schema to speed up processing.

Topic		Replies	Views
BigQuery API query jobs stuck in pending state for long-running analytics workloads Google Cloud Platform (GCP) question , analytics , sql , rest-api , gcp-2019 , query-performance , apis , bigquery , job-quotas	3	0	May 29, 2025
BigQuery load job fails when importing large CSV from Cloud Storage with 'Too many errors' message Google Cloud Platform (GCP) question , storage , analytics , schema-validation , gcp-2020 , cloud-storage , csv-import , bigquery , data-ingestion	6	0	June 13, 2025
BigQuery analytics on CDN logs delayed due to slow data ingestion pipeline Google Cloud Platform (GCP) question , analytics , gcp-2019 , real-time-monitoring , cloud-storage , cloud-functions , bigquery , delayed-ingestion	5	0	September 1, 2025
BigQuery export to Cloud Storage fails for large dataset with rate limit errors Google Cloud Platform (GCP) question , analytics , database , sql , gcp-2019 , cloud-storage , rate-limit , data-export , etl-failure	6	0	December 6, 2025
Data preparation pipeline fails when uploading large CSV files via BigQuery Data Transfer Service Mode Analytics question , cloud-deploy , etl , ma-2021 , retry-logic , data-preparation , cloud-storage , conn-reset , pipeline-instability	3	0	July 17, 2025
BigQuery Dataflow pipeline fails on large datasets with memory quota exceeded errors Google Cloud Platform (GCP) question , data-warehousing , analytics , dataflow , java , etl-pipeline , gcp-2019 , pipeline-optimization , bigquery	3	0	December 2, 2025
BigQuery streaming inserts fail from Kubernetes connector, causing reporting delays Google Cloud Platform (GCP) question , analytics , streaming , kubernetes , gcp-2019 , python , containers-ctn , bigquery , quota-exceeded	5	1	October 12, 2025
Data visualization jobs fail when loading large datasets from Object Storage to Analytics Cloud Oracle Cloud question , storage , analytics , timeout , data-integration , object-storage , oci-2021 , resource-limits , dataset-import	3	0	June 25, 2025
Data storage API SDK: Bulk data upload fails with timeout on large CSV files, partial records stored Oracle IoT Cloud question , performance-opt , rest-api , timeout-error , csv-processing , bulk-upload , api-sdk , data-storage , oiot-22	5	0	January 22, 2025

BigQuery load job stuck in pending state for large dataset import, delaying downstream analytics

Related topics