BigQuery streaming insert fails for IoT telemetry due to schema mismatch in data-storage pipeline

frank_wizard · April 16, 2025, 3:02pm

Our IoT telemetry streaming pipeline into BigQuery is failing with schema mismatch errors causing data loss in our analytics pipeline. Devices send sensor readings via Pub/Sub to Dataflow, which then streams into BigQuery tables.

Error from Dataflow logs:


Invalid schema update. Field temperature changed type from FLOAT to STRING
Table: iot_telemetry.sensor_readings
Rejected rows: 3,847 in last hour

The issue started after we deployed new firmware to a subset of devices that now sends temperature values with unit suffixes (e.g., “72.5F” instead of 72.5). BigQuery schema evolution doesn’t handle this type change automatically, and we’re losing critical telemetry data. We need both historical float data and new string format data for analytics. How can we handle schema evolution for IoT telemetry ingestion without data loss while maintaining Dataflow pipeline integration?

kevin_coder · April 24, 2025, 11:34am

Add new columns rather than changing types. Keep temperature as FLOAT for backward compatibility, add temperature_raw as STRING for the original value, and temperature_unit as STRING. Your Dataflow pipeline can populate all three fields. For old devices sending just floats, populate temperature and set temperature_unit to default ‘F’. For new devices, parse temperature_raw into temperature and temperature_unit. This approach maintains schema evolution without breaking existing queries and preserves all telemetry data.

ruth_guru · May 10, 2025, 7:00am

Let me provide a comprehensive solution covering BigQuery schema evolution, IoT telemetry ingestion, and Dataflow pipeline integration.

BigQuery Schema Evolution Strategy:

The key is to design your schema for forward compatibility from the start. For your immediate issue:

Add New Columns (don’t modify existing ones):
- Keep temperature as FLOAT64 (existing column)
- Add temperature_raw as STRING (original device value)
- Add temperature_unit as STRING (extracted unit: F, C, K)
- Add schema_version as INTEGER (track data format versions)
Update Table Schema:

ALTER TABLE iot_telemetry.sensor_readings
ADD COLUMN temperature_raw STRING,
ADD COLUMN temperature_unit STRING,
ADD COLUMN schema_version INT64;

This preserves existing data and queries while supporting new formats.

IoT Telemetry Ingestion Transformation:

Modify your Dataflow pipeline to normalize data at ingestion:

# Pseudocode - Temperature normalization logic:
1. Read raw telemetry message from Pub/Sub
2. Check if temperature field contains unit suffix (regex: \d+\.?\d*[A-Z])
3. If numeric only: temperature=value, temperature_raw=str(value), unit='F'
4. If with unit: parse value and unit, convert to float, store both
5. Set schema_version based on device firmware version
6. Write transformed record to BigQuery with all fields populated
# Handle parsing errors with dead-letter pattern

Dataflow Pipeline Integration:

Implement a robust transformation pipeline:

Parsing DoFn:
- Use regex to detect value format: `^([0-9.]+)([A-Z]*)$
- Extract numeric value and optional unit
- Handle edge cases: negative values, scientific notation, missing data
Error Handling:
- Wrap parsing in try-catch blocks
- Route unparseable records to dead-letter Pub/Sub topic
- Log schema version and device ID for troubleshooting
- Send alerts when error rate exceeds threshold (e.g., 1%)
Dead-Letter Table:
- Create iot_telemetry.failed_inserts table
- Store original payload, error message, timestamp, device_id
- Set up daily job to replay fixed records
Unit Conversion:
- Standardize all temperatures to a base unit (e.g., Celsius)
- Store original unit for audit trail
- Add temperature_normalized column for analytics

Implementation Steps:

Update BigQuery Schema (zero downtime):
- Add new columns as nullable
- Existing queries continue working
- New queries can use additional fields
Deploy Updated Dataflow Pipeline:
- Test transformation logic with sample data
- Deploy with canary pattern (10% traffic first)
- Monitor error rates and latency
Backfill Historical Data (optional):
- Run batch job to populate new columns for old records
- Set `temperature_raw = CAST(temperature AS STRING)
- Set temperature_unit = 'F' (or your default)
- Set schema_version = 1 for legacy data
Update Analytics Queries:
- Use temperature column for numeric operations (maintains compatibility)
- Join with unit column when displaying values
- Filter by schema_version if needed for specific analyses

Best Practices for Schema Evolution:

Version Your Schemas: Include schema_version in every record
Additive Changes Only: Never remove or change existing columns
Default Values: Use nullable columns or provide defaults for backward compatibility
Documentation: Maintain schema changelog in BigQuery table descriptions
Validation: Implement schema validation at ingestion time
Monitoring: Alert on unexpected schema versions or high rejection rates

Dataflow Pipeline Optimization:

Enable autoscaling: `–autoscalingAlgorithm=THROUGHPUT_BASED
Set appropriate worker machine types based on transformation complexity
Use streaming inserts with batching: `–numStreamingKeys=10000
Configure BigQuery write disposition: `WRITE_APPEND
Set up Stackdriver monitoring for pipeline health

Testing Strategy:

Create test table with new schema
Send sample payloads from both old and new firmware
Verify all fields populate correctly
Check analytics queries return expected results
Test dead-letter routing with intentionally malformed data
Validate performance under production-like load

This approach gives you flexible IoT telemetry ingestion that handles schema evolution gracefully while maintaining data integrity and query compatibility. The Dataflow pipeline integration ensures reliable transformation and error handling for all device firmware versions.

emma_solver · April 25, 2025, 12:47pm

Consider using BigQuery’s JSON type for flexible schema evolution if your telemetry structure varies frequently. Store the raw device payload as JSON, then use SQL JSON functions to extract fields. This gives you flexibility for IoT telemetry ingestion as device schemas evolve, though it’s less efficient for high-volume analytics queries. For your case with temperature specifically, the multi-column approach mentioned earlier is better.

frank_wizard · April 18, 2025, 9:25am

That makes sense for new data, but what about the schema itself? Do I need to create a new table version or can I add columns to the existing table? We have months of historical data and downstream dashboards that query the current table structure.

Topic		Views
IoT device data insert fails with invalid schema error when streaming dynamic telemetry fields to BigQuery Google Cloud IoT question , data-migration , dataflow , json , python , schema-error , bigquery , data-storage , gcpiot-24	6	April 8, 2025
BigQuery insert fails for nested records ingested from IoT devices via Dataflow Google Cloud IoT question , data-modeling , dataflow , analytics-report , json , bigquery , data-ingestion , telemetry , gcpiot-25	6	March 23, 2025
Data stream ingestion fails with malformed payload error in aziotc, blocking downstream analytics Microsoft Azure IoT question , stream-analytics , analytics , schema-validation , json , data-ingestion , data-stream , aziotc , malformed-payload	6	September 14, 2025
REST data stream ingestion fails schema validation for temperature sensor payloads in production Oracle IoT Cloud question , api-development , rest-api , schema-validation , sensor-data , json , real-time-monitoring , data-ingestion , data-stream	4	December 30, 2024
Best practices for integrating IoT telemetry with cloud ERP systems via Dataflow Google Cloud IoT discussion , integration , dataflow , pubsub , error-handling , integration-reliability , schema-mapping , gcpiot-25 , sys-integration	7	April 1, 2025
Optimizing Cloud Ingestion for High-Volume IoT Data Streams Generic IoT Topics discussion , data-quality , scalability , telemetry-api , cloud-ingestion , integration-bus , cloud-ingestion-for	4	October 9, 2025
Automated data archival from IIoT sensors to BigQuery for compliance reporting and long-term trend analysis Google Cloud IoT use-case , data-migration , dataflow , compliance , bigquery , data-storage , iiot-support , gcpiot-24 , cloud-iot-core	3	September 15, 2025
Data stream API payload rejected due to schema mismatch integration block SAP IoT question , rest-api , schema-validation , json , data-ingestion , data-ingestion-block , data-stream , sys-integration , sapiot-23	3	April 22, 2025
Out-of-order events in data stream causing inaccurate analytics results Google Cloud IoT question , dataflow , java , bigquery , time-series , event-processin , data-stream , gcpiot-25 , windowing	3	April 26, 2025

BigQuery streaming insert fails for IoT telemetry due to schema mismatch in data-storage pipeline

Related topics