BigQuery streaming insert fails for IoT telemetry due to schema mismatch in data-storage pipeline

Our IoT telemetry streaming pipeline into BigQuery is failing with schema mismatch errors causing data loss in our analytics pipeline. Devices send sensor readings via Pub/Sub to Dataflow, which then streams into BigQuery tables.

Error from Dataflow logs:


Invalid schema update. Field temperature changed type from FLOAT to STRING
Table: iot_telemetry.sensor_readings
Rejected rows: 3,847 in last hour

The issue started after we deployed new firmware to a subset of devices that now sends temperature values with unit suffixes (e.g., “72.5F” instead of 72.5). BigQuery schema evolution doesn’t handle this type change automatically, and we’re losing critical telemetry data. We need both historical float data and new string format data for analytics. How can we handle schema evolution for IoT telemetry ingestion without data loss while maintaining Dataflow pipeline integration?

Add new columns rather than changing types. Keep temperature as FLOAT for backward compatibility, add temperature_raw as STRING for the original value, and temperature_unit as STRING. Your Dataflow pipeline can populate all three fields. For old devices sending just floats, populate temperature and set temperature_unit to default ‘F’. For new devices, parse temperature_raw into temperature and temperature_unit. This approach maintains schema evolution without breaking existing queries and preserves all telemetry data.

Let me provide a comprehensive solution covering BigQuery schema evolution, IoT telemetry ingestion, and Dataflow pipeline integration.

BigQuery Schema Evolution Strategy:

The key is to design your schema for forward compatibility from the start. For your immediate issue:

  1. Add New Columns (don’t modify existing ones):

    • Keep temperature as FLOAT64 (existing column)
    • Add temperature_raw as STRING (original device value)
    • Add temperature_unit as STRING (extracted unit: F, C, K)
    • Add schema_version as INTEGER (track data format versions)
  2. Update Table Schema:

ALTER TABLE iot_telemetry.sensor_readings
ADD COLUMN temperature_raw STRING,
ADD COLUMN temperature_unit STRING,
ADD COLUMN schema_version INT64;

This preserves existing data and queries while supporting new formats.

IoT Telemetry Ingestion Transformation:

Modify your Dataflow pipeline to normalize data at ingestion:

# Pseudocode - Temperature normalization logic:
1. Read raw telemetry message from Pub/Sub
2. Check if temperature field contains unit suffix (regex: \d+\.?\d*[A-Z])
3. If numeric only: temperature=value, temperature_raw=str(value), unit='F'
4. If with unit: parse value and unit, convert to float, store both
5. Set schema_version based on device firmware version
6. Write transformed record to BigQuery with all fields populated
# Handle parsing errors with dead-letter pattern

Dataflow Pipeline Integration:

Implement a robust transformation pipeline:

  1. Parsing DoFn:

    • Use regex to detect value format: `^([0-9.]+)([A-Z]*)$
    • Extract numeric value and optional unit
    • Handle edge cases: negative values, scientific notation, missing data
  2. Error Handling:

    • Wrap parsing in try-catch blocks
    • Route unparseable records to dead-letter Pub/Sub topic
    • Log schema version and device ID for troubleshooting
    • Send alerts when error rate exceeds threshold (e.g., 1%)
  3. Dead-Letter Table:

    • Create iot_telemetry.failed_inserts table
    • Store original payload, error message, timestamp, device_id
    • Set up daily job to replay fixed records
  4. Unit Conversion:

    • Standardize all temperatures to a base unit (e.g., Celsius)
    • Store original unit for audit trail
    • Add temperature_normalized column for analytics

Implementation Steps:

  1. Update BigQuery Schema (zero downtime):

    • Add new columns as nullable
    • Existing queries continue working
    • New queries can use additional fields
  2. Deploy Updated Dataflow Pipeline:

    • Test transformation logic with sample data
    • Deploy with canary pattern (10% traffic first)
    • Monitor error rates and latency
  3. Backfill Historical Data (optional):

    • Run batch job to populate new columns for old records
    • Set `temperature_raw = CAST(temperature AS STRING)
    • Set temperature_unit = 'F' (or your default)
    • Set schema_version = 1 for legacy data
  4. Update Analytics Queries:

    • Use temperature column for numeric operations (maintains compatibility)
    • Join with unit column when displaying values
    • Filter by schema_version if needed for specific analyses

Best Practices for Schema Evolution:

  • Version Your Schemas: Include schema_version in every record
  • Additive Changes Only: Never remove or change existing columns
  • Default Values: Use nullable columns or provide defaults for backward compatibility
  • Documentation: Maintain schema changelog in BigQuery table descriptions
  • Validation: Implement schema validation at ingestion time
  • Monitoring: Alert on unexpected schema versions or high rejection rates

Dataflow Pipeline Optimization:

  • Enable autoscaling: `–autoscalingAlgorithm=THROUGHPUT_BASED
  • Set appropriate worker machine types based on transformation complexity
  • Use streaming inserts with batching: `–numStreamingKeys=10000
  • Configure BigQuery write disposition: `WRITE_APPEND
  • Set up Stackdriver monitoring for pipeline health

Testing Strategy:

  1. Create test table with new schema
  2. Send sample payloads from both old and new firmware
  3. Verify all fields populate correctly
  4. Check analytics queries return expected results
  5. Test dead-letter routing with intentionally malformed data
  6. Validate performance under production-like load

This approach gives you flexible IoT telemetry ingestion that handles schema evolution gracefully while maintaining data integrity and query compatibility. The Dataflow pipeline integration ensures reliable transformation and error handling for all device firmware versions.

Consider using BigQuery’s JSON type for flexible schema evolution if your telemetry structure varies frequently. Store the raw device payload as JSON, then use SQL JSON functions to extract fields. This gives you flexibility for IoT telemetry ingestion as device schemas evolve, though it’s less efficient for high-volume analytics queries. For your case with temperature specifically, the multi-column approach mentioned earlier is better.

That makes sense for new data, but what about the schema itself? Do I need to create a new table version or can I add columns to the existing table? We have months of historical data and downstream dashboards that query the current table structure.