Cloud SQL PostgreSQL logical replication fails to stream changes after schema alterations

specialist · January 6, 2025, 1:41pm

We have logical replication setup streaming changes from Cloud SQL PostgreSQL to Pub/Sub for our data pipeline. Replication works fine initially but breaks whenever we apply schema changes like adding columns or modifying constraints.

Error in Cloud SQL logs:

ERROR: logical decoding found inconsistent state
DETAIL: Schema changed during replication slot usage
CONTEXT: slot "pubsub_replication", output plugin "pgoutput"

The Pub/Sub integration stops receiving messages after DDL operations. We have to manually drop and recreate the replication slot, which causes data loss during the gap. How should logical replication setup handle schema changes without breaking the data pipeline? The schema change impact is significant since we deploy database migrations weekly.

master_solver · January 28, 2025, 5:49pm

The issue is that pgoutput requires the publication to be recreated after schema changes that affect replicated tables. Adding columns is particularly problematic. You need a proper schema change workflow that pauses replication, applies DDL, updates publication, then resumes.

mrsqlninja · February 21, 2025, 6:43pm

We use Flyway for migrations but it just runs the SQL scripts without any replication awareness. The migrations happen during deployment windows and replication breaks silently until someone notices missing data downstream.

singhace · February 25, 2025, 4:38pm

Here’s the complete solution for handling logical replication with schema changes:

Understanding the Problem:

PostgreSQL logical replication using pgoutput relies on consistent schema definitions in the replication slot. When you ALTER TABLE on a published table, the slot’s cached schema becomes stale, causing the “inconsistent state” error. The schema change impact breaks the replication stream.

Solution 1: Schema-Change-Aware Migration Workflow

Implement this process for DDL changes:

-- 1. Pause downstream consumers
-- 2. Note current LSN position
SELECT pg_current_wal_lsn();

-- 3. Drop publication (preserves slot)
DROP PUBLICATION IF EXISTS pubsub_publication;

-- 4. Apply schema changes
ALTER TABLE orders ADD COLUMN discount_pct DECIMAL(5,2);

-- 5. Recreate publication with new schema
CREATE PUBLICATION pubsub_publication FOR TABLE orders, customers;

-- 6. Advance replication slot to current position
SELECT pg_replication_slot_advance('pubsub_replication', pg_current_wal_lsn());

-- 7. Resume consumers

This approach maintains the replication slot while refreshing the publication definition. No data loss occurs because you advance the slot to the current position after schema changes are complete.

Solution 2: Use Datastream for Pub/Sub Integration

Google Cloud Datastream is purpose-built for CDC from Cloud SQL to Pub/Sub and handles schema evolution automatically:

Automatically detects schema changes
Updates stream configuration without manual intervention
Provides exactly-once delivery guarantees
Includes schema evolution in Pub/Sub messages

Datastream eliminates the need for custom replication slot management. For your weekly migration cadence, this is the most reliable option.

Solution 3: Backward-Compatible Schema Changes

When possible, structure migrations to avoid replication breaks:

Add new columns as nullable (don’t require defaults)
Use separate migrations for DROP operations
Create new tables instead of altering existing ones
Use views to present schema changes to applications

Example safe migration pattern:

-- Safe: adds nullable column
ALTER TABLE orders ADD COLUMN discount_pct DECIMAL(5,2);

-- Unsafe: adds column with default (rewrites table)
ALTER TABLE orders ADD COLUMN status VARCHAR(20) DEFAULT 'pending';

Recommended Implementation for Your Case:

Given your weekly migration schedule and Flyway usage, implement a hybrid approach:

Migrate to Datastream for Pub/Sub integration. This eliminates schema change issues entirely and provides better monitoring and error handling.
If staying with logical replication, wrap your Flyway migrations:

Create a pre-migration script:
- Pause downstream Pub/Sub consumers
- Drop and recreate publication after schema changes
- Verify replication slot health
- Resume consumers

Add monitoring for replication lag:

SELECT slot_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots;


   Alert when lag exceeds threshold (e.g., 100MB).

**Logical Replication Setup Best Practices:**

- Use separate replication slots for different consumers
- Set max_slot_wal_keep_size to prevent disk exhaustion
- Monitor slot lag and inactive slots
- Test schema changes in staging environment first
- Document which tables are replicated and to where
- Implement automated slot cleanup for failed replications

**For Pub/Sub Integration Specifically:**

If you must use logical replication instead of Datastream, implement these safeguards:

- Add schema version to Pub/Sub message attributes
- Include full row data (not just changes) during schema transitions
- Implement downstream schema registry to validate messages
- Use Cloud Functions to transform messages if schema drift occurs

The root cause of your breaks is that pgoutput doesn't support transparent schema evolution. You must either adopt Datastream (recommended for production reliability) or implement careful coordination between migrations and replication management. For weekly changes, the manual overhead of replication slot management will become a significant operational burden.

maria_ace · January 22, 2025, 6:47pm

Have you considered using Datastream instead of custom logical replication? It’s specifically designed for CDC from Cloud SQL to Pub/Sub and handles schema evolution automatically. The Pub/Sub integration is built-in.

laura401 · January 11, 2025, 12:48pm

We’re using pgoutput plugin with publication/subscription model. The replication is configured to track specific tables. Should we be using wal2json instead for better schema change tolerance?

kavyaerp · February 8, 2025, 1:32pm

For weekly schema migrations, you definitely need an automated workflow. Manual intervention isn’t scalable. What’s your current migration process? Are you using a tool like Flyway or Liquibase that could be extended to handle replication coordination?

Topic		Views
Cloud SQL replication lag leads to stale data in data warehouse reporting Google Cloud Platform (GCP) question , monitoring , data-warehousing , database , gcp-2020 , stale-data , replication-lag , read-replica , cloud-sql	6	December 10, 2024
IoT device data insert fails with invalid schema error when streaming dynamic telemetry fields to BigQuery Google Cloud IoT question , data-migration , dataflow , json , python , schema-error , bigquery , data-storage , gcpiot-24	6	April 8, 2025
Cloud SQL replica lag spikes during failover, causing stale data in analytics Google Cloud Platform (GCP) question , analytics , database , gcp-2021 , cloud-sql , failover , databases , sql-replica-lag , replication	6	September 28, 2025
BigQuery streaming insert fails for IoT telemetry due to schema mismatch in data-storage pipeline Google Cloud IoT question , dataflow , streaming , json , schema-mismatch , bigquery , data-storage , analytics-ml , gcpiot-24	4	April 18, 2025
Data warehouse sync fails in process analytics after upstream schema changes Mendix question , database , sql , data-integration , process-analytics , schema-mismatch , mendix-9-18 , etl-microflow , reporting-break	6	November 17, 2025
Data replication delays between PLM and MES systems impacting production schedules SAP PLM question , integration , data-modeling , event-driven , mfg-collab , sap-2021 , mes-integration , replication , production-impact	6	July 13, 2025
Challenges and solutions for cross-site database replication in manufacturing collaboration scenarios SAP PLM discussion , database-mgt , multi-site , sap-2020 , mfg-collab , latency , conflict-resolution , data-replication , distributed-systems	6	October 5, 2025
Device status events missing in BigQuery after Pub/Sub integration with Dataflow pipeline Google Cloud IoT question , integration , dataflow , schema-mapping , bigquery , device-mgmt , pubsub-23 , missing-events , analytics-gap	4	May 10, 2025
Snowpipe ingestion delays when syncing SAP master data through BDC interface Snowflake question , sql , ai-ml-integration , data-connectors , snow-7-0 , snowpipe , sap-bdc , ingestion-latency , real-time-ai-model-training	5	September 2, 2025

Cloud SQL PostgreSQL logical replication fails to stream changes after schema alterations

Related topics