Cloud SQL PostgreSQL logical replication fails to stream changes after schema alterations

We have logical replication setup streaming changes from Cloud SQL PostgreSQL to Pub/Sub for our data pipeline. Replication works fine initially but breaks whenever we apply schema changes like adding columns or modifying constraints.

Error in Cloud SQL logs:

ERROR: logical decoding found inconsistent state
DETAIL: Schema changed during replication slot usage
CONTEXT: slot "pubsub_replication", output plugin "pgoutput"

The Pub/Sub integration stops receiving messages after DDL operations. We have to manually drop and recreate the replication slot, which causes data loss during the gap. How should logical replication setup handle schema changes without breaking the data pipeline? The schema change impact is significant since we deploy database migrations weekly.

The issue is that pgoutput requires the publication to be recreated after schema changes that affect replicated tables. Adding columns is particularly problematic. You need a proper schema change workflow that pauses replication, applies DDL, updates publication, then resumes.

We use Flyway for migrations but it just runs the SQL scripts without any replication awareness. The migrations happen during deployment windows and replication breaks silently until someone notices missing data downstream.

Here’s the complete solution for handling logical replication with schema changes:

Understanding the Problem:

PostgreSQL logical replication using pgoutput relies on consistent schema definitions in the replication slot. When you ALTER TABLE on a published table, the slot’s cached schema becomes stale, causing the “inconsistent state” error. The schema change impact breaks the replication stream.

Solution 1: Schema-Change-Aware Migration Workflow

Implement this process for DDL changes:

-- 1. Pause downstream consumers
-- 2. Note current LSN position
SELECT pg_current_wal_lsn();

-- 3. Drop publication (preserves slot)
DROP PUBLICATION IF EXISTS pubsub_publication;

-- 4. Apply schema changes
ALTER TABLE orders ADD COLUMN discount_pct DECIMAL(5,2);

-- 5. Recreate publication with new schema
CREATE PUBLICATION pubsub_publication FOR TABLE orders, customers;

-- 6. Advance replication slot to current position
SELECT pg_replication_slot_advance('pubsub_replication', pg_current_wal_lsn());

-- 7. Resume consumers

This approach maintains the replication slot while refreshing the publication definition. No data loss occurs because you advance the slot to the current position after schema changes are complete.

Solution 2: Use Datastream for Pub/Sub Integration

Google Cloud Datastream is purpose-built for CDC from Cloud SQL to Pub/Sub and handles schema evolution automatically:

  • Automatically detects schema changes
  • Updates stream configuration without manual intervention
  • Provides exactly-once delivery guarantees
  • Includes schema evolution in Pub/Sub messages

Datastream eliminates the need for custom replication slot management. For your weekly migration cadence, this is the most reliable option.

Solution 3: Backward-Compatible Schema Changes

When possible, structure migrations to avoid replication breaks:

  • Add new columns as nullable (don’t require defaults)
  • Use separate migrations for DROP operations
  • Create new tables instead of altering existing ones
  • Use views to present schema changes to applications

Example safe migration pattern:

-- Safe: adds nullable column
ALTER TABLE orders ADD COLUMN discount_pct DECIMAL(5,2);

-- Unsafe: adds column with default (rewrites table)
ALTER TABLE orders ADD COLUMN status VARCHAR(20) DEFAULT 'pending';

Recommended Implementation for Your Case:

Given your weekly migration schedule and Flyway usage, implement a hybrid approach:

  1. Migrate to Datastream for Pub/Sub integration. This eliminates schema change issues entirely and provides better monitoring and error handling.

  2. If staying with logical replication, wrap your Flyway migrations:

    Create a pre-migration script:

    • Pause downstream Pub/Sub consumers
    • Drop and recreate publication after schema changes
    • Verify replication slot health
    • Resume consumers
  3. Add monitoring for replication lag:

    SELECT slot_name,
           pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
    FROM pg_replication_slots;
    
    

   Alert when lag exceeds threshold (e.g., 100MB).

**Logical Replication Setup Best Practices:**

- Use separate replication slots for different consumers
- Set max_slot_wal_keep_size to prevent disk exhaustion
- Monitor slot lag and inactive slots
- Test schema changes in staging environment first
- Document which tables are replicated and to where
- Implement automated slot cleanup for failed replications

**For Pub/Sub Integration Specifically:**

If you must use logical replication instead of Datastream, implement these safeguards:

- Add schema version to Pub/Sub message attributes
- Include full row data (not just changes) during schema transitions
- Implement downstream schema registry to validate messages
- Use Cloud Functions to transform messages if schema drift occurs

The root cause of your breaks is that pgoutput doesn't support transparent schema evolution. You must either adopt Datastream (recommended for production reliability) or implement careful coordination between migrations and replication management. For weekly changes, the manual overhead of replication slot management will become a significant operational burden.

Have you considered using Datastream instead of custom logical replication? It’s specifically designed for CDC from Cloud SQL to Pub/Sub and handles schema evolution automatically. The Pub/Sub integration is built-in.

We’re using pgoutput plugin with publication/subscription model. The replication is configured to track specific tables. Should we be using wal2json instead for better schema change tolerance?

For weekly schema migrations, you definitely need an automated workflow. Manual intervention isn’t scalable. What’s your current migration process? Are you using a tool like Flyway or Liquibase that could be extended to handle replication coordination?