Here’s the complete solution for handling logical replication with schema changes:
Understanding the Problem:
PostgreSQL logical replication using pgoutput relies on consistent schema definitions in the replication slot. When you ALTER TABLE on a published table, the slot’s cached schema becomes stale, causing the “inconsistent state” error. The schema change impact breaks the replication stream.
Solution 1: Schema-Change-Aware Migration Workflow
Implement this process for DDL changes:
-- 1. Pause downstream consumers
-- 2. Note current LSN position
SELECT pg_current_wal_lsn();
-- 3. Drop publication (preserves slot)
DROP PUBLICATION IF EXISTS pubsub_publication;
-- 4. Apply schema changes
ALTER TABLE orders ADD COLUMN discount_pct DECIMAL(5,2);
-- 5. Recreate publication with new schema
CREATE PUBLICATION pubsub_publication FOR TABLE orders, customers;
-- 6. Advance replication slot to current position
SELECT pg_replication_slot_advance('pubsub_replication', pg_current_wal_lsn());
-- 7. Resume consumers
This approach maintains the replication slot while refreshing the publication definition. No data loss occurs because you advance the slot to the current position after schema changes are complete.
Solution 2: Use Datastream for Pub/Sub Integration
Google Cloud Datastream is purpose-built for CDC from Cloud SQL to Pub/Sub and handles schema evolution automatically:
- Automatically detects schema changes
- Updates stream configuration without manual intervention
- Provides exactly-once delivery guarantees
- Includes schema evolution in Pub/Sub messages
Datastream eliminates the need for custom replication slot management. For your weekly migration cadence, this is the most reliable option.
Solution 3: Backward-Compatible Schema Changes
When possible, structure migrations to avoid replication breaks:
- Add new columns as nullable (don’t require defaults)
- Use separate migrations for DROP operations
- Create new tables instead of altering existing ones
- Use views to present schema changes to applications
Example safe migration pattern:
-- Safe: adds nullable column
ALTER TABLE orders ADD COLUMN discount_pct DECIMAL(5,2);
-- Unsafe: adds column with default (rewrites table)
ALTER TABLE orders ADD COLUMN status VARCHAR(20) DEFAULT 'pending';
Recommended Implementation for Your Case:
Given your weekly migration schedule and Flyway usage, implement a hybrid approach:
-
Migrate to Datastream for Pub/Sub integration. This eliminates schema change issues entirely and provides better monitoring and error handling.
-
If staying with logical replication, wrap your Flyway migrations:
Create a pre-migration script:
- Pause downstream Pub/Sub consumers
- Drop and recreate publication after schema changes
- Verify replication slot health
- Resume consumers
-
Add monitoring for replication lag:
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots;
Alert when lag exceeds threshold (e.g., 100MB).
**Logical Replication Setup Best Practices:**
- Use separate replication slots for different consumers
- Set max_slot_wal_keep_size to prevent disk exhaustion
- Monitor slot lag and inactive slots
- Test schema changes in staging environment first
- Document which tables are replicated and to where
- Implement automated slot cleanup for failed replications
**For Pub/Sub Integration Specifically:**
If you must use logical replication instead of Datastream, implement these safeguards:
- Add schema version to Pub/Sub message attributes
- Include full row data (not just changes) during schema transitions
- Implement downstream schema registry to validate messages
- Use Cloud Functions to transform messages if schema drift occurs
The root cause of your breaks is that pgoutput doesn't support transparent schema evolution. You must either adopt Datastream (recommended for production reliability) or implement careful coordination between migrations and replication management. For weekly changes, the manual overhead of replication slot management will become a significant operational burden.