Snowpipe ingestion delays when syncing SAP master data through BDC

We’re running Snowflake 7.0 with Snowpipe configured to ingest SAP master data changes through BDC zero-copy sharing. Our AI/ML models depend on near real-time updates, but we’re seeing consistent 15-30 minute delays between SAP data changes and availability in Snowflake.

Current setup uses AUTO mode with external tables pointing to SAP BDC shared stage. We’ve tried adjusting PIPE_REFRESH_MODE but delays persist. The impact is significant - our predictive models are training on stale data, affecting recommendation accuracy by 12-15%.

CREATE PIPE sap_master_pipe
  AUTO_INGEST = TRUE
  AS COPY INTO master_data_stage
  FROM @sap_bdc_share/master_data/;

PIPE_EXECUTION_SUMMARY shows successful runs but timestamps reveal the lag. Has anyone dealt with similar latency issues when connecting SAP BDC to Snowpipe for ML feature engineering? Wondering if Streams would be better than external tables here.

Interesting point about the BDC metadata sync delay. That might explain why our PIPE_EXECUTION_SUMMARY timestamps don’t align with SAP change logs. The direct stage write approach sounds promising but we’d need to justify the storage costs to management. Anyone have metrics on cost delta between zero-copy sharing vs direct staging for high-frequency updates?

I’ve dealt with this exact scenario. The key is understanding that AUTO mode is optimized for cost, not latency. For ML use cases requiring sub-5-minute freshness, you need ON_DEMAND with task orchestration.

We run tasks every 3-5 minutes during business hours, scaling back to 15-minute intervals overnight. The trick is monitoring PIPE_EXECUTION_SUMMARY and adjusting based on actual data arrival patterns. Also recommend implementing Snowflake Streams on your target tables - this gives your ML feature engineering layer clean CDC without scanning entire datasets.

Thanks for the insight. So you’re suggesting ON_DEMAND mode with tasks instead of AUTO_INGEST? What kind of refresh intervals are realistic for keeping ML training data fresh without overwhelming the pipe? Our models retrain every 2 hours currently.

Let me provide a comprehensive solution addressing all the key aspects:

1. Snowpipe Execution Mode Optimization Switch from AUTO_INGEST to ON_DEMAND mode with task-based orchestration. AUTO mode’s S3 event-driven approach introduces 5-15 minute batching delays that compound with SAP BDC metadata sync latency.

CREATE TASK refresh_sap_pipe
  WAREHOUSE = ml_compute_wh
  SCHEDULE = '3 MINUTE'
AS
  ALTER PIPE sap_master_pipe REFRESH;

2. SAP BDC Zero-Copy Configuration The metadata propagation delay is inherent to zero-copy sharing. Implement a hybrid architecture: use BDC sharing for slowly-changing master data (customers, products) but direct Snowflake-managed staging for high-velocity transactional data that feeds ML models. This eliminates the 2-5 minute BDC sync bottleneck for time-sensitive data.

3. Snowflake Streams for Change Data Capture Replace external table queries with Streams to capture only changed records:

CREATE STREAM sap_changes_stream
  ON TABLE master_data_stage;

Your ML feature engineering queries should consume from the Stream, processing only deltas. This reduces compute costs by 60-80% compared to full table scans and provides true CDC semantics.

4. External Tables vs Staged Ingestion Trade-offs Your current external table approach adds query-time overhead. For ML training pipelines requiring predictable performance, stage data into native Snowflake tables via COPY. Yes, this duplicates data from BDC sharing, but you gain:

  • Deterministic query performance (no S3 API calls)
  • Time Travel capabilities for reproducible ML experiments
  • Stream-based CDC (not available on external tables)
  • Better compression (15-20% storage savings)

Storage cost increase is typically 10-15% but compute savings offset this within 2-3 months.

5. PIPE_EXECUTION_SUMMARY Monitoring Implement proactive monitoring with alerting:

SELECT
  DATEDIFF('minute', LAST_RECEIVED_MESSAGE_TIMESTAMP, CURRENT_TIMESTAMP()) as lag_minutes,
  FILES_INSERTED,
  ROWS_PARSED
FROM TABLE(INFORMATION_SCHEMA.PIPE_USAGE_HISTORY(
  DATE_RANGE_START=>DATEADD('hour', -1, CURRENT_TIMESTAMP()),
  PIPE_NAME=>'SAP_MASTER_PIPE'
));

Set alerts when lag_minutes exceeds your SLA threshold (5 minutes for real-time ML use cases).

Recommended Architecture:

  • Master data: Keep on BDC zero-copy sharing, refresh every 30 minutes via ON_DEMAND pipe
  • Transactional data: Direct stage write, refresh every 3 minutes
  • ML feature layer: Consume from Streams on both sources
  • Monitor both pipes independently with different SLA thresholds

This approach reduced our ML training data staleness from 28 minutes average to 4 minutes p95, with only 12% storage cost increase. The key is recognizing that not all data needs the same refresh cadence - optimize each data category based on actual ML model sensitivity.

Before abandoning zero-copy sharing, try this hybrid approach: keep BDC for bulk master data, but use direct staging for high-frequency transactional changes. Configure two separate pipes with different refresh strategies.

For your ML models, implement Snowflake Streams on the staged transaction table to capture deltas. This way you get CDC benefits without full table scans. Your feature engineering can join the Stream deltas with the BDC master data as needed. We reduced our ML training data lag from 25 minutes to under 4 minutes using this pattern, and storage costs only increased 8% compared to pure zero-copy.

The AUTO mode latency you’re experiencing is actually expected behavior. AUTO_INGEST relies on S3 event notifications which can batch up to 5-10 minutes depending on file arrival patterns. For SAP BDC sharing, there’s additional overhead in the zero-copy metadata sync.

Have you considered switching to ON_DEMAND execution with scheduled tasks? This gives you more control over refresh intervals. Also, your external table approach adds another layer of indirection - Streams on staged data would eliminate that hop and provide true CDC capabilities for your ML pipelines.