After synthesizing all this feedback and running some POCs, here’s my consolidated perspective on the semantic vs custom feature engineering decision:
SAP Semantic Data Products and Governance:
The semantic layer provides enormous value for foundational business concepts. SAP BDC semantic models come with pre-validated business rules, data quality checks, and audit trails that are critical for regulated industries. For customer hierarchies, product classifications, financial metrics, and supply chain KPIs, leveraging these semantics saves months of validation work and ensures consistency across analytics and ML use cases. The built-in governance (lineage, access controls, change management) is production-grade and compliance-ready.
Snowpark Python UDF Development:
Snowpark excels at ML-specific transformations that semantic models don’t cover. Complex window functions, statistical aggregations, time-series feature engineering, and custom business logic all benefit from Snowpark’s flexibility. The Python UDF framework lets you implement any transformation pattern, call external libraries, and even integrate pre-trained models for feature generation. The key is treating Snowpark as the “enrichment layer” rather than replacing semantic foundations entirely.
Feature Engineering Best Practices:
Implement a layered architecture:
- L1 (Raw): SAP BDC shared data, unchanged
- L2 (Semantic): SAP semantic products with business logic
- L3 (Base Features): Direct mappings from semantics to ML features
- L4 (Derived Features): Snowpark transformations creating ML-specific features
This separation makes testing, debugging, and governance much more manageable. Each layer has clear ownership and validation criteria.
Hybrid Semantic + Custom Approach:
The hybrid pattern is the practical solution. Use SAP semantics for all business-validated attributes (customer demographics, product attributes, transaction amounts, organizational hierarchies). Build Snowpark transformations for ML-specific features (RFM scores, propensity calculations, embeddings, statistical aggregations, time-series lags).
Implement version pinning for semantic dependencies:
# Snowpark feature pipeline with semantic versioning
from snowflake.snowpark.functions import col, lag, avg
@sproc(name="generate_customer_features_v3")
def build_features(session, semantic_version="v2"):
# Pin to specific semantic version
semantic_view = f"sap_share.customer_semantic_{semantic_version}"
base = session.table(semantic_view)
# Layer custom ML features on semantic foundation
features = base.with_columns([
avg(col("purchase_amount")).over(
Window.partition_by("customer_id")
.order_by("transaction_date")
.rows_between(-6, -1)
).alias("avg_purchase_6m"),
lag(col("last_purchase_date"), 1).over(
Window.partition_by("customer_id")
.order_by("transaction_date")
).alias("days_since_previous")
])
return features
Model Governance and Compliance:
Implement comprehensive metadata management:
- Feature Registry: Track every feature with source semantic version, transformation logic, business owner, and approval status
- Lineage Tracking: Use Snowflake tags to link features back to SAP semantic sources and Snowpark transformation code
- Access Controls: Inherit RBAC from SAP semantics for base attributes, apply additional column-level security for derived features
- Validation Framework: Automated tests comparing semantic outputs to expected values, plus data quality checks on Snowpark transformations
- Change Management: When SAP updates semantics, create new versioned views, run regression tests on all dependent features, migrate models incrementally
The hybrid approach gives you SAP’s governance benefits (audit trails, validated business logic, regulatory compliance) while maintaining ML flexibility through Snowpark. Document everything, version aggressively, and treat semantic products as immutable inputs to your feature engineering pipeline.
Recommendation: Start with SAP semantics for all available business concepts. Only build custom Snowpark features when semantic products don’t cover your needs. As your ML platform matures, you’ll develop patterns for common transformations that can be templatized and governed as rigorously as the semantic layer itself. The goal is controlled flexibility - innovation where needed, standardization where possible.