Best practices for device data retention in SAP IoT sapiot-25

We’re planning our data retention strategy for a large IoT deployment (5,000+ devices) in sapiot-25 and would like to hear from others about best practices. Our main concerns are balancing storage costs with compliance requirements and maintaining query performance as data volume grows.

Currently considering a tiered approach: hot data (30 days) in HANA, warm data (6 months) in extended storage, cold data (7 years for compliance) in archive. But I’m curious what retention policies others are using and how you handle the transition between tiers. Also interested in hearing about data archiving strategies and any performance impacts you’ve experienced with large historical datasets.

Based on implementations across multiple large-scale IoT deployments, here’s a comprehensive retention strategy framework:

Data Retention Policy Structure: Implement a three-tier policy with clear transition criteria based on data age and access frequency:

  1. Hot Tier (HANA In-Memory): 30-45 days

    • Real-time analytics and dashboards
    • High-frequency queries (multiple times per hour)
    • Full granularity data
    • Target: Sub-second query performance
  2. Warm Tier (HANA Extended Storage): 6-12 months

    • Historical trend analysis
    • Medium-frequency queries (daily/weekly)
    • Full granularity with optional compression
    • Target: Query performance under 5 minutes
  3. Cold Tier (Archive/Data Lake): 7+ years

    • Compliance and audit requirements
    • Low-frequency queries (monthly/quarterly)
    • Aggregated summaries + raw data on-demand
    • Target: Query performance acceptable up to 30 minutes

Data Archiving Best Practices:

Automate tier transitions using lifecycle policies:


Data Lifecycle Policy:
- Hot → Warm: After 30 days AND query frequency < 10/day
- Warm → Cold: After 180 days AND query frequency < 1/week
- Archive immutability: Enable for compliance data
- Compression: Apply to warm/cold tiers (60-80% reduction)

Storage Cost Optimization:

For 5,000 devices generating data every 30 seconds:

  • Raw data: ~2.6 billion records/month
  • Hot storage cost: Highest (HANA in-memory)
  • Warm storage cost: 60% lower (HANA extended)
  • Cold storage cost: 90% lower (archive)

Estimated monthly storage:

  • Hot (30 days): 2.6B records × 1KB = ~2.6TB
  • Warm (6 months): 15.6B records × 0.5KB (compressed) = ~7.8TB
  • Cold (7 years): 218B records × 0.2KB (highly compressed) = ~43TB

Key cost reduction strategies:

  1. Apply compression at warm tier (50-70% reduction)
  2. Store aggregated summaries in cold tier with raw data on-demand retrieval
  3. Implement data sampling for non-critical analytics (e.g., keep every 10th reading for trend analysis)
  4. Use partition pruning in queries to minimize data scanned

Compliance Considerations:

  • Separate retention policies for regulatory vs. operational data
  • Implement audit trails for all data access and archival operations
  • Use immutable storage for compliance-critical data (cannot be modified/deleted)
  • Regular validation of archived data integrity (checksum verification)
  • Document retention policy decisions for regulatory audits

Performance Optimization:

  1. Create materialized views for common analytics queries (daily/weekly aggregations)
  2. Implement query result caching for frequently accessed historical data
  3. Use data partitioning by time period to improve query performance
  4. Pre-compute and store aggregates at multiple time granularities (hourly, daily, monthly)
  5. Implement a federated query layer that routes queries to appropriate tiers automatically

Implementation Roadmap:

Phase 1 (Months 1-2): Establish hot tier with 30-day retention, baseline storage costs

Phase 2 (Months 3-4): Implement warm tier transition, validate compression ratios

Phase 3 (Months 5-6): Deploy cold tier archiving, test compliance requirements

Phase 4 (Ongoing): Monitor and optimize based on actual usage patterns

The key is starting with conservative retention periods and adjusting based on actual query patterns. We’ve found that 80% of queries access data less than 7 days old, which validates aggressive tiering policies. Monitor your query access patterns for the first 3 months before finalizing long-term retention policies.

sapiot-25 has a unified query interface that transparently queries across tiers, but performance varies significantly. Queries on hot data return in seconds, warm data in minutes, and cold data can take 15-30 minutes depending on archive size. We implemented a query optimizer that pre-aggregates common analytics queries and stores results in a materialized view layer. This gives near-instant results for standard reports while still allowing ad-hoc queries against raw archived data when needed. The trade-off is additional storage for the aggregated views, but it’s minimal compared to raw data volume.

Another consideration: data retention policies should account for device decommissioning. When devices are retired, we archive their complete history immediately rather than waiting for the standard retention schedule. This prevents orphaned data from accumulating in hot storage. We also implemented a device lifecycle hook that triggers archival workflows automatically when devices are marked as decommissioned in the device registry. This has helped us maintain clean hot storage and predictable storage costs.

Your tiered approach aligns with what we implemented for a 3,500 device deployment. One key lesson: automate the tier transitions using SAP IoT’s data lifecycle policies rather than manual archiving. We set up policies that automatically move data based on age and access patterns. Hot data stays in HANA for real-time analytics, warm data moves to HANA native storage extensions after 30 days, and cold data goes to SAP Data Intelligence for long-term archiving after 180 days. The automated policies reduced our storage costs by 60% while maintaining compliance.

From a compliance perspective, make sure your retention policy clearly defines what constitutes device data versus operational metadata. In our industry (manufacturing), we must retain raw sensor readings for 10 years but operational logs only need 2 years. We use separate retention policies for different data categories. Also critical: implement immutable archiving for compliance data - once archived, data cannot be modified or deleted. sapiot-25 supports this through HANA’s secure store feature.

Great insights on automated policies and data categorization. How do you handle queries that span multiple tiers? For example, if someone needs to analyze trends across 2 years of data, does the system automatically query both HANA and archived data, or do users need to explicitly specify the data source? Performance is a concern when queries need to access archived data.