Based on implementations across multiple large-scale IoT deployments, here’s a comprehensive retention strategy framework:
Data Retention Policy Structure:
Implement a three-tier policy with clear transition criteria based on data age and access frequency:
-
Hot Tier (HANA In-Memory): 30-45 days
- Real-time analytics and dashboards
- High-frequency queries (multiple times per hour)
- Full granularity data
- Target: Sub-second query performance
-
Warm Tier (HANA Extended Storage): 6-12 months
- Historical trend analysis
- Medium-frequency queries (daily/weekly)
- Full granularity with optional compression
- Target: Query performance under 5 minutes
-
Cold Tier (Archive/Data Lake): 7+ years
- Compliance and audit requirements
- Low-frequency queries (monthly/quarterly)
- Aggregated summaries + raw data on-demand
- Target: Query performance acceptable up to 30 minutes
Data Archiving Best Practices:
Automate tier transitions using lifecycle policies:
Data Lifecycle Policy:
- Hot → Warm: After 30 days AND query frequency < 10/day
- Warm → Cold: After 180 days AND query frequency < 1/week
- Archive immutability: Enable for compliance data
- Compression: Apply to warm/cold tiers (60-80% reduction)
Storage Cost Optimization:
For 5,000 devices generating data every 30 seconds:
- Raw data: ~2.6 billion records/month
- Hot storage cost: Highest (HANA in-memory)
- Warm storage cost: 60% lower (HANA extended)
- Cold storage cost: 90% lower (archive)
Estimated monthly storage:
- Hot (30 days): 2.6B records × 1KB = ~2.6TB
- Warm (6 months): 15.6B records × 0.5KB (compressed) = ~7.8TB
- Cold (7 years): 218B records × 0.2KB (highly compressed) = ~43TB
Key cost reduction strategies:
- Apply compression at warm tier (50-70% reduction)
- Store aggregated summaries in cold tier with raw data on-demand retrieval
- Implement data sampling for non-critical analytics (e.g., keep every 10th reading for trend analysis)
- Use partition pruning in queries to minimize data scanned
Compliance Considerations:
- Separate retention policies for regulatory vs. operational data
- Implement audit trails for all data access and archival operations
- Use immutable storage for compliance-critical data (cannot be modified/deleted)
- Regular validation of archived data integrity (checksum verification)
- Document retention policy decisions for regulatory audits
Performance Optimization:
- Create materialized views for common analytics queries (daily/weekly aggregations)
- Implement query result caching for frequently accessed historical data
- Use data partitioning by time period to improve query performance
- Pre-compute and store aggregates at multiple time granularities (hourly, daily, monthly)
- Implement a federated query layer that routes queries to appropriate tiers automatically
Implementation Roadmap:
Phase 1 (Months 1-2): Establish hot tier with 30-day retention, baseline storage costs
Phase 2 (Months 3-4): Implement warm tier transition, validate compression ratios
Phase 3 (Months 5-6): Deploy cold tier archiving, test compliance requirements
Phase 4 (Ongoing): Monitor and optimize based on actual usage patterns
The key is starting with conservative retention periods and adjusting based on actual query patterns. We’ve found that 80% of queries access data less than 7 days old, which validates aggressive tiering policies. Monitor your query access patterns for the first 3 months before finalizing long-term retention policies.