I’ll provide the complete implementation details since this generated significant interest.
Container resource metrics collection started with IBM Cloud Monitoring agent deployment across all clusters using DaemonSets. We configured the agent to capture metrics at 30-second intervals for CPU usage, memory working set, network I/O, disk I/O, and container restart counts. Critical custom metrics included: container CPU throttling percentage, memory pressure indicators, and pod scheduling latency. We exported these to Prometheus for long-term storage and analysis, maintaining 90 days of high-resolution data and 1 year of downsampled metrics.
Cluster utilization analysis revealed the core problem - massive over-provisioning. Average cluster CPU utilization was 38%, memory 42%, but individual nodes ranged from 15% to 85%. The issue was poor workload distribution and overly conservative resource requests. We analyzed actual resource consumption patterns using percentile analysis: p50, p95, and p99 usage over 30-day windows. Most applications consumed 30-40% of their requested resources under normal load, 60-70% during peak periods. This data informed our right-sizing strategy.
Workload consolidation happened in three phases. Phase 1: We created workload profiles by analyzing resource usage patterns, identifying five distinct categories - CPU-bound batch jobs, memory-intensive data processing, I/O-heavy database workloads, bursty web services, and steady-state background services. Phase 2: We consolidated compatible workloads onto dedicated node pools with appropriate instance types. CPU-bound workloads got compute-optimized nodes, memory-intensive workloads got memory-optimized instances. Phase 3: We implemented pod topology spread constraints and anti-affinity rules to prevent resource contention. This reduced our node count from 247 to 156 nodes while improving average utilization to 68%.
Cost attribution required comprehensive tagging strategy. We enforced mandatory labels on all resources: team, application, environment, cost-center, and business-unit. IBM Cloud Monitoring dashboards aggregated resource consumption by these dimensions, calculating cost per team based on CPU-hours and GB-hours consumed. We implemented monthly chargeback reports showing each team’s infrastructure costs with drill-down to specific applications and namespaces. This visibility drove behavioral changes - teams started optimizing their resource requests when they saw the cost impact.
Performance SLA maintenance was critical during optimization. We established baseline SLAs before changes: 99.9% uptime, p95 response time under 200ms, zero data loss. During right-sizing, we implemented a gradual rollout strategy - optimize 10% of workloads weekly, monitor for 72 hours, validate SLA compliance, then proceed. We used Horizontal Pod Autoscaling (HPA) as a safety net - if resource constraints caused performance issues, HPA would scale out replicas. We implemented automated rollback triggers: if error rate exceeded 0.5% or p95 latency increased by more than 20%, the system automatically reverted resource settings. Over six months, we executed 847 optimization changes with only 3 rollbacks needed.
The 42% cost reduction came from multiple sources: 37% from node count reduction through consolidation, 28% from right-sizing individual workloads, 20% from eliminating idle resources, and 15% from better instance type selection. We maintained SLA compliance at 99.94% throughout the optimization period, actually improving from the previous 99.87%. The monitoring infrastructure itself costs $3,200 monthly but saves $47,000 monthly in infrastructure costs, providing 14.7x ROI. Teams now have real-time visibility into resource consumption and costs, enabling continuous optimization rather than periodic cost-cutting exercises.