GKE node pool autoscaling not triggering when pods exceed CPU requests

mrsqlninja · January 6, 2025, 8:16am

We’re running a GKE cluster (v1.14) with autoscaling enabled on our primary node pool. The node pool is configured with min 3 nodes and max 15 nodes. Over the past week, we’ve noticed that when our application pods request additional CPU resources during peak hours, the cluster autoscaler isn’t adding new nodes as expected.

Our pods have CPU requests set to 500m and limits at 1000m. When we deploy additional replicas (scaling from 20 to 35 pods), many pods stay in Pending state for 10-15 minutes before nodes are finally added. The GKE cluster autoscaler config shows enabled status, but the trigger timing seems broken.

Has anyone experienced delayed autoscaling in GKE? We need to understand if this is a configuration issue with our pod resource requests, node pool settings, or the autoscaler itself.

master_solver · January 8, 2025, 10:40am

I’ve seen similar behavior. First thing to check - are your pod resource requests accurately reflecting actual usage? If requests are too low, the scheduler might pack too many pods per node, leaving no room for the autoscaler to recognize capacity issues until it’s too late.

jose_creator · January 19, 2025, 3:21pm

Thanks both. I checked the configmap and found some interesting entries about “scale up not needed” even though we had pending pods. Our actual CPU usage is around 350m per pod on average, so the 500m request seems reasonable. Could there be an issue with how the autoscaler calculates available capacity?

jose_creator · January 31, 2025, 6:09pm

This might be related to the autoscaler’s scan interval combined with the time it takes to provision new nodes. GCP typically takes 3-5 minutes to spin up a new node, and if your traffic spike happens faster than that, you’ll see pending pods. Consider using Horizontal Pod Autoscaler with a buffer of pre-scaled pods, or look into node auto-provisioning if you’re on GKE 1.15+.

Topic		Views
AKS GPU node pool autoscaling fails to provision nodes for ML workloads Microsoft Azure question , ml-ai , compute , az-2021 , machine-learning , aks , autoscaling , quota-limits , gpu-nodes	4	June 10, 2025
GKE Autopilot vs Standard: Cluster management tradeoffs for production workloads Google Cloud Platform (GCP) discussion , compute , kubernetes , cost-optimization , gcp-2020 , containers-ctn , gke , gke-autopilot , cluster-management	3	July 21, 2025
Cloud Logging custom metrics missing for autoscaled Compute Engine instances after instance group update Google Cloud Platform (GCP) question , monitoring , compute , observability , gcp-2021 , yaml , custom-metrics , autoscaling , cloud-logging	3	August 29, 2025
EKS Cluster Autoscaler vs ECS Capacity Providers for dynamic scaling Amazon Web Services (AWS) discussion , compute , performance , cost-optimization , aws-2021 , ecs , eks , autoscaling , cluster-autoscaler	4	August 30, 2025
Container Insights metrics delayed in CloudWatch for Prometheus scraper Amazon Web Services (AWS) question , monitoring , observability , aws-2021 , eks , cloudwatch , container-insights , prometheus , metrics-delay	5	September 4, 2025
Cloud Logging missing container logs after pod restarts in GKE, impacting audit trails Google Cloud Platform (GCP) question , kubernetes , observability , gcp-2021 , containers-ctn , cloud-logging , gke , log-loss , fluentd	6	December 3, 2024
Pod autoscaling in container service fails to provision new nodes Oracle Cloud question , compute , iam , oci-2019 , autoscaling , container-servi , resource-manager , insufficient-permissions , stack-execution	3	December 31, 2024
Automated instance group scaling for ML inference workloads using custom metrics Google Cloud Platform (GCP) use-case , cost-reduction , observability , cost-optimization , gcp-2019 , machine-learning , custom-metrics , autoscaling , compute-engine	4	August 4, 2025
Cloud Monitoring alerts not firing for ERP container CrashLoopBackOff events in GKE Google Cloud Platform (GCP) question , kubernetes , observability , gcp-2019 , cloud-monitoring , gke , container-servi , alerts-missing , incident-delay	3	April 25, 2025

GKE node pool autoscaling not triggering when pods exceed CPU requests

Related topics