Pod autoscaling in container service fails to provision new nodes

lisadata · November 13, 2024, 1:11pm

Our OKE cluster’s horizontal pod autoscaler triggers correctly when load increases, but the cluster autoscaler fails to provision new worker nodes. The error logs indicate insufficient permissions, but I’ve verified the cluster has the default service principal.

We’re using Resource Manager stacks to manage our infrastructure, and I suspect the issue relates to how the service principal interacts with the stack execution context. The IAM policy for Resource Manager seems correct, but new nodes fail with “authorization failed” during provisioning.

I’m particularly concerned about whether the service principal has the right permissions for the specific instance family we’re trying to scale into (VM.Standard.E4.Flex). Has anyone encountered similar permission issues with autoscaling and Resource Manager?

johndata · November 20, 2024, 9:15am

Using the default OKE service principal. Here’s the error from cluster autoscaler logs:


Failed to create instance: authorization failed
Service: Compute, Code: NotAuthorizedOrNotFound
OPC request ID: 8F2A...B3C9

The cluster can create nodes manually through the console, but autoscaling fails. Our Resource Manager stack has all the node pool configurations.

lisaguru · December 2, 2024, 2:26pm

There’s another angle here - if you’re using Resource Manager to manage the cluster infrastructure, the service principal might need permissions to read or execute the stack. The autoscaler might be trying to update the stack configuration when adding nodes, which requires additional Resource Manager permissions beyond just compute instance creation. Have you checked if the OKE service has ‘manage stacks’ or at least ‘read stacks’ permissions in your compartment?

dorothylead · December 31, 2024, 7:40pm

You’ve identified the core issue correctly. Let me break down the solution by addressing each critical area:

Resource Manager IAM Policy: The OKE service principal needs explicit permissions to interact with Resource Manager when your infrastructure is stack-managed. Add this policy:


Allow service OKE to manage orm-stacks in compartment <your-compartment>
Allow service OKE to manage orm-jobs in compartment <your-compartment>

This allows the cluster autoscaler to read stack configurations and understand the infrastructure context when provisioning nodes.

Service Principal Permissions: The default OKE service principal has broad compute permissions, but you need to ensure it covers your specific use case:


Allow service OKE to manage instance-family in compartment <your-compartment>
Allow service OKE to manage compute-containers in compartment <your-compartment>
Allow service OKE to use subnets in compartment <your-compartment>
Allow service OKE to use vnics in compartment <your-compartment>

The instance-family permission is crucial for newer instance types.

Stack Execution Context: When Resource Manager manages your OKE infrastructure, the autoscaler doesn’t directly create compute instances - it triggers stack updates or references stack-defined configurations. The “authorization failed” error occurred because the service principal couldn’t read or execute within the stack context. This is why manual creation worked (your user account has direct compute permissions) but autoscaling failed (service principal lacked stack permissions).

Instance-Family Access: For VM.Standard.E4.Flex specifically, verify:

Your tenancy has access to E4 shapes in the target region
Service limits allow additional E4 instances
The OKE service principal has instance-family permissions (not just generic compute permissions)

You can verify this with:


oci iam policy list --compartment-id <compartment-ocid> --query "data[?contains(statements, 'OKE')]"

Complete Solution: Add the Resource Manager policies above, then verify the cluster autoscaler can read your stack configuration. Test by triggering autoscaling during a load test. The autoscaler should now successfully provision nodes by referencing the Resource Manager stack definitions. If issues persist, check the ORM job logs for the specific stack execution failures - they’ll show exactly which permission is missing.

The key insight is that Resource Manager introduces an additional permission layer beyond direct compute API access. Your service principal needs both compute permissions AND Resource Manager permissions to autoscale successfully in a stack-managed environment.

Topic		Replies	Views
OCI Monitoring metrics missing for compute instances created via autoscaling Oracle Cloud question , monitoring , compute , observability , oci-2020 , autoscaling , metrics-missing , instance-principal , launch-configuration	3	2	September 23, 2025
OCI REST API call to user management fails with 'NotAuthorized' error despite admin role Oracle Cloud question , security , rest-api , authorization , iam , oci-2019 , user-provisioning , policy-syntax , compartment-scope	7	2	February 3, 2025
Container pod fails to pull image from OCI Registry due to IAM policy misconfiguration (403 Access Denied) Oracle Cloud question , security , kubernetes , oci-2021 , access-denied , iam-policy , container-servi , oci-registry , dynamic-groups	4	2	June 1, 2025
IAM policy blocks Autonomous Database access via OCI CLI for automation Oracle Cloud question , security , automation , devops-auto , oci-2020 , iam-policy , policy-syntax , autonomous-database , oci-cli	3	0	November 22, 2025
IAM policy denied error when accessing Autonomous Database from OCI Compute instance Oracle Cloud question , compute , security , authentication , iam , oci-2021 , policy-denied , autonomous-database , dynamic-group	6	4	February 28, 2025
AKS GPU node pool autoscaling fails to provision nodes for ML workloads Microsoft Azure question , ml-ai , compute , az-2021 , machine-learning , aks , autoscaling , quota-limits , gpu-nodes	4	2	June 10, 2025
IAM API role assignment fails for custom service account with insufficient permissions Google Cloud Platform (GCP) question , security , rest-api , gcp-2019 , json , permission-denied , service-account , role-assignment , iam-api	6	0	July 24, 2025
GKE node pool autoscaling not triggering when pods exceed CPU requests Google Cloud Platform (GCP) question , compute , kubernetes , gcp-2019 , resource-limits , containers-ctn , gke , autoscaler , pod-scheduling	3	4	January 31, 2025
OCI Data Flow Spark job fails with 403 Forbidden when writing analytics results to Object Storage Oracle Cloud question , analytics , object-storage , oci-2021 , data-flow , service-principal , 403-forbidden , iam-policy , spark	7	3	December 9, 2024

Pod autoscaling in container service fails to provision new nodes

Related topics