Pod autoscaling in container service fails to provision new nodes

Our OKE cluster’s horizontal pod autoscaler triggers correctly when load increases, but the cluster autoscaler fails to provision new worker nodes. The error logs indicate insufficient permissions, but I’ve verified the cluster has the default service principal.

We’re using Resource Manager stacks to manage our infrastructure, and I suspect the issue relates to how the service principal interacts with the stack execution context. The IAM policy for Resource Manager seems correct, but new nodes fail with “authorization failed” during provisioning.

I’m particularly concerned about whether the service principal has the right permissions for the specific instance family we’re trying to scale into (VM.Standard.E4.Flex). Has anyone encountered similar permission issues with autoscaling and Resource Manager?

Using the default OKE service principal. Here’s the error from cluster autoscaler logs:


Failed to create instance: authorization failed
Service: Compute, Code: NotAuthorizedOrNotFound
OPC request ID: 8F2A...B3C9

The cluster can create nodes manually through the console, but autoscaling fails. Our Resource Manager stack has all the node pool configurations.

There’s another angle here - if you’re using Resource Manager to manage the cluster infrastructure, the service principal might need permissions to read or execute the stack. The autoscaler might be trying to update the stack configuration when adding nodes, which requires additional Resource Manager permissions beyond just compute instance creation. Have you checked if the OKE service has ‘manage stacks’ or at least ‘read stacks’ permissions in your compartment?

You’ve identified the core issue correctly. Let me break down the solution by addressing each critical area:

Resource Manager IAM Policy: The OKE service principal needs explicit permissions to interact with Resource Manager when your infrastructure is stack-managed. Add this policy:


Allow service OKE to manage orm-stacks in compartment <your-compartment>
Allow service OKE to manage orm-jobs in compartment <your-compartment>

This allows the cluster autoscaler to read stack configurations and understand the infrastructure context when provisioning nodes.

Service Principal Permissions: The default OKE service principal has broad compute permissions, but you need to ensure it covers your specific use case:


Allow service OKE to manage instance-family in compartment <your-compartment>
Allow service OKE to manage compute-containers in compartment <your-compartment>
Allow service OKE to use subnets in compartment <your-compartment>
Allow service OKE to use vnics in compartment <your-compartment>

The instance-family permission is crucial for newer instance types.

Stack Execution Context: When Resource Manager manages your OKE infrastructure, the autoscaler doesn’t directly create compute instances - it triggers stack updates or references stack-defined configurations. The “authorization failed” error occurred because the service principal couldn’t read or execute within the stack context. This is why manual creation worked (your user account has direct compute permissions) but autoscaling failed (service principal lacked stack permissions).

Instance-Family Access: For VM.Standard.E4.Flex specifically, verify:

  1. Your tenancy has access to E4 shapes in the target region
  2. Service limits allow additional E4 instances
  3. The OKE service principal has instance-family permissions (not just generic compute permissions)

You can verify this with:


oci iam policy list --compartment-id <compartment-ocid> --query "data[?contains(statements, 'OKE')]"

Complete Solution: Add the Resource Manager policies above, then verify the cluster autoscaler can read your stack configuration. Test by triggering autoscaling during a load test. The autoscaler should now successfully provision nodes by referencing the Resource Manager stack definitions. If issues persist, check the ORM job logs for the specific stack execution failures - they’ll show exactly which permission is missing.

The key insight is that Resource Manager introduces an additional permission layer beyond direct compute API access. Your service principal needs both compute permissions AND Resource Manager permissions to autoscale successfully in a stack-managed environment.