You’ve identified the core issue correctly. Let me break down the solution by addressing each critical area:
Resource Manager IAM Policy:
The OKE service principal needs explicit permissions to interact with Resource Manager when your infrastructure is stack-managed. Add this policy:
Allow service OKE to manage orm-stacks in compartment <your-compartment>
Allow service OKE to manage orm-jobs in compartment <your-compartment>
This allows the cluster autoscaler to read stack configurations and understand the infrastructure context when provisioning nodes.
Service Principal Permissions:
The default OKE service principal has broad compute permissions, but you need to ensure it covers your specific use case:
Allow service OKE to manage instance-family in compartment <your-compartment>
Allow service OKE to manage compute-containers in compartment <your-compartment>
Allow service OKE to use subnets in compartment <your-compartment>
Allow service OKE to use vnics in compartment <your-compartment>
The instance-family permission is crucial for newer instance types.
Stack Execution Context:
When Resource Manager manages your OKE infrastructure, the autoscaler doesn’t directly create compute instances - it triggers stack updates or references stack-defined configurations. The “authorization failed” error occurred because the service principal couldn’t read or execute within the stack context. This is why manual creation worked (your user account has direct compute permissions) but autoscaling failed (service principal lacked stack permissions).
Instance-Family Access:
For VM.Standard.E4.Flex specifically, verify:
- Your tenancy has access to E4 shapes in the target region
- Service limits allow additional E4 instances
- The OKE service principal has instance-family permissions (not just generic compute permissions)
You can verify this with:
oci iam policy list --compartment-id <compartment-ocid> --query "data[?contains(statements, 'OKE')]"
Complete Solution:
Add the Resource Manager policies above, then verify the cluster autoscaler can read your stack configuration. Test by triggering autoscaling during a load test. The autoscaler should now successfully provision nodes by referencing the Resource Manager stack definitions. If issues persist, check the ORM job logs for the specific stack execution failures - they’ll show exactly which permission is missing.
The key insight is that Resource Manager introduces an additional permission layer beyond direct compute API access. Your service principal needs both compute permissions AND Resource Manager permissions to autoscale successfully in a stack-managed environment.