I found the root cause after researching this pattern. The issue is a combination of network latency characteristics and query optimization in Analytics Cloud.
Network Latency Monitoring:
First, set up proper latency monitoring using OCI Monitoring. Create a custom metric tracking round-trip time from containers to Analytics Cloud. Use the Network Path Analyzer in OCI Console to trace the actual network path your container traffic takes. In your case, I suspect containers are routing through a NAT gateway while VMs use a service gateway, adding 50-100ms latency per request.
You can verify this with:
- Check route tables for your container subnet vs VM subnet
- Look for NAT gateway vs service gateway differences
- Use traceroute from both container and VM to Analytics Cloud endpoint
Analytics Cloud Resource Usage:
The “normal” CPU/memory you’re seeing doesn’t tell the full story. Analytics Cloud tracks query concurrency separately from resource utilization. Log into Analytics Cloud admin console and check the “Active Queries” dashboard. You might be hitting concurrency limits where container queries are queuing behind other workloads.
Also check the “Query Statistics” view filtered by source IP. If your containers share a NAT gateway IP, Analytics Cloud might be applying rate limiting thinking all requests come from a single client. This would explain why distributed VMs with individual IPs don’t hit the same bottleneck.
Query Optimization:
The different execution plans you discovered are the key symptom. Analytics Cloud’s query optimizer considers network latency when choosing execution plans. Higher latency connections get plans optimized for fewer round-trips but potentially more server-side processing. This is why you see table scans instead of index seeks.
To fix this, you need to provide query hints forcing index usage:
- Add optimizer hints to your queries specifying index preferences
- Or, adjust the Analytics Cloud connection profile for your container subnet to mark it as “low latency”
The connection profile setting is in Analytics Cloud under Settings > Connections > Advanced. Set the latency profile to “LAN” instead of the default “WAN” that it’s probably detecting based on your container’s network characteristics.
OCI Monitoring Metrics:
Set up these custom metrics in OCI Monitoring:
- Query response time (end-to-end from container)
- Network latency to Analytics Cloud endpoint (TCP connection time)
- Query execution time (from Analytics Cloud query logs)
- Container connection pool utilization
By comparing these metrics between containers and VMs, you’ll see where the latency is accumulating. My bet is 60% network path difference (NAT vs service gateway), 30% query optimization choosing wrong plans due to perceived latency, and 10% connection profile mismatch.
Immediate Fix:
Update your container subnet route table to use a service gateway for Analytics Cloud traffic instead of NAT gateway. Add this route:
- Destination: All Oracle Services
- Target: Service Gateway
- Description: Direct path to Analytics Cloud
This should immediately reduce your query times from 15-20 seconds back to 3-4 seconds by eliminating the NAT gateway hop and allowing Analytics Cloud’s optimizer to choose better execution plans.