We’re architecting a microservices platform on OCI and debating whether to route all internal service-to-service calls through API Gateway or allow direct REST endpoint communication within our private subnet. Our setup includes about 15 microservices that need to communicate frequently.
The API Gateway approach offers centralized security policies, rate limiting, and unified logging. However, I’m concerned about the additional latency and operational overhead of routing every internal call through an extra hop. Direct endpoint communication seems more efficient but loses the centralized control and monitoring benefits.
For those who’ve implemented similar architectures, what’s been your experience with API Gateway for internal traffic? Does the security and observability gain justify the performance trade-off? Are there hybrid approaches that balance both concerns?
We went full API Gateway initially and regretted it for internal traffic. The latency overhead was noticeable - added 15-30ms per call, which compounds when you have service chains with 4-5 hops. For truly internal, trusted services within the same VCN, we switched to direct communication using private DNS and mutual TLS. We only use API Gateway for external-facing APIs and for internal calls that cross security zones. This hybrid approach reduced our P95 latency by about 40% while maintaining security boundaries where they matter.
The operational overhead argument cuts both ways. API Gateway gives you centralized observability, but it also becomes a single point of failure and a bottleneck. We’ve had incidents where gateway issues took down our entire service mesh. Direct communication means more distributed monitoring complexity, but it’s more resilient. You need to factor in deployment complexity too - API Gateway route updates require careful coordination, while direct service discovery can be more dynamic. Have you considered service mesh solutions like Istio as a middle ground? You get centralized policies without the gateway bottleneck.
After implementing both patterns across multiple OCI deployments, here’s my synthesis on the three key dimensions:
Centralized Security: API Gateway shines for cross-boundary communication - anything touching external networks, different compartments, or services with varying trust levels. It provides consistent policy enforcement, OAuth/JWT validation, and threat protection without code changes. For internal services within the same security zone, mutual TLS with service accounts offers comparable security with less overhead. The decision point: if a service compromise could laterally affect others, route through the gateway. If services are equally trusted and compartmentalized, direct communication is acceptable.
API Latency: Real-world data from our implementations shows 10-25ms added latency for typical gateway routing in oci-2020. This compounds in service chains - a 5-hop call adds 50-125ms total. For synchronous request-response patterns with tight SLAs (< 100ms), this is prohibitive. For asynchronous workflows or batch processing, it’s negligible. Consider request patterns: high-frequency, low-latency calls (like health checks, cache lookups) should bypass the gateway. Business logic calls with looser SLAs can route through.
Operational Overhead: Gateway management centralizes configuration but creates deployment dependencies. We use this rule: external-facing and cross-zone traffic through API Gateway for unified access control and logging; internal, same-zone traffic uses direct endpoints with service mesh for observability. This gives you centralized monitoring without centralized routing overhead.
Hybrid Architecture Pattern:
- Public APIs → API Gateway (mandatory for security and rate limiting)
- Cross-compartment internal APIs → API Gateway (centralized policy enforcement)
- Same-compartment, high-trust services → Direct with mutual TLS and service mesh
- Async/event-driven communication → Direct to message queues/streams
Implementation Tips:
- Use OCI Service Mesh or Istio for observability on direct connections
- Implement circuit breakers and retry logic in service code for direct calls
- Deploy API Gateway in HA mode with multiple replicas if using for critical paths
- Monitor both gateway metrics and service-level metrics to catch issues early
- Document your routing decisions in architecture diagrams - future teams need to understand the patterns
The “right” answer depends on your specific SLAs, security posture, and operational maturity. Start with gateway-first for simplicity, then optimize high-traffic paths to direct communication as you identify bottlenecks. Don’t prematurely optimize - measure first, then decide.
Run some benchmarks before deciding. The latency impact varies significantly based on your gateway configuration and payload sizes. We measured 8-12ms overhead for small JSON payloads but 50-80ms for larger requests due to gateway processing. If your services exchange high-frequency, small messages, the cumulative latency becomes problematic. For lower-frequency, larger payloads, it’s usually acceptable. Also consider that API Gateway has throughput limits - check if your expected request rates fit within OCI’s gateway capacity for your subscription tier.
From a security perspective, API Gateway provides defense in depth even for internal traffic. It’s not just about external threats - insider threats, compromised services, and misconfigurations are real risks. Centralized security lets you implement consistent authentication, authorization, and audit logging across all services without duplicating code. You can enforce policies like request validation, payload inspection, and anomaly detection at the gateway level. The performance hit is real, but security shouldn’t be compromised for speed unless you have very specific latency SLAs. Consider that the operational overhead of managing security across 15 independent services might outweigh gateway management costs.
Appreciate all the perspectives. The hybrid approach seems most practical. How do you typically draw the line between what goes through the gateway versus direct communication? Is it purely based on trust boundaries, or are there other factors like data sensitivity or compliance requirements that influence the decision?