We’re evaluating our alerting strategy for OCI infrastructure and torn between using the native OCI Monitoring API alarm definitions versus integrating with our existing third-party monitoring platform (Datadog). The native alerting has the advantage of being tightly integrated with OCI metrics and doesn’t require data egress, but our team is already familiar with Datadog’s incident management workflow. I’m curious about real-world experiences with native alerting capabilities - specifically around customization options and how well they support complex incident response scenarios. What factors should influence this decision beyond just technical capabilities?
After reading through everyone’s experiences, here’s my analysis of the three key trade-off areas:
1. Native Alerting Capabilities:
OCI Monitoring API native alarms provide solid foundation-level alerting:
- Threshold alarms: Trigger when metrics cross defined values (CPU > 80%, disk > 90%)
- Absent alarms: Alert when expected metrics stop reporting (instance crash detection)
- Rate alarms: Detect rapid changes (sudden traffic spike, error rate increase)
- Built-in metrics: Compute, storage, network, database - all available without configuration
- Notification destinations: Email, PagerDuty, Slack (via Functions), webhooks, OCI Events
Limitations:
- Query language is basic compared to PromQL or Datadog’s query syntax
- No multi-condition logic (can’t do ‘CPU high AND memory high’ in single alarm)
- Limited historical context in alert notifications
- No anomaly detection or ML-based alerting
- Runbook integration requires custom Functions
Best for: Infrastructure-level alerts where speed matters, cost-sensitive deployments, teams comfortable with basic alerting logic.
2. Third-Party Integration Benefits:
Datadog and similar platforms excel at:
- Unified observability: Correlate OCI metrics with application logs, traces, and custom metrics
- Advanced alert logic: Composite conditions, forecast-based alerts, anomaly detection
- Rich notification context: Graphs, related events, automatic runbook links
- Collaborative incident management: Timelines, chat integration, postmortem tools
- Cross-cloud visibility: Monitor OCI alongside AWS, Azure, on-prem in single pane
- Extensive integrations: 500+ services and tools
Trade-offs:
- Cost: $15-50 per host/month plus metric ingestion fees
- Data egress: Metrics leave OCI, potential latency and bandwidth costs
- Complexity: Requires API integration setup, key management, IAM configuration
- Dependency: Outages in third-party service affect your monitoring
- Polling delay: Typically 1-5 minute lag vs real-time native alarms
Best for: Complex environments, teams needing advanced analytics, organizations with existing observability platform investments.
3. Customization and Incident Response:
Customization Comparison:
Native OCI:
- Alarm customization: Limited to metric, threshold, evaluation period, severity
- Notification customization: Message body templating, but context is minimal
- Remediation: Requires Functions to parse alert and trigger actions
- Dashboards: Basic metric visualization in OCI Console
- API access: Full programmatic control via Monitoring API for custom tooling
Third-Party:
- Extensive query customization: Complex expressions, calculations, aggregations
- Alert enrichment: Automatic context injection (tags, metadata, related metrics)
- Automated remediation: Workflow triggers, runbook automation, self-healing integrations
- Advanced dashboards: Correlation graphs, SLO tracking, predictive analytics
- Mobile apps: Full-featured incident response on mobile devices
Incident Response Workflow:
Native OCI workflow:
- Alarm triggers → Notification sent
- Engineer receives alert (email/Slack/PagerDuty)
- Login to OCI Console → Navigate to metric
- Manual investigation using Monitoring dashboards
- Remediation via Console or CLI
- Manual documentation of resolution
Third-party workflow:
- Alert triggers with full context (graphs, logs, related events)
- Engineer receives enriched notification with investigation links
- Single-pane investigation (metrics + logs + traces)
- Collaborative incident channel auto-created
- Automated runbooks suggest remediation steps
- Timeline and postmortem generated automatically
Strategic Recommendation:
For your team’s situation (existing Datadog investment, need for sophisticated incident response), I’d recommend:
Tier 1: Critical Infrastructure - Native OCI Alarms
- Compute instance health (CPU, memory, disk)
- Network connectivity failures
- Database availability
- Storage capacity critical thresholds Reason: Speed, reliability, no external dependencies
Tier 2: Application and Business Metrics - Datadog
- Application performance (response time, error rates)
- Business KPIs (transaction volume, user activity)
- Complex conditions requiring correlation
- Anomaly detection and forecasting Reason: Better investigation tools, team familiarity, advanced analytics
Integration Layer: Use PagerDuty or similar as aggregation point to:
- Deduplicate correlated alerts from both systems
- Apply consistent escalation policies
- Maintain single incident timeline
- Enable on-call rotation management
This tiered approach optimizes for speed where it matters (infrastructure), sophistication where needed (applications), and team efficiency (unified incident workflow). The incremental cost of Datadog is justified by improved MTTR and reduced alert fatigue through better context and correlation.
Native OCI alerting has come a long way. The biggest advantage is zero-latency access to metrics - alarms trigger immediately when thresholds breach because they’re evaluated server-side. With third-party tools, you’re polling metrics via API, which introduces delay and costs. However, customization is where third-party tools shine. Datadog’s query language and alerting logic are more sophisticated than OCI’s alarm queries. If you need complex conditions like ‘alert if CPU is high AND memory is high AND it’s during business hours’, that’s easier in Datadog.
We went all-in on native OCI alerting and regretted it initially. The customization is limited - you get basic threshold, absent, and rate-of-change alarms, but building sophisticated alert logic requires multiple alarms and manual correlation. The notification channels are decent (email, PagerDuty, Slack via Functions), but you don’t get the rich context and runbook integration that platforms like Datadog provide. We ended up building a hybrid approach where critical infrastructure alarms use OCI native for speed, and application-level alerting goes through Datadog for better context.
Cost is definitely a factor. Third-party tools charge based on metrics ingested and hosts monitored. For large OCI deployments, this can be substantial. Native OCI monitoring is included in your compute costs with no additional per-metric charges. Alert deduplication across systems requires either a central alerting platform (like PagerDuty) or custom logic in Functions to correlate events. We use PagerDuty as the aggregation layer - both OCI alarms and Datadog alerts flow into PagerDuty, where we apply deduplication rules and escalation policies.
Don’t underestimate the operational overhead of maintaining integrations. Datadog’s OCI integration requires API keys, proper IAM policies, and periodic updates as OCI adds new services. When we had an API key expire, we lost monitoring visibility for 6 hours before someone noticed. Native alerting has no such dependencies - it just works. However, the incident response workflow is less mature. You get an alert notification, but there’s no built-in incident timeline, collaborative investigation tools, or automated remediation triggers like you get with dedicated observability platforms.