Monitoring and Alerting Playbook for Cloud/CDN Outages (Actionable Templates)
opsmonitoringSRE

Monitoring and Alerting Playbook for Cloud/CDN Outages (Actionable Templates)

UUnknown
2026-02-13
10 min read
Advertisement

Ready-to-deploy monitoring rules and alert templates to detect Cloudflare/AWS outages and validate fallback compatibility—deploy within 90 minutes.

When Cloudflare or AWS goes dark: a monitoring & alerting playbook you can deploy in minutes

Pain point: You rely on third-party CDNs and cloud providers, but their outages cascade into your app and customers complain. You need rules and templates that detect dependency failures fast and validate fallback compatibility before you flip the switch.

Why this matters in 2026

Outages like the January 16, 2026 incident that affected major properties (including X) showed how a single edge provider incident can ripple across thousands of businesses. Since 2023 we've seen higher adoption of multi-CDN, RPKI-driven routing, and edge compute (Workers / Lambda@Edge). But these patterns increase operational complexity: more moving parts, more compatibility risk when failing over.

This playbook gives you: ready-to-deploy monitoring rules, alert templates, and a compatibility assessment checklist for fallback services so you can detect failures in provider stacks (Cloudflare, AWS, etc.) and safely activate fallbacks.

Inverted pyramid summary (most important first)

  • Detect provider outages using multi-path synthetics, DNS & BGP signals, and provider status pages.
  • Alert with actionable messages that include triage steps and playbook links.
  • Assess fallback compatibility automatically (TLS, URL signing, CORS, cookies, websocket, rate limits).
  • Failover decision tree and SLA mapping so you know when to switch traffic.

Core monitoring strategy (high signal, low noise)

Design monitoring so it answers these questions in under 60 seconds:

  • Is the CDN/edge provider dropping client-facing traffic?
  • Is DNS resolution or propagation the root cause?
  • Are BGP or transit-level issues manifesting?
  • Can our configured fallback accept traffic without breaking auth, caching or signed URLs?

1) Multi-path synthetics (must-have)

Run synthetic checks from at least three vantage points and two network paths per test: via the provider's public hostname (proxied) and direct to origin IP. This distinguishes provider edge failure from origin failure.

Example: use blackbox_exporter (Prometheus), Datadog synthetics, or a managed synthetic provider.

# Prometheus/blackbox_exporter probe config (snippet)
- job_name: 'synthetics'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
      - 'https://www.example.com'        # CDN-proxied hostname
      - 'https://203.0.113.10:443'       # direct-to-origin (use origin cert or IP SAN)
  relabel_configs:
  - source_labels: ['__address__']
    target_label: 'instance'
  - source_labels: ['__param_target']
    target_label: 'probe_target'

Prometheus alert rule (blackbox):

groups:
- name: synthetics.rules
  rules:
  - alert: ExternalEndpointUnavailable
    expr: probe_success{job='synthetics'} == 0
    for: 2m
    labels:
      severity: page
    annotations:
      summary: 'Synthetic probe failed for {{ $labels.probe_target }}'
      description: 'probe failed from {{ $labels.instance }}. Check CDN vs origin routing.'
  

2) DNS & TTL monitoring

Monitor authoritative answers and TTL changes. A CDN outage may be accompanied by DNS anomalies (NXDOMAIN, SERVFAIL) or sudden TTL increases/decreases from automation.

# Example Datadog monitor (pseudo)
query: 'avg(last_1m):max:dns.query.response_time{domain:example.com} > 2'
name: 'DNS latency - example.com'

3) BGP / transit-level signals

Subscribe to BGPStream/PeeringDB alerts or integrate with providers that expose RIB change events. Large outages often show as massive prefix withdrawals or hijacks.

Alert when prefix withdraws > threshold or high AS path churn for provider ASNs (Cloudflare: AS13335; major AWS ASNs):

# Conceptual rule (SIEM/BGP stream)
if prefix_withdrawals(provider_asn) > 50 in 5m:
  trigger 'BGPPrefixWithdrawalSpike'

4) Provider status page + RSS scraping

Monitor provider status pages and their RSS/JSON feeds. Cloudflare and AWS publish incidents. Scrape and correlate with your synthetics before alerting external teams.

Tip: correlate provider-reported incidents with your internal failure signals to reduce noisy escalations.

Ready-to-deploy alert templates (Slack, PagerDuty, email)

Alerts must answer: What failed? Where did it fail? Next step? Include runbook link and major troubleshooting commands.

PagerDuty / Incident alert (template)

Title: [CRITICAL] External dependency failure - {{service}} via {{provider}}
Severity: P1
Summary: Synthetic checks failing for '{{hostname}}' from {{regions}}. probe_success=0 (since {{first_seen}}).
Affected: frontend, api
What to check:
  1) Provider status: {{provider_status_url}}
  2) DNS resolution: dig + timeout checks from 3 regions
  3) Origin direct test: curl --resolve '{{hostname}}:443:{{origin_ip}}' https://{{hostname}}/health
Runbook: {{runbook_url}}
Escalation: Follow on-call rota

Slack alert (compact actionable)

:rotating_light: *External CDN outage suspected* - *{{provider}}*
• Service: {{service}}
• Regions: {{regions}}
• Symptom: Synthetic probes failed (last 5m)
• Quick checks: `curl -I https://{{hostname}}` | `curl --resolve '{{hostname}}:443:{{origin_ip}}' https://{{hostname}}/health`
• Runbook: {{runbook_url}} | :pager: @oncall

Email / Ticket body

Subject: Incident - {{service}} affected by {{provider}} anomaly
Body:
- Impact: 10% 500 responses from European edge
- First seen: {{first_seen}}
- Actions taken: Enabled additional synthetics, opened ticket with provider
- Next steps: Triage origin, evaluate failover compatibility

Failover compatibility assessment: automated checklist

Before failing over traffic to a backup CDN or direct-to-origin route, confirm these compatibility items. Automate checks where possible.

  1. TLS / certificate chain: Is the fallback serving a certificate that matches the hostname or does SNI/work with IP-based certs? Test with openssl s_client and automated cert transparency checks. See our notes on domain and certificate due diligence.
  2. Signed URLs / URL expiration: Do signed URLs generated by the primary CDN work on the fallback? If not, ensure you can issue compatible signatures or switch to a permissive mode for the window of failover. Hybrid control planes can simplify this—see hybrid edge workflows.
  3. Cookies & session affinity: Will the fallback accept session cookies or maintain sticky sessions? Confirm cookie domains and SameSite settings.
  4. CORS & preflight: Confirm fallback returns Access-Control-Allow-* headers identical to primary path for JS clients.
  5. Websockets / HTTP/2 / HTTP/3: Check protocol support. Many edge providers default to QUIC/HTTP/3; fallback might only support HTTP/2 or HTTP/1.1 and break real-time features. For guidance on low-latency protocol tradeoffs see resources on HTTP/3 & low-latency patterns.
  6. WAF rules & Bot management: Are legitimate API clients blocked by strict WAF rules on the fallback? Test known user agents and API keys.
  7. Rate limits & quotas: Validate that the fallback won't throttle your traffic given expected surge during failover.
  8. Origin IP whitelists: Ensure fallback IP ranges are allowed by origin firewall and load balancers.
  9. Logging & analytics: Confirm request/response logging formats are preserved for post-incident analysis. Consider automating metadata extraction and correlation in your observability pipeline: metadata automation helps here.
  10. Signed cookies / authentication tokens: Test token verification paths (JWT keys, introspection endpoints) from the fallback path.

Automated compatibility test example (bash):

# TLS and CORS quick check
HOST='www.example.com'
FALLBACK='fallback.example-cdn.com'
OPENSSL_OUT=$(echo | openssl s_client -servername $HOST -connect $FALLBACK:443 2>/dev/null)
CORS=$(curl -s -I -H 'Origin: https://app.example.com' https://$FALLBACK/api/status)
if [[ "$OPENSSL_OUT" =~ "Verify return code: 0" ]] && [[ "$CORS" =~ "Access-Control-Allow-Origin" ]]; then
  echo 'TLS+CORS OK'
else
  echo 'Compatibility failure: inspect TLS or CORS'
fi

Runbook: step-by-step incident triage & failover decision tree

  1. Detect: alert fires (synthetic failures across >=2 regions and provider-status confirmed or BGP abnormalities).
  2. Scope: map impacted services (web, API, mobile). Use real-user monitoring (RUM) and backend error rates.
  3. Triage: Is origin healthy? Run direct-to-origin checks. If origin failing, it's not a CDN outage—follow origin runbook.
  4. Correlate: Cross-reference provider status, BGP signals, and DNS anomalies. If provider reports incident and you see edge-only failures, escalate provider ticket.
  5. Assess fallback readiness: Run automated compatibility checklist. If >80% checks pass, prepare controlled failover.
  6. Failover: Implement staged switch—route a small % (5-10%) of traffic to fallback, monitor error rate and latency for 5–10 minutes.
  7. Scale: If staged tests are green, increase traffic in steps. Keep provider ticket open and monitor for provider recovery.
  8. Post-incident: Revert to primary when provider stable and run post-mortem: timeline, SLA impact, RCA, and configuration changes to improve future response.

SLA mapping and decision thresholds

Translate SLA commitments into actionable thresholds. Example:

  • Customer-facing SLA: 99.95% availability monthly.
  • Internal alert to evaluate failover: sustained client error rate > 2% and synthetic failover detection for > 3 minutes.
  • Automatic staged failover: when error rate > 5% and > 3 regions affected for 5+ minutes.

Document the business impact tiers and the approval chain. For P1 incidents (revenue-impacting), automated failover can be executed by on-call without executive approval; for lower severity, require stakeholder sign-off. Also consider operational cost impacts of long-term failovers — storage and observability retention can balloon your bill; see cloud storage cost guidance.

Templates: Prometheus / Datadog / CloudWatch alerts

Prometheus (blackbox/probe)

alert: CDNProxyDown
expr: probe_success{probe='http',job='synthetics',probe_target='https://www.example.com'} == 0
for: 2m
labels:
  severity: page
annotations:
  summary: 'CDN proxy appears down for example.com'
  description: 'Probes failing from multiple regions. Run direct-to-origin checks.'

Datadog monitor (HTTP synthetic)

name: 'Synthetics - example.com (CDN) - Failure rate high'
query: 'avg(last_5m):sum:synthetics.test.failed{test_name:example-cdn} > 0'
message: '{{#is_alert}} :rotating_light: CDN failure detected for example.com. Runbook: {{runbook_url}} {{/is_alert}}'

AWS CloudWatch alarm (origin health)

# CloudWatch alarm for ALB 5xxs via AWS CLI (example)
aws cloudwatch put-metric-alarm --alarm-name 'ALB-5xx-High' \
  --metric-name HTTPCode_Target_5XX_Count --namespace AWS/ApplicationELB \
  --statistic Sum --period 60 --evaluation-periods 3 \
  --threshold 10 --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=LoadBalancer,Value=app/my-alb/123456

Real-world examples & lessons learned (experience)

Case: A global media company in 2025 relied on a single CDN and saw a 40-minute outage during a regional Cloudflare configuration failure. Their remediation included:

  • Implementing multi-path synthetics that would have detected the edge-only failure in 30s instead of 10 minutes.
  • Creating a fallback origin route with automated certificate issuance (ACME) to avoid TLS mismatch issues.
  • Pre-validating signed URL compatibility across their backup CDN—reducing failover time from 40 minutes to 6 minutes in subsequent tests.

Lesson: proactively test failover compatibility monthly and after any major change to signed token algorithms, TLS policy, or WAF rules.

  • HTTP/3 & QUIC adoption: Fallbacks may need to support HTTP/3; otherwise, real-time features can regress. Monitor protocol negotiation failures; see low-latency pattern notes.
  • RPKI and routing security: More networks validate RPKI, reducing certain hijack risks — but still monitor prefix withdrawals for provider ASNs.
  • Edge compute everywhere: With Workers and Lambda@Edge handling auth and transforms, ensure fallback can replicate edge logic or proxy to an equivalent compute layer; architecture notes here: Edge-first patterns.
  • Multi-CDN control planes: Expect richer APIs for programmatic failover; embed these into your alerting automation for faster, safer rollouts.

Advanced strategies (future-proofing)

  • Chaos synthetic testing: Periodically simulate provider degradations (DNS flapping, provider status mixing) in a non-prod environment to validate runbooks and automation.
  • Feature parity tests: Automate test suites that validate auth flows, signed URLs, and real-time APIs across your primary and fallback CDNs after any change.
  • Observability correlation: Use a central incident graph that correlates synthetics, BGP, DNS and RUM to reduce time-to-detect and avoid false positives. Metadata extraction automation can speed correlation: automating metadata extraction.

Quick checklist to implement in the next 90 minutes

  1. Deploy blackbox_exporter probes for proxied hostname and direct origin IP in at least 3 regions.
  2. Create Prometheus alert for probe_success==0 with 2m for paging severity.
  3. Subscribe to provider status RSS/JSON and add a correlation step in your alert pipeline.
  4. Run an automated compatibility script against your fallback and record results in CI (fail the build if critical checks fail).
  5. Draft Slack/PagerDuty templates and link the service runbook.

Appendix: quick commands & sanity checks

  • Direct TLS test:
    openssl s_client -connect $HOST:443 -servername $HOST
  • Direct origin curl:
    curl -I --resolve 'www.example.com:443:203.0.113.10' https://www.example.com/health
  • DNS trace:
    dig +short www.example.com @8.8.8.8
  • Traceroute to CDN edge:
    traceroute www.example.com

Final takeaways

Detect early: Multi-path synthetics + DNS + BGP give the fastest signal on whether the problem is at the edge, DNS, or origin.

Alert with context: Deliver alerts that include immediate triage steps, runbook links, and compatibility status so on-call engineers can act fast.

Validate failovers: Automate compatibility checks for TLS, signed URLs, CORS, websockets, and rate limits. Failovers should be tested monthly and after any infra change.

Call to action

If you want a prebuilt repo with the Prometheus rules, Datadog monitors, alert templates and the automated compatibility scripts used in this playbook, download our 2026 Monitoring & Alerting Playbook bundle and import the configs into your observability stack. Validate your fallbacks today—don’t wait for the next provider incident.

Advertisement

Related Topics

#ops#monitoring#SRE
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-16T14:33:17.459Z