CloudWatch Alarms That Don’t Wake You Up at 3 AM for the Wrong Reasons

Most CloudWatch alarms are created once, never reviewed, and either fire constantly for things that don’t matter or stay silent during outages that cost real money. As someone who has been on-call for systems where both failure modes were in play simultaneously, I learned the difference between alarm setups that protect you and ones that just make noise. Today I’ll share the configuration that actually works.

CloudWatch alarms monitoring dashboard showing proper alert configuration

The Two Categories You Need

Separate your alarms into two buckets: alarms that wake someone up and alarms that inform without paging. Most teams conflate these and end up with either alert fatigue (everything pages) or blind spots (nothing pages). The rule is simple: if the alert doesn’t require someone to act within the hour, it shouldn’t page anyone at 3 AM.

Informational alarms go to a Slack channel or email. Urgent alarms trigger PagerDuty or SNS with on-call escalation. The difference is consequence: is something actively broken for customers, or is something trending toward a problem?

The Alarms That Actually Matter

Error rate, not error count. Alarming on a fixed number of errors is almost always wrong. A service handling 10,000 requests per minute with 20 errors is healthy. A service handling 100 requests per minute with 20 errors is on fire. Alarm on the percentage: errors divided by total requests, over a 5-minute evaluation period. Something like 5% error rate sustained for 2 consecutive periods before firing keeps you from being paged for a single bad request.

Latency at the 99th percentile, not average. Average latency hides the worst user experiences. If p99 latency on your API doubles, 1 in 100 requests is slow — which sounds small until you’re handling real traffic. CloudWatch supports extended statistics (p90, p95, p99) on most metrics. Use them. That’s what makes the p99 metric endearing to engineers who’ve been burned by average-based alarms — the average can look fine while real users are experiencing failures.

ECS task count below desired. If your ECS service is supposed to run 3 tasks and is running 2, something died and didn’t restart. This is a useful alarm that’s often missed because ECS service events don’t automatically trigger notifications.

aws cloudwatch put-metric-alarm \
  --alarm-name "ecs-tasks-below-desired" \
  --metric-name "RunningTaskCount" \
  --namespace "ECS/ContainerInsights" \
  --statistic "Minimum" \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 2 \
  --comparison-operator "LessThanThreshold" \
  --dimensions Name=ClusterName,Value=your-cluster Name=ServiceName,Value=your-service \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:your-alarm-topic

RDS storage and connection count. Storage alarms at 80% full are obvious but often missing. Connection count approaching max_connections is less obvious but equally important — a connection spike that exhausts the connection pool causes application errors that look unrelated to the database.

Composite Alarms: The Underused Feature

Composite alarms combine multiple alarms with AND/OR logic. Use them to reduce noise. Example: page me when API error rate is high AND latency is elevated. If only one is true, it might be a single bad request or a brief slowdown. Both together means something is actually wrong.

aws cloudwatch put-composite-alarm \
  --alarm-name "api-service-degraded" \
  --alarm-rule "ALARM(api-error-rate-high) AND ALARM(api-p99-latency-high)" \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:pagerduty-topic

Alarm Hygiene

Review your alarms monthly. If an alarm fires and everyone ignores it, either the threshold is wrong or the alarm shouldn’t exist. Set alarm descriptions that tell the on-call engineer what to check. Probably should have led with this, honestly — an alarm named “high-error-rate” with no description forces the responder to investigate before they can even start fixing. Add a runbook link or a one-sentence description of likely causes. That investment pays off at 3 AM.

Jason Michael

Jason Michael

Author & Expert

Jason covers aviation technology and flight systems for FlightTechTrends. With a background in aerospace engineering and over 15 years following the aviation industry, he breaks down complex avionics, fly-by-wire systems, and emerging aircraft technology for pilots and enthusiasts. Private pilot certificate holder (ASEL) based in the Pacific Northwest.

48 Articles
View All Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay in the loop

Get the latest stigcloud updates delivered to your inbox.