CloudWatch Alarms That Don't Wake You Up at 3 AM for the Wrong Reasons

Most CloudWatch alarms are created once, never reviewed, and either fire constantly for things that don’t matter or stay silent during outages that cost real money. As someone who has been on-call for systems where both failure modes were in play simultaneously, I learned the difference between alarm setups that protect you and ones that just make noise. Today I’ll share the configuration that actually works.

CloudWatch alarms monitoring dashboard showing proper alert configuration

The Two Categories You Need

Separate your alarms into two buckets: alarms that wake someone up and alarms that inform without paging. Most teams conflate these and end up with either alert fatigue (everything pages) or blind spots (nothing pages). The rule is simple: if the alert doesn’t require someone to act within the hour, it shouldn’t page anyone at 3 AM.

Informational alarms go to a Slack channel or email. Urgent alarms trigger PagerDuty or SNS with on-call escalation. The difference is consequence: is something actively broken for customers, or is something trending toward a problem?

The Alarms That Actually Matter

Error rate, not error count. Alarming on a fixed number of errors is almost always wrong. A service handling 10,000 requests per minute with 20 errors is healthy. A service handling 100 requests per minute with 20 errors is on fire. Alarm on the percentage: errors divided by total requests, over a 5-minute evaluation period. Something like 5% error rate sustained for 2 consecutive periods before firing keeps you from being paged for a single bad request.

Latency at the 99th percentile, not average. Average latency hides the worst user experiences. If p99 latency on your API doubles, 1 in 100 requests is slow — which sounds small until you’re handling real traffic. CloudWatch supports extended statistics (p90, p95, p99) on most metrics. Use them. That’s what makes the p99 metric endearing to engineers who’ve been burned by average-based alarms — the average can look fine while real users are experiencing failures.

ECS task count below desired. If your ECS service is supposed to run 3 tasks and is running 2, something died and didn’t restart. This is a useful alarm that’s often missed because ECS service events don’t automatically trigger notifications.

aws cloudwatch put-metric-alarm 
 --alarm-name "ecs-tasks-below-desired" 
 --metric-name "RunningTaskCount" 
 --namespace "ECS/ContainerInsights" 
 --statistic "Minimum" 
 --period 60 
 --evaluation-periods 3 
 --threshold 2 
 --comparison-operator "LessThanThreshold" 
 --dimensions Name=ClusterName,Value=your-cluster Name=ServiceName,Value=your-service 
 --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:your-alarm-topic

RDS storage and connection count. Storage alarms at 80% full are obvious but often missing. Connection count approaching max_connections is less obvious but equally important — a connection spike that exhausts the connection pool causes application errors that look unrelated to the database.

Composite Alarms: The Underused Feature

Composite alarms combine multiple alarms with AND/OR logic. Use them to reduce noise. Example: page me when API error rate is high AND latency is elevated. If only one is true, it might be a single bad request or a brief slowdown. Both together means something is actually wrong.

aws cloudwatch put-composite-alarm 
 --alarm-name "api-service-degraded" 
 --alarm-rule "ALARM(api-error-rate-high) AND ALARM(api-p99-latency-high)" 
 --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:pagerduty-topic

Alarm Hygiene

Review your alarms monthly. If an alarm fires and everyone ignores it, either the threshold is wrong or the alarm shouldn’t exist. Set alarm descriptions that tell the on-call engineer what to check. Let me back up — this is what matters. — an alarm named “high-error-rate” with no description forces the responder to investigate before they can even start fixing. Add a runbook link or a one-sentence description of likely causes. That investment pays off at 3 AM.

CloudWatch Alarms That Don’t Wake You Up at 3 AM for the Wrong Reasons

The Two Categories You Need

The Alarms That Actually Matter

Composite Alarms: The Underused Feature

Alarm Hygiene

Jason Michael

Leave a Reply Cancel reply

The Two Categories You Need

The Alarms That Actually Matter

Composite Alarms: The Underused Feature

Alarm Hygiene

Jason Michael

You Might Also Like

The Hidden Benefits of Infrastructure as Code

Secrets Management in AWS: Stop Hardcoding and Start Sleeping at Night

Kubernetes Pod Stuck in Pending State Fix It Fast

Leave a Reply Cancel reply

Stay in the loop