The 3 Biggest Wins — 60 Percent Bill Reduction
AWS CloudWatch billing has gotten complicated with all the “default settings are fine” noise flying around. It’s not fine. Your CloudWatch bill can jump from $200 to $2,000 a month without a single line of application code changing — just because AWS defaults are built for convenience, not your wallet.
The math is brutal and worth staring at directly. Standard ingestion: $0.50 per GB. Storage: $0.03 per GB per month. One log group left on “never expire” with moderate traffic becomes a slow financial leak that compounds every 30 days without anyone noticing.
Here’s what actually moves the needle — no new tools, no platform migrations:
- Move old logs to Infrequent Access — saves 50 percent on ingestion ($0.25/GB vs $0.50/GB)
- Set retention to 30 days for application logs — saves 70 percent on storage costs
- Redirect VPC Flow Logs and ALB logs to S3 — eliminates ingestion costs entirely for your highest-volume sources
Those three moves combined cut most teams’ CloudWatch bills by 60 percent inside a week. No code changes. No subscriptions. Just native AWS features sitting there, unused, while you overpay.
Diagnosing What’s Actually Costing You
Before cutting anything, you need to know which log sources are burning through budget. Most teams guess wrong — blaming application logs when the real culprit is VPC Flow Logs or a rogue CloudWatch Insights query scanning terabytes of data nobody asked for.
Open AWS Cost Explorer. Filter by Service: CloudWatch. Group by Usage Type. You’ll see line items like:
- Logs-Storage — retention gone wrong. Old logs sitting around forever.
- Logs-Hourly — ingestion volume. Too much data flowing into CloudWatch.
- GMD-Metrics — custom metrics with high cardinality (we’ll fix this).
- DataProcessing-Bytes — Log Insights queries scanning logs ($0.005 per GB scanned).
The path: Cost Explorer > Services > CloudWatch > pick your billing period > Group By: Usage Type. You’ll see exact dollar amounts tied to each cost driver. No guessing.
In my experience, “Logs-Storage” and “Logs-Hourly” account for roughly 80 percent of overages. Three of the six fixes below will solve your problem almost immediately.
Fix 1: Migrate Old Logs to Infrequent Access
Probably should have opened with this section, honestly. CloudWatch has a log class system most engineers don’t know exists — Standard at $0.50/GB ingestion, Infrequent Access at $0.25/GB ingestion. Half the price. The trade-off is that queries on IA logs cost more, and there’s a retrieval fee attached.
That trade-off only stings if you query those logs constantly. Error logs you check once a month during a post-mortem? Audit trails nobody touches until a compliance review? Perfect fit.
How to do it:
- Open CloudWatch > Log Groups
- Select the log group
- Actions > Edit log group class > Infrequent Access
- Confirm
The cost drop is immediate. A log group ingesting 100 GB per month goes from $50 ingestion to $25 ingestion overnight. Do that for two log groups and you’ve saved $50/month for about 10 minutes of clicking. Most teams miss this entirely — they assume all CloudWatch logs cost the same. They don’t.
Use Infrequent Access for:
- Error logs (only queried during incidents)
- Debug logs (rarely touched in production)
- Audit/compliance logs (queried for reports, not dashboards)
- Historical logs older than 90 days
Do NOT use Infrequent Access for:
- Logs powering real-time dashboards
- Logs you query multiple times daily
- Application performance logs (APM data)
Fix 2: Set Retention Policies
The default log group setting is “never expire.” That’s CloudWatch’s version of a debt trap — $0.03 per GB per month, compounding indefinitely. A 1 TB log group sitting untouched for two years costs $720 in storage alone. Nobody’s reading it after month three. That’s just money gone.
Retention policies are straightforward. Set them by log type:
- Application logs — 30 days (recent context is all you need)
- Security/authentication logs — 365 days (compliance usually requires it)
- Audit logs — check your regulations. PCI-DSS requires 90 days minimum. SOC 2 often wants 365.
- API access logs — 7–14 days (rarely needed for troubleshooting)
The storage savings are significant. A 500 GB application log group on 365-day retention costs $15/month in storage. Switch to 30 days and you’re paying $1.25/month. Scale that across 10 log groups and you’ve freed up $165/month with a configuration change — not a refactor.
Set retention across multiple log groups using the AWS CLI:
aws logs put-retention-policy --log-group-name /aws/lambda/myfunc --retention-in-days 30
Or through the console: Log Groups > select group > Actions > Edit retention policy > choose days > Save.
Here’s the counterintuitive part. This often saves more than moving to Infrequent Access. Infrequent Access cuts ingestion costs once. Retention policies cut storage costs every single month, forever. The ongoing nature is what makes it stack up.
Fix 3: Route High-Volume Logs to S3
VPC Flow Logs, ALB access logs, CloudFront logs — these three generate enormous volume. A moderately busy ALB can produce 50–100 GB per day. At $0.50/GB ingestion plus $0.03/GB/month storage in CloudWatch, that’s a significant number fast.
S3 standard storage runs $0.023 per GB per month. No ingestion fee. That’s roughly 20x cheaper on storage alone.
The math on 100 GB/day of VPC Flow Logs:
- CloudWatch: (100 GB/day × 30 days × $0.50) + (3,000 GB × $0.03) = $1,500 + $90 = $1,590/month
- S3: 3,000 GB × $0.023 = $69/month
- Savings: $1,521/month
You lose real-time CloudWatch Insights queries on those logs — which, for massive volumes, are slow and expensive anyway. What you gain is cheap archival and Athena for ad-hoc queries that cost pennies instead of dollars.
To redirect ALB access logs to S3:
- EC2 > Load Balancers > select ALB
- Actions > Edit attributes
- Access logs > Enable > specify S3 bucket
- Save
For VPC Flow Logs:
- VPC > Your VPCs > select VPC
- Flow Logs tab > Create flow log
- Destination: S3 bucket
- Choose your bucket and prefix
- Create
Logs start flowing to S3 immediately. A typical Athena query scanning a few days of VPC Flow Logs runs $0.25–$0.50. The equivalent CloudWatch Insights query on the same data? Easily $50+.
Fix 4: Stop the High-Cardinality Metrics Drain
Custom metrics are a hidden cost that sneaks up fast. Every unique dimension value is a billable metric — $0.30 per metric per month.
Here’s how it spirals. You emit a metric called “api_latency” with a dimension of UserID. You have 1,000 users. That’s 1,000 billable metrics. Cost: $300/month for what feels like one metric. Now add RequestID as a dimension with 500,000 unique values per month. That’s $150,000/month. I’ve seen this happen — teams add “request ID” to a high-throughput metric and their bill jumps $15,000 in a single month. Nobody caught the dimension cardinality until the invoice arrived.
The fix is aggregation. Group by user tier or service instead of individual user. Example:
Bad (high cardinality):
api_latency { UserID: "user-1234" } = 145ms
api_latency { UserID: "user-5678" } = 132ms
... (1,000 unique UserIDs = 1,000 metrics)
Good (low cardinality):
api_latency { UserGroup: "premium" } = 145ms
api_latency { UserGroup: "free" } = 132ms
... (2 metrics = $0.60/month instead of $300)
Review your custom metrics in CloudWatch console: Metrics > All Metrics. Look for dimensions with thousands of unique values. Those are your culprits. Refactor them to aggregate by tier, region, or service — not by individual entity.
Fix 5: Tune Your Log Insights Queries
CloudWatch Logs Insights charges $0.005 per GB scanned. Feels cheap. Run a query over 10 TB during a week-long troubleshooting session and that single query costs $50. Do it five times and you’ve spent $250 before anyone notices.
Most teams run inefficient queries without realizing the cost. No time filter means scanning all logs in the group. No filter clause means scanning every event.
Bad query (scans 10 TB, costs $50):
fields @timestamp, @message | filter @message like /ERROR/
Good query (scans 50 GB, costs $0.25):
fields @timestamp, @message | filter @message like /ERROR/ | stats count() by @message
The second version uses a tight time range selected in the UI — last 4 hours, not all-time — plus a stats aggregation that summarizes instead of returning raw rows. Same answer. A fraction of the cost.
Best practices for Log Insights:
- Always set start/end time. Default to the minimum window needed — 4 hours, not 30 days.
- Use stats and aggregation functions. Never just filter and return rows if you can summarize.
- Save common queries as saved queries. Share them with your team so they run once, not ten times.
- Filter with @message before running stats. Narrow scope first, then summarize.
Tuning query efficiency takes 15 minutes once and saves $100+ per month going forward. That’s a good ratio.
Fix 6: Stop Logging Things You Don’t Need
This is the hardest fix — and the most impactful long-term. Most applications log too much. Full HTTP request bodies. JWT tokens. Stack traces on every error, not just real exceptions. SDK retry logs that run 3–5x the volume of actual application output.
I learned this the hard way. A Python Lambda function using the requests library in verbose mode was logging full HTTP responses — including the entire HTML body on 404 pages. That worked out to 50 KB per invocation × 100,000 invocations per day × $0.50/GB ingestion = $750/month in pure noise. Disabling verbose logging cut it to $8/month. Don’t make my mistake.
Pruning logs requires code changes, but a typical application logging 100 KB per request can drop to 5 KB — request ID, status code, latency, error message — cutting log volume by 95 percent. That’s not a rounding error.
What to audit:
- HTTP request bodies — log the first 200 bytes and a request ID, not the full payload.
- Stack traces — log only for real exceptions, not every error condition.
- SDK logs — disable verbose logging from boto3, requests, or your web framework. Enable only in development.
- Database query logs — these get massive. Log slow queries only (a 500ms threshold works well).
- Sensitive data — audit for PII, API keys, tokens. Log the outcome, not the raw credentials.
This fix isn’t a quick toggle. It’s a pull request and a small refactor. But it’s where most teams find an extra $500–$2,000/month in long-term savings — compounding every month after.
Set the Billing Alarm Before You Forget
You’ve just cut your CloudWatch bill by 60 percent. Good. Now make sure it doesn’t creep back up as engineers add new log sources and skip retention settings on the way out the door.
Create a CloudWatch billing alarm:
- CloudWatch console > Alarms > Create alarm
- Select metric: Billing > Service: CloudWatch
- Statistic: Maximum
- Period: 1 month
- Threshold: 120 percent of your current monthly spend
- Add notification: SNS topic (email yourself)
- Create
The 120 percent threshold catches drift early without paging you for normal variation. A $1,000/month bill triggers an alarm at $1,200 — giving you two weeks to investigate before the billing period closes.
Setup time: five minutes. Cost of skipping it: $2,000 in surprise charges when a new VPC Flow Log or a chatty application gets deployed without anyone noticing.
These six fixes — Infrequent Access, retention policies, S3 redirection, cardinality control, query tuning, and application-level logging pruning — get most teams to a 60 percent reduction. The first three are plug-and-play, no code required. The last three take slightly more effort but compound month over month.
Start with Cost Explorer. Find the biggest cost driver. Apply the matching fix. Repeat. Within a week, your CloudWatch bill will look like something a reasonable person approved.
Leave a Reply