Kubernetes OOMKilled Error — What It Means and How to Fix It

What OOMKilled Actually Means

Kubernetes debugging has gotten complicated with all the cryptic status codes flying around. Your pod keeps restarting, you check the logs, and there it is: OOMKilled. Exit code 137. Not Kubernetes that pulled the trigger — the Linux kernel itself terminated your container. The OOM (Out Of Memory) killer made a hard decision under pressure, and your process lost.

When a container blows past its memory limit, the kernel doesn’t send a warning email. It kills the process instantly. Run kubectl describe pod and you’ll find it sitting right there: LastState.Reason: OOMKilled, exit code 137. If the pod restarts automatically — and it usually does — you get a CrashLoopBackOff pattern. That’s the tell. Something is eating memory faster than your limits allow.

How to Confirm OOMKilled Is the Real Problem

Don’t assume. I learned this the hard way — spent two hours chasing phantom memory leaks before realizing I was looking at the wrong pod entirely. Confirmation takes 90 seconds with kubectl.

Run this first:

kubectl describe pod <pod-name> -n <namespace>

Look for the Last State section. That’s where the kernel’s verdict shows up:

Last State:
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Thu, 01 Jan 2025 14:32:15 +0000
  Finished:     Thu, 01 Jan 2025 14:32:45 +0000

Exit code 137 confirms it. The math is 128 + 9 (SIGKILL). You’re in the right place.

Now pull recent events:

kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

You should see something like:

pod-name  OOMKilled  Pod  Memory limit exceeded

Pattern match against your terminal. That exact structure means OOMKilled is your real problem. Something else showing up — ImagePullBackOff, a segfault-driven CrashLoopBackOff, permission denied errors — and you’re debugging the wrong thing entirely. Stop. Start over.

Find the Root Cause Before You Change Anything

Probably should have opened with this section, honestly. Most fixes floating around online are just “increase your memory limit” with zero actual diagnosis attached. That works maybe 40% of the time. The other 60%, you’re just masking whatever is actually broken.

Three scenarios cause OOMKilled. You need to figure out which one you’re dealing with before touching anything.

Scenario 1 — Memory Limit Is Too Low for the Workload

Your application legitimately needs more memory than you gave it. Simplest case. Check what the pod is actually consuming:

kubectl top pod <pod-name> -n <namespace>

Real usage in megabytes, right there. If your limit is 256Mi and you’re seeing consistent usage above 200Mi — with periodic spikes nudging 240Mi — you’ve hit the ceiling.

Pull the pod spec YAML and look at this section:

resources:
  limits:
    memory: "256Mi"
  requests:
    memory: "128Mi"

The limit is the hard cap. The request is what Kubernetes reserves during scheduling. Actual usage climbs above the limit and the kernel kills it. No negotiation, no grace period. Done.

Scenario 2 — Memory Leak in the Application

Memory usage grows over time instead of holding steady. You redeploy the pod, it looks fine for 20 minutes, then starts climbing again. That’s a leak — not a limit problem.

Watch the trend over several minutes:

watch -n 5 'kubectl top pod <pod-name> -n <namespace>'

Memory creeping from 50Mi to 150Mi to 250Mi with no traffic change? You have a leak. Raising the limit here just gives it more runway before dying. The pod will still die — just later, and with a bigger mess.

The real fix is profiling the application. Heap dumps, flamegraphs, language-specific memory profilers — pprof for Go, JProfiler for Java, Valgrind for C. Reproduce the scenario. Find what’s holding memory. Fix it. Then set reasonable limits based on what the fixed app actually needs.

Scenario 3 — Traffic Spike or Noisy Neighbor

Memory is normally fine, then suddenly isn’t. One large request, a batch job running in the same namespace, an unexpected import — temporary pressure, not a sustained leak.

Check resource usage across the namespace:

kubectl top pods -n <namespace>

Is your pod spiking while everything else sits flat? Dig into application logs for what triggered it. Large file load? Bulk request? Caching gone wrong? A static memory limit won’t save you when traffic patterns are unpredictable — at least not without some horizontal scaling strategy backing it up.

How to Fix OOMKilled Based on Root Cause

If It’s Wrong Limits — Adjust the Pod Spec

Once you know actual peak usage, set the limit 20–30% above it. If kubectl top shows consistent 220Mi usage with occasional 240Mi spikes, 300Mi is a reasonable limit. Set the request closer to normal usage — say, 200Mi. Kubernetes gets accurate scheduling data without you massively overprovisioning the node.

Update your pod YAML:

containers:
- name: my-app
  image: my-app:v1.2.3
  resources:
    requests:
      memory: "200Mi"
    limits:
      memory: "300Mi"

Deploy it. Watch it. If the pod stabilizes without restarting, that’s your fix.

If It’s a Memory Leak — Profile and Fix the Code

Increasing limits here is tape on a leaking pipe. Temporarily useful — the pipe still leaks.

Raise the limit just enough to keep the pod alive long enough to gather diagnostics. For Java, add these JVM flags to enable heap dumps on OOM:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof

For Go, pprof handles memory profiles. For Python, memory_profiler gets the job done. Run your workload, reproduce the leak, analyze the output, find what’s holding memory. Fix it. Then lower the limit back to something realistic.

Don’t make my mistake — I’ve watched teams run 8Gi limits on apps that genuinely need 500Mi. That’s expensive, and every one of those cases turned out to be an unresolved leak someone bumped the limit on and forgot about.

If It’s Traffic Spikes — Use Horizontal Pod Autoscaling

One pod can’t absorb every traffic pattern you throw at it. Instead of making that single pod enormous, run multiple smaller ones and scale based on actual load.

Create an HPA resource:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

This scales your deployment between 2 and 10 replicas, targeting 80% average memory utilization. Traffic spikes, new pods spin up. Traffic drops, extras terminate. You’re paying for capacity when you actually need it — not reserving a monster pod 24/7 just in case.

How to Stop OOMKilled From Happening Again

Set Resource Requests and Limits From Day One

Don’t deploy a pod without them — at least if you want Kubernetes to schedule intelligently. Without requests, the scheduler is guessing. Without limits, containers will cheerfully consume every megabyte available on the node. Start conservative: 256Mi request, 512Mi limit covers most workloads reasonably well. Then adjust based on kubectl top data after a few hours of real traffic.

Use LimitRange at the Namespace Level

This stops developers from deploying pods with missing or wildly inflated limits. A LimitRange enforces sane defaults across an entire namespace:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - max:
      memory: "2Gi"
    min:
      memory: "64Mi"
    default:
      memory: "512Mi"
    defaultRequest:
      memory: "256Mi"
    type: Container

Every container in that namespace gets a 256Mi request and 512Mi limit by default unless someone explicitly overrides it. The “forgot to set limits” mistake stops cold. I’m apparently someone who forgets this on new namespaces constantly — a LimitRange works for me while manual spec reviews never quite do.

Alert on Memory Usage Before It Kills Pods

Set monitoring on container memory usage and alert at 75% of the limit. Not 100% — by then the kernel is already mid-kill. Prometheus or whatever monitoring tool you’re running can track this with a simple expression:

container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.75

Alert on it. Investigation starts before the outage, not after the 2am page. That’s the whole point.

OOMKilled errors need a diagnosis before they need a fix. Confirm the symptom with kubectl, measure actual usage, figure out whether you’re dealing with wrong limits, a leak, or traffic pressure — then address what’s actually broken. Quick limit bumps hide real problems. Do the work up front and OOMKilled becomes rare instead of something you’re debugging every other week.

Jason Michael

Jason Michael

Author & Expert

Jason covers aviation technology and flight systems for FlightTechTrends. With a background in aerospace engineering and over 15 years following the aviation industry, he breaks down complex avionics, fly-by-wire systems, and emerging aircraft technology for pilots and enthusiasts. Private pilot certificate holder (ASEL) based in the Pacific Northwest.

48 Articles
View All Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay in the loop

Get the latest stigcloud updates delivered to your inbox.