Check the Release Status Before Anything Else
Helm chart debugging has gotten complicated with all the conflicting advice flying around. Everyone tells you to immediately dive into pod logs. That instinct is backwards — at least if you want to actually solve the problem fast.
Run this first:
helm status <release-name>
One command. It tells you what Helm actually thinks is happening with your release. The output lands in one of a few states: DEPLOYED (you’re probably fine), FAILED (something broke), PENDING_UPGRADE (a previous operation never finished), or SUPERSEDED (newer versions are stacked on older ones). Each state means something different. Each requires a different fix.
I learned this the hard way — specifically, a Tuesday afternoon three years ago when I burned forty minutes crawling through pod events for a release stuck in PENDING_UPGRADE from a failed deployment six hours earlier. Helm was blocked. Completely blocked. Nothing new could run until I cleared that state. Forty minutes gone before I even checked the thing that mattered.
Always check Helm first, then check Kubernetes. Not the other way around.
PENDING_UPGRADE is your bottleneck. No new Helm operations will execute until you clear it. Two options: rollback to the last clean revision, or delete the Helm secret directly.
Pull up the full history first:
helm history <release-name>
Every attempt, timestamped, with status. Revision 5 succeeded, revision 6 failed — roll back to 5. Simple.
helm rollback <release-name> 5
If you’re impatient — and honestly, who isn’t when production is on fire — you can delete the Helm secret holding the bad state directly:
kubectl delete secret sh.helm.release.v1.<release-name>.v<revision> -n <namespace>
Aggressive. Use it only when you know what you’re touching. But it works.
Render the Manifests to Catch Template Errors
Probably should have opened with this section, honestly. It kills half of all chart failures in about thirty seconds flat.
Before anything reaches your cluster, render what Helm is actually going to deploy:
helm template <release-name> ./chart-directory -f values.yaml
Raw Kubernetes manifests. No API server contact. You can pipe the output to a file, grep through it, whatever — just look. Typos in image names. Malformed resource names. YAML that’s technically valid but semantically broken. It’s all sitting right there in the output.
For more detail, use the dry-run install with debug:
helm install <release-name> ./chart-directory -f values.yaml --dry-run --debug
The --debug flag walks through the template rendering step by step. If you spot something like image: "{{ .Values.image.repository }}" sitting unrendered in the output, a values file is missing or a --set override never made it in.
Real example from maybe 2021: I had a chart expecting replicaCount: 3 as an integer. I passed it as a string in my values override. The template rendered without complaint. The Deployment got created with replicas: "3" — a string — and Kubernetes rejected it at validation with a thoroughly cryptic error message. The helm template output showed the problem immediately. Fifteen seconds instead of thirty minutes. Don’t make my mistake.
Read the Kubernetes Events and Pod Logs
Manifests look clean. Now find out why the pod isn’t running.
Start here — not with logs:
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
Events are Kubernetes narrating what happened. Timestamped, namespace-scoped, specific. ImagePullBackOff. FailedScheduling. Readiness probe failures. The events tell you exactly what blocked the pod from reaching Ready state before you’ve opened a single log file.
Narrow it down if you know the namespace:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Then describe the pod:
kubectl describe pod <pod-name> -n <namespace>
The Conditions section. Look at the Reason field specifically. That’s the gold. Waiting on a readiness probe. Stuck in a CrashLoop. Pending resource allocation. The reason is right there.
Three failure modes cover ninety percent of what you’ll actually see:
- ImagePullBackOff — The image doesn’t exist, the tag is wrong, or the node can’t authenticate to pull it. Check your registry, your credentials, and the tag you specified. All three.
- CrashLoopBackOff — Container started, then crashed. Grab logs with
kubectl logs <pod-name> -n <namespace>— use--previousif it already crashed before you got there. The application error will tell you what broke: bad config, missing dependency, port conflict. - Readiness probe failure — Container is running, probe isn’t passing. Your health check endpoint is misconfigured, the app isn’t listening on the expected port, or it’s just slow to start. Check the probe definition in your values and confirm the target endpoint actually exists.
Logs are ground truth. But they’re noisy. Don’t skip kubectl describe just to jump straight to logs. Describe tells you the state. Logs tell you why. Different questions, different tools.
Rollback Fast When the Fix Is Not Obvious
Templates render clean, pod events show something you can’t solve in five minutes — rollback. Stop burning time on a live incident. Ship the working version back and debug offline.
helm rollback <release-name> <revision-number>
Helm reverses to that state. Cluster returns to a known good version. You get space to actually think.
Rollback is not failure — it’s the correct move when you’re on the clock. I’ve watched teams spend two hours patching a deployment in-place when rolling back and investigating offline would have taken fifteen minutes. That math never makes sense.
One trap worth knowing: the --wait flag. Deploy with helm install --wait and Helm blocks until pods reach Ready state. Misconfigured readiness probe — wrong endpoint, bad timing — and the deploy times out after five minutes, marks the release FAILED, and now you’re back in PENDING_UPGRADE when you try again.
Don’t use --wait in production unless your probes are verified and stable. Or set a tight timeout — --timeout 2m — and watch events separately. Let the deploy fire and monitor independently.
Three Helm Deployment Mistakes That Cause Most Failures
Values not being passed correctly
You built a values-prod.yaml with all your production overrides. You ran helm install -f values.yaml. Your production values never entered the chart. Helm rendered using defaults — wrong image tag, replica count at 1 instead of 5, ingress hostname pointing nowhere useful.
Fix: name your values files explicitly on every command. One -f flag per file, and Helm merges them in order — left to right, later files win. Always run helm template ... -f values.yaml -f values-prod.yaml first to confirm what actually renders before anything touches the cluster.
CRDs not installed before the chart
Your chart deploys a Custom Resource. The CRD it depends on isn’t installed yet. Kubernetes has no schema for MyCustomResource, so it rejects the manifest as invalid — and the error message won’t always make that obvious.
Fix: install the CRD chart first. Most projects ship it separately. Alternatively, add a crds subdirectory to your chart and document the install order clearly. Some teams wire this up with Helm hooks at weight 0 to force CRD installation before the main release runs.
RBAC permissions missing for the service account
Your chart creates a ServiceAccount. The app tries to list pods, read ConfigMaps, something routine. No Role or RoleBinding grants those permissions. The API call fails — quietly, inside your app — and you spend an hour convinced the feature is broken when really it’s a missing RoleBinding with a name that doesn’t quite match.
Fix: check your chart templates for ServiceAccount, Role, and RoleBinding resources. Confirm they exist and the names actually match across all three. Verify the serviceAccountName in the Pod spec lines up with the RoleBinding subject. On restricted clusters, talk to your cluster admin before assuming those rules will be accepted automatically.
Quick checklist for next time: Helm status first. Render the template second. Describe the pod third. Check events fourth. Rollback if you’re stuck past five minutes. Then dig into values files, CRDs, and RBAC when the immediate fire is out.
Leave a Reply