When Your CI/CD Pipeline Fails on Deploy Stage — Start Here
CI/CD troubleshooting has turned into an absolute tangle with all the abstraction layers flying around. I’ve spent enough Friday nights staring at failed deploy logs to know exactly how the next thirty minutes of your evening will go. A pipeline failing on the deploy stage hits different than a build failure. The code compiles. Tests pass. Artifacts exist. Then your orchestrator tries to spin up the new version and everything goes sideways.
The frustrating part? Ninety percent of deploy-stage failures trace back to five specific causes. I’m going to show you how to identify and fix each one in minutes, not hours. Today, I will share it all with you.
How to Read the Deploy Stage Error — Fast
Before we diagnose anything, grab the right log output. Your pipeline dashboard shows red. That’s not a diagnosis — that’s a symptom.
Open your platform’s deploy logs directly. Not the summary. The raw output. Most platforms bury this behind a “View full logs” link that nobody clicks until they’re already forty minutes deep.
Run this immediately:
kubectl logs -n your-namespace deployment/your-app --tail=100 --timestamps=true
Or for ECS:
aws logs tail /ecs/your-task --follow
Look for the actual exit code. Most CI/CD tools report failure but hide the reason three screens down. Exit codes 1, 127, and 139 mean totally different things — code 1 is often a permission issue, code 137 means your container got killed by memory pressure or timeout, code 127 means the deploy script literally cannot find a command it needs.
Scan for these patterns in the output: “permission denied,” “401 Unauthorized,” “ImagePullBackOff,” “Timeout,” “Health check failed.” One of those keywords will jump out. If nothing jumps out, scroll back further. The actual error often appears before the summary line, not after it.
Skipping ahead to the part you want. Most people panic and restart the pipeline. Don’t make my mistake. Just read the logs first.
Permissions and Secrets Are Wrong or Missing
This one kills more deploys than any other root cause. Environment variables don’t exist. IAM roles lack permissions. API tokens expired three weeks ago and nobody noticed.
Search your deploy logs for these strings: “permission denied,” “403 Forbidden,” “401 Unauthorized,” “secret not found,” “role does not have permission.”
To verify secrets are actually injected at runtime, add a debug step to your pipeline before the deploy command:
printenv | grep -E 'API_TOKEN|DATABASE_URL|AWS_' | sed 's/=.*/=***MASKED***/g'
This prints all matching variables with values masked. If a critical variable doesn’t appear, it wasn’t injected. Check your secrets manager — AWS Secrets Manager, HashiCorp Vault, GitHub Secrets, whatever you’re running.
If the variable exists but the deploy still fails with “401 Unauthorized,” the secret itself is invalid. Maybe your token expired. Or it’s for the wrong environment. I’m apparently someone who once spent two hours debugging a deploy permission failure only to discover the staging credentials were pointing at production — which had stricter IAM rules — and Vault works for me while environment variable injection never actually did what I thought it was doing. Don’t make my mistake.
For Kubernetes deployments, verify the service account has the right role binding:
kubectl get rolebinding -n your-namespace -o wide
For ECS or Lambda, check the execution role directly in the AWS console. Expand the inline policy. Does it actually include the actions your deploy step needs? The fix varies by platform, but the diagnostic is universal — echo the variable, confirm it exists, then run a small test command with just that credential. Don’t deploy yet. Verify the credential works in isolation first.
Artifact or Image Version Mismatch
But what is an artifact mismatch, exactly? In essence, it’s when your build stage produces something your deploy stage can’t find — or worse, finds the wrong version of. But it’s much more than that.
This happens most often with container images. The build pushes an image tagged with a git commit hash. The deploy tries to pull latest. The registries don’t match. The digest changed between stages. That’s what makes this failure mode so frustrating to us engineers who already checked everything once.
Check what actually got pushed to your registry:
docker inspect your-registry.azurecr.io/your-image:latest | grep -A 5 RepoDigests
Compare that SHA to what your deploy is trying to pull. They should match. If they don’t, your build and deploy are talking to different registries or different image names entirely.
In your deploy manifests — Kubernetes YAML, ECS task definition, whatever you’re using — pin the image reference to a digest instead of a tag:
image: your-registry.azurecr.io/your-image@sha256:abc123def456...
Tags are mutable. A tag can point to different digests at different times. Digests are immutable. Force your build stage to write the digest into the deploy manifest, then deploy that exact artifact. No ambiguity.
If you’re using GitHub Actions or GitLab CI, do this in the build job:
docker push your-image:$COMMIT_SHA
DIGEST=$(docker inspect --format='{{.RepoDigests}}' your-image:$COMMIT_SHA | grep -o 'sha256:[a-f0-9]*')
sed -i "s|IMAGE_DIGEST|$DIGEST|g" deploy.yaml
Then the deploy stage uses that pinned digest. No version mismatch possible.
Health Check Timeout Killing the Deploy
The container starts. The orchestrator runs a health check. The check fails. The container gets killed. The deploy rolls back. You see “pod failed health checks” in the logs and your stomach drops.
This usually means the app takes longer to start than the health check timeout allows. A Node.js app with a slow database migration. A Java service with JVM startup overhead — sometimes 30 to 45 seconds on its own. A Python app loading a 2GB model into memory. These things take time.
Check your orchestrator config. In Kubernetes, look at the deployment YAML:
kubectl get deployment your-app -o yaml | grep -A 10 livenessProbe
You’re looking for initialDelaySeconds. That’s how long the orchestrator waits before running the first health check. If your app takes 45 seconds to start but initialDelaySeconds is set to 10, the health check will fail every single time. Guaranteed.
Increase it:
kubectl patch deployment your-app -p '{"spec":{"template":{"spec":{"containers":[{"name":"your-app","livenessProbe":{"initialDelaySeconds":60}}]}}}}'
For ECS, the equivalent is healthCheck.startPeriod in the task definition. The log signature that confirms this is the culprit — you’ll see the container starting, then after a few seconds, “health check failed” or “readiness probe failed,” followed immediately by termination. That sequence is the tell.
Rollback Loop and How to Break It
Worst-case scenario: the deploy fails, so your orchestrator automatically rolls back to the previous version. But the rollback also fails. Now you’re stuck in a loop — the system keeps trying both deploy and rollback, neither succeeding, your Slack notifications going absolutely haywire.
Frustrated by watching the same error repeat every ninety seconds, most engineers keep staring at the dashboard hoping something changes. It won’t.
First, pause the pipeline. Most CI/CD tools have a cancel button. Use it. Then manually force your cluster back to a known-good state. In Kubernetes:
kubectl rollout undo deployment/your-app -n your-namespace
kubectl rollout history deployment/your-app -n your-namespace
For ECS:
aws ecs update-service --cluster your-cluster --service your-service --force-new-deployment --region us-east-1
Once the cluster is stable on the last working version, run through this checklist before re-triggering anything:
- Verify all secrets and environment variables are correct — at least if you want to avoid looping right back into this mess
- Confirm the artifact digest matches between build and deploy stages
- Check health check timeouts in your orchestrator config
- Review the actual error message in the deploy logs one more time
- Make a small test deploy to staging first if at all possible
Deploy failures sting. They always do. But they’re almost always fixable once you know where to look — and these five causes cover the vast majority of what you’ll actually encounter. Start with logs, work through each one systematically, and you’ll be back to green before your coffee gets cold.
Leave a Reply