Terraform State File Corruption How to Recover Fast

Terraform State File Corruption — How to Recover Fast

Terraform state corruption has gotten complicated with all the half-answers and panic-driven Stack Overflow posts flying around. As someone who managed infrastructure for a mid-sized SaaS company on AWS for three years, I learned everything there is to know about this specific nightmare. Today, I will share it all with you.

Three years ago, a developer on my team ran a terraform apply while I was mid-way through a terraform destroy on a different workspace. Our state locking wasn’t configured — we’d been meaning to set it up “later” since about March. The state file came out partially written. Malformed JSON. Missing resource IDs. Two hundred resources across three regions, and the whole thing was garbage by 4:47 PM on a Friday.

Slack lit up immediately. My hands went cold. You know the feeling.

The recovery took four hours that day. It doesn’t have to take that long for you. Often under an hour, if you know where to look.

What Actually Causes Terraform State to Corrupt

Three buckets. Knowing which one you’re in changes everything about how you recover.

Interrupted applies or destroys. Ctrl+C mid-operation. A CI/CD pipeline timing out at the 10-minute mark. A network drop right when Terraform is writing state back to S3. The file gets partially written — valid JSON up to line 847, then nothing. Terraform can’t read it next run.

Concurrent operations without state locking. Two people or two CI jobs hitting the same state simultaneously. Local state doesn’t prevent this. Neither does a remote backend without DynamoDB or Redis locking configured. Both processes write. Last write wins. What you’re left with is basically a Frankenstein file — pieces of both operations, usually broken beyond a simple fix.

Manual edits gone wrong — and I’m including the classic “wrong terminal tab” scenario here. State files are JSON. Human-readable JSON. The temptation to just quickly fix that one resource ID is enormous. The consequences show up on the very next terraform plan. Don’t make my mistake.

Check If You Have a Backup Before Doing Anything Else

Probably should have opened with this section, honestly.

If you’re using S3 as your remote backend and bucket versioning is enabled, you already have a backup — likely several. This is the fastest path out of the hole.

Pull up the AWS CLI and list recent versions of your state file:

aws s3api list-object-versions \
  --bucket your-terraform-state-bucket \
  --prefix prod/terraform.tfstate \
  --query 'Versions[0:10]'

That gives you the ten most recent versions. The LastModified timestamp is your guide — find the version from before the corruption happened. Grab the VersionId and restore it:

aws s3api get-object \
  --bucket your-terraform-state-bucket \
  --key prod/terraform.tfstate \
  --version-id VERSIONID123 \
  terraform.tfstate.recovered

You’ve got a candidate file. Before pushing anything back to S3, validate it:

jq . terraform.tfstate.recovered > /dev/null && echo "Valid JSON"

Passes? Move it into place:

mv terraform.tfstate.recovered terraform.tfstate

Run terraform plan. No changes showing up means you’re recovered. Changes do show up? Those are resources created after your backup timestamp — your call whether to apply them based on what you know about recent deployments.

For local state, Terraform writes a .terraform.tfstate.backup automatically after every successful apply. Check if it exists and eyeball the timestamp:

ls -lah .terraform.tfstate.backup

If it predates the corruption, restore it:

cp .terraform.tfstate.backup terraform.tfstate

Then validate with terraform plan. No S3 versioning, no local backup, remote backend without rollback support? You’re in manual territory. That’s next.

How to Manually Fix a Corrupted State File

Open the raw JSON. It sounds dangerous. It’s slightly dangerous. But you need to actually see what broke:

jq . terraform.tfstate | head -50

Corrupted state looks like incomplete objects, truncated strings, or malformed JSON that jq flat-out refuses to parse. A healthy resource block looks something like this:

{
  "type": "aws_instance",
  "name": "web_server",
  "instances": [
    {
      "schema_version": 1,
      "attributes": {
        "id": "i-0123456789abcdef0",
        "ami": "ami-12345678",
        "instance_type": "t3.micro"
      }
    }
  ]
}

A corrupted entry might have the id field sliced mid-string. The attributes object might be empty. The whole resource block might just end abruptly at the curly brace that never got written.

Before deleting anything, list what Terraform currently thinks is in state:

terraform state list

Every resource shows up here. Spot the corrupted one. Inspect it directly:

terraform state show aws_instance.web_server

Error on that command? Confirmed corrupt. Remove just that resource:

terraform state rm aws_instance.web_server

That’s it for that resource — Terraform forgets it entirely. Nothing in AWS actually changes. The instance keeps running. Terraform just stops knowing about it.

Here’s where people get tripped up, though. Run terraform plan after your state rm. If the plan shows “Create” for a resource that already exists in AWS, Terraform is about to make a duplicate. That’s fixable but annoying — and completely avoidable. Import the existing resource back into state instead of letting Terraform try to recreate it.

Rebuilding State When There Is No Backup

This is the long road. Brace yourself.

terraform import is what you need. It reads an existing resource directly from AWS and adds it back into your state file. Say you have an EC2 instance — ID i-0123456789abcdef0 — running fine in AWS but invisible to Terraform. Import it:

terraform import aws_instance.web_server i-0123456789abcdef0

Terraform queries AWS, pulls the full resource configuration, writes it to state. Now it knows about the instance again.

Ten to twenty resources? Tedious but doable. Sit down with a spreadsheet, a terminal, and about an hour. A hundred resources across five regions? That’s a weekend. Possibly a long one.

The community tool terraformer was built for exactly this scenario. It reads your actual AWS account and generates both Terraform code and a populated state file for bulk resources. Not official HashiCorp software — but it’s actively maintained and genuinely useful when you’re staring down dozens of orphaned resources. Install via Homebrew or grab the binary directly, then run something like:

terraformer import aws --resources=ec2,rds,s3 --regions=us-east-1

It generates a full directory of .tf files and a matching state file covering everything in your account. You’ll need to strip out resources you don’t want Terraform managing, but even with cleanup it beats manual import at scale.

How to Prevent This From Happening Again

Three things. Do all three. Seriously.

Enable S3 versioning on your state bucket. One CLI command. Costs almost nothing at typical state file sizes. Takes thirty seconds to run. If versioning had been on during my Friday incident, I would have been home by 5:15 PM instead of 9:00 PM:

aws s3api put-bucket-versioning \
  --bucket your-terraform-state-bucket \
  --versioning-configuration Status=Enabled

Enable state locking with DynamoDB. Create a table called terraform-locks — primary key LockID, string type. Then wire it into your backend config in backend.tf:

terraform {
  backend "s3" {
    bucket         = "your-terraform-state-bucket"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

That’s the concurrent-writes problem solved. Second terraform apply waits until the first one finishes. No more Frankenstein state files from simultaneous operations.

Gate your applies through CI. GitHub Actions, GitLab CI, Jenkins — whatever you’re already running. Configure it so only one terraform apply runs at a time per environment. Most platforms support this natively through concurrency settings. I’m apparently a GitHub Actions person and it works for me while manual discipline never actually holds up under deadline pressure.

That’s what makes proper state management endearing to us infrastructure folks — it’s boring, unglamorous configuration that saves you from the worst Friday afternoons of your career. Set it up today. Not later.

Jason Michael

Jason Michael

Author & Expert

Jason covers aviation technology and flight systems for FlightTechTrends. With a background in aerospace engineering and over 15 years following the aviation industry, he breaks down complex avionics, fly-by-wire systems, and emerging aircraft technology for pilots and enthusiasts. Private pilot certificate holder (ASEL) based in the Pacific Northwest.

48 Articles
View All Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay in the loop

Get the latest stigcloud updates delivered to your inbox.