Rough Work/the guardrails

Blameless Post-Mortems

When a production outage happens, the human instinct is to find someone to blame. We want to know who pushed the bad code, who missed the alert, who misconfigured the load balancer. We think that if we find the careless person and correct them, it won't happen again.

Safety researchers call this the Bad Apple theory — the idea that complex systems fail because of a few rotten individuals, and that removing them restores health. It's intuitive but wrong.

In a complex system, human error is never the root cause — it's a symptom. If an engineer was able to crash the site with a single command, the site was already broken. The system allowed the error to happen.

The Blameless Post-Mortem starts from a different assumption: everyone involved was acting rationally, doing what made sense to them at the time with the information they had. That reframe makes "who?" the wrong question entirely. The right questions are structural: How did the deployment tool fail to catch the error? How did the monitoring stay silent for twenty minutes? How do we make it impossible for a future engineer to make this same mistake? Those questions produce changes. Blame produces silence.

And silence is the problem. Blame and truth are incompatible — people protect themselves when they're afraid of consequences, and accurate information is the only raw material a post-mortem can work with. A culture where engineers are afraid to speak honestly about what happened is a culture that will keep having the same incidents.

to navigate