A Taxonomy of Software Failures
March 22, 2023•887 words
(Somewhat inspired by this classic article about categorizing exceptions, which anyone who writes software should read.)
There are an infinite number of specific ways that a production software system can fail, and in my career I've encountered a vast array of them (of course, some percentage of them were my fault in the first place). But I've noticed that production failures tend to fall into a few broad categories, so my hope is that by sorting the categories out, it can be easier to anticipate and fix issues (and maybe even prevent a few. That would be nice!)
New Defect Released
This is by far the most straightforward failure. Someone wrote some new code, it got deployed, and it turned out to have an immediately obvious bug in it. This sort of issue can be mitigated by having a process of thorough code review, QA testing, integration testing, etc and it can also often (but not always!) be solved by having a robust and simple process for rolling back any newly released code. However, if for example the released code includes a dependency for something else in prod, or if it bundles some other features which can't be reverted out, your best fallback is having some sort of accelerated process for deploying hotfixes. Note that every obstacle in between writing and deploying code (review, permissions gate, deployment window, ...) which helps reduce the frequency of errors making it out to prod also slows the deployment of any fix for the errors that do make it to prod.
Existing Defect Noticed
When buggy code gets released, it's actually best for someone to notice immediately. If that doesn't happen, you may end up in this situation, where a bug has been out in production for plenty of time but it has only just become an emergency. There are many reasons for something like this to happen -- for example, a new customer may start using an existing feature in a way you didn't anticipate, or a new release may depend on the defective code in a high-impact way.
The best way to avoid this sort of thing is to avoid deploying buggy code in the first place, of course. However, since we live in an imperfect world and you will deploy buggy code from time to time, there are a few mitigations available. One is to make a priority of finding and fixing existing bugs -- decrease the friction for your users to report them, motivate developers to fix them, and generally invest in code quality. However, all that will fail from time to time, so the last fallback is again to shorten the turnaround time between "defect noticed" and "hotfix deployed".
External Failure
A large-scale software ecosystem collects external dependencies the way the floor underneath your couch collects dust bunnies and long-lost board game pieces. However, unlike the problem under your couch, which can be solved by a vacuum cleaner, you cannot go through and clean out the external dependencies because they provide solutions for your users. This is unfortunate, because sometimes these dependencies will fail, for circumstances entirely beyond your control. Sometimes third party software breaks entirely on its own, or sometimes they just SILENTLY CHANGE THEIR API FORMAT ONE DAY BECAUSE THEY ARE MONSTERS. The best you can do is proactively identify and catalog all the external dependencies that you have, make sure the systems which connect to them are as resilient as possible in case they fail, and (if possible) identify and implement ways to backup or cache external data so that you have temporary fallbacks available if you need.
Load/Infrastructure Failure
Though many would prefer not to think about this, your software ecosystem does ultimately rely on physical machines, existing in the real world, running your software on them. Sometimes these machines have physical problems. More commonly, especially if you are using the sort of cloud providers that let you mostly abstract that away, the work you're asking some piece of infrastructure to do exceeds the work that it is capable of doing, and you start running into all manner of complex and diabolical failure states. Your server runs out of memory, your database runs out of rows, your monthly subscription runs out of compute, or perhaps you have some kind of exponentially inefficient algorithm buried deep in your codebase that is suddenly running against too much data to handle in a timely manner.
This sort of thing is why devops people get paid so well. But obviously there are ways to be proactive, such as monitoring the resources you're using and the performance of your infrastructure. Having a well-architected environment in general is a huge help here because it makes it easier for you to fix or replace infrastructure on the fly.
I find it helps to be aware of these different failure cases because they cover just about anything that can go wrong with production software, so when something starts to break, an easy starting point is to try and figure out which of these (it may be more than one!) is happening. This sort of knowledge can also inform your attempts to be proactive about heading off problems before they emerge, or to be reactive and categorize the failures you've been experiencing in hopes of future improvement.