Day 3 - We All Have Downtime
April 8, 2026•645 words
Part of Listed's 100 Day Writing Challenge.
Recently, I fell sick with a small flu, which kind of reminded me of a scenario at work where my team had a client who requested for 100% service availability. Not 99.999%, not 99.999999%, a straight 100%. Upon reading that specification, one of my senior technical lead said in anger, 'Even humans have downtime.'
We All Have Downtime
Downtime
For the non-technical, downtime refers to the time when the service is down. Specifically, when the application, website or service stops working for planned or unplanned reasons.
Basically, if you can't connect to Discord (or any chat application) and they haven't informed you of ongoing maintenance, it’s unplanned. If its due to a pre-informed maintenance, its planned.
Availability
In formal documentation, downtime is often referred to as the opposite of 'availability', which refers to 'How many % of the time, within a time period, must the service be up and running?'. Usually, this is referred using 9's, indicating 99.000% (two 9's) towards 99.999% (five 9's) availability.
Splunk has a blog resource about it if you are curious to read more.
Usually, this is included in Service Level Agreements (SLA) to users or clients in documented writing. Depending on the organization, users or clients, they may or may not include planned downtime (maintenance) as part of availability metrics.
100% Availability
"Anything that can go wrong will go wrong." - Murphy's Law
Pretty much, it's impossible to achieve 100% Availability, there is always a uncontrolled factor for everything:
- Excel Spreadsheet on your computer? Your computer can blue screen on you.
- That service crew at mac-Donalds? They might catch a flu wave.
- That self-service checkout at mac-Donald? Maybe someone had a bad day and intentionally broke all the machine.
- AWS Data Center? Well, it's Iran site recently got hit by an ICBM Missile.
- Your typical software service? Your servers might go down.
Or, I could just curse any scenario with a meteorite strike to prove my point.
Regardless of whatever service, object, thing or human, there is always a 'bruh moment' factor that will make your system go down. The point here isn't to compensate for all the 'bruh moments', but to compensate to a reasonable degree at a reasonable cost.
Availability and Cost
It is feasible to achieve 99.9999999% of availability, in fact you could:
- Buy a multiple computer as backup if your other computer blue screens on you.
- Hire more service crews on standby.
- Make the self-service checkout as durable as a tank.
- Have a Missile interceptor at your local data center.
- Just add more servers to your software service.
But at what cost? And is it worth it? Some organizations aim for unrealistically high availability with little to no benefit. Plus, does the users even need that much availability in the first place? Might be better to invest more in incident response planning.
Incident Response
Since you will eventually have some unexpected downtime, why not invest into a incident response plan? This can be:
- Sending your computer to a repair shop.
- Quickly contacting other off-duty staff members or emergency contract temporary staffs.
- Calling a repair crew to fix the self-service checkout.
- Informing customers of the data center outage, and quickly sending in a repair crew.
- Inform customers of the service outage and restart the server, with a prayer.
- Even NASA have incident response team ready to fix bugs in space. The most recent occurrence is fixing a Microsoft Outlook bug on space (lol).
Point is, if you can't afford to prevent it, be ready for it to the best of your ability. Just like how I couldn't fully prevent myself from catching a flu, but at least I could see a doctor and get medication 🥲.