Uptime Nines

deanwebb · September 16, 2018, 11:13:26 AM

http://www.451alliance.com/Reports/UptimeInstitutestudyshowsoutagesarecommonandcostly.aspx

I encourage everyone to read this article and discuss here.

The parts about failure to upgrade legacy hardware and software really resonated with me. "It works, so why upgrade?" is essentially like saying, "It still chops, why sharpen the axe?" When we think we keep those four or five nines by keeping things in production and not taking them out for an upgrade, we fool ourselves.

The number about outages due to power issues made me pause - until I remembered all those data centers in cheap locations... guess why those locations are cheap? Hmmm? Could it be because those places have infamously bad power grids, among other issues?

Failures at co-locator and other third party providers? That's what happens when you put your trust in a third party. They're a third party, not one *directly* impacted by the outage. They make promises to do the right thing and then disappear to where nobody actually keeps tabs on what they're doing.

And then the finger points at the network... OK, lads, that's on *us*. But we have to ask, how many of those outages themselves are traceable back to third-party providers, ISPs that don't do things right, or that gear that didn't get upgraded or replaced?

Maybe what we need to do is to predict the costs of outages and see how they stack up against the cost of doing a proper upgrade. Maybe we should be happy with two nines, which looks to me to be a consistently sustainable reality, than to strive for 3-5 nines and wind up with some harsh long term costs for those short-run benefits.

If we send a report up the management chain that says, "Hey, here's how many millions we stand to lose if we fail to spend thousands and take a few weekends to do some needed upgrade and maintenance work," maybe then we'd get the budget approval and SLA relaxation to get that stuff done.

Otanx · September 17, 2018, 03:29:34 PM

I think people forget that having X nines uptime is not something that just happens because nothing broke. If you have one router, and a switch, and last year nothing broke you don't have a network that supports five nines. You just got lucky. I see this a lot. Someone says their network supports 5 nines but they don't have a test network, or change control, or monitoring. What is the line from financial statements? "Past performance does not indicate future performance"

There is also the bean counter 5 nines. They look at how often the service goes down, and then how much extra they can charge if they promise 5 nines. Then subtract how much they will pay out for SLA failures. If that is a positive number then you are now running a 5 nines service, and are going to get blamed when a failure happens. Even better if they can put in a convoluted method for customers to request SLA credits that takes months, and ends up costing the customer more in man power than the credits are worth. Not bitter about this at all.

I think a lot of the issues shown by the article fall into one of those two scenarios. These are services that are not engineered for four nines, they just are because luck. Building for X nines up time is not easy or cheap. I also think that it isn't worth it 99.999% of the time. Call it the 5 nines of 5 nines. So is 99.99% a myth? I don't think so. There are places that do it, and do it right. However, there are hundreds if not thousands more that claim to do it, but don't.

-Otanx

deanwebb · September 18, 2018, 08:40:42 AM

Hey, the Sun only has 50% uptime... they take it down every night for maintenance!

But, yes, luck... how many of us have seen the switch that has a huge uptime number because everyone's afraid what would happen if they rebooted it?

Otanx · September 18, 2018, 09:50:23 AM

They fail over to the moon before they take the sun down for maintenance. There is still some down time, but my orbital mechanics are weak so I can't give you a number.

As for the afraid to reboot it comment. Too many times. I really want to get to the point I can honestly put in a change ticket to install https://netflix.github.io/chaosmonkey/

-Otanx

deanwebb · September 19, 2018, 07:32:15 AM

I saw that tool and immediately thought, "Say, while (X) is down, we can utilize this outage to upgrade (Y) and (Z)..."

KDog · October 03, 2018, 07:29:59 PM

Quote from: deanwebb on September 19, 2018, 07:32:15 AM
I saw that tool and immediately thought, "Say, while (X) is down, we can utilize this outage to upgrade (Y) and (Z)..."

In a previous role I may or may not have engineered some unscheduled downtime so that I could do exactly that...

Dieselboy · October 04, 2018, 11:15:09 PM

So is it okay to have a less-reliable service if you design mitigation?

I can give examples where reliability is not satisfactory, so they simply add more of them. Examples are on space probes / satellites, aeroplanes etc. Even in networks you add resilient devices to ensure higher uptime.

Just checked microsoft azures uptime percentages and they guarantee 99.9%. So an app developer would either deploy resilient services into azure or even better, deploy into Azure and AWS resiliently and leverage both for increased service uptime as a whole. This then somewhat decouples the lower uptime offered by the platform from the service.

Dieselboy · October 05, 2018, 04:07:51 AM

What actually accounts for "downtime" or outage? For example, we had an issue earlier in the year where our ISP was suffering because another one of their customers was under DDoS. We had 50% packet loss for about 9 hours. Is this considered an outage? And does this come under the 99.9xxx SLA agreement?

deanwebb · October 06, 2018, 03:17:06 PM

Having failover and HA and clustering and load balancing are all used to provide the uptime in case one part fails.

Of course, if an entire *system* of one of those above things fails, then it's all over, sport!

Any time the app server can't be pinged, it's downtime. If port 1433 goes quiet on the Oracle box, it's downtime. If the Internet is slow... well, technically, it's slow time, not downtime. But if it's slow enough for TCP sessions to time out, you got some downtime.

In that DDoS case, it's a claim that can be filed with the cyber-insurance guys.