Is Software Resiliency the New Data Center Redundancy?
Digital infrastructure leaders used to live and die by availability. Uptime was everything. But as slow becomes the new down, leaders are looking outside the traditional 2N+1 box to deliver.
This is the fifth in a series of five blog posts reflecting the top-of-mind issues discussed during the most recent Infrastructure Masons Advisory Council meeting.
It used to be that redundancy – that is, having a backup system in case the main system goes down – was the way to maintain availability. And availability was everything. Resiliency – the ability of the system to recover from a fault – was less important. Bounce back at least before the generators run out of fuel, and you’re good.
Not so anymore. Availability still matters, to be sure, but digital infrastructure leaders are thinking outside the traditional 2N+1 redundancy box. Given the unrelenting and exponential growth of data – that is, of demand on the infrastructure – digital infrastructure leaders are looking for more efficient and sustainable ways to deliver availability without building two (or three) of everything.
Resiliency is the new redundancy
Some are finding their answers in software. As one Advisory Council member, an end user, explained, “When we had a data center fault, it directly impacted our business – we couldn’t serve our customers. That was huge and real and revenue impacting. It caused major red flags.”
The solution was not another 2N data center.
“Because of the threat,” the end user continued, “the engineering team built in software resiliency. Then as software resiliency went up, my need for data center resiliency went down.” In response to a question about resiliency versus redundancy, he clarified, “More redundancy in the portfolio, less resiliency in the elements.”
The end user added, “With zones I can fail over between those constantly – and we do, several times a week. That evolution of the software changed my data center requirements.”
“As the software resiliency went up, my need for data center resiliency went down.” – Click to tweet
In response to a question about whether buying more systems is cheaper than building a more redundant facility, the end user said, “Yes, if the software is aligned to do it.” In other words, it only works if resiliency is built into the software. Then “the more availability zones I have the less equipment I need.”
The end user articulated the math: “If I have two availability zones it requires a replication factor of 2.4 (100% in each zone, plus 20% overhead each as additional buffer for when things spike during failover). If I go to three availability zones whereby I can actually distribute workloads, the replication factor goes from 2.4 to 1.9. That’s 1.9 times the equipment instead of 2.4. If I go to four availability zones the replication factor goes to 1.8. If I have 20 zones I can have a replication factor of 1.2. Then I need a total of 20% extra equipment (1.2) to run the workload instead of 140% extra equipment (2.4) to run that load.”
It’s all about the workload
But what level of redundancy and resiliency is appropriate? One end user stressed what felt like a consensus: “It depends on what the workload is.” He shared an example, “Take the case of a very large manufacturing company running its entire manufacturing execution system out of a data center. It’s a very highly transactional database that has to be in sync real time because it’s driving the whole supply chain, all the factory utilization – the whole business.”
“In that case you want to have A and B in full synchronous replication at all times,” the end user explained. “Then maybe the network costs start to trump whether it’s more expensive to build in higher resiliency” – or redundancy. So it’s not necessarily about infrastructure redundancy being more effective at delivering availability than software resiliency, but about the cost of ensuring availability via software or infrastructure.
“What level of redundancy and resiliency is appropriate? It depends on the workload.” – Click to tweet
“It depends on what the application is.”
One end user said that because resiliency v. redundancy decisions are so dependent on the application, “The business will drive us from the cost perspective to more compute-specific architectures.” For example, he said, “with machine learning we can go straight power. We don’t have to worry about persistent data to be restored. And we don’t have to be in major metro areas because latency isn’t issue.” In contrast, “These are different requirements than, say, virtual machines for public clouds where we need ultra-resiliency.”
In the past, the end user continued, the economics clearly favored application-agnostic architectures that served the majority of line of business needs – because that approach pooled the risk of any single business unit being wildly off its capacity needs projections. Today, he said “It may be that cost drivers [of having application-specific architecture] are now substantial enough that business units are willing to take on the risk of getting their predictions wrong [and either overbuilding or not having enough capacity].”
Slow is the new down
One partner predicted that, all things considered, “The more critical the data is – the bigger the impact the data will have on the profitability or survivability of the business – the more it will still tend to be in redundant environments and require some level of structural support. I think it will be that way for a long time.”
“The more critical the data is, the more it will still tend to be in redundant environments and require some level of structural support.” – Click to tweet
Even when data isn’t objectively ‘critical’ if it’s the reason the business exists (think of any of the social media platforms, for example), availability is as important for that business as it is for the manufacturer in the earlier example.
“Look at it from the business’s side,” suggested one partner. “If you know that if you don’t respond to customer requests in a certain number of seconds they’ll go to another app, then you’re going to do everything you can to make sure you can respond within that certain number of seconds.”
“Slow is the new down,” said another partner.
But the way the hyperscale companies are achieving availability, in many cases, is different. They’re doing it through software resiliency rather than infrastructure redundancy.
“Criticality is in the eye of the business. Slow is the new down.” – Click to tweet
Bottom line: Availability still matters, to be sure. So does speed. But the answer to the question of how to deliver availability and speed is changing. It’s not always 2N+1.
For more insights into what’s top-of-mind in 2018 for digital infrastructure leaders, check out the previous posts in our series.
Previous posts in the 2018 top-of-mind series: