I’ve thought some recently about availability / reliability calculations. Sometimes, I think people over-emphasize the impact of hardware failures on availability, and not focus enough on the impact that software failures and human error has on availability. Yes, hardware does fail. But so do people. And I think software fails even more often. A reliability estimate of a network that’s based purely on the hardware MTBF isn’t just missing some small details – it’s a lie. The true reliability of a complex network is far more complicated than the MTBF on the hardware.
Even when hardware does fail, in my experience it usually doesn’t just power down and disappear from the network. Maybe a port flaps, or an existing hardware issue only is exposed when the software is upgraded to a certain version. Or maybe a fiber pair isn’t cut by construction, but is crimped or damaged enough that a high number of errors occur.
I get a little nervous with high availability architectures that depend on tightly coupled devices that share a single control plane. Like switch stacking or Cisco’s VSS. I would much prefer something like VPC where the two devices have independent control planes. Software issues in networking devices can cause lots of “fun” issues – like preventing a switch from processing BPDU’s, or responding to UDLD. The single control plane is a single point of failure*.
I like having separate redundant devices, that are highly independent. In the long run, I think this kind of architecture will be more resilient and have fewer ways in which it can fail catastrophically. It also makes it easier / less complicated to perform upgrades.
Human error is a little more complicated to protect against. Ultimately, if someone with access screws up enough, they can take down your network. Non-technical solutions like requiring peer reviews, or testing every change in a lab environment (to the extent that you can) will help. Sometimes changes still won’t go as planned. I think some basic design precautions can help mitigate the damage for highly critical applications. If the application is load balanced – put some of the servers on different VLAN’s or subnets. This prevents a single ACL or PBR misconfiguration on one of the VLAN’s from cutting off all of your servers. It could also help mitigate the impact of a spanning-tree “event”. I think this is nearly as important as making sure a server / application is connected to multiple switches. You should try to eliminate single points of failure in the network from a logical perspective too, not just from a physical perspective.
An even better idea would be to put the application in two separate data centers, using DNS load balancing (like F5’s FTM or Cisco’s GSS) to direct users to the right instance. There will be some complexity on the back end to replicate the data, and you’ll have to ensure that your monitoring system will truly detect any failures that would prevent users from getting to the application.
At a high level, to increase availability you can use redundancy – but, the redundant aspects should be highly independent and loosely coupled. This is (one of many reasons) why stretching subnets / VLAN’s across data centers is such a bad idea. You are taking two relatively independent environments (the separate data centers), and making them less independent and more tightly coupled. This makes it less likely that the one data center will survive an incident in the other. This is a crutch for people who are not designing their applications properly with resiliency in mind.
* This also applies outside of networking. If you’re using cloud service providers, you should be using multiple ones. Your cloud provider’s control plane can be a single point of failure, as the recent Amazon EC2 outage demonstrated.