Take Responsibility for Resiliency // Subnet Spot

A number of applications and web services hosted in the north-east suffered issues an outages during hurricane Sandy. I found it somewhat frustrating the way some businesses described these outages.

I heard people talking about this being an “unforeseeable” event. Really people? In 2003, a much larger portion of the northeast US was affected by the blackout in August. So, I find it pretty hard to believe that it was completely unforeseeable that a large swatch the the country could be affected by an event.

There are a variety of technologies than can be used to distribute an application across multiple data centers. On the front side, DNS based technologies like F5’s GTM (Global Traffic Manager) can answer DNS queries and point users to different data centers based on availability, the user’s location, etc. Anycast technologies work at a lower level to route traffic that is destined for a particular IP address to the instance closest to the end user (and can withdraw routes for an instance if you’re having a problem in one data center). On the back side, basically all decent data bases support some kind of synchronous or asynchronous replication to keep data in sync across multiple instances. (The trade-off between synchronous or asynchronous replication is basically one between the acceptability of some data loss in an outage, versus performance). Depending on the design of your application and database, you may even be able to have multiple active data centers simultaneously.

My point is - if your application is located in a single data center – you made a choice to do that.

And this is perfectly OK for many applications. In many cases, it may not be worth the cost of complexity of designing, building, and operating a more resilient design.

But – in these cases where an outage in a single region takes your site down, you decided that it wasn’t worth the cost of making your application resilient across locations. Either:

You made correct decisions, and based on the cost of resiliency versus the the money lost from the downtime, you’re OK with this outage. Presumably you have set your user’s expectations appropriately. If you fall into this category, but your users have different expectations, then you need to work on communicating better with your users to bridge this gap.
Or, you underestimated the cost of downtime (or underestimated the risk of data center outages), and you realize you should have made the application resilient in the first place.

In either of these cases, this outage is a result of your decisions. Not an “act of god”, or an “unforeseeable event”. You cannot have a highly available, resilient application when it is hosted in a single data center.

When building an application you determine how resilient it is. It is absolutely possible to make services resilient across data centers and across regions. It’s just a matter looking at the costs versus the benefits to the business. So, take responsibility for the resiliency of your applications and services.