When AWS has an outage people really notice. Obviously the tech community explodes. Every company starts tweeting out their downtime status while swarms of developers / ops folks at all of these companies gang tackle the problem to try to get their site back up. Increasingly though, even the non-tech folks notice and are frustrated. On the great Christmas Eve outage of 2012 that took down consumer behemoths such as Netflix, social media was awash in frustrated users, but even the relatively minor outage on Friday caused enough waves for my completely non-technical wife to notice as some of her favorite websites were down.
While AWS is an incredible service it has it’s issues from time to time so it’s best to be prepared. Over the past year the company for which I work, Mindflash, has made significant strides to try to prepare for these sorts of events and during the Friday the 13th outage this paid off.
On Friday Amazon had a partial outage with one of its availability zones (read: one of its data centers) in Virginia. The issue was that while these machines were able to connect to each other just fine, the outside world was unable to reach them.
At Mindflash we run all of our services (except for much of our S3 content – perhaps the subject of another blog post) in Virginia and actually had a substantial number of our servers running in the availability zone that had the problem, but we managed to get away relatively unscathed. How? A combination of efforts over the past year to run with redundancy in multiple availability zones mixed in with some good old-fashioned luck.
To give you a bit of background, Mindflash runs about 5 different types of servers:
- Public web servers – What explains our product and pricing, allows you to sign up, login, etc.
- Product web servers – What you see once you log in to the product.
- API servers – Handle requests from the product to get and store data
- Conversion servers – Handle the process of taking our trainers’ content and making it into something better for being consumed on the web / mobile devices
- Background services – Handle scheduled or long-running tasks (aside from conversion) that don’t need to hold up a response to the client.
With a couple caveats we run at least 2 of each of these types of servers and have either load balanced them on ELB so that our traffic is split between each of the same type of machine or set them up to run on a queuing system so if one of the machines goes down it just stops picking up jobs. We do this both to maintain great performance when we have heavy traffic as well as to ensure the site’s uptime.
When we set up our machines for each of these types of servers we ensure that we have them spread out across the availability zones. That way if machines go offline in one of the availability zones we still have at least 1 in another availability zone that’s working. For the servers that run off of a queue system, they’ll stop taking jobs when they go offline because they won’t be able to reach the queue. For those being load balanced, Amazon’s load balancing service, ELB, is able to tell when a server is offline and automatically stop serving traffic to it. Since only one availability zone went out on Friday in theory we should have been 100% good to go.
Well, we weren’t quite. Remember that the outage only affected traffic from outside of Amazon’s network. Thus, the load balancers didn’t think the servers in that availability zone were offline and continued to serve traffic to them. This makes for is a shaky user experience. If you were lucky you’d get one of the servers that was still responding. If you weren’t you’d see errors or the page just wouldn’t load. When we realized this it was easy to fix as we just manually removed the machines in the bad availability zone from all of our load balancers.
There was a huge positive to this though. We’re not 100% there with our redundancy strategy yet so we still have a few servers of which we aren’t running two or more. Even though some of them were in the availability zone with the outage none of them are hit directly by machines outside of Amazon’s network. Thus, everything continued to work since the connection between Amazon’s servers was still intact and working just fine. We definitely caught a break here.
What we’ve learned from this is that it seems our strategy to ensure we have redundancy spread between multiple availability zones paid off where we’ve implemented it. That said, in the areas we haven’t done this though we’re still at risk and during this outage we just got lucky. There’s always more work to do.
In general Mindflash has been very lucky with AWS outages. Even when Netflix was down last Christmas Eve we were still up. While we were still lucky with this one a lot of our success this time was the result of the hard work we’ve put in to prepare for these events.
This is our story. I would love to read any comments on your strategies for surviving these outages and how they’ve worked out for you in the past.