Disaster recovery in AWS, GCP and Azure - Thoughts on Capacity Planning and Risks
Last updated
One of the most popular cloud disaster recovery models in the industry today is the “pilot light” model where critical applications and data are in already place so that it can be quickly retrieved if needed. A simple question one must ask before adopting this model is what thought has been given to whether the AWS/GCP/Azure APIs will work and if the requisite capacity will be available in the alternate region.
Having experienced times when I could not get all the instances I needed in a specific region for several hours during normal(non-peak/seasonal) times and no outages in other regions, I would expect that if there’s a major regional outage, instance availability in other regions on the same continent will get pretty scarce pretty quickly. Companies with unlimited infra-budgets would certainly not want to adopt this model and would rather opt for reserved instances or spot instances in another(standby) region, which would be the first things to evaporate whenenver any of AWS’s regions go down.
Now, Spots are essentially made up of “unused capacity” and if there’s a sudden surge in requests for on-demand instances, they’ll kill the Spot instances and give them to the people asking for on-demand.
For people who’re looking for an answer to - “What if our primary region is down for an extended period?” for real may actually need to spend the cash for reserve capacity in the standby region — so the cost/benefit analysis may have to include a substantial spend in the standby region all the time. Here’s one of the blog posts about the change in spot pricing algorithm https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/.
I know when we had to scale up for peak times, there would literally be no instances of some types available, on any market. I’m sure if us-east-1 fell off the face of the earth the other regions would start to feel a pinch pretty quick.
This can be mitigated somewhat if you’re using a container platform where you can use many different instance types(fewer people are using the c5.8xlarge compared to like, m5.large). I suspect if you’re actually worried about a whole region disappearing there’s some larger considerations to think about. Amazon can’t just absorb that kind of hit to their capacity without us noticing.
Another option is autoscaling groups with mixed instance types. Some of the more expensive instance types would not be as popular as the less expensive ones. A cost/benefit tradeoff to consider is if you’d rather pay extra for overprovisioned instances or have your service be down. There are a bunch of companies that run their container platforms on 6-7 different instance types to try to keep from running into an issue where one would run out, and it would still happen occasionally. Some of them ended up in situations where they needed i3.2xlarge in us-west-2. 180 or so at a time, 60 in each AZ. Every so often it would take some hours to get all of them, and they ended up making API calls(polling) to detect instance availabilty continuously.
Now, you might consider talking to the sales folks at the cloud providers to guarantee resources or reserved capacity in alternate regions but most of them woud not guarantee anything unless you’re spending >$100K/mo on their cloud. Having said that, nothing I said should be taken to suggest that reserved capacity isn’t possible. There isn’t a way to pay like 5% of your us-west-2 bill and be guaranteed that you could replace your whole usw2 capacity with us-east-2 without having to fight everybody else who is also trying to fail over. If you want to reserve the capacity, you pay the reserced instance pricing.
Regardless, one should consider implementing a minimal failover where you would continue to ingest data while your primary region is down and would then process the backlog when it came back up. Nobody would be able to use the UI and your users wouldn’t be getting alerts but you wouldn’t drop data aside from during that initial failover. This all assumes a regional failure. Depending on what your dependencies are, that might mean it’s not so much “load shedding” as “welp, the load isn’t reaching us anymore”.
Shubham Bhaskar Sharma
Time travelling through entropy