Is your ship on fire?
Yesterday, the world paused as an AWS outage in Virginia brought down hundreds of businesses, websites, and applications. The full root-cause is still unknown but the impact was definitely felt. Many companies could do nothing more than watch as their business ground to a halt or felt serious stutter-step pains as applications faltered. Entire websites failed, elastic assets disappeared from the internet, and basic acts of provisioning in AWS failed. Though this was not a complete failure of the N. Virginia region, it is valuable to understand S3’s importance in the AWS ecosystem. Almost every service in some way connects to S3 and the impact of this type of failure ripples around the world.
- Lambda = Store code in an S3 bucket and deploys at execution time
- Beanstalk = Source Code for Auto-scale groups in an S3 Bucket
- EC2 = Amazon Machine Images are all EBS Snapshots stored in an S3 bucket
- EBS Snapshots/Backups = Yup… those go to S3
- Many more
When speaking with the CIO/CTO of an organization, we will generally talk about their automation strategy and often times this includes their consumption of public cloud providers like AWS. I will stand firmly by the concept that you can build a fully public cloud ready environment. However, I believe that you will never get out of the datacenter business. Today, I explained this to a colleague with a simple analogy. A full migration could be compared to the concept of selling your house to move into an apartment. You no longer have to deal with most of the day-to-day maintenance but when a pipe bursts or the air conditioning goes out, you are left waiting on the maintenance team to resolve the issue. For simple inconveniences, this model works beautifully as it frees up your resources. When you have a major outage, you become a very small fish in a very big cloud.
AWS best practices for highly available applications suggest that you spread your servers across Availability Zones to decrease your service impact and maximize your uptime ability. This makes sense when you consider each Availability Zone (AZ) as a potential failure domain. Spreading my database servers across multiple AZ will give me regional protection against an issue in one of the zones. This type of solution also requires an application or middle-ware that can handle a distributed replication model.
What happens when everything goes wrong and the ship is on fire? The nature of cloud is often summed up as – “just another person’s computer”. While true and factual, this is not always a bad thing. For small to mid-size companies, this is a huge influx of support staff that you might not otherwise have. When things fail at this level, the impact is to everyone. Even Docker felt major pains during this outage and consider how this affected many other companies that depend on their registry to deploy applications. So moving to the cloud means not just designing for failure but expecting failure. As I work with clients on the resiliency of their solution, I always stress that there are two types of scenarios that we must protect – Business Continuity and Datacenter failure. I like the term over the generic Disaster recovery. Yesterday’s outage is an example of a Datacenter Failure that cascaded into a major service interruption and though it might not have been a huge crater or act of God event, it was a disaster.
When do you push the emergency button, grab the kids, and flee? The timing and movement will depend on how you built the resiliency of your environment. In the case of this AWS outage, the only option would be to either jump to a new regions or to a new cloud. For many of you, this is not a simple option or decision. It would be tantamount to breaking the glass too soon and initiating a process that is quickly reversible. This is where a hybrid approach to the traditional cloud services could impact your company in a positive way. Specifically, let’s consider moving from AWS region us-east-1 (N. Virginia) to us-west-1 (California). If we include a pair of replicating ONTAP Cloud systems one in each region, then we could easily cut the NetApp Snapmirror and bring the data up in California. Now, the hardest things to move are your applications but this is solved by leveraging tools like CHEF, CloudFormation, Terraform, or even homegrown scripts and custom AMIs. Imagine a completely cold site that can start immediately at the touch of a button with very little, if any, ‘data-loss’.
AWS got you down? Instantly jump into Azure… Ok, I will likely get some flack for that statement but here is the reality. You can do exactly this and mitigate the failures that were felt yesterday. I have built solutions leveraging NetApp Private Storage (co-located NetApp array using Direct Connect and Express Route). Heck, I even went so far as create a full failover solution with SQL server to fail from one cloud to another. This mitigates my failure domain since I am not only on different hardware but different physical Datacenter and platform. Two-vendor strategy right? Why not in the cloud? Alternately, NetApp now offers ONTAP Cloud for Azure so the process gets even easier. I only need to keep that system running and in-sync to protect my company from an outage.
At the end of the day, public cloud isn’t going to solve all of your uptime requirements. I feel that this is a critical point. Failures happen and should be expected. Heck, Netflix even built a tool called ChaosMonkey to constantly check their systems to see if they could survive catastrophic failures. In many ways, you must plan that the maintenance guy for your proverbial apartment won’t fix your issue at the speed that you need it. Have a backup plan and potentially review your strategy on how you plan for Business Continuity and Datacenter Failures.