Cloud

Minimizing the impact of cloud outages

January 18, 2022
Uma Palepu

Since the historic AWS outage in December 2021, more than 15 additional, high-profile cloud outages have disrupted the availability of widely used business applications. Most recently, a Microsoft outage in January of this year created global disruption for Azure, Teams, Outlook, and Sharepoint users, followed by an additional incident affecting Outlook users in February.

Cloud outages can happen for various reasons, including human error, electrical incidents, weather, vandalism, and technical errors related to deployment, maintenance and configuration. In at least one case, a ransomware attack was to blame – and it will likely happen again. Hackers are constantly running thousands of virtual machines on the cloud to find weaknesses in any public or private cloud infrastructure.

Most companies already leverage a private, public or hybrid cloud model for the significant advantages of scalability, reliability and high availability enabled by cloud technologies (We talk more about the advantages of the public cloud in this blog post).

And while we all benefit from this expansion of the cloud, the increased use also puts tremendous pressure on cloud infrastructures to deliver. Expectations are high, and people consider the cloud as critical to modern life as basic utilities like electricity, water or gas. As a result, the impacts of a cloud outage can be far-reaching, placing public cloud companies – and the services that rely on them – under more scrutiny when they occur. With cloud outages now occurring nearly monthly, it’s important for businesses to prepare for an inevitable outage with a plan that ensures they can recover quickly and continue to serve clients.

Six ways to prepare your team to minimize security risks and downtime from cloud outages

Don’t let the next outage catch you off guard! While cloud vendors must be prepared to stay ahead of these threats and react quickly when they occur, it’s equally important for those leveraging the cloud to also be prepared. As with any situation, preparation can help manage the risks.

Analyze the readiness of your team to respond to critical situations. Develop short, medium and long-term plans to address areas of risk.
Ensure you have a business continuity and disaster recovery plan. Understand your specific business and application profile, and ensure everyone on your technology team has the same understanding of what it means to be “up and running.” Develop a plan for activities, with responsibilities clearly assigned for all tasks required to bring systems back online.
Review your testing plan for disaster recovery, reduced availability and outage scenarios. Determine how long it will take for your team to get systems back up and running in each scenario. Your application and DevOps leadership should drive this process and run periodic, automated tests to check for preparedness and recoverability.
Review the commitments you’ve made to customers in your service-level agreements (SLAs). Can you quantify the costs to reimburse customers in the case of an unexpected outage? Do you need to revise these terms?
Ask your product and technology leadership to outline the business impact of an outage qualitatively and quantitatively. For example, how will the impact be measured in terms of both lost revenue and loss of goodwill?
Assess how your current cloud architecture can reduce the impact of service outages. Ensure your team maximizes resiliency through proper application design and infrastructure choice and thoroughly tests for recoverability. In public cloud scenarios, a simple, yet high-value strategy is to deploy across multiple-availability-zones (AZ), at a minimum.

Three ways to manage outages and security risks with cloud providers

Ensure your team has a common understanding of the commitments and response plans of your cloud vendors and that these are accurately accounted for in your overall cloud recovery and security risk mitigation plan.

Review the terms of your vendors’ SLAs. What commitments are your cloud providers making to you regarding their service? What reimbursements will they provide for reduced availability?
Consider SLAs in quantifiable terms and make sure they are appropriate to the level of business impact. The higher the risk, the more you have to factor in the guarantees from your vendors.

Most public cloud providers calculate SLAs on a monthly basis. What this means is that, if a cloud provider is guaranteeing 99.99% uptime, the services can be down about four to five minutes in each month. If there is a seven to eight-hour outage, uptime drops to 99%. So, make sure you have appropriate credits included in your SLAs.
Consider the implications of cloud availability for your unique business requirements. Your cloud strategy and risk will vary if you leverage infrastructure as a service (IaaS) or platform as a service (PaaS) capability from the cloud vendor.

In the case of IaaS, the vendor is only responsible for ensuring the uptimes for systems under their control, such as the hardware behind the services. In this scenario, you are responsible for ensuring applications are secure and properly designed to protect against intrusions and outages. Depending upon the business context, the cost-risk-benefit trade-off of having multiple availability zones or multiple support regions may make this additional responsibility worth it to you.

In the case of PaaS (for example, Dynamo DB or Azure SQL services), vendors own the responsibility of ensuring the uptimes of their services. You still need to make the right choices and design for higher security and lower risk, but it is less of a burden.

To learn more about what you can do to strengthen cloud security and build your cloud outage recovery plan, visit our Connect page to contact our team.

Uma Palepu