### Solution SLAs in Azure

Let’s talk about Service Level Agreement (SLA) of your solution in Azure.

Hal Berenson wrote a great article about SLA lately.  It is a great conceptual background for the present today.

Here we want to focus on how you should proceed to come up with an SLA for your solution.

Although we are going to use Azure in all examples, most of the guidance apply to any public cloud solution.

I see a lot of customers who take a common short cut.  They take the component with the highest SLA (say %99.95) and assign it for their entire solution.  This is quick indeed but as we’ll see, it gives a very optimistic (hence risky for the provider) SLA.

Here we will discuss an approach to establish a theoretical SLA baseline based on the SLAs of sub components.

We’ll talk of availability SLA.  The same logic would apply on other characteristics, e.g. performance, durability, etc.  .

## SLA for Service Consumer

For a service consumer, the SLA is part of the specs of the service and sets the expectations.  It is something we design against.

It is also something guaranteed, i.e. it is backed financially.  That part is important and it isn’t.  It shows the service provider puts its money where its mouth is:  it will refund us when it fails the SLA.  But it doesn’t back our business.  We might get a refund of \$15 because a service was down for an extra 30 minutes a month.  But how much is 30 minutes worth of business during peak hours?

Beside the contractual SLA, what is interesting is the actual average uptime of a service.

## SLA for Service Provider

For a Service Provider, the SLA is a compromise.  It’s compromise for engineering who would like it to be as low as possible so they can ensure they can hit it (e.g. 20 hours a week).  It’s a compromise for sales too.  Sales likes SLA as high as possible so they can attract customers with an aura of reliability (e.g. %99.9999).

Missing an SLA means financial penalty but also reputation penalty.  If you miss it too often, consumer can’t rely on it and usage will likely drop.

It’s a tough act to balance and this is why the approach we will lay out should help to establish a baseline.

## Azure SLAs

Azure has a comprehensive list of SLA documented here.

For instance, App Service SLA is %99.95 (at the time of this writing, early 2018).  Most (if not all) Azure SLAs are monthly, which means the counters resets every month.

Those are contractual SLAs.  For actual measured average uptimes, a good reference is Cloud Harmony.  They measure most public cloud providers (including obscure ones).  Cloud Harmony belongs to Gartner which makes it independent from cloud vendors.  They provide uptime measures for the past 30 days for VMs, storage, CDN, Web site & Databases.

Uptime routinely scores %100, even across regions.  Sometimes they are a little lower.  Exceptionally they go below SLA in some regions.

Microsoft, like other giant cloud providers, has a lot to lose (in reputation) by missing their SLA.  They invest a lot to avoid it from happening.

It is important to note that Azure is planned to be up %100 of the time.  When there is downtime, it is caused by a human error or hardware malfunction.

## Our Solution in Azure

So we already put some nuance on the SLA of a Service Provider.

Another fallacy of SLA is to believe the Azure Infrastructure SLA is our solution SLA.  For instance, if a solution runs on an Azure service with %99.95 SLA, our solution has %99.95 of uptime.

This assumes our solution is up %100 of the time whenever the underlying infrastructure is up.  Of course this isn’t possible.  Deployment occurs, bugs occur, configuration glitch occur, etc.  .

## Probability 101

Can we simply compute a theoretical SLA for a solution using 2 or more Azure services?  Yes we can.

Let’s start with a simple example:  a Web App (SLA of %99.95) with a SQL DB (SLA of %99.99).  What is the SLA of that solution?

To answer that question, we’ll need a probability 101.  It will be light, it will be fast, but we need it.

We can interpret an SLA as a probability.  Let’s observe the following:  measured availability (over a long period of time) equals the probability of being up (at any given time).

$P(\text{service is up}) = availability = \dfrac {\text{total time service is up} }{\text{total time measured} }$

Indeed, if a service has availability of %99.9, it means the probability of it being up at any time is %99.9.  It also mean that during 30 days, the service should be down around 43 minutes.

We want to compute the probability of both services (Web App + SQL DB) be up at the same time.

Let’s consider the multiplication law of probability.  That is, if A & B are independent then:

$P(\text{A and B}) = P(A) \cdot P(B)$

In our case:

$\begin{array}{lcl} P(\text{Web App and SQL DB are up}) &=& P(\text{Web App is up}) \cdot P(\text{SQL DB is up})\\ &=& \%99.95 \cdot \%99.99\\ &=& \%99.94 \end{array}$

This is often is a surprising result for customers we speak to.

But if we think about it, the fact the compound SLA is lower than each individual SLA does make sense.  Each component can fail independently.  So the more components, the more failures can occur.

## Independent SLAs

There is a bias in the previous computation.

We assumed that both service failures were independent events.  That is debatable.

Different services are isolated.  This means they fail independently.  But they aren’t %100 isolated in the sense they still share some failure modes.  Some events could take both services down.  For instance a natural disaster.  More common would be a bug introduced in the region’s compute fabric.

Parts of the failure probability for each service aren’t independent of each other.  In general, assuming both SLAs are independent is pessimistic.  It results in a compound SLA lower than measured.

There are no easy way to estimate that though.  This is one of the reasons why measures are necessary.

## Boosting Availability

Let’s consider ways to boost availability in public cloud.  A classic method:  load balancing / failing over across regions.

Let’s consider a simple scenario first:  storage RA-GRS.  Local Redundant Storage (i.e. LRS) has an SLA of %99.9.  Reads on read only Global Redundant Storage (RA-GRS) have a SLA of %99.99.  Let’s try to understand why.

We have the following solution:

and we want to know what is the SLA of that solution.  Here the problem is different.  Both services don’t depend on each other:  they complement each other.

What we want to know is what is the probability that at least one of the services is up.  One of them being up is enough.

For that, we need to consider some probability laws again.  The negation of a probability is easy to compute:

$P(down) = 1 - P(up)$

Using simple mathematical manipulations, we can find the probability we are looking for:

$\begin{array}{lcl}P(\text{A or B is up}) &=& 1 - P(\text{A and B are down})\\ &=& 1 - P(\text{A is down}) \cdot P(\text{B is down}) \\&=& 1 - (1 - P(\text{A is up})) \cdot (1 - P(\text{B is up}))\end{array}$

Now let’s consider RA-GRS:

$\begin{array}{lcl}P(\text{Primary or Secondary is up}) &=& 1 - (1 - P(\text{Primary is up})) \cdot (1 - P(\text{Secondary is up}))\\&=& 1 - (1 - \%99.9) \cdot (1 - \%99.9)\\&=& 1 - \%0.1 \cdot \%0.1\\&=& 1 - \%0.0001\\&=&\%99.9999\end{array}$

Boom.  We can imagine that this is the computation behind the SLA of RA-GRS.

Let’s notice a caveat though.  This is the SLA that at least one region’s storage is up.  There is no load balancing / failing over at the service level.  This means the service’s client must try one then try the other if the primary fails.  We will take that into account in the next few examples.

Let’s consider our solution with a Web App and a SQL DB.  For simplification, let’s assume the SQL DB is either static or read only on the secondary site.

$\begin{array}{lcl}P(\text{Primary or Secondary is up}) &=& 1 - (1 - P(\text{Primary is up})) \cdot (1 - P(\text{Secondary is up}))\\&=& 1 - (1 - \%99.94) \cdot (1 - \%99.94)\\&=& 1 - \%0.06 \cdot \%0.06\\&=& 1 - \%0.000036\\&=&\%99.9964\end{array}$

This is a nice SLA.  But as we noticed in the RA-GRS example, the client needs to be the one implementing the failover.  This isn’t acceptable in a web scenario.  Let’s implement the failover on the service side by using Azure Traffic Manager (SLA %99.99).

How do we compute the SLA of that solution?

UPDATE (22-01-2018):  There was an error in the original publication.  Thanks @d_chapdelaine for pointing it out!

We need to chain the Traffic Manager with the fail-over solution (i.e. %99.9964).

$\begin{array}{lcl} P(\text{TM and fail over is up}) &=& P(\text{TM is up}) \cdot P(\text{Either Primary or Secondary is up})\\ &=& \%99.99 \cdot \%99.9964\\ &=& \%99.9864 \end{array}$

Yes, by adding Traffic Manager in front of our web solution, we weaken it.  Doing it on the client side was %99.9964 while on the service side it is %99.9864.  But we are still higher than the %99.94 of a single region, but inching to something higher.

In general, as we see, load balancing across region does boost SLA but it is quite expensive to not gain a nine.

Challenges come with stateful solutions (the majority of interesting solutions).  It requires replicating data to the secondary region.  But it also requires failing over to the secondary region prior being able to write there.  It is a case of a macro state machine.

## Calculating SLA on a real solution

Let’s consider a more realistic solution with 4 tiers:

The compound SLA is the product of probability, hence %99.84.

This example brings to the forefront that it can be easy to drop below 3-nines if we aren’t careful.

## Scenarios

Sometimes it can be interesting to consider scenarios instead of the entire solution.

Azure Storage is a good example:  the reading vs writing scenarios do not have the same SLA for RA-GRS.

We could consider scenarios where the DB isn’t involved in previous examples.  SLA would be higher for those scenarios.

Let’s consider a solution where Azure AD B2C authenticate end-users.  For scenarios where the user is authenticated, the Azure AD B2C doesn’t need to be up.  Again, that would improve the SLA for those scenarios.

It is useful to consider scenarios when the solution implements distinct business processes.  Those business processes are understood by the consumer.  Having different SLAs for different processes can bring nuance on the offering.  It can also add complexity.

A good example is an e-Commerce solution having different SLAs:

• One for consulting the catalog (read only)
• One for interacting with the cart (session writing)
• One for passing a command (back-end interaction)

Those three SLAs refer to business process the end user can easily identify.

Architecting our solution to “gracefully degrade experience” when components fail also improve SLA.  This is easier than done in most cases, especially for existing applications.

## Theory vs Measure

As Hal’s article mentioned, the sure way to establish a SLA is to combine theoretical value and measures.

We definitely encourage you to measure the availability of your solution.  This would allow to iron out up all the bias we mentioned.

In this article we articulated a methodology.  We should use it to establish a theoretical baseline.  This is useful before measures are available.  For instance, it is useful to orient the design at the architecture stage.  Measures should definitely complement it once they become available.

## Summary

Let’s recap as we are aware that was a lot of material.

We need to establish the SLA of our solution at a level:

• Low enough we are comfortable delivering
• High enough to be attractive to service consumers

The methodology we articulated here helps determine a theoretical baseline.  It combines SLAs of services used for the solution.

When services depend on each other (e.g. the Web App on the SQL DB), we multiply SLAs.  When services complement each other (multi-region failovers), we have another formula.

Beyond compounding Azure services, we need to consider application failure modes:

• Down-time related to deployment
• Bugs introduced in deployment
• etc.

We can boost the SLA of a solution by implementing a multi-region failover.  This increases the complexity and cost of the solution.

We can consider separate scenarios with separate SLAs.  We can also architect in a way to gracefully degrade experience if some sub-services go down.

It is important to measure availability to get a realistic SLA.

We can then iterate to either improve the SLA or optimize the cost of the solution.