Meaningful Availability

I actually read this paper for the first time a year ago. But I found it so good that I’ve decided to give it another ready, with a written summary.

We all know that availability is a critical requirement for web applications. Yet, capturing a metric indicating meaningful availability for an app is far from being a trivial thing to do.
This paper proposes a way to calculate availability, named windowed user-uptime.

The State of the Metrics

There are many ways to compute availability.

Historically, it has ben described as a comparison between Mean Time to Failure and Mean Time to Recovery.
However, a binary up/down metric doesn’t work with complex distributed systems, as something will always be failing, which doesn’t mean the system has availability issues.

Over time, big cloud providers have used various metrics systems, which fall into two categories: time-based and count-based.

Time-based availability metrics

Time-based availability metrics define availability as:

availability = uptime / (uptime + downtime)

This is a very meaningful metric for users, as it is directly based on whether the system is up or down.
However, complex distributed systems make this kind of computation complicated, as it considers an availability issue the same way whether tne entire platform is down, or a single service is.

Count-based availability metrics

Count-based availability metrics define availability as:

availability = successful requests / total requests

This metric acknowledges complex systems a bit more, as if a single service is unavailable, we will still consider all the successful requests.
However, it is biased by the most active users, who will make a lot more requests than less active ones.
It is also biased by the behaviors of clients during outages, who will make less requests, therefore virtually increasing availability.

User-Uptime

The idea of user-uptime is to consider every user, not the entire system as a whole, to the strengths of both time-based and count-based availability.

user-uptime

As an example, let’s take the following outage, which affects users selectively.

per-user incident

In this example, user1 makes a lot more requests than any other user, so their failures accounts negatively towards availability.
User-uptime, using the sum of the uptime from each individual user will account them all equally.

Challenges

Of course, user-uptime has challenges too.

Duration labelling remains easy as long as back-to-back events are both failures or successes. However, as soon as there are successes among the failures, labelling the duration of an unavailability becomes a lot less trivial.

If a user is not using the system at a given time, they have no perception of it being up or down.
So they only use active users to compute availability. They compute active users based on the p99 for interarrival time, which is 30 minutes for GMail.

active and inactive periods

In this example, the user has a period of downtime with their first failed request. Then, after a second failure, they become inactive. During that inactivity, they are not counted toward downtime, as they are not using the application anymore.

Once they make a request again, as it is failing, they count dowards donwtime again, until the service becomes available again.

Properties

To analyze user-uptime, they have generated synthetic user requests for an hour, and caused them all to fail for a contiguous 15 minutes.
During that time, each client made a different number of requests, to mimic user-behavior.

Availability was computed using both a count-based availability metric, and user-uptime.

comparison of success-ratio and user-uptime

This graph shows that both availabilities have similar values, around 0.75 (3 quarters of an hour, hence 15 minutes downtime).

However, the standard deviation is much lower (0.08 compared to 0.05), showing that user-uptime captures the outage more precisely.

An even more interesting graph is when they added retries. Whenever a request failed, the user retries it.

comparison of success-ratio and user-uptime with retries

Here we can see that, as there are more failure requests being made, the count-based availability goes down, even though unavailability for the user’s point of view should still be at 0.75.
However, user-uptime, as it is computing the data per-user, keeps a sane availability value.

Windowed User-Uptime

Windowed User-Uptime is a way to compute uptime for any period of interest.

For any period of time of interest (say, a quarter), they create window sizes of different values (eg, 1 minute, 1 hour, 1 day, 1 month, 1 quarter).
Then, for each window size, they look at all availabilities and pick the worst one.

For example:

windowed user-uptime

In this figure, we can see:

  • That the overall availability for the quarter is 99.991% (the point at the right of the curve).
  • The worst 1-minute unavailability was 92%. There was no minute in the quarter where it was worse than that.
  • The knee of the curve, being at about 2 hours shows the length of the longest incident, which brought availability down to 92%.

The rest of the section (5.2) gives the mathematical proof that windowed user-uptime will be monotonically non-decreasing.

Production Evaluation

Setting up user-uptime instead of success-ratio has displayed some bias in the later.

They have discovered that 99% of users contribute 38% of all requests, and the remaining 1% contribute to 62% of the total, which is improperly computed by success-ratio, and makes availability consistently appear better or worse than it actually is.

Several examples are included, , here is one of them:

effect of abusive users

In this example, the availability graphs look both the same before and after the event happens.
However, user-uptime shows almost no impact, when success-ratio shows a decrease in availability.

Their investigation uncovered that a small number of users had enabled a third-party application which was making invalid requests, ending up in errors which were retried with no exponential back-off.

Even though it impacted only a handful of users, the success-ratio went down. As this only impacted a small number of users, the user-uptime metric matches users perception.

Windowed uptime also allows very nice interpretations in burstiness of availability.

Let’s look at an example:

monthly windowed user-uptime between two services

Both services have the same monthly availability. Yet their graphs tell very different things.

The hangouts graph has a clear knee curve at about 4 hours, which is the longest unavailability event that happened for that service.
We can therefore make the assumption that a large event happened this month, and that next month should be better for this service.

Drive, on the other end, has no knee curve. We can therefore make the assumption that the service has a lot of short incidents, which the service is still likely to suffer from on the following month.
That service probably has more issues to fix.

Conclusion

This paper is a bit weird in the sense that it could have been two, one about user-uptime and the other about windowed uptime.
Both are related, but could also be applied independently.

In the last section, talking about the applicability of windowed user-uptime, they mention that in order to achieve this, they needed fine-grained logs of individual user operations, which yell “traces” to me. It does seem very hard to implementing those methods with metrics only.

I believe those two methods are great ways to compute availability, and user-uptime is a great way to base alerts on.

As they end the paper with:

We are confident that windowed user-uptime is broadly applicable: any cloud service provider should be able to implement it.