I actually read this paper for the first time a year ago. But I found it so good that I’ve decided to give it another ready, with a written summary.
We all know that availability is a critical requirement for web applications.
Yet, capturing a metric indicating meaningful availability for an app is far
from being a trivial thing to do.
This paper proposes a way to calculate availability, named windowed user-uptime.
There are many ways to compute availability.
Historically, it has ben described as a comparison between Mean Time to Failure
and Mean Time to Recovery.
However, a binary up/down metric doesn’t work with complex distributed systems, as something will always be failing, which doesn’t mean the system has availability issues.
Over time, big cloud providers have used various metrics systems, which fall into two categories: time-based and count-based.
Time-based availability metrics
Time-based availability metrics define availability as:
availability = uptime / (uptime + downtime)
This is a very meaningful metric for users, as it is directly based on whether
the system is up or down.
However, complex distributed systems make this kind of computation complicated, as it considers an availability issue the same way whether tne entire platform is down, or a single service is.
Count-based availability metrics
Count-based availability metrics define availability as:
availability = successful requests / total requests
This metric acknowledges complex systems a bit more, as if a single service is
unavailable, we will still consider all the successful requests.
However, it is biased by the most active users, who will make a lot more requests than less active ones.
It is also biased by the behaviors of clients during outages, who will make less requests, therefore virtually increasing availability.
The idea of user-uptime is to consider every user, not the entire system as a whole, to the strengths of both time-based and count-based availability.
As an example, let’s take the following outage, which affects users selectively.
In this example, user1 makes a lot more requests than any other user, so their
failures accounts negatively towards availability.
User-uptime, using the sum of the uptime from each individual user will account them all equally.
Of course, user-uptime has challenges too.
Duration labelling remains easy as long as back-to-back events are both failures or successes. However, as soon as there are successes among the failures, labelling the duration of an unavailability becomes a lot less trivial.
If a user is not using the system at a given time, they have no perception of
it being up or down.
So they only use active users to compute availability. They compute active users based on the p99 for interarrival time, which is 30 minutes for GMail.
In this example, the user has a period of downtime with their first failed request. Then, after a second failure, they become inactive. During that inactivity, they are not counted toward downtime, as they are not using the application anymore.
Once they make a request again, as it is failing, they count dowards donwtime again, until the service becomes available again.
To analyze user-uptime, they have generated synthetic user requests for an
hour, and caused them all to fail for a contiguous 15 minutes.
During that time, each client made a different number of requests, to mimic user-behavior.
Availability was computed using both a count-based availability metric, and user-uptime.
This graph shows that both availabilities have similar values, around 0.75 (3 quarters of an hour, hence 15 minutes downtime).
However, the standard deviation is much lower (0.08 compared to 0.05), showing that user-uptime captures the outage more precisely.
An even more interesting graph is when they added retries. Whenever a request failed, the user retries it.
Here we can see that, as there are more failure requests being made, the
count-based availability goes down, even though unavailability for the user’s
point of view should still be at 0.75.
However, user-uptime, as it is computing the data per-user, keeps a sane availability value.
Windowed User-Uptime is a way to compute uptime for any period of interest.
For any period of time of interest (say, a quarter), they create window sizes
of different values (eg, 1 minute, 1 hour, 1 day, 1 month, 1 quarter).
Then, for each window size, they look at all availabilities and pick the worst one.
In this figure, we can see:
The rest of the section (5.2) gives the mathematical proof that windowed user-uptime will be monotonically non-decreasing.
Setting up user-uptime instead of success-ratio has displayed some bias in the later.
They have discovered that 99% of users contribute 38% of all requests, and the remaining 1% contribute to 62% of the total, which is improperly computed by success-ratio, and makes availability consistently appear better or worse than it actually is.
Several examples are included, , here is one of them:
In this example, the availability graphs look both the same before and after
the event happens.
However, user-uptime shows almost no impact, when success-ratio shows a decrease in availability.
Their investigation uncovered that a small number of users had enabled a third-party application which was making invalid requests, ending up in errors which were retried with no exponential back-off.
Even though it impacted only a handful of users, the success-ratio went down. As this only impacted a small number of users, the user-uptime metric matches users perception.
Windowed uptime also allows very nice interpretations in burstiness of availability.
Let’s look at an example:
Both services have the same monthly availability. Yet their graphs tell very different things.
The hangouts graph has a clear knee curve at about 4 hours, which is the
longest unavailability event that happened for that service.
We can therefore make the assumption that a large event happened this month, and that next month should be better for this service.
Drive, on the other end, has no knee curve. We can therefore make the
assumption that the service has a lot of short incidents, which the service is
still likely to suffer from on the following month.
That service probably has more issues to fix.
This paper is a bit weird in the sense that it could have been two, one about
user-uptime and the other about windowed uptime.
Both are related, but could also be applied independently.
In the last section, talking about the applicability of windowed user-uptime, they mention that in order to achieve this, they needed fine-grained logs of individual user operations, which yell “traces” to me. It does seem very hard to implementing those methods with metrics only.
I believe those two methods are great ways to compute availability, and user-uptime is a great way to base alerts on.
As they end the paper with:
We are confident that windowed user-uptime is broadly applicable: any cloud service provider should be able to implement it.