Measuring the performance of services is tricky. There is an almost irresistible desire to measure average performance. But measuring service performance using averages is pretty much guaranteed to provide misleading results. The best way (I know of anyway) to get accurate performance results when measuring service performance is to measure percentiles, not averages. So Do Not use averages or standard deviations, Do use percentiles. See below for the details.
1 Averages for service performance are typically wrong
So let's say you have your shiny new service and you want to know how its performing. You likely set up some kind of test bench and start firing off a bunch of requests to the service and record the latency and throughput for the requests over some period of time. But how should one present a summary of the performance data?
In most cases what I see people do next is calculate average latency and average throughput. If they are particularly fancy they might even throw in standard deviations for both.
Unfortunately in most cases both the average and the standard deviation don't accurately represent the performance of the system.
The reason for this is pretty straight forward - the average and standard deviation attempt to describe the characteristics of a set of data assuming they describe a normal curve. This is the famous bell shaped curve. But service performance is almost never normal. In fact performance distribution tends to be pretty flat for most requests and then fall off a cliff. this is not what you would call a normal distribution. There are lots of reasons for this.
For example, most services have some kind of caching and typically caching is a pretty good technique but cache misses are expensive. So while most requests will be serviced quickly out of a cache some number of requests will cause a cache miss and will be substantially more expensive to handle. This behavior isn't really a curve, it's more like a step function.
Other reasons are queuing behavior. Most services can be thought of as a series of queues and so long as the load is within the queue's capacity then everything is fine but once that capacity is exceeded then time outs, failures, etc. will start to happen since the system can't recover until incoming requests fall enough to let the system catch up. So pretty much the system just shuts down until the queues are cleared.
Now there are ways to torture normal distributions to get behavior that is closer to what one sees in service behavior. We can start to talk about Kurtosis, skew, etc. But these are all attempts to force the data to be summarized by some model and if that model isn't appropriate to the data set then the information the model is giving is just plain wrong. So if one is going to start playing around with different types of distributions then this still means one has to collect the same kind of data that percentiles (discussed below) require in order to prove that the distribution is accurate. Well if one is going to do all the work to collect percentile data then why not just use the percentile data?
But the punch line is this - characterizing a service's performance across many requests using averages will almost certainly produce misleading data. So please, just say no to averages.
2 Even small numbers of customers having a bad experience costs real money
O.k. o.k. so averages are wrong. But they are really easy to calculate and as long as they 'close enough' aren't we happy? This is actually something that has been studied and the answer is - no. To help frame this discussion consider the following:
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO 2007 - 100 ms delays caused a 1% drop in sales at Amazon.
Performance Related Changes and their User Impact 2009 1 second delays saw a loss of 1.2% of revenue/user. This went to 2.8% at 1 second and 4.3% at 2 seconds.
Why Web Performance Matters: Is Your Site Driving Customers Away? 2010 Between 2 4 seconds 8% of users abandon the site, from 2 -6 seconds it's 25% and by 10 seconds it's 38%.
With this data in mind let's go back to look at those averages. If the average is say a 50 ms delay shouldn't everyone be happy? Well not if say 20% of users are seeing 100 ms plus latencies. This wouldn't show in the average and the standard deviation is largely meaningless anyway since it's describing probability for the wrong curve.
In that case the 20% of users seeing 100 ms plus latencies, using the Amazon number, 20% * 1% = 0.2% of sales just walked out the door. That isn't a healthy way to run a business. In fact major service companies measure the experience of their users up to 99.9% (as will be explored below) because bad experiences for even small numbers of users have significant financial consequences.
Put another way, it's cheaper to create systems that have predicable performance into the 3 9s than to lose sales caused by bad performance at the end of the performance curve.
3 Percentiles
So typically the way we will accurately represent system performance is using percentiles. The idea behind percentiles is pretty straight forward - what percentages of users had a particular experience?
Imagine, for example, that we ran a test 10 times and the latencies we got back were 1, 2, 1, 4, 50, 30, 1, 3, 2 & 1 ms. The first thing we would do is order the latencies from smallest to largest - 1, 1, 1, 1, 2, 2, 3, 4, 30, 50.
The median or 50th percentile is the best latency that 50% of the requests experienced. In this case we would count 1/2 the results or 10/2 = 5 results and that is the median. In this case it's 2 ms. So this means that 50% of the requests had a latency of 2ms or better.
The 90th percentile is the best latency seen by 90% of the requests. In this case that's the 9th result (0.9 * 10 = 9) which is 30 ms.
Typically the results will show a graph from 1 percentile through 90% in increments of 1% followed by 99.9% or higher if appropriate. In most cases the results have to be shown on a logarithmic scale to be easily viewable.
Throughput is measured in a similar way. The big difference is that for throughput one is measuring the number of requests completed over some window of time. 1 second is a pretty typical window. I usually just measure how many requests completed during a particular window.
Is it ok to average percentiles? For example, I get a 90th percentile result from a performance test tool. I run that test 3 times. Being a tool, it hides the detail. Is there anything wrong with averaging the 90th percentiles from the 3 test runs, to avoid selection bias? To avoid picking the best or worst of the 3 in other words, to report something that represents all the tests?
I am not a mathematician or a statistician so I wouldn’t trust anything I say.
When I run a latency or throughput test multiple independent times I typically show all the results as a chart. It gives a ‘sense’ of what kind of numbers we are seeing. I don’t typically combine them together but it’s not clear that doing so gives a meaningful result.
Another alternative to use the multiple results is to show a min and max value. E.g. ‘this is the worst 90% we saw and this is the best over X runs’.
But just averaging the results together is iffy, especially since it’s not clear that the average represents anything useful. If the way your system behaves is such that the 90% results don’t follow a normal curve then the average isn’t telling you anything.
Well that’s kind of the thing. The test tool we use hides the detail because there is just so much of it. I’m testing a web app, so I’m measuring hundreds of individual page loads, dozens or hundreds of times per test. Then I run that whole test multiple times, and those are the 90th percentiles I want to average — for each page. It’s a different kind of performance test from service testing… or at least there’s a lot more results info to wrangle afterward…
Having slept on it, and when you suggest min and max 90th percentiles, it makes me think averaging 90th percentiles should be fine as long as they’re close anyway between test runs, and I could provide min, max, and std. dev. too just to round out the picture. Because that’s the problem with averages anyway right? They can be misleading because of the detail they hide…
Understand what you’re saying about averages being meaningless if the results don’t fall on a bell curve. I don’t think we have that issue, but won’t know for sure until we get 90th percentile measurements implemented…
Thanks for soundboarding this with me. :)
I would make sure to at least print out the min and max in addition to the averages (e.g. a standard range error chart which shows the average in the middle and the min and max as lines) to help people understand the worse case scenarios.
How the typical graph would look like? I mean – do you have any sample graph for the given latency measurements ?
Sorry for taking literally months to post your comment but I kept thinking I would find a graph and I never did and then I just felt guilty. I suck.