Benchmarking

Benchmarking is surprisingly difficult. Full accuracy requires knowledge of your entire environment, to ensure that there aren't excessive sources of noise, of statistics to understand the results, and an understanding of exactly how your code executes. JMH includes a warning in the output of your benchmark:

The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.

Note that sometimes benchmarking is used with the connotation that it only means benchmarking entire systems, for comparison. As I use the term, it refers to any structured attempt to understand the performance of some operations, whether they be exercising an entire OS/application, or a microbenchmark that measures a specific operation in code.

Systems Performance: Enterprise and the Cloud

This book is not primarily about benchmarking, but contains a good chapter on the subject. Of course, it's also a tremendous resource for understanding performance in general.

Benchmarks That Aren't Your Friends

There’s maybe three main roles someone could find themselves authoring a benchmark in:

They maintain a system and want to be able to track its performance over time, or to be able to state with some objectivity what its “performance” is,

they are in the market for a system and want to compare several different possible choices, or,

they are a member of a committee and are tasked with producing a standard benchmark that vendors can use and advertise.

For the first two sorts of benchmarks, you only have to worry about inadvertantly tricking yourself. For the third, you have to worry about the people using the benchmark trying to cheat.

This taxonomy doesn't mention microbenchmarks used for understanding the behavior of code, but they're much like the first type of benchmark.

Active Benchmarking

To perform active benchmarking:

If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.

While the benchmark is running, analyze the performance of all components involved using other tools, to identify the true limiter of the benchmark.

Evaluating the Evaluation: A Benchmarking Checklist

Six questions from Brendan Gregg that can indicate a flawed benchmark:

Why not double?
Did it break limits?
Did it error?
Does it reproduce?
Does it matter?
Did it even happen?

Benchmarking: You're Doing it Wrong

Strangeloop talk, giving an overview of the challenges of accurate benchmarking.

accuracy requires understanding as many levels as possible, from code to hardware
the impact of caches
the impact of warmup
noisy environments, frequency scaling
use representative hardware specs
don't assume your sample is normal, or unimodal

Producing Wrong Data Without Doing Anything Obviously Wrong!

Shows that the effects of

Including additional environment variables
Changing linking order

may produce variations in experimental results large enough to indicate that -O2 is better than -O3 or vice-versa.

Minimum Times Tend to Mislead When Benchmarking

A detailed explanation of why you cannot just look at the minimum times a benchmark gives. The article does not suggest a precise alternative, but mentions looking at histograms and confidence intervals as two approaches that would be vastly better than using the minimum. This post also contains a minimal example of active benchmarking: by measuring the number of lookups in a program, we find a source of non-determinism that correlates well with the execution time.

Dan Luu suggests that the resurgence of people advocating for taking the minimum time when benchmarking might be based on people reading robust benchmarking in noisy environments.

Systems Benchmarking Crimes

A long list of weaknesses found in systems benchmarks.

How Not to Measure Latency

Strangeloop talk by Gil Tene.

Server performance analysis almost never should just measure just the average, or even 99th percentile latency--these measurements hide the worst case. You should know the max latency. The same author wrote about why graphing percentiles is misleading in blog form.
Typical users will frequently hit 99th percentile responses, because there are many HTTP requests per user interaction. Almost every interaction is worse than the median. If their page load waits for all those requests, their experience will always be worse than the median response time.
Coordinated omission: if your benchmarking harness waits for one response before issuing another, you'll send few requests to the server while it's performing badly, and many while it's performing well, making the results look good when they are terrible.
Under sufficient load, your response time will eventually go to infinity. If your load generator can't provoke that behavior, it's exhibiting coordinated omission.
What goals to evaluate when benchmarking? While you can learn by pushing systems beyond their breaking point, you usually want to test at much lower loads--where you'll (hopefully) be running in prod. How exactly the system fails when overloaded is usually less important. Setting latency requirements and seeing under what conditions the system can meet them is typically more important.

Open Versus Closed: A Cautionary Tale

Introduces the concept of open and closed systems that appears in coordinated omission, but also notes that real world systems are often partially open (note: in this sense, a system is not a backend system such as a DB, or web server, but a combination of a backend system and a source of load). In a closed system, new requests only start after previous ones finish (which can cause coordinated omission). In an open system, new requests begin at a fixed rate. In a partially open system, we model users arriving at a fixed rate, but each user makes a series of requests, with later requests only beginning when earlier requests complete (e.g. consider a user logging in, then navigating a web UI).

Also observes an interesting phenomenon, that scheduling policies that reduce response time in open systems may have very limited effects on closed systems. Improperly using a closed benchmark doesn't just make the system under test look better, it benefits certain types of systems more than others (this is true of everything that introduces noise or otherwise distorts benchmarks, but it's a nice example).

JMH vs. Caliper: Reference Thread

A comparison of Caliper and JMH, two JVM benchmark harnsses. Some JVM specific information, but very informative about the design of microbenchmark harnesses.

IMO, Caliper is not as bad for large benchmarks. In fact, Caliper feels just like pre-JMH harnesses we had internally in Sun/BEA. And that is not a coincidence, because Caliper benchmark interface is very intuitive and an obvious one. The sad revelation that cas upon me over previous several years is that the simplicity of benchmark APIs does not correlate with benchmark reliability.

Rigorous Benchmarking in Reasonable Time

We explain the shortcomings of statistical methods commonly used in our field. Instead, we offer a sound method based on effect sizes and confidence intervals.

We provide an observational study of non-determinism in the DaCapo and SPEC CPU benchmarks.

We offer a sound experimental methodology that makes best use of experiment time. We establish, both formally and in practice, the optimum number of repetitions of experiments to achieve the most precise results for a given experiment time.

We compare our methodology with heuristics-based practice, and show that the latter often leads to either too few or too many experiments.

We revisit the question of the effect of code layout on the performance of DaCapo and SPEC CPU and show that it is less important than prior work had shown.

I haven't fully digested this paper. Julia Evans wrote some nice notes about this paper: Tiny Notes on Rigorous Benchmarking in Reasonable Time.

Virtual Machine Warmup Blows Hot and Cold

A measure of the volatility of performance measurements for virtual machines.

The fundamental aim of this paper is to test, in a highly idealised setting, the following hypothesis:

H1 Small, deterministic programs reach a steady state of peak performance.

.... Instead we encountered many benchmarks which slowdown, never hit a steady state, or have inconsistent performance from one run to the next. Only 30.0–43.5% (depending on the machine and OS combination) of ⟨VM, benchmark⟩ pairs consistently reach a steady state of peak performance. Of the seven VMs we studied, none consistently reached a steady state of peak performance. These results are much worse than reported in previous works. Our results suggest that much real-world VM benchmarking, which nearly all relies on assuming that benchmarks do reach a steady state of peak performance, is likely to be partly or wholly misleading. Since microbenchmarks similar to those in this paper are often used in isolation to gauge the efficacy of VM optimisations, it is also likely that ineffective, or deleterious, optimisations may have been incorrectly judged as improving performance and included in VMs.

Need to reread and annotate this paper. Also describes how the authors designed krun.

xkcd Statistics

Statistics tip: always try to get data that's good enough that you don't need to do statistics on it.

While much of this material emphasizes rigor, precise measurement and the scientific method, a significant amount of real world performance work is about harvesting the low lying fruit. Thomas Dullien once commented that he was shocked how much real world performance work was avoiding allocation in GC'd languages, instead of lower-level optimization. I had to make him sad by pointing out that avoiding n+1 database queries is a bigger source of gains in the wild.

Performance Fundamentals

Brendan Gregg has a good observation:

Understanding benchmarks is particularly difficult for the beginner, who has no instinct for whether numbers are suspicious or not. If you bought a thermometer that showed the temperature of the room you’re in as 1,000 degrees Fahrenheit (or Celsius), you’d immediately know that something was amiss. The same isn’t true of benchmarks, which produce numbers that are probably unfamiliar to you (Systems Performance, ).

In addition to Brendan's book, some useful resources for developing this intuition:

Be careful to understand how these numbers may change. The latency and throughput of storage, main memory, or your network stack can change significantly, though wide area networking latency can only change so much.

Understanding Software Dynamics

The first several chapters show experimental methods for measuring these performance fundamentals. Regrettably, I got diverted and have to come back and finish this book.

Tooling

I don't know enough about configuring an environment for reliable benchmarking. These tools look interesting, but I cannot comment beyond that.

krun

A tool for benchmarking reliably with extreme care to minimize noise.

ReBench: Execute and Document Benchmarks Reproducibly

Tool for running benchmarks and collecting and packaging data. Contains denoise, for adjusting Linux system settings to remove sources of noise. The readme mentions that krun may be more capable than denoise.

Heap

The Art of Computer Systems Performance Analysis

Unread. Looks dry, but contains lots of material on benchmarking, especially statistical models. Contains a discussion of factorial designs, and partial factorial designs.