Benchmarking is surprisingly difficult. Full accuracy requires knowledge of your entire environment, to ensure that there aren't excessive sources of noise, as well as statistics to understand the results. JMH includes a warning in the output of your benchmark:

The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.

Note that sometimes benchmarking is used with the connotation that it only means benchmarking entire systems, for comparison. As I use the term, it refers to any structured attempt to understand the performance of some operations, whether they be exercising an entire OS/application, or a microbenchmark that measures a specific operation in code.

Active Benchmarking

To perform active benchmarking:

  1. If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.
  2. While the benchmark is running, analyze the performance of all components involved using other tools, to identify the true limiter of the benchmark.

Evaluating the Evaluation: A Benchmarking Checklist

Six questions from Brendan Gregg that can indicate a flawed benchmark:

  1. Why not double?
  2. Did it break limits?
  3. Did it error?
  4. Does it reproduce?
  5. Does it matter?
  6. Did it even happen?

Benchmarking: You're Doing it Wrong

Strangeloop talk, giving an overview of the challenges of accurate benchmarking.

Producing Wrong Data Without Doing Anything Obviously Wrong!

Shows that the effects of

  1. Including additional environment variables
  2. Changing linking order

may produce variations in experimental results large enough to indicate that -O2 is better than -O3 or vice-versa.

Minimum Times Tend to Mislead When Benchmarking

A detailed explanation of why you cannot just look at the minimum times a benchmark gives. The article does not suggest a precise alternative, but mentions looking at histograms and confidence intervals as two approaches that would be vastly better than using the minimum. This post also contains a minimal example of active benchmarking: by measuring the number of lookups in a program, we find a source of non-determinism that correlates well with the execution time.

Dan Luu suggests that the resurgence of people advocating for taking the minimum time when benchmarking might be based on people reading robust benchmarking in noisy environments.

Systems Benchmarking Crimes

A long list of weaknesses found in systems benchmarks.

How Not to Measure Latency

Strangeloop talk.

JMH vs. Caliper: Reference Thread

A comparison of Caliper and JMH, two JVM benchmark harnsses. Some JVM specific information, but very informative about the design of microbenchmark harnesses.

IMO, Caliper is not as bad for large benchmarks. In fact, Caliper feels just like pre-JMH harnesses we had internally in Sun/BEA. And that is not a coincidence, because Caliper benchmark interface is very intuitive and an obvious one. The sad revelation that cas upon me over previous several years is that the simplicity of benchmark APIs does not correlate with benchmark reliability.

Systems Performance: Enterprise and the Cloud

This book is not primarily about benchmarking, but contains a good chapter on the subject.