Benchmarking is surprisingly difficult. Full accuracy requires knowledge of your entire environment, to ensure that there aren't excessive sources of noise, as well as statistics to understand the results. JMH includes a warning in the output of your benchmark:
The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell.
Note that sometimes benchmarking is used with the connotation that it only means benchmarking entire systems, for comparison. As I use the term, it refers to any structured attempt to understand the performance of some operations, whether they be exercising an entire OS/application, or a microbenchmark that measures a specific operation in code.
To perform active benchmarking:
- If possible, configure the benchmark to run for a long duration in a steady state: eg, hours.
- While the benchmark is running, analyze the performance of all components involved using other tools, to identify the true limiter of the benchmark.
Six questions from Brendan Gregg that can indicate a flawed benchmark:
- Why not double?
- Did it break limits?
- Did it error?
- Does it reproduce?
- Does it matter?
- Did it even happen?
Strangeloop talk, giving an overview of the challenges of accurate benchmarking.
- accuracy requires understanding as many levels as possible, from code to hardware
- the impact of caches
- the impact of warmup
- noisy environments, frequency scaling
- use representative hardware specs
- don't assume your sample is normal, or unimodal
Shows that the effects of
- Including additional environment variables
- Changing linking order
may produce variations in experimental results large enough to indicate that -O2 is better than -O3 or vice-versa.
A detailed explanation of why you cannot just look at the minimum times a benchmark gives. The article does not suggest a precise alternative, but mentions looking at histograms and confidence intervals as two approaches that would be vastly better than using the minimum. This post also contains a minimal example of active benchmarking: by measuring the number of lookups in a program, we find a source of non-determinism that correlates well with the execution time.
A long list of weaknesses found in systems benchmarks.
- Server performance analysis almost never should just measure just the average, or even 99th percentile latency--these measurements hide the worst case. You should know the max latency. The same author wrote about why graphing percentiles is misleading in blog form.
- Typical users will frequently hit 99th percentile responses, because there are many HTTP requests per user interaction. Almost every interaction is worse than the median. If their page load waits for all those requests, their experience will always be worse than the median response time.
- Coordinated omission: if your benchmarking harness waits for one response before issuing another, you'll send few requests to the server while it's performing badly, and many while it's performing well, making the results look good when they are terrible.
- Under sufficient load, your response time will eventually go to infinity. If your load generator can't provoke that behavior, it's exhibiting coordinated omission.
- What goals to evaluate when benchmarking? While you can learn by pushing systems beyond their breaking point, you usually want to test at much lower loads--where you'll (hopefully) be running in prod. How exactly the system fails when overloaded is usually less important. Setting latency requirements and seeing under what conditions the system can meet them is typically more important.
A comparison of Caliper and JMH, two JVM benchmark harnsses. Some JVM specific information, but very informative about the design of microbenchmark harnesses.
IMO, Caliper is not as bad for large benchmarks. In fact, Caliper feels just like pre-JMH harnesses we had internally in Sun/BEA. And that is not a coincidence, because Caliper benchmark interface is very intuitive and an obvious one. The sad revelation that cas upon me over previous several years is that the simplicity of benchmark APIs does not correlate with benchmark reliability.
This book is not primarily about benchmarking, but contains a good chapter on the subject.