Flat Profiles

These notes should be considered pretty tentative.

If I understand correctly, a flat profile is one where the code has no obvious hotspots. In such a profile, there's no way to get large performance improvements by looking at any particular function and optimizing it. Improvements could come from one of three sources:

  1. re-architecting the application
  2. micro-optimizations throughtout the program, but many micro-optimizations targeting a wide range of different areas
  3. compiler/runtime improvements that raise performance in many spots throughout the program

The discussions I've seen have centered around two questions:

  1. are flat profiles common, or the outlier?
  2. what does that mean for performance work?

The Death of Optimizing Compilers

Flat profiles are dying. Already dead for most programs. Larger and larger fraction of code runs freezingly cold, while hot spots run hotter.

Slides from a Daniel Bernstein talk (which I don't think is available in full). Claims software performance is dominated by very specific hotspots, and that optimizing compilers do not help here.

Danny Bee on Flat Profiles

Oh great, that's nice, i guess i can stop worrying about the thousands of C++ applications google has built that display this property, and ignore the fact that in fact, the profiles have gotten more flat over time, not less flat.

Pack it up boys, time to go home!

Basically, [Bernstein] is just asserting i'm wrong, with little to no data presented, when i'm basing mine on the results of not only thousands of google programs (which i know with incredible accuracy), but the thousands of others at other companies that have found the same. I'm not aware of him poring over performance bugs for many many thousands of programs for the past 17 years. I can understand if he's done it for his open source programs (which are wonderful, BTW :P)

Response to Bernstein's slides.

AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers

ABSTRACT ....Second, based on a longitudinal analysis of AsmDB data from real-world online services, we present two detailed insights on the sources of front-end stalls: (1) cold code that is brought in along with hot code leads to significant cache fragmentation and a corresponding large number of instruction cache misses; (2) distant branches and calls that are not amenable to traditional cache locality or next-line prefetching strategies account for a large fraction of cache misses. Third, we prototype two optimizations that target these insights. For misses caused by fragmentation, we focus on memcmp, one of the hottest functions contributing to cache misses, and show how fine-grained layout optimizations lead to significant benefits. For misses at the targets of distant jumps, we propose new hardware support for software code prefetching and prototype a new feedback-directed compiler optimization that combines static program flow analysis with dynamic miss profiles to demonstrate significant benefits for several large warehouse-scale workloads. Improving upon prior work, our proposal avoids invasive hardware modifications by prefetching via software in an efficient and scalable way. Simulation results show that such an approach can eliminate up to 96% of instruction cache misses with negligible overheads.

Relevant to the claim that performance is dominated by a handful of hot loops, that can be best optimized by hand, rather than by optimizing compilers.

Miscellany

These are a handful of comments about flat profiles in the wild and what they mean for the value of optimizing first vs. profiling. This feels like a subtly different discussion than the Bernstein/Danny Bee discussion above. In particular, several of these comments seem to suggest that flat profiles are a sign of architectural problems, whereas Bernstein/DannyBee seem to be arguing about the behavior of basically well-designed well-performing software.

Kasey Junk

Literally the thread I’m responding too is asking what the value of optimize then profile is. My experience is that you are likely hosed if you’ve gotten to this point in the 80% case. That isn’t to say these techniques have no value, I’ve used many of them myself. But the question was, “how often does it happen that your optimizations are for a flat profile”. For me, the answer is “most of the time”.

David Goldblatt

The longer I work on performance teams, the more I agree with the "performance always matters" point of view[1]. In large binaries (or at least, in the large binaries I've worked on), we don't see a few tight loops taking the bulk of the time. Instead, there's a death by a thousand cuts in which many small inefficiencies add up, and have to be clawed back slowly and painfully, often by people with less understanding of the semantics of the code in question than the original authors. Most performance work I see isn't stuff like "write the tight loop in assembly; save 30% on execution time", it's stuff like "reuse the locale object to avoid construction penalties; save 0.4%".

Features a response saying flat profiles suggest poor architecture.

mlthoughts

Says he doesn't see flat profiles, attributes this to designing for performance, writing performance tests for specific modules.

Footnotes