G1 String Deduplication Doesn't Fully Deduplicate Strings

Starting with Java 8u20, there is a feature for the G1 garbage collector to support string deduplication (using the VM argument XX:+UseStringDeduplication) as a memory saving optimization. In later Java versions, the Shenendoah and ZGC collectors also support the feature.

Over time, I've noticed a lot of unclarity about the scope of this feature. This piece aims to make it clear what GC based string deduplication accomplishes, and why it might still sometimes be useful to manually duplicate strings.

Why Deduplicate Strings?

Strings are a large portion of the data used by many applications and it's common to have certain strings you which are repeated throughout an application. 1 These duplicated strings might be fragments of HTML, SQL statements, or strings for domain concepts like the USA.

If these duplicated strings all correspond to constants2, then they might be replacable with an enum, but that's not necessarily possible for user data. Suppose thousands of users are entering the variants "USA", "United States" or "United States of America". Perhaps we don't know the variants users actually use ahead of time. Or alternatively, we know what users might use, but have a requirement to accurately round-trip user data. In either case, a string is the only data structure we can use.

Regardless of why, when we have to store large quantities of data as strings, a significant portion of our memory may be wasted storing copies of the same string. If we can replace these copies with shared references to a canonical instance, we can save a significant amount of memory.

A Caution About String.intern

String interning is a feature of the JVM that has existed forever. Some readers probably assume that I'm going to recommend it. Calling String.intern returns a canonical instance of a string. Two entirely separate pieces of code within the same JVM calling "abc".intern() will get back references to the same string. On the face of it, this is exactly what we want.

Unfortunately, in older versions of the JVM, misuse of String.intern could crash the JVM. Starting in Java 6 and 7, these crashes were fixed. This left only subtler performance problems as reasons to avoid String.intern, and led some developers to think it was a suitable way to save memory. In fact, using String.intern risks major slowdowns and increased GC pauses. I'll defer explaining in favor of Aleksey Shipilëv's excellent article, but will quote his summary:

In almost every project we were taking care of, removing String.intern() from the hotpaths, or optionally replacing it with a handrolled deduplicator, was the very profitable performance optimization. Do not use String.intern() without thinking very hard about it, okay?

Memory Use of Strings

In the open JDK implementation, each Java String contains two primitive fields and a reference to an array. Here are a String's fields as defined in the JDK Source, minus comments, annotations and newlines:

public final class String {
    private final byte[] value;
    private final byte coder;
    private int hash;
}

In a <= 32GB heap, the empty string uses 24 bytes for the String excluding the array, and another 16 bytes for the array, or 40 bytes total.3 4 A 1024 character ascii string also uses 24 bytes for the String excluding the array, but uses 1040 bytes for its array. 5 For small strings, the majority of the memory use is the string excluding the array, for longer strings, the array dominates.

GC Based String Deduplication

At a very high level, the idea behind the GC's string deduplication is very similar to String#intern. Repeated instances of the same string are replaced with one canonical copy. However, aside from using a different implementation that avoids the performance problems associated with String.intern, there is a further difference.

While String.intern canonicalizes the entire string object, GC based deduplication only modifies the backing byte[] array for strings. After the GC deduplicates strings, two copies of the same string will only have one array instance. But the two String instances will remain separate objects. Since the String object itself is not deduplicated, repeated ~String~s will have a significant overhead.6

With String.intern, we decide when and where to deduplicate strings. The GC, on the other hand, can arrange so that it tries to deduplicate each individual string exactly, after it's survived a number of garbage collections, or is being moved to the old generation (hotspot source).

The choice to deduplicate as part of the GC is designed to avoid overhead from duplicating short-lived strings. JVM implementations make it extremely cheap to allocate memory in the young generation. If a string doesn't survive past that generation, there's no advantage to deduplication. The initial allocation is cheap, and garbage collection of objects that don't survive the young generation is also cheap. By deferring the decision to deduplicate a string until garbage collection happens, you save CPU time for short-lived strings, at the expense of temporarily increased memory use and more frequent young collections.

Manual String Deduplication

For manual string deduplication, the idea is that when the application code handles data, and allocates new strings, it keeps a "pool" of string instances it's seen, and canonicalizes string instances it encounters. Using a map as the pool, this means code that looks like:

String deduplicate(ConcurrentHashMap<String, String> pool, String s) {
    String found = pool.putIfAbsent(s, s);
    return found != null ? found : s;
}

This results in having two local variables pointing at the same object. The first copy of the string will use the full memory, while the second copy will only cost a pointer.

The pool itself can be implemented with various structures--a simple ConcurrentHashMap, an LRU cache, etc.7 There are a lot of degrees of freedom about regarding the length of time that you store the strings, and the scope of what you deduplicate. The more strings your deduplicator sees, the more memory savings it can provide. However, the longer the deduplicator lives, the greater the potential it can hold on to strings that would otherwise be dead.

The most aggressive sort of deduplication would use a global static string pool, that stores all strings for any portion of your application and deduplicates them all. The downsides are obvious: a global pool is effectively a sort of memory leak, and if multiple subsystems use it, you're repeatedly checking a string to see if it's a duplicate.

More likely, to manually deduplicate strings, you should pick some component that loads data that may contain duplicate strings, and deduplicate as that component operates.8

Experiments

It seems obvious enough that GC based deduplication will save some memory, but that explicit deduplication might sometimes save more. Let's test that out.

We'll use a simple design, accumulating strings in a list of lists, and measuring the number of strings we can store before throwing an OutOfMemoryError.

The core of the code is:

try {
    var strings = new ArrayList<List<String>>();
    for (var i = 0; i < 1000000; i++) {
        var newStrings = new ArrayList<String>();
        for (var j = 0; j < 1000; j++) {
            var s = copyString(bh, j);
            seen++;
            newStrings.add(s);
        }
    strings.add(newStrings);
    return strings;
}
catch (OutOfMemoryError e) {
    System.out.println("TotalSeen=" + seen + ", Time=" + new Date());
    return null;
}

The copyString method reads from a preinitialized list of strings, and creates a copy. If the JMH parameter dedup is true, we deduplicate OutOfMemorythe string using ConcurrentHashMap.

private String copyString(Blackhole bh, int j) {
    var b = preAllocatedStrings.get(j % preAllocatedStrings.size()).getBytes();
    bh.consume(b);
    var s = new String(b);
    if (dedup) {
        s = stringMap.computeIfAbsent(s, Function.identity());
    }
    return s;
}

Our variables are:

  1. Type of deduplication (none, G1, manual deduplication, G1 and manual deduplication)
  2. The number of distinct strings (1, 1024, 65536)
  3. The length of the strings (0, 16, 256, 2048)

We'll consider every permutation of our options. Code for the experiment can be found on GitHub.

Results

Before getting results, I wondered if it was worth running experiments. It was obvious that manual deduplication would save more memory than relying on GC based deduplication. Further, this effect would be most pronounced with smaller strings.

In fact, there were two surprises from the first run. Although neither one changes the overall conclusion, I found them interesting. Let's look at some selections from the results (full results):

manual_dedup g1_dedup string_count string_length total_seen
True True 1 0 25937667
True True 1024 0 25952237
True True 49152 0 25885824
True True 1 0 25954704
True True 1024 0 25934483
True True 49152 0 25893245
True False 1 0 25935941
True False 1024 0 25926346
True False 49152 0 25883550
True False 1 0 25929824
True False 1024 0 25929245
True False 49152 0 25883362
False False 1 0 4467367
False False 1024 0 4481169
False False 49152 0 4463225
False False 1 0 4470115
False False 1024 0 4467640
False False 49152 0 4468824
False True 1 0 4482684
False True 1024 0 4481550
False True 49152 0 4465321
False True 1 0 4470331
False True 1024 0 4483550
False True 49152 0 4468550

First of all, when using the empty string, manual deduplication makes a difference, but GC string deduplication makes no difference. It turns out that the String constructor special cases the empty string, and uses a canonical backing array, regardless of any other JVM parameters. Effectively, regardless of the JVM parameters, GC deduplication is automatically applied while the 0-length string is constructed.

The next table comes from preliminary results:

manual_dedup g1_dedup string_count string_length total_seen
False True 1 16 2490376
False False 1 16 2126530
False True 1 256 500079
False False 1 256 430665
False True 1 2048 70263
False False 1 2048 61823
False True 1024 16 2417633
False False 1024 16 2126110
False True 1024 256 487646
False False 1024 256 430428
False True 1024 2048 70795
False False 1024 2048 61037
False True 49152 16 2283492
False False 49152 16 2081110
False True 49152 256 389843
False False 49152 256 380974
False True 49152 2048 15742
False False 49152 2048 12752

The numbers from the first experiment don't seem right. There is a difference between using GC deduplication and no string deduplication, but it is relatively small. Moreover, the results don't seem internally consistent--with just one string, GC deduplication should give us almost identical results whether that string is 16 bytes or 2048 bytes (the extra 2032 bytes give us room to store less than 100 extra copies). But the results drop off precipitously, showing that deduplication isn't having the full effect it could.

The explanation is that GC deduplication only happens as part of the garbage collector's work. While processing data, the GC threads register strings, and a background thread processes them and deduplicates them.

Since our benchmark only allocates strings without generating other garbage, there isn't enough gc pressure to force deduplication. The solution is to create synthetic GC pressure, by allocating some data, and then throwing it away. Tests show that it was important to create short lived data to trigger young-gen collections. Calling System.gc() to trigger full collections had small effects.

So I added code to periodically trigger young generation churn:

accumulated += s.length();
if (accumulated >= gcFreq) {
    churnYoungGen(bh);
    accumulated = 0;
}

private void churnYoungGen(Blackhole bh) {
    for (int i = 0; i < 8192; i++) {
        byte[] bytes = new byte[1024];
        bh.consume(bytes);
    }
}

This worked--adding extra allocations made deduplication have a much larger effect.9 The fact that GC based string deduplication only works with another source of GC pressure is interesting, and could lead to missed memory savings. In practice, my instinct is that it would rarely matter.

Final Results

manual_dedup g1_dedup churn_gc string_count string_length gc_freq total_seen
True True True 1 0 65536 25934037
True True True 1 16 65536 25933724
True True True 1 256 65536 25931536
True True True 1 2048 65536 25928043
True True True 1024 0 65536 25931828
True True True 1024 16 65536 25909387
True True True 1024 256 65536 25864074
True True True 1024 2048 65536 25484952
True True True 49152 0 65536 25875071
True True True 49152 16 65536 25315824
True True True 49152 256 65536 22949606
True True True 49152 2048 65536 5305588
False True True 1 0 65536 4470129
False True True 1 16 65536 3835000
False True True 1 256 65536 4199465
False True True 1 2048 65536 4020224
False True True 1024 0 65536 4480948
False True True 1024 16 65536 3399859
False True True 1024 256 65536 4207463
False True True 1024 2048 65536 3950660
False True True 49152 0 65536 4466367
False True True 49152 16 65536 310269910
False True True 49152 256 65536 3692237
False True True 49152 2048 65536 522091
True False True 1 0 65536 25930263
True False True 1 16 65536 25939592
True False True 1 256 65536 25925000
True False True 1 2048 65536 25916034
True False True 1024 0 65536 25937824
True False True 1024 16 65536 25899531
True False True 1024 256 65536 25799765
True False True 1024 2048 65536 25062721
True False True 49152 0 65536 25877468
True False True 49152 16 65536 25296518
True False True 49152 256 65536 22885916
True False True 49152 2048 65536 4886967
False False True 1 0 65536 4469860
False False True 1 16 65536 2123819
False False True 1 256 65536 430343
False False True 1 2048 65536 61841
False False True 1024 0 65536 4478824
False False True 1024 16 65536 2123819
False False True 1024 256 65536 429311
False False True 1024 2048 65536 60818
False False True 49152 0 65536 4466814
False False True 49152 16 65536 2075749
False False True 49152 256 65536 381065
False False True 49152 2048 65536 12671

These results are in line with what we we expect. Manual deduplication lets us store many more copies of the same strings compared to using GC based string deduplication, which itself saves more than using no deduplication.

Conclusion

I've shown how gc based deduplication works, and that you can sometimes achieve larger memory savings with manual deduplication. That's all.

Should you use manual deduplication? Maybe, maybe not. How much benefit it can have depends on your application, and how it uses strings. Taking heap dumps and analyzing them for duplicated strings will show you the maximum possible savings.

There are also costs to manual duplication--manual deduplication requires having an idea about the type of strings your application uses, and their lifetime. The garbage collector design ensures that short-lived strings are not deduplicated, while each long-lived string is deduplicated exactly once. Finding a place to do that in your application may be difficult. Getting it wrong risks wasting cycles deduplicating strings that would've died young. In the worst case, the cache you use for deduplication can cause an unbounded memory leak. Though I didn't show any cases where manual deduplication made an application use more memory, it's easy to see that it's possible.

Instead of a recommendation, then, my goal is just to clarify what the GC does and doesn't provide, and make you aware that manual deduplication is a further technique that you can consider using.

Footnotes


  1. Authors of JEP 192 which proposed adding string deduplication to G1 estimated that strings averaged 25% of memory contents in various applications, and duplicated strings averaged 13.5%.

  2. String constants contained in class files are interned by the JVM, so we are concerned with strings that are created at runtime by application and library code.

  3. Project Lilliput aims to reduce the size of the object header to 8 bytes or less. That would significantly change all the numbers in this post, and would make manually deduplicating somewhat less compelling, but not change the overall shape of the results.

  4. There is a further complication for empty strings, that we'll see later.

  5. For a non-ascii string, it's 16 bytes per UTF-16 code unit. Unicode is complex, but that means that what we think of as a "letter" will typically but not always take 16 bytes (see unicode). The different treatment of ASCII and non-ascii strings came with JEP 254 in Java 9.

  6. One question is why the GC doesn't deduplicate the strings themselves. I haven't read an explanation, but I can think of three good reasons it doesn't. First, while comparing strings for reference equality is rarely good practice, it would be confusing for its results to depend on when GC had run. Second, code in the wild can synchronize on strings, and deduplicating them could lead to contention or deadlocks (personally, I think synchronizing on Strings is dubious, but VM implementors have to keep existing code working).

    Third, if I read the GC code correctly, while the GC registers the strings for deduplication, a background thread deduplicates strings concurrently with the application code that is using them. If so, I believe the deduplication code cannot invalidate references to the strings.

  7. My coworker David Bogus told me about an extremely simple form of this technique that he used at an old job: when iterating through database results, deduplicate strings from each row against the contents of the previous row. By considering just the last row, he avoided any complexity, shipped the tweak immediately, and kept overhead extremely minimal.

  8. There's no requirement that you deduplicate as you load the data, but the period after a string is instantiated, but not deduplicated is pure waste. Those strings might live through garbage collection, and push the overall memory use of your application up. At the extreme, you might not be able to load all your data without deduplicating as you go, in which case, you're going to crash your application, if you don't do the duplication.

  9. gcfreq is a parameter to the benchmarks, corresponding to how many bytes of strings are copied in between allocating memory just for the sake of triggering GC. I used two values, roughly 65KB and 1MB. To precisely measure how much memory manual deduplication saves, you would want to run with a series of decreasing values, and validate that you'd reached a value where essentially all strings were being deduplicated. You can use my code to do that by adding more values to the gcFreq param, but I confess I'm a bit tired of rerunning and reanalyzing it.

  10. Calling this row out, as I don't understand it.