G1 String Deduplication Doesn't Fully Deduplicate Strings

Starting with Java 8u20, there is a feature for the G1 garbage collector to support string deduplication (using the VM argument XX:+UseStringDeduplication) as a memory saving optimization. In later Java versions, the Shenendoah and ZGC collectors also support the feature.

Over time, I've noticed a lot of unclarity about the scope of this feature. This piece aims to make it clear what GC based string deduplication accomplishes, and why it might still sometimes be useful to manually duplicate strings.

Why Deduplicate Strings?

Strings are a large portion of the data used by many applications and it's common to have certain strings you which are repeated throughout an application. ¹ These duplicated strings might be fragments of HTML, SQL statements, or strings for domain concepts like the USA.

If these duplicated strings all correspond to constants², then they might be replacable with an enum, but that's not necessarily possible for user data. Suppose thousands of users are entering the variants "USA", "United States" or "United States of America". Perhaps we don't know the variants users actually use ahead of time. Or alternatively, we know what users might use, but have a requirement to accurately round-trip user data. In either case, a string is the only data structure we can use.

Regardless of why, when we have to store large quantities of data as strings, a significant portion of our memory may be wasted storing copies of the same string. If we can replace these copies with shared references to a canonical instance, we can save a significant amount of memory.

A Caution About String.intern

String interning is a feature of the JVM that has existed forever. Some readers probably assume that I'm going to recommend it. Calling String.intern returns a canonical instance of a string. Two entirely separate pieces of code within the same JVM calling "abc".intern() will get back references to the same string. On the face of it, this is exactly what we want.

Unfortunately, in older versions of the JVM, misuse of String.intern could crash the JVM. Starting in Java 6 and 7, these crashes were fixed. This left only subtler performance problems as reasons to avoid String.intern, and led some developers to think it was a suitable way to save memory. In fact, using String.intern risks major slowdowns and increased GC pauses. I'll defer explaining in favor of Aleksey Shipilëv's excellent article, but will quote his summary:

In almost every project we were taking care of, removing String.intern() from the hotpaths, or optionally replacing it with a handrolled deduplicator, was the very profitable performance optimization. Do not use String.intern() without thinking very hard about it, okay?

Memory Use of Strings

In the open JDK implementation, each Java String contains two primitive fields and a reference to an array. Here are a String's fields as defined in the JDK Source, minus comments, annotations and newlines:

public final class String {
    private final byte[] value;
    private final byte coder;
    private int hash;
}

In a <= 32GB heap, the empty string uses 24 bytes for the String excluding the array, and another 16 bytes for the array, or 40 bytes total.³ ⁴ A 1024 character ascii string also uses 24 bytes for the String excluding the array, but uses 1040 bytes for its array. ⁵ For small strings, the majority of the memory use is the string excluding the array, for longer strings, the array dominates.

GC Based String Deduplication

At a very high level, the idea behind the GC's string deduplication is very similar to String#intern. Repeated instances of the same string are replaced with one canonical copy. However, aside from using a different implementation that avoids the performance problems associated with String.intern, there is a further difference.

While String.intern canonicalizes the entire string object, GC based deduplication only modifies the backing byte[] array for strings. After the GC deduplicates strings, two copies of the same string will only have one array instance. But the two String instances will remain separate objects. Since the String object itself is not deduplicated, repeated ~String~s will have a significant overhead.⁶

With String.intern, we decide when and where to deduplicate strings. The GC, on the other hand, can arrange so that it tries to deduplicate each individual string exactly, after it's survived a number of garbage collections, or is being moved to the old generation (hotspot source).

The choice to deduplicate as part of the GC is designed to avoid overhead from duplicating short-lived strings. JVM implementations make it extremely cheap to allocate memory in the young generation. If a string doesn't survive past that generation, there's no advantage to deduplication. The initial allocation is cheap, and garbage collection of objects that don't survive the young generation is also cheap. By deferring the decision to deduplicate a string until garbage collection happens, you save CPU time for short-lived strings, at the expense of temporarily increased memory use and more frequent young collections.

Manual String Deduplication

For manual string deduplication, the idea is that when the application code handles data, and allocates new strings, it keeps a "pool" of string instances it's seen, and canonicalizes string instances it encounters. Using a map as the pool, this means code that looks like:

String deduplicate(ConcurrentHashMap<String, String> pool, String s) {
    String found = pool.putIfAbsent(s, s);
    return found != null ? found : s;
}

This results in having two local variables pointing at the same object. The first copy of the string will use the full memory, while the second copy will only cost a pointer.

The pool itself can be implemented with various structures--a simple ConcurrentHashMap, an LRU cache, etc.⁷ There are a lot of degrees of freedom about regarding the length of time that you store the strings, and the scope of what you deduplicate. The more strings your deduplicator sees, the more memory savings it can provide. However, the longer the deduplicator lives, the greater the potential it can hold on to strings that would otherwise be dead.

The most aggressive sort of deduplication would use a global static string pool, that stores all strings for any portion of your application and deduplicates them all. The downsides are obvious: a global pool is effectively a sort of memory leak, and if multiple subsystems use it, you're repeatedly checking a string to see if it's a duplicate.

More likely, to manually deduplicate strings, you should pick some component that loads data that may contain duplicate strings, and deduplicate as that component operates.⁸

Experiments

It seems obvious enough that GC based deduplication will save some memory, but that explicit deduplication might sometimes save more. Let's test that out.

We'll use a simple design, accumulating strings in a list of lists, and measuring the number of strings we can store before throwing an OutOfMemoryError.

The core of the code is:

try {
    var strings = new ArrayList<List<String>>();
    for (var i = 0; i < 1000000; i++) {
        var newStrings = new ArrayList<String>();
        for (var j = 0; j < 1000; j++) {
            var s = copyString(bh, j);
            seen++;
            newStrings.add(s);
        }
    strings.add(newStrings);
    return strings;
}
catch (OutOfMemoryError e) {
    System.out.println("TotalSeen=" + seen + ", Time=" + new Date());
    return null;
}

The copyString method reads from a preinitialized list of strings, and creates a copy. If the JMH parameter dedup is true, we deduplicate OutOfMemorythe string using ConcurrentHashMap.

private String copyString(Blackhole bh, int j) {
    var b = preAllocatedStrings.get(j % preAllocatedStrings.size()).getBytes();
    bh.consume(b);
    var s = new String(b);
    if (dedup) {
        s = stringMap.computeIfAbsent(s, Function.identity());
    }
    return s;
}

Our variables are:

Type of deduplication (none, G1, manual deduplication, G1 and manual deduplication)
The number of distinct strings (1, 1024, 65536)
The length of the strings (0, 16, 256, 2048)

We'll consider every permutation of our options. Code for the experiment can be found on GitHub.

Results

Before getting results, I wondered if it was worth running experiments. It was obvious that manual deduplication would save more memory than relying on GC based deduplication. Further, this effect would be most pronounced with smaller strings.

In fact, there were two surprises from the first run. Although neither one changes the overall conclusion, I found them interesting. Let's look at some selections from the results (full results):

manual_dedup	g1_dedup	string_count	total_seen
True	True	1	25937667
True	True	1024	25952237
True	True	49152	25885824
True	True	1	25954704
True	True	1024	25934483
True	True	49152	25893245
True	False	1	25935941
True	False	1024	25926346
True	False	49152	25883550
True	False	1	25929824
True	False	1024	25929245
True	False	49152	25883362
False	False	1	4467367
False	False	1024	4481169
False	False	49152	4463225
False	False	1	4470115
False	False	1024	4467640
False	False	49152	4468824
False	True	1	4482684
False	True	1024	4481550
False	True	49152	4465321
False	True	1	4470331
False	True	1024	4483550
False	True	49152	4468550

First of all, when using the empty string, manual deduplication makes a difference, but GC string deduplication makes no difference. It turns out that the String constructor special cases the empty string, and uses a canonical backing array, regardless of any other JVM parameters. Effectively, regardless of the JVM parameters, GC deduplication is automatically applied while the 0-length string is constructed.

The next table comes from preliminary results:

manual_dedup	g1_dedup	string_count	string_length	total_seen
False	True	1	16	2490376
False	False	1	16	2126530
False	True	1	256	500079
False	False	1	256	430665
False	True	1	2048	70263
False	False	1	2048	61823
False	True	1024	16	2417633
False	False	1024	16	2126110
False	True	1024	256	487646
False	False	1024	256	430428
False	True	1024	2048	70795
False	False	1024	2048	61037
False	True	49152	16	2283492
False	False	49152	16	2081110
False	True	49152	256	389843
False	False	49152	256	380974
False	True	49152	2048	15742
False	False	49152	2048	12752

The numbers from the first experiment don't seem right. There is a difference between using GC deduplication and no string deduplication, but it is relatively small. Moreover, the results don't seem internally consistent--with just one string, GC deduplication should give us almost identical results whether that string is 16 bytes or 2048 bytes (the extra 2032 bytes give us room to store less than 100 extra copies). But the results drop off precipitously, showing that deduplication isn't having the full effect it could.

The explanation is that GC deduplication only happens as part of the garbage collector's work. While processing data, the GC threads register strings, and a background thread processes them and deduplicates them.

Since our benchmark only allocates strings without generating other garbage, there isn't enough gc pressure to force deduplication. The solution is to create synthetic GC pressure, by allocating some data, and then throwing it away. Tests show that it was important to create short lived data to trigger young-gen collections. Calling System.gc() to trigger full collections had small effects.

So I added code to periodically trigger young generation churn:

accumulated += s.length();
if (accumulated >= gcFreq) {
    churnYoungGen(bh);
    accumulated = 0;
}

private void churnYoungGen(Blackhole bh) {
    for (int i = 0; i < 8192; i++) {
        byte[] bytes = new byte[1024];
        bh.consume(bytes);
    }
}

This worked--adding extra allocations made deduplication have a much larger effect.⁹ The fact that GC based string deduplication only works with another source of GC pressure is interesting, and could lead to missed memory savings. In practice, my instinct is that it would rarely matter.

Final Results

manual_dedup	g1_dedup	churn_gc	string_count	string_length	gc_freq	total_seen
True	True	True	1	0	65536	25934037
True	True	True	1	16	65536	25933724
True	True	True	1	256	65536	25931536
True	True	True	1	2048	65536	25928043
True	True	True	1024	0	65536	25931828
True	True	True	1024	16	65536	25909387
True	True	True	1024	256	65536	25864074
True	True	True	1024	2048	65536	25484952
True	True	True	49152	0	65536	25875071
True	True	True	49152	16	65536	25315824
True	True	True	49152	256	65536	22949606
True	True	True	49152	2048	65536	5305588
False	True	True	1	0	65536	4470129
False	True	True	1	16	65536	3835000
False	True	True	1	256	65536	4199465
False	True	True	1	2048	65536	4020224
False	True	True	1024	0	65536	4480948
False	True	True	1024	16	65536	3399859
False	True	True	1024	256	65536	4207463
False	True	True	1024	2048	65536	3950660
False	True	True	49152	0	65536	4466367
False	True	True	49152	16	65536	3102699¹⁰
False	True	True	49152	256	65536	3692237
False	True	True	49152	2048	65536	522091
True	False	True	1	0	65536	25930263
True	False	True	1	16	65536	25939592
True	False	True	1	256	65536	25925000
True	False	True	1	2048	65536	25916034
True	False	True	1024	0	65536	25937824
True	False	True	1024	16	65536	25899531
True	False	True	1024	256	65536	25799765
True	False	True	1024	2048	65536	25062721
True	False	True	49152	0	65536	25877468
True	False	True	49152	16	65536	25296518
True	False	True	49152	256	65536	22885916
True	False	True	49152	2048	65536	4886967
False	False	True	1	0	65536	4469860
False	False	True	1	16	65536	2123819
False	False	True	1	256	65536	430343
False	False	True	1	2048	65536	61841
False	False	True	1024	0	65536	4478824
False	False	True	1024	16	65536	2123819
False	False	True	1024	256	65536	429311
False	False	True	1024	2048	65536	60818
False	False	True	49152	0	65536	4466814
False	False	True	49152	16	65536	2075749
False	False	True	49152	256	65536	381065
False	False	True	49152	2048	65536	12671

These results are in line with what we we expect. Manual deduplication lets us store many more copies of the same strings compared to using GC based string deduplication, which itself saves more than using no deduplication.

Conclusion

I've shown how gc based deduplication works, and that you can sometimes achieve larger memory savings with manual deduplication. That's all.

Should you use manual deduplication? Maybe, maybe not. How much benefit it can have depends on your application, and how it uses strings. Taking heap dumps and analyzing them for duplicated strings will show you the maximum possible savings.

There are also costs to manual duplication--manual deduplication requires having an idea about the type of strings your application uses, and their lifetime. The garbage collector design ensures that short-lived strings are not deduplicated, while each long-lived string is deduplicated exactly once. Finding a place to do that in your application may be difficult. Getting it wrong risks wasting cycles deduplicating strings that would've died young. In the worst case, the cache you use for deduplication can cause an unbounded memory leak. Though I didn't show any cases where manual deduplication made an application use more memory, it's easy to see that it's possible.

Instead of a recommendation, then, my goal is just to clarify what the GC does and doesn't provide, and make you aware that manual deduplication is a further technique that you can consider using.

Footnotes

Authors of JEP 192 which proposed adding string deduplication to G1 estimated that strings averaged 25% of memory contents in various applications, and duplicated strings averaged 13.5%.↩
String constants contained in class files are interned by the JVM, so we are concerned with strings that are created at runtime by application and library code.↩
Project Lilliput aims to reduce the size of the object header to 8 bytes or less. That would significantly change all the numbers in this post, and would make manually deduplicating somewhat less compelling, but not change the overall shape of the results.↩
There is a further complication for empty strings, that we'll see later.↩
For a non-ascii string, it's 16 bytes per UTF-16 code unit. Unicode is complex, but that means that what we think of as a "letter" will typically but not always take 16 bytes (see unicode). The different treatment of ASCII and non-ascii strings came with JEP 254 in Java 9.↩
One question is why the GC doesn't deduplicate the strings themselves. I haven't read an explanation, but I can think of three good reasons it doesn't. First, while comparing strings for reference equality is rarely good practice, it would be confusing for its results to depend on when GC had run. Second, code in the wild can synchronize on strings, and deduplicating them could lead to contention or deadlocks (personally, I think synchronizing on Strings is dubious, but VM implementors have to keep existing code working).

Third, if I read the GC code correctly, while the GC registers the strings for deduplication, a background thread deduplicates strings concurrently with the application code that is using them. If so, I believe the deduplication code cannot invalidate references to the strings.↩
My coworker David Bogus told me about an extremely simple form of this technique that he used at an old job: when iterating through database results, deduplicate strings from each row against the contents of the previous row. By considering just the last row, he avoided any complexity, shipped the tweak immediately, and kept overhead extremely minimal.↩
There's no requirement that you deduplicate as you load the data, but the period after a string is instantiated, but not deduplicated is pure waste. Those strings might live through garbage collection, and push the overall memory use of your application up. At the extreme, you might not be able to load all your data without deduplicating as you go, in which case, you're going to crash your application, if you don't do the duplication.↩
gcfreq is a parameter to the benchmarks, corresponding to how many bytes of strings are copied in between allocating memory just for the sake of triggering GC. I used two values, roughly 65KB and 1MB. To precisely measure how much memory manual deduplication saves, you would want to run with a series of decreasing values, and validate that you'd reached a value where essentially all strings were being deduplicated. You can use my code to do that by adding more values to the gcFreq param, but I confess I'm a bit tired of rerunning and reanalyzing it.↩
Calling this row out, as I don't understand it.↩