Optimizing .tar.gz Archives by Changing File Order

A few weeks ago, I was doing something with a sizeable .tar.gz file, and wondered how the order of files affected the process. I'm not that knowledgable about compression, but I know that gzip uses a sliding window in which it looks for opportunities to compress repeating chunks of text. If you give it highly repetitive text, it does well, if you give it random data, it will probably give you a bigger file than when you started. So reordering files seems like it could matter.

This idea seems obvious enough someone else has probably explored it, but I decided to spend a few hours playing around with code, rather than investigate see if it's already been implemented.¹

I used my project StringMatching, for no reason beyod convenience. I cloned commit 0a4ad368d25c5af7d3d751f9e903abc8ae792dae from GitHub. I decided to remove the .git directory, because I thought it would be easier to guess about how the process treated text files. The resulting .tar is 337920 bytes, with the .tar.gz being 45768 bytes.

The code is on GitHub.

Attempts

For the attempts that involve some randomness, I did 10 runs, and captured the best, worse and average results. I did comparisons using the best result, since if you wanted to use these techniques, you could do multiple runs, and pick the best one.

Method	Best	Worst	Average	Difference
shuffled files	47819	49564	48802	+ 6.6%
most common token	46641	-	-	+ 1.9%
sorting by size	46431	-	-	+ 1.4%
python default	45890	-	-	+ 0.2%
shuffling files within directories	45771	46817	46466	+ 0.0%
tar -czvf	45768	-	-	-
swapping	45202	-	-	- 1.2%
swapping within directories	44959	46317	45840	- 1.8%

Shuffled Files: I shuffled a list of all found files, then added them to the archive.
Most Common Token: I applied a naive regex to split Java code into fragments that might sometimes correspond to tokens in Java syntax, then sorted files by their most common "token". In hindsight, not surprising this gave horrible results. Perhaps return is the most common token across arbitrary java files.
Shuffling Files Within Directories: Same as #1, except that we only shuffle within a directory, keeping all the files from that directory together.
Sort Files by Size: This one seems absurd. The fact that it ends up doing well may just be luck, though you could imagine that larger pieces of code have structural similarities.
Script With Default Ordering: I used the same process as my other attempts, without changing the order of files returned by walking the directories. I suspect the directory order is different than what tar uses, leading to a space increase, but I didn't check it.
tar -czvf: The baseline implementation. It makes sense that it does well relative to several other attempts. Files in the same package are likely to use many of the same imports, types and methods.
Swapping Adjacent Files A sort of bubble-sort lik algorithm. Starting with the files in their natural order, we try swapping files, then recreating the gzipped archive. If the result is smaller than our previous best result, we keep the swap, and try swapping the moved element with the one before it.
Swapping Adjacent Files After Shuffling Within Directories: Apply the same algorithm as in 5, but beforehand, shuffle files within directories, as in #3.

Does Any of This Matter?

No clue. It's hard to imagine these implementations are good for anything--the best results require doing at least O(n) calls to tar -czvf for an n-file archive. Google's zopfli algorithm spends major CPU time in exchange for moderate improvements in compressed file size, so a better implementation might be interesting. Lastly, I did one experiment on one library, so real-world results could be completely different.

Footnotes

Then I sat on my results for a few weeks before writing them up.↩