Optimizing .tar.gz Archives by Changing File Order
A few weeks ago, I was doing something with a sizeable
.tar.gz file, and wondered how the order of files affected the process. I'm not that knowledgable about compression, but I know that gzip uses a sliding window in which it looks for opportunities to compress repeating chunks of text. If you give it highly repetitive text, it does well, if you give it random data, it will probably give you a bigger file than when you started. So reordering files seems like it could matter.
This idea seems obvious enough someone else has probably explored it, but I decided to spend a few hours playing around with code, rather than investigate see if it's already been implemented.1
I used my project StringMatching, for no reason beyod convenience. I cloned commit
0a4ad368d25c5af7d3d751f9e903abc8ae792dae from GitHub. I decided to remove the .git directory, because I thought it would be easier to guess about how the process treated text files. The resulting
.tar is 337920 bytes, with the
.tar.gz being 45768 bytes.
The code is on GitHub.
For the attempts that involve some randomness, I did 10 runs, and captured the best, worse and average results. I did comparisons using the best result, since if you wanted to use these techniques, you could do multiple runs, and pick the best one.
|shuffled files||47819||49564||48802||+ 6.6%|
|most common token||46641||-||-||+ 1.9%|
|sorting by size||46431||-||-||+ 1.4%|
|python default||45890||-||-||+ 0.2%|
|shuffling files within directories||45771||46817||46466||+ 0.0%|
|swapping within directories||44959||46317||45840||- 1.8%|
Shuffled Files: I shuffled a list of all found files, then added them to the archive.
Most Common Token: I applied a naive regex to split Java code into fragments that might sometimes correspond to tokens in Java syntax, then sorted files by their most common "token". In hindsight, not surprising this gave horrible results. Perhaps return is the most common token across arbitrary java files.
Shuffling Files Within Directories: Same as #1, except that we only shuffle within a directory, keeping all the files from that directory together.
Sort Files by Size: This one seems absurd. The fact that it ends up doing well may just be luck, though you could imagine that larger pieces of code have structural similarities.
Script With Default Ordering: I used the same process as my other attempts, without changing the order of files returned by walking the directories. I suspect the directory order is different than what tar uses, leading to a space increase, but I didn't check it.
tar -czvf: The baseline implementation. It makes sense that it does well relative to several other attempts. Files in the same package are likely to use many of the same imports, types and methods.
Swapping Adjacent Files A sort of bubble-sort lik algorithm. Starting with the files in their natural order, we try swapping files, then recreating the gzipped archive. If the result is smaller than our previous best result, we keep the swap, and try swapping the moved element with the one before it.
Swapping Adjacent Files After Shuffling Within Directories: Apply the same algorithm as in 5, but beforehand, shuffle files within directories, as in #3.
Does Any of This Matter?
No clue. It's hard to imagine these implementations are good for anything--the best results require doing at least O(n) calls to
-czvf for an n-file archive. Google's zopfli algorithm spends major CPU time in exchange for moderate improvements in compressed file size, so a better implementation might be interesting. Lastly, I did one experiment on one library, so real-world results could be completely different.
Then I sat on my results for a few weeks before writing them up.↩