File Order Effects on Compression

When creating a compressed archive (i.e. foo.tar.gz), the order of the files in the archives can matter significantly. At the extreme, putting duplicate files sequentially will give huge savings, but even without duplicates, putting similar files together

I rediscovered this idea a few years ago, but expected I wasn't the first one. After presenting my own efforts, a lot of people shared their own ways of achieving it.

Programming Folklore

Gwern calls the technique of grouping similar files together "programming folklore", which is a perfect description. If there's a canonical explanation, I've yet to find an internet accessible version.

Simple Approaches


Scans the contents of a directory, groups the files by binary similarity, generates a filelist and prints the list to stdout....

Based on Bart Massey's implementation of Simhash. My own experiences with BinHash suggest it works quite well. It can be beaten, but for optimizing an arbitrary archive, I would absolutely give it a try.

Lobster's Comments on the Experiment

References to Git Pack files, RAR archiving, etc. See also HN Comments on the Experiment.