File Order Effects on Compression
When creating a compressed archive (i.e.
foo.tar.gz), the order of the files in the archives can matter significantly. At the extreme, putting duplicate files sequentially will give huge savings, but even without duplicates, putting similar files together
I rediscovered this idea a few years ago, but expected I wasn't the first one. After presenting my own efforts, a lot of people shared their own ways of achieving it.
Gwern calls the technique of grouping similar files together "programming folklore", which is a perfect description. If there's a canonical explanation, I've yet to find an internet accessible version.
- Kornel's shell script for grouping by file types
- Manschott's shell script for grouping files of similar types and names
- Aleksey Shipilëv's shell script for grouping by file and name
Scans the contents of a directory, groups the files by binary similarity, generates a filelist and prints the list to stdout....
Based on Bart Massey's implementation of Simhash. My own experiences with BinHash suggest it works quite well. It can be beaten, but for optimizing an arbitrary archive, I would absolutely give it a try.
References to Git Pack files, RAR archiving, etc. See also HN Comments on the Experiment.