On predicting predictors: hacking archive formats for fun and prophecy



We aim to inform you about the archive formats you use every day. We will include an in-depth look at the tar, ar, cpio, gzip, bzip2, and deb formats, as well as the internals of the Git object store. Armed with this information, we will show you a practical application: removing the redundancy between files in version control and distributions of source and binaries.


Existing projects like pristine-tar focus on finding the right options to the compression code to reproduce the file from the uncompressed data (“gzip -9 —rsyncable”), treating the file formats as magic black boxes. Our in-depth analysis of archive formats lets us record just enough information to reproduce any archive regardless of the tool used to produce it.

Speaking experience