New file format?

Duplicity uses a standard backup scheme in the sense that for the first backup, duplicity performs a full backup. Additional backups are stored in separate files and record only changed files.

This scheme is similar to the one people have been using with tar to back up the files for decades. And in fact, duplicity does use the tar format. However, for various reasons it may be time to come up with a new archive format that is better suited for hard disks and duplicity.

Table of Contents

Why not tar?

Tar is a very standard format, so it seems best to use it unless there are good reasons not to. However, tar (Tape ARchive) is showing its age and has some disadvantages:

Current Proposal

This section describes a possible format in general terms. Please let me know if you have suggestions or corrections.

Block Level Formatting

The archive will have two different levels of formatting, much like a .tar.gz file has two levels of formatting. At the outer level, the archive will look like a header, two collections of blocks, and a footer. The collections of blocks hold the important archive data, usually in compressed and/or encrypted form.

At the outer, or block, level, the file will look like this:

Outer Structure

Here is more detail on the items in that diagram:

Inner Level Formatting:

The block level formatting described above gives us a file within a file. However this inner file now has the property of being completely encrypted or compressed, while retaining some level of random access.

The inner file may be laid out as follows:

Inner Structure

At the top level there are two data streams. The first, called the Inner file, has two parts, the Inner data file and the Inner index file. The inner data file is laid out similarly to tar, with a file header followed by its data. The Inner index file is in some sense redundant information that merely provides an easy way to access the inner data file.

The block table is a second stream all by itself. It should contain a list of the outer offsets of all the blocks (at the outer or block level) along with the inner offset of the beginning of that block's data. Given an inner offset (for instance as obtained from a file header), the block table allows an application to find the correct block, seek to it, and uncompress/decrypt it. Also, the block table should include the offset in the inner index file of the root (/) directory's header.

Here are some more remarks and details on the inner level structure:

Benefits of this format

This section is basically the inverse of the Why not tar? section above.

Open questions on above

Here are some issues I am unsure about.

Acknowledgments

Thanks to Will Dyson, John Goerzen, Randall Nortman, and Kevin Spicer for valuable discussion.