Table of contents

  1. How do I pick out directories using the file:// and scp:// syntax? Something is going wrong.
  2. Duplicity uses too much temp space, what can I do?
  3. Why are long incremental chains bad?

Questions and Answers

  1. How do I pick out directories using the file:// and scp:// syntax? Something is going wrong.

    Perhaps you need an extra slash. For instance, to specify the /usr/backup directory you have to use the url file:///usr/backup. Note the three slashes. The first two are part of the URL definition and the third is the beginning of /usr/backup.

    Similarly, to specify /usr/backup on machine host.net using scp, you would say scp://host.net//usr/backup (note the two slashes after host.net). If you wanted to use the directory temp right under the scp default directory (usually the user's home directory), then you could just use scp://host.net/temp.

  2. Duplicity uses too much temp space, what can I do?

    Duplicity may require lots of temp space sometimes, depending on the size of the volumes created. You can specify where to put many temp files using the TMPDIR environment variable. The default is either /usr/tmp or /tmp, depending on the system.

  3. Why are long incremental chains bad?

    Let's look at how duplicity chains are formed and why long chains are a very bad choice for backups. We intentionally leave out a lot of detail just to concentrate on the 10,000 foot view.

    FULL: For example start with a full backup, 3 files, A (3 blocks), B (5 blocks), C (2 blocks).

    DAT - A1 A2 A3 B1 B2 B3 B4 B5 C1 C2

    SIG - A1 A2 A3 B1 B2 B3 B4 B5 C1 C2

    INC1: We modify B and don't touch A or C, the inc contains:

    DAT -             B2    B4

    SIG -             B2    B4

    INC2: We modify B and C and don't touch A, the inc contains:

    DAT -          B1                C2

    SIG -          B1                C2

    As you can see, the incrementals take up very little room compared to normal backups, thus saving network bandwidth and disk space.

    So now, we want to restore the backup completely. The restore is started with the Full and all the blocks A are restored to the filesystem and we search Inc1 and Inc2 for changes. Finding none, we proceed to B and restore all its blocks from Full. Next we restore B2 and B4 from Inc1, then finally B1 from Inc2. Finally we restore all the blocks from C, find no changes in Inc1, but find a block to restore in Inc2. At this point all files have been restored and we can close all files and finish.

    As you can see from this small example, all files except A had changes in the incremental chain (Inc1, Inc2). New files may be created in and files may be deleted at any point, all represented in a space saving incremental change set. However, the incremental model has a couple of drawbacks.

    Recovery is not efficient over long chains. Expand the above to 100 or more chains and you'll understand that a file may start off in one state and be completely changed in just a few of the incrementals. The savings come from network bandwidth reductions, not from speed of recovery.

    Remote files are easily corrupted. Given that consumer grade hardware is used in many data centers and that consumer grade hardware is on the low end of the quality spectrum, things like memory, disk, and network routers can introduce errors into the upload or download process. Plus the fact that most users do not verify their backup by default and you have the makings for a nasty problem with recovery if you use a long incremental chain. Some of this could be mitigated with PAR2 and other corrupt-data recovery systems, but PAR2 usage is also low among users.

    The bottom line is simple. Keep the number of incrementals for a maximum of one month, then do a full backup. If you have things that don't change much, split your backup into stable and unstable types and do incrementals on the stable only when you've done a system update, or weekly if you feel paranoid. The unstable backups should get an incremental each day for a month, then a full backup. It'll be a lot less data and a lot safer.