For a backup system to work and be of value when something goes wrong, it needs to have these properties:

  • Fully automated: If you have to think about, you will forget or skip it.
  • Off site storage: RAID will not prevent fire or theft; nor accidentally deleting the wrong file.
  • Secured transfer and access: The backup drive can also be stolen or corrupt.

For the transfer, this already restricts the number of tools to pick from: scp, sftp, rsync. And assuming the files to transfer are large, while bandwidth is limited and/or uptime of source/destination systems are limited is only one left: rsync. It is the only tool which is able to resume a previous transfer.

Rsync can use the ssh protocol to transfer files, thus securing the connection. Furthermore, it can utilize the automated authentication through public key. It does require an ssh server on either source or destination though, which will have to be available on the Internet. Thus it's necessary to take a few security precautions. Not running sshd on the standard port 22 will already filter out a lot of attacks, so let's pick another port, e.g. 222.

** First try

For this example, let's assume a pull-backup, that is the destination machine requests files from the source (user foo at example.com) where the original backup file is located. Typically, this will happen on a regular interval, through a cron job. For example, we could imagine running this command every hour (assuming some lock file so we don't disturb an ongoing sync):

rsync --bwlimit=25 --checksum --partial -e "ssh -p 222" -r foo@example.com:/backup /backup

  • bwlimit will limit the transfer to 25 kilo bytes / second, to avoid saturating the line.
  • checksum verifies the file checksum, rather than assuming they are the same only based on size and date.
  • partial enables resuming the download.
  • -e "ssh -p 222" sets the SSH port used by the source.
  • -r syncs recursivly into directories.

The problem with the last option, though, is that it will overwrite existing files on the destination. Imagine a backup file getting corrupt on the source; it will now propagate the same error to the destination and render both files useless. Thus, instead of syncing a whole directory, we'll have to find a way to select files to transfer. I wont go into that here, so maybe it will be a later post.

** Automated login

For the above line to work as part of a cron job, the destination has to be automatically authenticated. Public key authentication with SSH is fairly simple to set up. On the destination machine (which is the client in the ssh connection), run this command to generate a key. Do not set a password. Then copy the key over to the source machine (still assuming it runs SSH on port 222).

ssh-keygen -t dsa

scp -P 222 ~/.ssh/id_dsa.pub foo@example.com:/tmp

On the source machine, copy the keyfile to its correct location. Assuming .ssh and authorized_keys do not already exist.

mkdir ~/.ssh
chmod 700 ~/.ssh
cp /tmp/id_dsa.pub ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

You should now be able to log in from destination to source without having to enter a password:

ssh -p 222 example.com

** Restricted shell

The basic feature of ssh is to give you a shell on the remote host. However, in our situation, we've just granted the destination machine, and anybody there full access to our machine. We might want to restrict this a bit; only allowing the rsync command to run, without other access. rssh handles this. On the source machine, install rssh, and make sure whatever user is using this shell is in the rsshusers group.

yum install rssh

usermod -a -G rsshusers -s /usr/bin/rssh foo

Modify /etc/rssh.conf and enable rsync access by uncommenting allowrsync.

** Ready to backup

Now everything should be ready to run. I'll still skip some of the details of the backup script, but assume there is a file on the source machine which lists which file to copy. (Alternatively, lists all file so we can compare what we have and don't have on the destination). Furthermore, it is assumed that each backup file comes with a corresponding checksum file, e.g. .MD5. The beginning of a script might look like this:

alias backup="rsync --bwlimit=25 --checksum --partial -e 'ssh -p 222' --protocol=29"

backup foo@example.com:/backup/list /tmp
[Determine which file to transfer next, e.g. filename.tar.gz]

backup foo@example.com:/backup/filename.tar.gz /backup
backup foo@example.com:/backup/filename.tar.gz.md5 /backup

There's a few things to note here:

  • An alias, backup, is used to avoid repeating all the options every time.
  • Since we run sshd on port 222, we have to use the -e option. However rssh will not accept this. The option --protocol 29 is used to work around this incomparability in rsync / rssh. (Unfortunately, it seems rssh is not maintained any more).
  • The list file is assumed to contain the list of available backup files, so we can compare to the files already on the destination machine.
  • The main file and its .md5 file is transferred separately, with the .md5 last. This is so we can use that as a flag to mark a finished transfer. If the transfer of the main file is interrupted, we can resume it when the .md5 is not yet there.