Backup and Restore

If you're anything like me, you have an ever-increasing volume of both personal and professional data stashed away on your computer: correspondence, reports, articles, accounts, photographs, presentations and perhaps even music. The loss of this data would be catastrophic - so much so that I rely on three different layers of defence to protect the data (database replication between machines, RAID mirroring of server drives, and tape backup. Not everyone is as computer-dependent as me, however - or as paranoid - but everyone should give some thought to backing up their most important data.

And those who administer servers obviously have reponsibility for other users' data, and have extra reason to be concerned with safeguarding that data.

You can lose data in many ways: hardware failure, theft, security compromises, virus infection of attached Windows clients, and perhaps the commonest - finger trouble, like typing "rm *.doc" when you meant "rm *.bak". (Technicians diagnose this as "PEBCAK" - "Problem Exists Between Chair And Keyboard"). However the data disappears, it's nice to know that you can be back in business by restoring it from a tape or CD.

A Few Golden Rules:


What Should You Back Up?

Where is the critical data on a Linux system? Here's a few obvious candidates:

Location

What's there

/home

Users' home directories - their working files

/var


/var/www

Your web site

/var/ftp/pub

Files hosted on your FTP server

/var/lib/mysql

MySQL databases

/var/named

Domain name server zone files

/etc

Various system configuration files
Table 1 - Directories that are candidates for backup

You should review the applications installed on your system - particularly multi-user server applications such as databases - and check where they store their data files.

However, for some systems, the system configuration itself represents many hours of work, and so you should consider backing up configuration files in /etc, under /var and possibly in some other locations.

Backup Philosophies

There are several different types of backup, as shown in Table 2. Most people work out a backup scheme that makes a full backup, usually on a weekend night, as this can be very time-consuming, with differential or incremental backups on weekday nights. This will require at least two tapes, possibly more; however, you should also consider using two complete sets of media, alternated weekly, with the set not currently in use being stored off-site. Take the tapes home with you, or store them at a bank or other location, but don't forget about security - if the data on your tapes is valuable, it should be subject to the same safeguards and controls as the live data on your systems.

Type

Description

System Backup

A backup of the complete system - all binaries, configuration files, and user data, plus - ideally - non-filesystem data, such as the master boot record and boot sector(s). Backup utilities that can do this are also usually able to create a bootable CD-ROM (or other media) that can restore the system onto bare metal. None of the utilities described here can do this, but commercial backup programs sometimes have this feature. Also, check out mkCDrec ( http://mkcdrec.ota.be /)

Full Backup

A backup of all the desired files. Essentially, this is a snapshot of the data of interest, at the time the backup is made.

Differential Backup

A backup of all the files that have been updated, changed or created since the last full backup. With a full backup on Sunday night, Monday night's differential backup will contain Monday's work, Tuesday night's differential backup will contain Monday and Tuesday's work, and so on. A restore will require the full backup media, plus the latest differential backup.

Incremental Backup

A backup of all the files that have been updated, changed or created since the last full or incremental backup. With a full backup on Sunday night, Monday night's incremental backup will contain Monday's work, Tuesday night's incremental backup will contain Tuesday's work, and so on. A restore will require the full backup media, plus all incremental backups made since then.
Table 2 - Types of backup

Backup Media

Tape

Tape is expensive - both for the drives and the media - and seems to be an area where the more you spend, the less trouble you will have. However, tape is hard to beat for convenience and reliability, particularly for large systems. Tape drives can be attached in several ways - via IDE interface, floppy interface, SCSI or parallel port. Your first step is to work out the device name by which Linux will refer to your drive; for example, a SCSI tape drive will be known as /dev/st0, while a QIC-117 (Travan) drive will be seen as /dev/qft0 and a parallel port drive will be /dev/pt0. For full details of device names, refer to the Linux kernel source code file /usr/src/linux/Documentation/devices.txt

Next, you need to understand that tape drives are sequential devices - random access is generally not possible. When you write a file to tape, the general behaviour will be for the tape to be written from the beginning to the end of the file, and then rewind automatically. This means that the next file written will overwrite the last one. To avoid this, you can cause the tape to not rewind by prefixing the device name with an 'n' to indicate no-rewiind mode. So, if you back up to /dev/nst0, the tape will remain positioned after the end of the file, and what you write next will be appended.

To position the tape at the correct point, as well as performing tasks like erasing and retensioning tape, you will need to master the mt command. This usually takes an argument, -f devicename, followed by a command, for example

mt -f /dev/st0 retension

to retension a tape. To skip over the first file on a tape and position to the first block of the next file, use the fsf (forward space file) command:

mt -f /dev/nst0 fsf 1

Note the use of the non-rewinding device name - without this, the tape will be positioned but will immediately rewind again. You can also position to a file by number with the asf command; for example, to position to the first block of the third file on the tape:

mt -f /dev/nst0 asf 2

Remember, programmers start counting from zero!

[root@fulbert root]# mt -f /dev/st1 tell
At block 0.
[root@fulbert root]# mt -f /dev/nst1 fsf 1
[root@fulbert root]# mt -f /dev/nst1 tell
At block 313193.
[root@fulbert root]#
Listing 1 - Using the mt command to position a tape

Removable Hard Drives

Hard drives are cheaper than ever, and have as much capacity as . . .well, as your hard drive. With the addition of a drawer ($A40 - $A60) to make the drive removeable, hard drives provide fast, high-capacity and inexpensive backup and archival media. The major concern has to be their robustness when removed from the computer, particularly when being transported for off-site storage. Another concern is the requirement for most systems to be powered down for hard drives to be added or removed.

As far as Linux is concerned, hard drives can be formatted and mounted, then files copied to them, or they can be used as raw media and written to by commands like tar. For this reason, in many of the examples that follow, you can treat tapes and hard drives as equivalent.

CD-R/CD-RW, DVD[+/-]R[W], DVD-RAM

Many PC's ship with a CD burner as standard these days and DVD burners are dropping in price. The commonest technique that people use to backup to CD-R or CD-RW is to use a script which will run mkisofs to create an ISO 9660 image, and then run cdrecord to burn the CD. This is a little slow, and it will require around 700 MB of working space (usually in /tmp), but it does work. In general, CD-R is most useful as an archival mechanism - for example, to periodically clean out that download directory that's full of tarballs and RPM's.

A typical command that can be used to burn a directory tree to CD might be:

mkisofs -R /master/tree | cdrecord -v speed=8 dev=0,0 -

Direct access is possible using the UDF filesystem; however, the Linux kernel does not currently have support for UDF writes to CD-RW or DVD media. See http://fy.chalmers.se/~appro/linux/DVD+RW/ for useful information on burning DVD's. The best support under Linux appears to be for DVD-RAM media, which is designed for data storage but is also able to store video. DVD-RAM drives, such as those made by Panasonic, put the media in a cartridge, which protects it from scratches, and are good for 100,000 writes - a significantly better lifetime than tapes, which typically wear out after a few hundred writes.

By contrast, DVD-RW requires the whole disk to be written at once, but provides random access reads.

Backup Between Systems

Another useful technique - especially if your systems have larger hard drives than you really need - is to back up data between systems. You can set this up quite simply using scripts which invoke the scp (secure copy) command, but a much more efficient way is to use the rsync command, which transfers only changes between the two systems.

Standard Utilities

The good news is that every Linux distribution comes with a number of free utilities which can be used to make backups. Although commercial backup programs provide some additional capabilities and prettier (graphical) interfaces, it's not difficult to completely automate your backup routine with cron and some simple scripts.

tar

tar has been around since the early days of UNIX. Although the name implies operation as a tape archiver, the way that *ix treats files and devices interchangeably makes it much more generally useful.

Most people are familiar with the use of tar, coupled with gzip or bzip2, as a way of distributing packages of source code, commonly known as tarballs. However, tar is much more versatile.

The GNU tar utility supplied with most Linux distributions has lots of options, but they're easy to sort out. First, there are the commands:


Command

Explanation

c

Create an archive

x

Extract an archive

t

List the contents of an archive

u

Update - only append files that are newer than the copy in the archive file
Table 3 - The essential tar commands

One only of the commands must be present. Next, some necessary or useful options:

Option

Explanation

f filename

Specifies the name of the archive to be created, extracted, listed, etc. This can be a file or a device such as a tape drive (e.g. /dev/st0) or raw disk partition (e.g. /dev/hda9)

v

Verbose operation - probably undesirable in scripts, but reassuring when working interactively

z

gzip or gunzip the archive - compresses an archive being created, decompresses on extraction

j

bzip2 or bunzip2 the archive - compresses an archive being created, decompresses on extraction
Table 4 - The most important tar options

The complete syntax is

tar command&options file-or-directory ...

Since one of the options is usually the -f filename option, the command to create an archive of an entire directory tree usually looks like this

tar cvf archive-name.tar directory-name

or, with compression:

tar czvf archive-name.tar.gz directory-name

Here is an example of a very short script that uses tar to back up the databases on a Lotus Domino server to a SCSI tape drive. This script is simply placed in /etc/cron.daily and runs shortly after four a.m. along with the other cron jobs:

#!/bin/bash

TAPE=/dev/st1
service domino stop
cd /var/local/notesdata
tar czf $TAPE *.nsf *.NSF
service domino start

Listing 2 - A simple script using tar for backup

cpio

The cpio (CoPy Input to Output) command is rather more obscure. In copy-out mode (-o and -O options), cpio reads the names of files on its standard input, one filename per line, and writes the files to an archive. The filenames are typically piped from some other program, often the find command.

In copy-in mode (-i and -I options), cpio reads files from an archive. By default, all files are extracted, but if any filenames are specified on the command line, they are interpreted as wildcard globbing patterns and only files which match will be extracted. The patterns do not work like shell filename wildcards; for example, slashes in filenames are treated as ordinary characters for pattern-matching purposes.

By default, cpio reads and writes archives through stdin and stdout. Use the -F option, or the -I or -O options, to specify an archive file.

cpio is much less commonly used in the Linux world than tar, so I do not propose to go into details here.

dump & restore

The dump utility is able to perform both full and differential (and - by implication - incremental) backups. It also records the dates and times at which various backups were performed, in the file /etc/dumpdates. To use dump effectively, you need to understand the concept of dump levels. A level zero ( -0) dump performs a full backup of a filesystem (or directory tree, in the latest versions). Any other dump level (-1 through -9) directs dump to back up all files which were created or modified after the last dump of a lower level. So, to perform a full backup to tape:

dump -0uf /dev/st0 /home

will perform a full backup of the /home filesystem to the first SCSI tape drive, and the -u option will cause it to update the /etc/dumpdates file so that this backup is recorded. Now, a couple of nights later, the command

dump -1uf /dev/st0 /home

will cause a differential backup to be be done. Since it is a level 1 backup, it will back up files created or modified since the last level 0 (full) backup, and the result will be output like this:

DUMP: Date of this level 1 dump: Tue Jan 6 01:17:36 2004
DUMP: Date of last level 0 dump: Sun Jan 4 01:17:35 2004
DUMP: Dumping /dev/vg00/lv01 (/home) to /dev/st0
DUMP: Added inode 8 to exclude list (journal inode)
DUMP: Added inode 7 to exclude list (resize inode)
DUMP: Label: none
DUMP: mapping (Pass I) [regular files]
DUMP: mapping (Pass II) [directories]
DUMP: estimated 29282 tape blocks.
DUMP: Volume 1 started with block 1 at: Tue Jan 6 01:18:02 2004
DUMP: dumping (Pass III) [directories]
DUMP: dumping (Pass IV) [regular files]
DUMP: Closing /dev/st0
DUMP: Volume 1 completed at: Tue Jan 6 01:19:29 2004
DUMP: Volume 1 29500 tape blocks (28.81MB)
DUMP: Volume 1 took 0:01:27
DUMP: Volume 1 transfer rate: 339 kB/s
DUMP: 29500 tape blocks (28.81MB) on 1 volume(s)
DUMP: finished in 46 seconds, throughput 641 kBytes/sec
DUMP: Date of this level 1 dump: Tue Jan 6 01:17:36 2004
DUMP: Date this dump completed: Tue Jan 6 01:19:29 2004
DUMP: Average transfer rate: 339 kB/s
DUMP: DUMP IS DONE

Listing 2 - Output of a typical dump command

You can arrange dump to perform incremental backups by using a higher dump level number each night for a week - using -1 on Monday, -2 on Tuesday night, and so on.

The restore command is used to restore files from a dump tape. The simplest way to use it is to restore everything, but before doing that the filesystem should be (re)formatted, mounted and should be the current working directory:

mke2fs -j /dev/vg00/lv01
mount /dev/vg00/lv01 /home
cd /home
restore -rf /dev/st0

However, in the case where only one or a few files have been lost, it is best to use restore in its interactive mode, with the restore -i option. This will read the directory from the dump tape, then prompt you to navigate around the directory tree, adding files to the extraction list. Finally, an extract command will restore the files from the tape.

Subcommand

Explanation

add filespec

Adds the specified file(s) or directory (and its subdirectories) to the list of files to be extracted.

cd dir

Chages working directory

delete filespec

Deletes the specified file(s) or directory from the list of files to be extracted

extract

Begins extracting the required file(s) from the dump. You will be asked which volume you wish to start with - in a multi-volume tape set, it's usually fastest to start with the last volume and work back towards the first.

help

Sumarizes available commands

ls filespec

Pretty obvious . . .

pwd

Prints the working directory

quit

Quits without extracting

setmodes

Sets the owner, permissions and timestamps on directories in the extract list, but does not extract anything from the tape.

verbose

Toggles the verbose flag. With -v turned on, information about each file is printed as it is extracted.
Table 5 - Interactive restore subcommands

Here is an example of a short interactive restore session, in which two files are restored. Note that the restore command has to be started in the directory that was dumped to tape:

[root@fulbert root]# cd /home
[root@fulbert home]# restore -if /dev/st0
restore > cd les
restore > ls *.txt
curfloo-emote-think-patch.txt
curfloo-gcc3_1-partial_patch.txt
curfloo-update.txt
matilda.txt
rfc1163.txt
spamips.txt
spamsort.txt
restore > add rfc1163.txt
restore: ./les: File exists
restore > cd work
restore > ls *.DOC
7S.DOC
BTW42.DOC
CSA.DOC
DRMMACH.DOC
HWCONTRT.DOC
HWCTRCOV.DOC
PACKLIST.DOC
SOLSDIR.DOC
VIDP1.DOC
restore > restore 7S.DOC
restore: unknown command; type ? for help
restore > add 7S.DOC
restore: ./les/work: File exists
restore > extract
You have not read any volumes yet.
Unless you know which volume your file(s) are on you should start
with the last volume and work towards the first.
Specify next volume # (none if no more volumes): 1
set owner/mode for '.'? [yn] n
restore > quit
[root@fulbert home]#
Listing 3 - Performing an interactive restore

Although the preceding examples have used tape, it's worth noting that the same commands can be used with removeable hard drives, DVD-RAM, optical drives and many other kinds of media.

rsync

The rsync command is A Thing of Beauty and a Joy Forever, and all the nicer for being written by a local leading light of the Linux community, Andrew Tridgell (of Samba fame). The program is based upon an algorithm that compares two files and only transfers the bits of the files that are different, thereby conserving bandwidth and making the best use of slow links. You can use rsync to backup files betweeen directories (and implicitly, media) on a single machine, or to synchronise directories between machines. The basic syntax of the command is:

rsync option ... source ... [user@]host:destination

which copies local files or directories to a remote system, or

rsync option ... [user@]host:source ... destination

which copies from a remote machine to a local. In both cases, the rsync command will connect to the remote system, invoke a shell (using either the rsh or ssh commands) and run rsync at the far end. It is also possible to connect directly to an rsync daemon on the remote system, but this technique is less common today, since using ssh provides better security with less configuration effort. In fact, the latest version of rsync (2.6.0) uses the ssh protocol by default.

There are lots of options for the rsync command, but some of the most important are:

Option

Explanation

--e ssh

Uses ssh as the shell on the remote machine. This is much more secure than the default (rsh), since it utilizes stronger authentication and encrypts the data being transferred.

-u

Update only - will not overwrite newer files on the destination

-t

Set the timestamps on the remote files to the same as the local files. This allows rsync to skip files that have the same length and mtime.

-v

Increases verbosity. Use two v's for more detail, or three if you want to be bombarded with detail.

-q

Quiet operation - suppresses informational messages from the remote server.

-r

Recursive operation - required to copy directories

-l

Preserve symbolic links

-p

Set the permissions on the remote system to be the same as the local permissions.

-o

Sets the owner of the destination file to be the same as for the local file. Can only be done as root.

-g

Sets the group of the destination file to be the same as for the local file. Can only be done as root.

-z

Compress data being sent to a remote machine

-a, --archive

Equivalent to -rlptgoD

-b

Make backups of files (by default, with a ~ suffix)

--progress

Show progress during file transfers
Table 6 - Major rsync options

Some examples will make these options clearer:

To transfer the contents of the directory scripts on my workstation to the same directory on our Samba server, I'll first of all run the ssh-agent, so that I can get it to hold my ssh keys, then add a key with the ssh-add command. This means that I needn't keep supplying the password for each ssh-related command. Then, I invoke rsync, with a few well-chosen options:

[les@sleipnir les]$ ssh-agent $SHELL
[les@sleipnir les]$ ssh-add
Enter passphrase for /home/les/.ssh/id_rsa:
Identity added: /home/les/.ssh/id_rsa (/home/les/.ssh/id_rsa)
[les@sleipnir les]$ rsync -rtpo -e ssh --progress scripts les@fulbert:
        189 100%    0.00kB/s    0:00:00
        503 100%    0.00kB/s    0:00:00
         86 100%    0.00kB/s    0:00:00
        107 100%    0.00kB/s    0:00:00
[les@sleipnir les]$

Now, I can go ahead and edit files in that directory; when I reinvoke the rsync command, only the modified files will be changed (because the files are so small, the entire files are transferred - with larger files, only parts of them would be transferred):

[les@sleipnir les]$ rsync -rtpo -e ssh --progress scripts les@fulbert:
         87 100%    0.00kB/s    0:00:00
[les@sleipnir les]$

If I use the -b (backup) option, the destination file will be backed up, before being overwrittem

[les@sleipnir les]$ rsync -rtpob -e ssh --progress scripts les@fulbert:
        188 100%    0.00kB/s    0:00:00
[les@sleipnir les]$

Following this command, the remote directory (/home/les/scripts on fulbert) looks like this:

[les@fulbert scripts]$ ls -l
total 20
-rwxrwxr-x    1 les      les           188 Jan  6 17:37 counter
-rwxrwxr-x    1 les      les           189 Jan  6 17:36 counter~
-rwxrwxr-x    1 les      les           503 Jan  6 16:22 remindme
-rwxrwxr-x    1 les      les            87 Jan  6 17:30 shifter
-rwxrwxr-x    1 les      les           107 Jan  6 17:13 stat
[les@fulbert scripts]$

As you can see, the counter file has been updated, but the previous version remains as counter~.

Perhaps the best way to use rsync is to invoke it from within a script, possibly run as a cron job. The following short example script provides basic functionality, including a test to check that the backup host is reachable and logging of errors to the system log. While the script is suitable for running as a cron job, it is up to the user to arrange a suitable authentication method for remote access as root (e.g. a private key with no password, but the security implications are obvious).

#!/bin/bash

export PATH="/bin:/usr/bin:/usr/local/bin"

BACKUPHOST="samba"
CONNECTAS="root"
SOURCEDIRS="/home /var/profile /var/samba/netlogon /var/www"
DESTDIR="/home/backup"
OPTIONS="-rtogp -e ssh"

ping -c 1 $BACKUPHOST > /dev/null 2>&1
if [ $? -ne 0 ] ; then
   logger -s -p syslog.error "$0: Backup host $BACKUPHOST is down or unreachable"
   exit 1
fi
for DIR in $SOURCEDIRS ; do
   rsync $OPTIONS $DIR $CONNECTAS@$BACKUPHOST:$DESTDIR >/dev/null 2>&1
   if [ $? -gt 0 ] ; then
       logger -s -p syslog.error "$0: rsync backup script failed"
       exit 1
   fi
done

Listing 4 - A simple backup script using rsync

In short, rsync is a tremendously powerful utility that will easily repay time spent studying and experimenting with it.

Summary

In this article, I have shown a number of different commands that are part of every Linux distribution and can be used to back up to various types of media. I hate to say it, but you now have no excuses!

Refences and Further Reading

Preston, W. Curtis, "Unix Backup and Recovery", O'Reilly

The rsync Web Site: http://rsync.samba.org/

Various man pages for tar, cpio, dump
Page last updated: 07/Sep/2004 Back to Home Copyright © 1987-2010 Les Bell and Associates Pty Ltd. All rights reserved. webmaster@lesbell.com.au

...........................