# Remote backups server using FreeBSD, ZFS, and Rsync



## phoenix (Apr 30, 2009)

Updates
*2010-04-04*:
We've rolled out version 3 of our rsbackup system.  See this thread for more information on it.  (Version 2 never really saw the light of day, but was used as a stepping stone to the refactored version 3.)

Intro
A co-worker and I developed a centralised backup solution using FreeBSD, ZFS, and Rsync.  The following set of posts describe how we did it.

*Note:* this is fairly long, and includes code dumps from all the scripts and config files used.

Server Hardware
Our central backup server uses the following hardware:

Chenbro 5U rackmount case, with 24 hot-swappable drive bays, and a 4-way redundant PSU
Tyan h2000M motherboard
2x dual-core Opteron 2200-series CPUs at 2.2 GHz
8 GB ECC DDR2-SDRAM
3Ware 9550SXU PCI-X RAID controller in a 64-bit/133 Mhz PCI-X slot
3Ware 9650SE PCIe RAID controller in an 8x PCIe slot
Intel PRO/1000MT 4-port gigabit PCI-X NIC
24x 500 GB SATA harddrives
2x 2 GB CompactFlash cards in CF-to-IDE adapters

OS Configuration
We're currently running the 64-bit amd64 version of FreeBSD 7.1.  We'll be upgrading to 7.2 once it's released.  And we are anxiously awaiting the release of 8.0 with ZFSv13 support.

Two of the gigabit NIC ports are combined using lagg(4) and connected to one gigabit switch.  We're considering adding the other two ports to the lagg interface, but we're waiting for a new managed switch that support LACP before we do.

The 2 CF cards are configured as gm0 using gmirror(8).  / and /usr are installed on gm0.

The 3Ware RAID controllers are configured basically as glorified SATA controllers.  Each drive is configured as a "SingleDrive" array, and appear to the OS as separate drives.  Using SingleDrive instead of JBOD allows the RAID controller to use the onboard cache, and allows us to use the 3dm2 monitoring software.  Each drive is also named after the slot/port it is connect to (disk01 through disk24).

The 24 harddrives are also labelled using glabel(8), according to the slot they are in, using the same names as the RAID controller uses (disk01 through disk24).

The drives are added to a ZFS pool as 3 separate 8-drive raidz2 vdevs, as follows:

```
# zpool create storage raidz2 label/disk01 label/disk02 label/disk03 label/disk04 label/disk05 label/disk06 label/disk07 label/disk08
# zpool add    storage raidz2 label/disk09 label/disk10 label/disk11 label/disk12 label/disk13 label/disk14 label/disk15 label/disk16
# zpool add    storage raidz2 label/disk17 label/disk18 label/disk19 label/disk20 label/disk21 label/disk22 label/disk23 label/disk24
```

This creates a "RAID0" stripe across the three "RAID6" arrays.  The total storage pool size is just under 11 TB.


```
# zpool status
  pool: storage
 state: ONLINE
 scrub: none requested
config:

        NAME              STATE     READ WRITE CKSUM
        storage           ONLINE       0     0     0
          raidz2          ONLINE       0     0     0
            label/disk01  ONLINE       0     0     0
            label/disk02  ONLINE       0     0     0
            label/disk03  ONLINE       0     0     0
            label/disk04  ONLINE       0     0     0
            label/disk05  ONLINE       0     0     0
            label/disk06  ONLINE       0     0     0
            label/disk07  ONLINE       0     0     0
            label/disk08  ONLINE       0     0     0
          raidz2          ONLINE       0     0     0
            label/disk09  ONLINE       0     0     0
            label/disk10  ONLINE       0     0     0
            label/disk11  ONLINE       0     0     0
            label/disk12  ONLINE       0     0     0
            label/disk13  ONLINE       0     0     0
            label/disk14  ONLINE       0     0     0
            label/disk15  ONLINE       0     0     0
            label/disk16  ONLINE       0     0     0
          raidz2          ONLINE       0     0     0
            label/disk17  ONLINE       0     0     0
            label/disk18  ONLINE       0     0     0
            label/disk19  ONLINE       0     0     0
            label/disk20  ONLINE       0     0     0
            label/disk21  ONLINE       0     0     0
            label/disk22  ONLINE       0     0     0
            label/disk23  ONLINE       0     0     0
            label/disk24  ONLINE       0     0     0
```


```
# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
storage                10.9T   5.11T   5.76T    47%  ONLINE     -
```

We then created ZFS filesystems for basically everything except / and /usr:

/home
/tmp
/usr/local
/usr/obj
/usr/ports
/usr/ports/distfiles
/usr/src
/var
/storage/backup

We enabled lzjb compression on /usr/ports and /usr/src, and disabled it on /usr/ports/distfiles.  And we enabled gzip-9 compression on /storage/backup.  We also disabled atime updates on everything except /var.


----------



## phoenix (Apr 30, 2009)

RSBackup
We developed a "simple" set of shell scripts that perform remote backups of Linux and FreeBSD systems using rsync and ZFS snapshots.  The scripts run a sequential series of rsync connections for all servers at a remote site, while also doing multiple sites in parallel. It uses SSH (as user rsbackup, with a password-less RSA key) to connect to the remote server, then uses rsync to send data back through the SSH connection. Backups are stored on a ZFS filesystem (/storage/backup/), with a separate directory for each site, and separate sub-directories for each server. Before each nightly backup run, a ZFS snapshot is taken of the /storage/backup filesystem, named using the current date, in YYYY-MM-DD format.

We called our solution *rsbackup*.

rsbackup is configured to run every night starting at 7 pm, via root's crontab.  The crontab looks like this:


```
SHELL=/bin/sh
MAILTO=root
PATH=/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin

#min    hour    day     month   weekday         command
*/15    *       *       *       *               /root/scripts/check-fs.sh

# Take a snapshot of the backups filesystem
50      18      *       *       *               /root/rsb/rsb-snapshot

# Run the rsbackup script
0       19      *       *       mon-fri         /root/rsb/rsb-wrapper force
0       19      *       *       sat-sun         /root/rsb/rsb-wrapper start
50      6       *       *       mon-fri         /root/rsb/rsb-wrapper stop
```

The crontab above shows the helper scripts that are used:

check-fs
rsb-snapshot
rsb-wrapper
rsb-one

*check-fs* checks the status of the gmirror and the zpool to make sure there are no checksum errors, dying drives, missing drives, degraded vdevs, and so on.  If there are, then an e-mail is delivered with the details of the issues.

*rsb-snapshot* pulls in the rsbackup config file to determine which filesystem to snapshot, then creates a snapshot using the current date as the snapshot name (YYYY-MM-DD format).

*rsb-wrapper* pulls in the rsbackup config file, then checks if any other rsbackup processes are running. If there are any, a warning is displayed and the wrapper exits. If there are none, then the backup process is started. rsb-wrapper is also run just prior to 7 am, to check if any rsync processes are still running, and to kill them if they are (We didn't want backups running during the day, as they will hog all the upload bandwidth for the remote sites). Error and warning messages from all the log files are then sent via e-mail to the address listed in the crontab.

*rsb-one* can be used from the command-line to do a manual backup of a single server at a single site.  It uses the same config file as the rest of the scripts.  Command syntax is:
`# rsb-one -s sitename -h hostname`

The gist of the backup process is this:

every night, a ZFS snapshot is created of the /storage/backup filesytem.  This becomes the historical backup for everything, as one can navigate through all the snapshots via /storage/backup/.zfs/snapshot/<snapname>/<sitename>/<server>/.
every night, a full rsync is done of virtually every file on the remote systems against a local directory for that server.

We are currently backing up 102 remote servers.  The backups start at 7pm, the rsync for the last server starts around 2am, and everything is finished by 4am.

The size of the snapshots fluctuates daily, but the average is under 10 GB.  The base storage required for those 102 servers is ~ 4 TB, which gives us over 500 days of daily backups, well over the 13 months we were hoping for.


----------



## phoenix (Apr 30, 2009)

Remote Server Config
The rsbackup system requires a bit of setup on both the central backup server and the remote server(s).  The following shows how to configure a Debian Linux host for backups.

On the remote host


install rsync (preferably 3.0.x, as it has much reduced CPU and RAM usage, and it starts sending file changes while generating the file list)
create backups group
*addgroup --system backups
create rsbackup user
adduser rsbackup*
manually set the password to * in /etc/shadow to prevent console logins, the shell can be set to /bin/sh, as there are no interactive logins 
add rsbackup to group backups
*adduser rsbackup backups*
edit sudo config to allow backups group to run rsync with no password
*visudo
Cmnd_Alias RSYNC = /usr/bin/rsync
%backups ALL=(ALL) NOPASSWD: RSYNC*
create .ssh/ directory in ~rsbackup/
*mkdir ~rsbackup/.ssh*
create blank authorized_keys file
*touch ~rsbackup/.ssh/authorized_keys*
set correct permissions on .ssh/ directory and .ssh/authorized_keys file
*chmod 700 ~rsbackup/.ssh
chmod 600 ~rsbackup/.ssh/authorized_keys*

On the central backup server

copy public SSH key for rsbackup to remote server
*scp /root/rsb/conf/rsbackup.rsa.pub remoteserver:*
on the remote server, move the rsbackup.rsa.pub file to ~rsbackup/.ssh/authorized_keys
test SSH logins using the key (must be done as root)
*ssh -l rsbackup -i /root/rsbackup/conf/rsbackup.rsa -p <portnum> <server>*
test that rsbackup can run rsync via sudo without passwords, but cannot run any other commands via sudo
*sudo /bin/ls   (should fail)
sudo rsync --version     (should work)*


----------



## phoenix (Apr 30, 2009)

Central Backup Server Config
All rsbackup-related stuff is (currently) stored under */root/rsb/* (ideally, it should be stored under /usr/local/ to follow hier(7)).

The example below shows the configuration steps used for *testserver*.


If this is the first server added for a site, create a site directory, using the DNS name for the site, under */root/rsb/sites/*
*mkdir sites/testsite*
Create/edit the site_defaults file
*cp /root/rsb/conf/site_defaults /root/rsb/sites/testsit/ 
ee sites/testsite/site_defaults*
Create a config file for the server.
*ee sites/testsite/testserver
Add/edit at least the following:
RSYNC_SERVER=testserver.hostname
SERV_DIR=testsite/*
Add any overrides for items in the global defaults (mainly SSH port to use)
If there are special excludes for this server, add the following
*RSYNC_EX_SERVER=$SITE_CONF/exclude.testserver*
Add/edit the exclude file listed above
*ee sitess/testsite/exclude.testserver*
Connect via ssh to add the host to the known_hosts file
*ssh -l rsbackup -i /root/rsb/conf/rsbackup.rsa testserver.hostname*
Add the site to the global sites list
*ee conf/sites.lst*
Rename the server config file to end in .cfg (only server config files ending in .cfg are processed)
*mv sites/testsite/testserver sites/testsite/testserver.cfg*
The site and server(s) will be picked up in the next run of rsbackup via cron

Try to only add 1 or 2 new servers per day. The initial rsync run takes a long time, as it has to copy over every file in the system. Any still-running rsync processes will be killed at 7 am weekdays, so the initial sync may be spread across multiple days. Adding servers on Friday is best, as the rsync processes will run until complete or Monday at 7 am, whichever comes first.


----------



## phoenix (Apr 30, 2009)

Restoring From Backups
Every snapshot that is created can be navigated via the hidden *.zfs/snapshot/<snapshotname>/* directory hierarchy. The .zfs directory is placed in the root of the ZFS filesystem. As you navigate through the snapshot hierarchy, ZFS automatically mounts the snapshot as a read-only filesystem. You can also manually mount the snapshot as read-only using *mount -t zfs*. In this way, you can restore files from either the most recent backup (the normal filesystem hierarchy) or from any previous backup (the snapshot hierarchy).

To manually mount a snapshot (as root):
`# mount -t zfs -r /storage/backup@2008-09-12 /mnt`

You can clean up the output of mount by periodically running (as root):
`# mount | grep 'backup@' | awk '{ print $3 }' | xargs -n 1 umount`

Individual Files/Folders

SSH to the central backups server
Switch to root
cd into the /storage/backup/.zfs/snapshot/ directory
Do an ls to see all the available snapshot dates
cd into the desired snapshot directory
cd into the <site>/<server>/ directory
find the file/folder you need and scp it back to the server in question

Complete System Restore - Linux
In order for this to work correctly, the username you use in the rsync command will need to be part of the sudoers users/groups that can run rsync on the central backup server.

Boot replacement server off a Linux LiveCD (Knoppix/Kanotix/etc).
Partition the drive(s) as needed using cfdisk (see fstab in the server's etc directory on the central backup server).
Format the partitions as needed (see fstab in the server's etc directory on the central backup server).
*mkfs -t ext3 /dev/sda1
mkfs -t xfs /dev/sda5 
mkfs -t xfs /dev/sda6 
and so on*
Mount the partitions under /mnt.
*mount -t xfs /dev/sda5 /mnt 
mkdir /mnt/boot /mnt/usr /mnt/home /mnt/var 
mount -t ext3 /dev/sda1 /mnt/boot 
mount -t xfs /dev/sda6 /mnt/usr 
and so on*
cd to /mnt (not really needed, but a good safety-net, just in case).
Run rsync to copy everything from the central backup server to the local server
*Note 1:* --numerical-ids is *very* important, do not forget this option, or things will fail in spectacular ways! 
*Note 2:* -H is needed to restore hardlinks to various files. Without this, the restore will be significantly larger. 
*# rsync -vaH --partial --stats --numeric-ids --rsh=ssh --rsync-path="sudo rsync" username@backupserver:/storage/backup/<site>/<server>/ /mnt/*
Grab a coffee as it does the transfer. Time depends on the size of the dataset being restored.
Install GRUB into the boot sector of the harddrive.
*grub-install --no-floppy --recheck /dev/sda 
grub-install --no-floppy /dev/sda*
Reboot the server to make sure everything comes up correctly.

For the last step, where you run rsync, you can use a ZFS snapshot directory to restore the server to any day. Instead of */storage/backup/<site>/<server>/* you can use */storage/backup/.zfs/<snapshotdate>/<site>/<server>/*


----------



## phoenix (Apr 30, 2009)

Complete System Restore - FreeBSD
In order for this to work correctly, you will need to be part of the sudoers users/groups on the central backup server that can run rsync without requiring a password.

First, do a minimal install of FreeBSD, to make the drives bootable:

Boot replacement server using the FreeBSD install CD.
Select Canada as the country.
Select USA ISO as the keymap.
Select Standard install.
Select OK on the warning message.
Delete all existing partitions. Press A to create a single partition for FreeBSD. Mark it as Bootable. Press Q to save the changes.
Select Standard MBR (no boot manager).
Select OK on the warning message.
Create the partitions needed (see the fstab under /storage/backup/<site>/<server>/etc/). Press Q to save the changes.
Select Minimal install.
Select FTP Passive for the installation media (or CD/DVD if using the full CD1).
Select Main Site.
Select the correct network device (xl0 on my test server).
Select No for IPv6.
Select Yes for DHCP.
Enter the correct hostname.
Select Yes on the warning message.
Wait as it does the minimal install.
Select OK on the completion message.
Select No for "function as a network gateway".
Select No for "configure inetd".
Select No for "enable SSH login".
Select No for "anonymous FTP".
Select No for "NFS server".
Select No for "NFS client".
Select No for "customize system console".
Select Yes for "set this machine's time zone'.
Select No for "Is this machine's CMOS clock set to UTC".
Select America -- North and South for region.
Select Canada for country.
Select Pacific Time - west British Columbia for timezone.
Select Yes for "PDT".
Select No for "Linux compatibility".
Select No for "mouse".
Select No for "browse the package collection".
Select Yes for "add any initial user accounts".
Select User for "User and group management".
Fill in the blanks. The exact contents don't matter, as the rsync restore will wipe this out. This is just for testing during the initial boot.
Select Exit for "User and group management".
Select OK on the warning message.
Type root's password twice.
Select No on the warning message.
Press Tab key to get to "Exit Install". Press enter.
Select Yes on the warning message to exit the installer and reboot the system.

Test that the new install boots correctly, and that you can login from the console.

Then follow the steps below to restore the data from the backups server.


Boot replacement server off a FreeBSD LiveCD that includes rsync (Frenzy/FreeSBIE/etc). Frenzy 1.1 seems to work best.
Type *nohdmnt* at the boot menu, to prevent the existing filesystems from being mounted automatically.
Enable modifying of drives while the system is running.
*sysctl -w kern.geom.debugflags=16*
Create a directory to use for the mount point of the harddrive partitions.
*mkdir /root/media*
Mount the partitions under /root/media
*mount /dev/ad4s1a /root/media 
mkdir /root/media/usr /root/media/var /root/media/home
mount /dev/ad4s1d /mnt/usr 
mount /dev/ad4s1e /mnt/var 
mount /dev/ad4s1f /mnt/home 
and so on*
Change to /root/media (not really needed, but a good safety-net, just in case).
Run rsync to copy everything from the central backup server to the local server.
*Note 1:* --numerical-ids is *very* important, do not forget this option, or things will fail in spectacular ways! 
*Note 2:* -H is needed to restore hardlinks to various files. Without this, the restore will be huge, and will fail. FreeBSD uses hardlinks a lot! 
*rsync -vaH --partial --inplace --stats --numeric-ids --rsh="ssh" --rsync-path="sudo rsync" username@backupserver:/storage/backup/<site>/<server>/ /mnt/*
Grab a coffee as it does the transfer. Length of the restore depends on the size of the dataset being restored.
Reboot the server, without any CDs in the drive, to make sure everything comes up correctly. Test that you can login from the console.

For the last step, where you run rsync, you can use a ZFS snapshot directory to restore the server to any day. Instead of */storage/backup/<site>/<server>/* you can use */storage/backup/.zfs/<snapshotdate>/<site>/<server>/*


----------



## phoenix (Apr 30, 2009)

The rsbackup Script
This is version two our our prototype rsbackup, it's still a little rough around the edges, and spread across too many separate files.  It works well for us, but it's not as pretty as it should be.    There are a couple of different coding styles, and some options may not be in use anymore.  We're hoping to clean it up over the summer, when school is not in session (we don't want to disrupt the backups during the school year).  We'd also like to amalgamate rsbackup, rsb-one, and rsb-snapshot together.


```
#!/bin/sh                                                                                                                                              

Defaults="rsbackup.conf"
. $Defaults             

# Functions used in this script
do_rsync()                     
{                              
        SITE_CONF="${SERVERS_DIR}/${1}"

        #find each .cfg file in the passed dir, load defaults, site_defaults, server_defaults, and run the rsync
        for I in $( find $SITE_CONF  -type f -name "*.cfg" ); do                                                  

                # Load Standard defaults
                . $Defaults            

                # Load site wide defaults
                if [ -f $SITE_CONF/site_defaults ]; then
                     . $SITE_CONF/site_defaults         
                fi                                      

                # Load server specific options
                if [ ! -z $I ]; then         
                        if [ -f $I ]; then   
                          . $I               
                        fi                   
                fi                           

                # make sure the site directory exists
                if [ ! -e $BACKUP_DIR/$SITE_DIR ]; then
                        mkdir $BACKUP_DIR/$SITE_DIR    
                fi                                     

                # just to make typing easier
                S_DIR="$BACKUP_DIR/$SITE_DIR/$SERV_DIR"

                # make sure the directory for the server itself exists
                if [ ! -e $S_DIR ]; then
                        mkdir $S_DIR
                fi

                echo ""
                echo "====>> $( date "+%b %d %Y: %H:%M" )  Starting rsync for $RSYNC_SERVER" >> $logfile
                echo ""

                # The actual rsync command
                rsync $RSYNC_OPTIONS $RSYNC_SITE_OPTIONS $RSYNC_SRV_OPTIONS \
                        --exclude-from=$RSYNC_EX_DEF $RSYNC_EXTRA_EXCLUDE \
                        --rsync-path="$RSYNC_EXEC" --rsh="$RSYNC_SSHCMD -p $RSYNC_PORT -i $RSYNC_SSH_KEY" \
                        --log-file=/var/log/rsbackup/$RSYNC_SERVER.log \
                        $RSYNC_USER@$RSYNC_SERVER:/ $S_DIR

                echo ""
                echo "====>> $( date "+%b %d %Y: %H:%M" )  Ending rsync run for $RSYNC_SERVER" >> $logfile

        done
}


# run the rsync for each directory listed in sites.conf
for site in $( cat ${CONF_DIR}/sites.lst ); do
        echo ""
        echo "****>> $( date "+%b %d %Y: %H:%M" )  Starting sequential run for servers at ${site}" >> $logfile
        do_rsync ${site} &
        sleep $SLEEPTIME
done
```


----------



## phoenix (Apr 30, 2009)

rsbackup.conf
This is the main configuration file that all the scripts use.  It lists where the log files should be stored, how long to wait between sites, the default options used for the rsync command, and so on.


```
RS_DIR="/root/rsb"
SERVERS_DIR="$RS_DIR/sites"
CONF_DIR="$RS_DIR/conf"
logfile="/var/log/rsbackup/rsbackup.log"

# Where all the backups are stored
BACKUP_DIR="/storage/backup"

#Default options for rsync
# RSYNC_OPTIONS      are the defaults for the rsbackup system
# RSYNC_SITE_OPTIONS are the overrides that apply to all systems at one site (set in servers/<site>/site_defaults file)
# RSYNC_SRV_OPTIONS  are the overrides that apply to one specific server (set in servers/<site>/<server>.rs file)

RSYNC_OPTIONS="--archive --stats --numeric-ids --delete-during --partial --inplace --hard-links"
RSYNC_SITE_OPTIONS="--compress --compress-level=9"
#RSYNC_SITE_OPTIONS=""
RSYNC_USER="rsbackup"
RSYNC_PORT="55556"
RSYNC_EX_DEF="$CONF_DIR/exclude.default.linux"
RSYNC_SSH_KEY="$CONF_DIR/rsbackup.rsa"
RSYNC_EXEC="sudo rsync"
RSYNC_EX_MEDIA="$CONF_DIR/exclude.pass1"
RSYNC_EX_SERVER=""
RSYNC_SSHCMD="/usr/local/bin/ssh"

SLEEPTIME=250
```


----------



## phoenix (Apr 30, 2009)

rsb-wrapper
This is the wrapper script that is run via cron.

When called with the parameter *force*, it will start rsbackup, no questions asked.

When called with the parameter *start*, it will check for other running rsbackup processes.  If there are any, then it outputs a warning message and exits without starting rsbackup.  If there are no running rsbackup processes, then it starts one.

When called with the parameter *stop*, it will unquestionably kill any running rsync and rsbackup processes.  It will then tail and grep all the log files for warnings and errors, and echo them so cron can send them as an e-mail.


```
#!/bin/sh                               

# Set custom PATH
export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

# Set the default exit value
exitval=0                   

# Grab the PID of the current script
pid=$$                              

# Get info on how we were called
curdir=$( /usr/bin/dirname ${0} )

# Pull in the config file for rsbackup, which should be in the same directory we are called from
if [ -e ${curdir}/rsbackup.conf ]; then                                                         
        . ${curdir}/rsbackup.conf                                                               
        cd $RS_DIR                                                                              
else                                                                                            
        echo "Error:  unable to load the config file."                                          
        exit 1                                                                                  
fi                                                                                              

# Functions used in this script
check_logs()                   
{                              
        local word="${1}"      

        cd /var/log/rsbackup

        for log in $( ls *.log ); do
                msg_head="${log}: " 
                msg_body="$(tail -1 ${log} | grep "${word}" )"
                if [ "${msg_body}x" != "x" ]; then            
                        echo ${msg_head}                      
                        echo ${msg_body}                      
                        echo ""                               
                fi                                            
        done                                                  
}                                                             


# Main script
case "$1" in 
        [Ff][Oo][Rr][Cc][Ee])
                echo "Forcing rsbackup to start"
                ./rsbackup > /dev/null 2>&1 &
                ;;
        [Ss][Tt][Aa][Rr][Tt])
                # Check if any rsync/rsbackup processes are already running, and abort if there are
                numrunning=$( pgrep -lf rsbackup | grep rsync | wc -l | cut -c 8- )

                if [ ${numrunning} -eq 0 ]; then
                        echo "Starting rsbackup"
                        cd ${RS_DIR}
                        ./rsbackup > /dev/null 2>&1 &
                else
                        echo "Warning:  other rsbackup processes are running.  Not starting."
                        exitval=2
                fi
                ;;
        [Ss][Tt][Oo][Pp])
                # Check if there are any running rsync/rsbackup processes, and abort if there aren't
                numrunning=$( pgrep -lf rsbackup | grep rsync | grep -v rsb-wrapper | wc -l | cut -c 8- )

                if [ ${numrunning} -gt 0 ]; then
                        echo -n "Attempting to forcibly stop rsbackup ... "

                        pkill -9 -f rsbackup

                        numrunning=$( pgrep -lf rsbackup | grep rsync | grep -v rsb-wrapper | wc -l | cut -c 8- )
                        sleep 3
                        if [ ${numrunning} -gt 0 ]; then
                                echo "ERROR!"
                                echo "Unable to stop all processes."
                                exitval=1
                        else
                                echo "done."
                                echo ""
                                exitval=0
                        fi
                else
                        echo "No running rsbackup processes.  Nothing to stop."
                        exitval=0
                fi


                echo "Checking logs for warnings"
                echo "----------------------------------"
                check_logs "warning"

                echo ""

                echo "Checking logs for errors"
                echo "----------------------------------"
                check_logs "error"
                ;;
esac

exit $exitval
```


----------



## phoenix (Apr 30, 2009)

rsb-snapshot
This script just pulls in the central config file, figures out which ZFS filesystem is being used, and creates a snapshot of it.  The snapshot is named after the current date, using YYYY-MM-DD as the format.


```
#!/bin/sh

export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin

# Get info on how we were called
curdir=$( /usr/bin/dirname ${0} )

# Pull in the config file for rsbackup, which should be in the same directory we are called from
if [ -e ${curdir}/rsbackup.conf ]; then
        . ${curdir}/rsbackup.conf
else
        echo "Error:  unable to load the config file."
        exit 1
fi

# Get today's date, formatted as YYYY-MM-DD
today=$( date "+%Y-%m-%d" )

# Remove any leading slashes from storage directory
if [ $( echo $BACKUP_DIR | /usr/bin/cut -c 1 ) = "/" ]; then
        backupdir=$( echo $BACKUP_DIR | /usr/bin/cut -c 2- )
else
        backupdir=$BACKUP_DIR
fi

# Create a snapshot using the date in the name
/sbin/zfs snapshot ${backupdir}@${today}

if [ $? -ne 0 ]; then
        echo "Error:  unable to create the snapshot (${backupdir}@${today})."
        exit 1
fi

exit 0
```


rsb-one
This script can be used to do a manual backup of a single server at a single site.  Mainly used for testing, but has also come in handy on a couple of occasions when the automatic backup failed.

This pulls in the central config file, but duplicates the rsync command, so one has to keep this file and the rsbackup file in sync.  We're planning on amalgamating this into the main rsbackup script.


```
#!/bin/sh                          

which="/usr/bin/which"
basename=$( ${which} basename )
dirname=$( ${which} dirname )  
scriptname=$( ${basename} ${0} )
scriptdir=$( ${dirname} ${0} )  
scriptversion=1.0               
sshcmd="/usr/local/bin/ssh "    

# Pull in the defaults file
defaults="rsbackup.conf"   
if [ -r ${scriptdir}/${defaults} ]; then
        . ${scriptdir}/${defaults}      
else                                    
        echo "Error!  Main config file doesn't exist."
        exit 1                                        
fi                                                    

# Arguments passed to this script are:
#  -s sitename     this tells the script where to find the site settings
#  -h hostname     this tells the script which host config file to grab
if [ $# -gt 0 ]; then
        while getopts "s:h:" OPTION; do
                case "${OPTION}" in
                        "s")
                                # Check if site config file exists, and read it in
                                if [ -r ${RS_DIR}/sites/${OPTARG}/site_defaults ]; then
                                        sitedir=${RS_DIR}/sites/${OPTARG}
                                        . ${sitedir}/site_defaults
                                else
                                        echo "Error!  Site directory doesn't exist."
                                        exit 1
                                fi
                                ;;
                        "h")
                                # Check if host config file exists, and read it in
                                if [ -r ${sitedir}/${OPTARG}.cfg ]; then
                                        hostconf=${sitedir}/${OPTARG}.cfg
                                        . ${hostconf}
                                else
                                        echo "Error!  Host conf file doesn't exist."
                                        exit 1
                                fi
                                ;;
                        *)
                                echo "Usage: ${0} -s sitename -h hostname"
                                ;;
                esac
        done
else
        echo "No arguments given.  Nothing to do."
        echo ""
        echo "Usage: ${0} -s sitename -h hostname"
        exit 1
fi

# Check whether there's a server-specific exclude file needed
if [ -z $RSYNC_EX_SERVER ]; then
        RSYNC_EXTRA_EXCLUDE=""
else
        RSYNC_EXTRA_EXCLUDE="--exclude-from=${sitedir}/$RSYNC_EX_SERVER"
fi

# Make sure that the backup directory exists
if [ ! -e $BACKUP_DIR/$SITE_DIR/$SERV_DIR ]; then
        mkdir $BACKUP_DIR/$SITE_DIR/$SERV_DIR
fi

# Do the rsync
rsync $RSYNC_OPTIONS $RSYNC_SITE_OPTIONS $RSYNC_SRV_OPTIONS \
        --exclude-from=$RSYNC_EX_DEF $RSYNC_EXTRA_EXCLUDE \
        --rsync-path="$RSYNC_EXEC" --rsh="$sshcmd -p $RSYNC_PORT -i $RSYNC_SSH_KEY" \
        --log-file=/var/log/rsbackup/$RSYNC_SERVER.log \
        $RSYNC_USER@$RSYNC_SERVER:/ $BACKUP_DIR/$SITE_DIR/$SERV_DIR
```


----------



## phoenix (Apr 30, 2009)

Example site_default file
This is the config file that lists defaults for all servers at a specific site, as well as the main directory to use for the backups for all the servers at that site.


```
#Site wide options
#required
SITE_DIR=site
```

Example server config file
This is the config file that each remote server would have.  It lists any server-specific exclude files to use, the hostname of the server, and the name of the directory to store the backup under (usually named after the server).


```
# adding an additional exclude file
RSYNC_EX_SERVER=exclude.server

# These 2 are required, and specific to each server
RSYNC_SERVER=server.hostname
SERV_DIR=server
```

Default exclude file for Linux servers
This is an example of the default exclude file used for all Linux servers.


```
/sys/*
/proc/*
*mozilla/firefox/*/Cache/**
/var/lib/vservers/vs1/home/*
*/.googleearth/Cache/**
*/.googleearth/Cache/temp/**
/var/spool/squid/**
/backup/*
/var/spool/cups/**
/var/log/**.gz
*/cache/apt/archives/**
/var/lib/vservers/vs1/var/tmp/**
/home/programs/tmp/**
/home/programs/vmware/**
/home/**/.thumbnails/**
/home/**/.java*/deployment/cache/**
/home/**/profile/**
/home/**/.local/Trash/**
/home/**/.macromedia/**
```


----------



## phoenix (Apr 30, 2009)

check-fs
This script monitors the health of the gmirror and the zpool.  It runs via cron.  If any anomalies are detected, and e-mail is sent with all the details.  It's based on the zpool check script run via periodic(8).


```
#!/bin/sh

send=0

# Check zpool status
status=$( zpool status -x )

if [ "${status}" != "all pools are healthy" ]; then
        zpoolmsg="Problems with ZFS: ${status}"
        send=1
fi

# Check gmirror status
status=$(gmirror status)

if $( gmirror status | grep DEGRADED > /dev/null ); then
        gmirrormsg="Problems with gmirror: ${status}"
        send=1
fi

# Send status e-mail if needed
if [ "${send}" -eq 1 ]; then
        echo "${zpoolmsg} ${gmirrormsg}" | mail -s "Filesystem Issues on backup server" someone@somewhere.com
fi

exit 0
```


----------



## DutchDaemon (May 1, 2009)

Could you post some details?


----------



## phoenix (May 1, 2009)

Filesystem Layout
And finally, here's the directory structure used, to show where the different files go, where the backups go, etc.

*/root/rsb/*
conf/
rsb-one
rsb-snapshot
rsb-wrapper
rsbackup
rsbackup.conf
sites/

*/root/rsb/conf/*
exclude.default.bsd
exclude.default.linux
rsbackup.rsa
rsbackup.rsa.pub
server.rs.example
site_default.example
sites.lst

*/root/rsb/sites/*
site1/
site2
site3/
site4/

*/root/rsb/sites/site1/*
exclude.host1
exclude.host2
exclude.host3
host1.cfg
host2.cfg
host3.cfg
host4.cfg
site_defaults


----------



## FBSDin20Steps (May 1, 2009)

Spam alert!!!


----------



## ArtemD (May 1, 2009)

Thank you for the great howto (*very* informative btw ). I was wondering thou how stable do you find ZFS on FreeBSD? Did you have any issues? How about performance under heavy load?


----------



## phoenix (May 1, 2009)

The original server setup, using a single 24-drive raidz2 vdev in the storage pool, was not very good.  We learnt the hard way that the IOps performance of a raidz vdev is equivalent to that of a single drive.  IOW, a 24-drive raidz2 is no faster than a single SATA drive!!

Plus, when you have to replace a drive in the vdev, as we had to, it will thrash all the drives in the raidz vdev ... and thrashing 24 drives 24-hours a day *really* slows things down, usually leading to re-starts of the resilver process.  After a week of that, we rebuilt the box using the 3x raidz2 vdevs using 8-drives each.  Performance went through the roof after that.

Turns out, the official recommendation from SUN is to use <=10 drives per raidz vdev, preferably 6-8.

The original setup would complete ~60 server backups between 7pm and 7am.  We really fiddled with the sleep times between starting the parallel rsyncs, and with the ordering of the sites, but we couldn't really get it much better than ~60 servers in one run.

Moving to the 3x raidz setup, we can complete 102 server backups within 5 hours, leaving plenty of time for extra servers.

We did have to do some manual tuning of various sysctls, and loader tunables.  And we switched to using OpenSSH from the ports tree, with the HPN patches (went from ~30 Mbits/sec max network throughput to over 90 Mbits/sec, per SSH connection).

We monitor the server using SNMP, MRTG, and Routers2.  Even though we can only poll the 32-bit disk counters every 60 seconds, we average 80 MBytes/sec disk I/O during the backup run, with the odd peak at 120 MBytes/sec.  The system is still very responsive to SSH connections, log tailing, and other interactive duties.

We also push the contents of the /storage/backup directory out to a second, identical system, at an off-site location.  Takes a little under 4 hours for that.  Using a slightly modified rsync script (basically just a for loop through the directories under /storage/backup, with a separate rsync per sub-directory).

The kicker to all this:  ~$10,000 CDN for each storage server!!  And we're working on a method to automate backups for the few Windows stations we still have (also using ssh and rsync).

Another school district in the province spent over $250,000 CDN for their backup setup, with less storage space, a lot more administrative overhead, and more physical servers.  Without off-site redundancy.    Sometimes, I really like working with FreeBSD and Linux systems!!


----------



## vivek (May 2, 2009)

Nice. I've also built a server but without ZFS. I run rsnapshot  and shell scripts to backup 3 MySQL servers and 5 webservers. I'm using 1TBx4 hard disk with RAID 10. We make a full backup to tape. 

Your setup is awesome. Did you able to run any disk I/O tests?  If so could you paste your results? 

TIA.


----------



## phoenix (May 2, 2009)

I did, way back when we first started, but didn't keep them (had nothing to compare them to).  Just simple dd runs, so nothing really useful.

Any suggestions on disk benchmarks to run?


----------



## vivek (May 2, 2009)

benchmarks/bonnie++/

OR 
benchmarks/iozone/

Later can create graphs too from data.


----------



## phoenix (Jun 9, 2009)

I ran some iozone benchmarks on one of the servers.  Created a new ZFS filesystem, with all the default settings (noatime off, compression off).

The iozone commands used:
`# iozone -M -e -+u -T -t <threads> -r 128k -s 40960 -i 0 -i 1 -i 2 -i 8 -+p 70 -C`
I ran the command using 32, 64, 128, and 256 for <threads>

*Write* speeds range from 236 MBytes/sec to 582 MBytes/sec for sequential; and from 242 MBytes/sec to 550 MBytes/sec for random.

*Read* speeds range from 3.3 GBytes/sec to 5.5 GBytes/sec for sequential; and from 1.8 GBytes/sec to 5.5 GBytes/sec for random.

All the gory details are below.


```
32-threads:  Children see ...  32 initial writers =  582468.13 KB/sec
32-threads:  Parent sees  ...  32 initial writers =  108808.46 KB/sec
64-threads:  Children see ...  64 initial writers =  236144.47 KB/sec
64-threads:  Parent sees  ...  64 initial writers =   86942.94 KB/sec
128-threads: Children see ... 128 initial writers =  284706.68 KB/sec
128-threads: Parent sees  ... 128 initial writers =   10850.40 KB/sec
256-threads: Children see ... 256 initial writers =  258260.59 KB/sec
256-threads: Parent sees  ... 256 initial writers =    9882.16 KB/sec

32-threads:  Children see ...  32 rewriters =  545347.52 KB/sec
32-threads:  Parent sees  ...  32 rewriters =  339308.08 KB/sec
64-threads:  Children see ...  64 rewriters =  419838.51 KB/sec
64-threads:  Parent sees  ...  64 rewriters =  335620.45 KB/sec
128-threads: Children see ... 128 rewriters =  350668.51 KB/sec
128-threads: Parent sees  ... 128 rewriters =  319452.97 KB/sec
256-threads: Children see ... 256 rewriters =  317751.52 KB/sec
256-threads: Parent sees  ... 256 rewriters =  295579.66 KB/sec

32-threads:  Children see ...  32 random writers =  379256.37 KB/sec
32-threads:  Parent sees  ...  32 random writers =   95298.44 KB/sec
64-threads:  Children see ...  64 random writers =  551767.68 KB/sec
64-threads:  Parent sees  ...  64 random writers =  113397.95 KB/sec
128-threads: Children see ... 128 random writers =  241980.60 KB/sec
128-threads: Parent sees  ... 128 random writers =   74584.01 KB/sec
256-threads: Children see ... 256 random writers =  398427.84 KB/sec
256-threads: Parent sees  ... 256 random writers =   20219.56 KB/sec

32-threads:  Children see ...  32 readers = 5023742.86 KB/sec
32-threads:  Parent sees  ...  32 readers = 4661309.72 KB/sec
64-threads:  Children see ...  64 readers = 5516460.71 KB/sec
64-threads:  Parent sees  ...  64 readers = 3949337.61 KB/sec
128-threads: Children see ... 128 readers = 4748635.74 KB/sec
128-threads: Parent sees  ... 128 readers = 3208982.03 KB/sec
256-threads: Children see ... 256 readers = 4358453.38 KB/sec
256-threads: Parent sees  ... 256 readers = 2741593.08 KB/sec

32-threads:  Children see ...  32 re-readers = 5502926.62 KB/sec
32-threads:  Parent sees  ...  32 re-readers = 4650327.75 KB/sec
64-threads:  Children see ...  64 re-readers = 5509400.02 KB/sec
64-threads:  Parent sees  ...  64 re-readers = 4526444.40 KB/sec
128-threads: Children see ... 128 re-readers = 4072363.55 KB/sec
128-threads: Parent sees  ... 128 re-readers = 2840317.47 KB/sec
256-threads: Children see ... 256 re-readers = 3329375.95 KB/sec
256-threads: Parent sees  ... 256 re-readers = 2183894.33 KB/sec

32-threads:  Children see ...  32 random readers = 5555090.45 KB/sec
32-threads:  Parent sees  ...  32 random readers = 4602383.62 KB/sec
64-threads:  Children see ...  64 random readers = 4402270.77 KB/sec
64-threads:  Parent sees  ...  64 random readers = 2059081.52 KB/sec
128-threads: Children see ... 128 random readers = 3070466.93 KB/sec
128-threads: Parent sees  ... 128 random readers =  525076.11 KB/sec
256-threads: Children see ... 256 random readers = 1888676.12 KB/sec
256-threads: Parent sees  ... 256 random readers =  293304.53 KB/sec

32-threads:  Children see ...  32 mixed workload = 3130000.18 KB/sec
32-threads:  Parent sees  ...  32 mixed workload =  123281.78 KB/sec
64-threads:  Children see ...  64 mixed workload = 1587053.33 KB/sec
64-threads:  Parent sees  ...  64 mixed workload =  294586.82 KB/sec
128-threads: Children see ... 128 mixed workload =  807349.95 KB/sec
128-threads: Parent sees  ... 128 mixed workload =   98998.77 KB/sec
256-threads: Children see ... 256 mixed workload =  393469.55 KB/sec
256-threads: Parent sees  ... 256 mixed workload =  112394.90 KB/sec
```


----------



## vivek (Jun 11, 2009)

Thanks. You got impressive disk I/O :e


----------



## jyavenard (Aug 1, 2009)

*Great post, useless benchmarks *

Hi

Fantastic posts, I wish I had found this thread earlier. I created a similar setting, though using higher capacity drives.

One note however, the iozone benchmarks are useless here, especially the read speed.
All it is showing is that the data is in RAM or CPU cache...

A more valid test would be:
[cmd=]iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>[/cmd]

size needs to be at least twice more than the amount of RAM and rounded to the power of 2 (for more accuracy). E.g with 6GB of ram, use size = 16g

testfile is the path to the file on the disk to test...

There's physically no way a RAID array with 24 disks, each having a physical limit of around 100MB/s could achieve over 5GB/s reading, heck that's more than what a single lane PCI-e can carry !
The 3Ware 9650 is a PCI-e 1.0 8 lanes ; the PCI-e 8X 1.0 port can carry 2GB/s maximum...

I tested a RAIDZ setup with 6 x 2TB (Western Digital RE4 drives) and achieve 280MB/s write and 320MB/s read. Which is ok (faster than what the dual-NIC could output), but not exceptionally great.

A linux box with a E8200 (2.6GHz) dual core with 2GB of RAM and 5 x 1.5TB consumer-level drive achieved with md 270MB/s write but 455MB/s read ...


----------



## User23 (Aug 4, 2009)

Nice Setup & Howto 

I have a question to your Settings on the 3Ware Controller.

Did you have the WriteCache enabled in the 3dm2 Webinterface?

I got a very poor throughtput without WriteCache enabled on 7.2 Release AMD64
on different machines with different 3ware controllers. And it looks like im not the only person who hit that bug(?) .

twa, 3ware performance issue

thx & best regards


----------



## phoenix (Aug 4, 2009)

Yes, we have the write cache enabled on the controller, and use the performance profile for each of the disks.  This gives us a nice, fast, "2nd level" cache (disk cache -> controller cache -> ZFS ARC), and it allows the controller to re-order writes to the drives, as needed.

Makes it a bit more intelligent than a plain JBOD setup, where the controller would be just a dumb SATA controller.


----------



## phoenix (Aug 4, 2009)

jyavenard said:
			
		

> One note however, the iozone benchmarks are useless here, especially the read speed.
> All it is showing is that the data is in RAM or CPU cache...



It's not useless, considering the bulk of our data will be in the ARC, and it's mostly reads to compare the data to what's on the remote servers.  Plus, the transfer from one backup server to the other is all reads.  Also, since the servers are on UPSes, and the RAID controllers have batteries, all the caches are configured as write-back, so as soon as data hits one of the caches, it's considered "written to disk".



> A more valid test would be:
> [cmd=]iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>[/cmd]



I'll see if I can run the above, for comparison.  However, I'm off on holidays starting tomorrow, so it won't be until after the 13th that I'll be able to try this.


----------



## SuperMiguel (Aug 11, 2009)

why u install your system on 2 CF card?? instead on a small HD? speed?


----------



## phoenix (Aug 13, 2009)

We wanted to maximise the use of all 24 drive bays for data storage.

We didn't want to have to partition one of the drives to make room for the OS, we didn't want to dedicate an entire 500 GB drive to the OS, and we didn't have any extra internal drive bays that could be used.

Thus, we used small (2 GB and 4 GB) CompactFlash drives for the OS install (uses less than 2 GB for / and /usr), and used all 24 drives for data storage.  These were small enough that they could be attached to the inside of the case.


----------



## saxon3049 (Aug 20, 2009)

Very useful, I already liked to it on another forum and it's provided me with a new solution to try out.


----------



## confusion (Aug 31, 2009)

This is an interesting write-up. Is there an advantage of using zfs snapshots vs. rsnapshot?


----------



## phoenix (Aug 31, 2009)

rsnapshot uses hardlinks and directories on standard filesystems.  We looked into doing this originally, but managing the hardlinks and directories and what-not was not fun.

ZFs snapshots are internal to the filesystem.  They are accessible at any time via the /<path>/.zfs/snapshot/<snap name>/ directory.  And you get all the added bonuses of ZFS (compression, pooled storage, easy admin, etc).

We looked at a lot of different remote backup tools, especially ones that use rsync, and even tried coming up with some custom stuff using hardlinks, squasfs, other compressed fs, LVM, etc and just could not find a storage stack that was usable and simple.  

Then ZFS was imported into FreeBSD (we're a Debian Linux shop, but we use FreeBSD on the firewalls, so getting a FreeBSD storage box was not a hard sell).  And the rest is history.


----------



## phoenix (Sep 11, 2009)

jyavenard said:
			
		

> A more valid test would be:
> [cmd=]iozone -R -a -i 0 -i 1 -i 2 -g <size> -f <testfile> -b <excelfile>[/cmd]



Test is still running, but here's some preliminary results:

```
random  random
              KB  reclen   write rewrite    read    reread    read   write
              64      64  781660  985384  1772821  3758700 3559345 2210854                                                             
             128     128  810472 2515594  3043191  4280683 4116568 2775716                                                             
             256     256 1019294 1741907  2327052  3762009 3610222 2051407                                                             
             512     512 1062365 1166212  1816181   530057 2188145 1236732                                                             
            1024    1024  812607 1173034  1218977  1190593 1190593 1022996                                                             
            2048    2048  423683  977971  1390538  1392341 1360799 1142823                                                             
            4096    4096  888472 1130223  1386544  1382416 1382861 1098427                                                             
            8192    8192  688925 1068884  1152028  1176646 1178219 1027706                                                             
           16384   16384  872934 1051746  1160754  1109570 1164057  993998
           32768   16384  663842 1018810  1254801  1227216 1249020  976341                                                             
           65536   16384  926499 1079457  1099078  1214551 1287518 1057221
          131072   16384  568620 1043002   829028  1156366  771242 1012178
          262144   16384  893503 1046252  1225213  1180366 1139746 1023720
          524288   16384  173176  268374  1126224  1030217 1101103  259177
         1048576   16384  163071  231045   279858    37752  266438  200727
```

So you can see that it's ranging between ~200 MBps and 1 GBps for writes and between 300 MBps and 3 GBps for reads.

Once the test completes, I'll post the full results.


----------



## jyavenard (Sep 12, 2009)

phoenix said:
			
		

> Test is still running, but here's some preliminary results:
> 
> ```
> random  random
> ...



200MB and 300MB/s are the only value that actually mean something in your setup. Provided the kind of data you are writing (mirroring external machine) , the cache effect is irrelevant. 

It's surprising that you are only achieving 200MB/s write provided the number of disks your are using. I get the same speeds with only 6 disks. 

But don't quote that you get 3GBs read. It's nonsense when performing disk benchmarks.


----------



## ttsiod (Sep 18, 2009)

*You mentioned backing up Windows machines...*

I am backing up Windows machines using rsync to our OpenSolaris/ZFS server. Here is how I do it:

http://users.softlab.ntua.gr/~ttsiod/win32backup.html


----------



## phoenix (Sep 18, 2009)

Yeah, there are a lot of different methods to run rsync on the Windows machine (I personally prefer the rsync.net backup agent, which supports SSH).  However, they are all client solutions.  I've yet to find a server-hosted solution for this.  We'd prefer to keep all the backup configuration on the server.  Makes it easier to schedule and manage the network/disk load.

There are SSH daemons for Windows, and there are rsync apps for Windows.  But I have yet to find a pair that will allow:

server to connect via SSH
server to initiate the rsync program on the client
client connect back through the SSH tunnel to push the data to the server

If we could find that, then we could backup everything, via a single set of config files on the backup server.


----------



## gene (Sep 19, 2009)

*Cygwin*



			
				phoenix said:
			
		

> There are SSH daemons for Windows, and there are rsync apps for Windows.  But I have yet to find a pair that will allow:
> 
> server to connect via SSH
> server to initiate the rsync program on the client
> client connect back through the SSH tunnel to push the data to the server



Install cygwin with its openssh  and rsync packages then run 'ssh-host-config'.  It should set up everything needed to make sshd a windows service.

Once you have cygwin installed you can refer to '/usr/share/doc/Cygwin/openssh.README' if you have problems.


----------



## gene (Sep 19, 2009)

*A few questions*

Thank you very much for posting all of this information.  I've been planning on doing something very similar and it's great to see someone else accomplishing it at a much larger scale than I'm planning.

Are you backing up any databases?  If so how are you doing it?

To hazard a guess, if you have mysql databases I'm thinking you are locking all the tables then directly copying the contents of '/var/mysql' (or where ever the databases are stored on the file system) or you are dumping the contents of the databases to flat files before doing the rsync.

What is your retention policy?  I see that you were hoping for 13 months, but was that based off an SLA?

When you do hit the ~500 day mark I'm guessing will you be removing snapshots, starting with the oldest, to free up space. Have you considered keeping one snapshot for each week or month, that way you can still have access to data that was backed up more than 500 days prior and still have space for future back ups?  The storage pool will still eventually fill up doing this, I'm sure, but it would be interesting to see how long it could last.

Did you consider using the larger Chenbro chassis (50 bay) instead?

And just to satiate the geek in me, do you have any pictures of the servers?


----------



## phoenix (Sep 21, 2009)

gene said:
			
		

> Are you backing up any databases?  If so how are you doing it?



Yes.  MySQL databases.  We dump the databases to text files, and then rsync both those and the db directory as part of the rsync process.  We've done recoveries using both the dumps and the binary files.



> What is your retention policy?  I see that you were hoping for 13 months, but was that based off an SLA?



We're aiming for 13 months.  It looks like we'll have to move to larger harddrives before the first year is out, though.  Using 500 GB drives, we only have 2 TB of disk space left.  1 TB drives are coming down in price, though, and the issues with them appear to be solved.



> When you do hit the ~500 day mark I'm guessing will you be removing snapshots, starting with the oldest, to free up space. Have you considered keeping one snapshot for each week or month, that way you can still have access to data that was backed up more than 500 days prior and still have space for future back ups?  The storage pool will still eventually fill up doing this, I'm sure, but it would be interesting to see how long it could last.



Yes, that is one possibility we are looking at.  Keeping the backups from the 7th, 14th, 21st, and 28th of each month, starting on the 14th month.  And then keeping those for an extra year.



> Did you consider using the larger Chenbro chassis (50 bay) instead?



We didn't know about the 50-bay cases until after we had things installed and working.

We re-purposed servers for this.  The 5U mega-servers were originally purchased to act as Xen/KVM/VMWare hosts.  Then we realised that CPU and RAM are more important for VM hosts than disk space.  So these became the backup servers.  And the other 5U servers will become storage servers for the VM hosts (which will probably be net-booted 1U or 2U systems with gobs of CPU and RAM).



> And just to satiate the geek in me, do you have any pictures of the servers?



Not currently, no.


----------



## gene (Sep 21, 2009)

*Drives*

Are all of your drives the same make and model?


----------



## phoenix (Sep 21, 2009)

No.  We use 12 Seagate drives, and 12 Western Digital drives.  Bought in four batches of 6 drives each, to try an minimise the "all from the same manufacturing batch" issue (would really suck if they all died at the same time).  A pair of the drives have been replaced already with newer WD drives.


----------



## gene (Sep 28, 2009)

gene said:
			
		

> Install cygwin with its openssh  and rsync packages then run 'ssh-host-config'.  It should set up everything needed to make sshd a windows service.
> 
> Once you have cygwin installed you can refer to '/usr/share/doc/Cygwin/openssh.README' if you have problems.



Have you given this a try?  I've done it with 2003 server successfully, and just over the weekend did it with an XP box with success.


----------



## phoenix (Sep 28, 2009)

I'm in the process of testing it.

It's going to require making some (possibly massive) changes to our backup script.  For example, there's no sudo in cygwin.

I've got it working manually.  Now to figure out how to automate it, and to test a system restore.  And to figure out what needs to be added to the exclude file.


----------



## spork (Feb 6, 2010)

Quick question for the author...  I'm using your set of scripts as a starting point because it all seems pretty sane.  Once I'm happy with it, I'll probably change things up a bit.

I'm having one bizarre issue that I can't track down though...  My backups box has much more storage than all the machines it's backing up, so I have not been paying much attention to the space used over the last few weeks.  As I was copying some things off, I realized that there's more data than I'd expect on the backups server.  After poking around a bit I found that rsync is simply not deleting files.  I see the "--delete-during" option in the script, also tried plain old "--delete" with the same result.

Any ideas?  I see people with similar problems when they are working from a file list or with wildcards, but the only wildcards I've got a in my exclude lists...

I'm totally stumped by this one.


----------



## jb_fvwm2 (Feb 6, 2010)

--delete-after ?? That fixed something in rsync here, maybe it would fix
it...


----------



## spork (Feb 7, 2010)

Nope...  It grew some more tonight with "--delete-after" as well.  Seems like a common problem with rsync, I'll have to figure out how to step through what the script is doing but scale things down enough so I can see what's happening.


----------



## spork (Feb 18, 2010)

Almost there...  Since the boxes are active while they are being backed-up, rsync throws errors here and there about files disappearing and the like, which is fairly normal.

What I did not know is that rsync skips ALL file deletion operations if it encounters ANY errors.  There's an "--ignore-errors" flag, but it's a bit blunt - it ignores any errors, which could be problematic.  I have a query out about this on the rsync list.

So if you're using this script, or a similar method, you might want to look for this line in your backup logs:

2010/02/18 01:42:30 [75398] IO error encountered -- skipping file deletion

That does not refer to a single file, that means NO files were deleted in the entire run.


----------



## phoenix (Feb 18, 2010)

Even when --delete-during is used, which does the file deletions as it comes across them, instead of batching them up at the end?


----------



## spork (Feb 18, 2010)

phoenix said:
			
		

> Even when --delete-during is used, which does the file deletions as it comes across them, instead of batching them up at the end?



Yep, --delete-during was the initial option I used.  The number of errors is small, and they all give a "bad file descriptor (9)" error, which I think rsync feels is a "really bad" error compared to the normal "file disappeared" type errors.  Googling around on the "bad file descriptor" error gives me lots of hits on problems with smbfs mounts, but not much else (and I have no smbfs mounts).

I'll try a run with "--ignore-errors" tonight and see what happens.  Not an optimal solution, but a good stopgap.


----------



## dennylin93 (Jun 13, 2010)

phoenix said:
			
		

> ```
> /sys/*
> /proc/*
> *mozilla/firefox/*/Cache/**
> ...



I'm wondering why both "*" and "**" are used at the end? Is there any particular reason since "*" seems to be sufficient.


----------



## phoenix (Jun 16, 2010)

This file just grew organically, with three of us adding to it, so some things have one *, and others have two.  No real reason beyond that, I don't think.

I believe the ** in the middle of a path is important, though.

The globbing/regex stuff in rsync is confusing, to say the least.


----------



## dennylin93 (Jun 16, 2010)

phoenix said:
			
		

> The globbing/regex stuff in rsync is confusing, to say the least.




I can't agree more . I actually had to run test cases to understand the man page better.


----------

