# What is the best compression method for backups?



## Maelstorm (Jan 14, 2018)

The thread title pretty much says it all: What is the best compression method for backups?

Due to the...spectacular...hardware failure that my FreeBSD box suffered last month, I've been making plans to perform automated system backups.  In addition to the USB ports on the mainboard, this machine also has a PCI Firewire/USB combo card which has an *INTERNAL* USB port.  I purchased a 128GB USB memory stick and plugged it into the internal port and formatted the drive.

I want to do backups using tar(1) and there is a list of compression algorithms to choose from.  I know that the fastest would be no compression, but it also takes up the most space as well.  So I am looking for a balance between time and space.  Some of the compression algorithms I have not heard of before.  The list as given in the man page is as follows:


xz
gzip
bzip2
lrzip
lz4
lzma
lzop
compress
I know what gzip, bzip2, lzma, and compress are.  I haven't heard of xz, lrzip, lz4, or lzop.

What do you recommend?


----------



## geheimnisse (Jan 14, 2018)

In my experience, xz is nice. Smaller file sizes than gzip, faster than bzip2 (and usually smaller file sizes than bzip2).


----------



## Snurg (Jan 14, 2018)

Whatever compression you use, make sure it works a way that is error-tolerant in case the compressed file has errors. For example tape dropouts can easily cause parts of files get corrupted.

This is the reason why I usually do not compress backups due to bad experiences in the past. If I cannot extract an archive because its checksum is invalid, all its contents are lost. Some archivers can skip over such files or uncompress only the parts that are not damaged, others refuse to unpack anything then.

Another sweet method of backing up when using zfs mirrored configuration is to just remove a drive and store it in a safe place, put in another drive and let it resilver. With ZFS compression, another layer of compression probably does not make much sense.

By the way, what was that spectacular kind of hardware failure? *curious*


----------



## Maelstorm (Jan 14, 2018)

Snurg said:


> Whatever compression you use, make sure it works a way that is error-tolerant in case the compressed file has errors. For example tape dropouts can easily cause parts of files get corrupted.
> 
> This is the reason why I usually do not compress backups due to bad experiences in the past. If I cannot extract an archive because its checksum is invalid, all its contents are lost. Some archivers can skip over such files or uncompress only the parts that are not damaged, others refuse to unpack anything then.
> 
> Another sweet method of backing up when using zfs mirrored configuration is to just remove a drive and store it in a safe place, put in another drive and let it resilver. With ZFS compression, another layer of compression probably does not make much sense.



I do not use ZFS.  Besides, I'm backing up to a very spacious thumb drive.  I doubt there will be any errors on it.  I've decided to use geheimnisse's suggestion of xz.  I will test to make sure it does work as intended though.  The thumb drive is plugged into an I/O expansion card on the INSIDE of the case, so it's a permanent mount, like a hard disk partition.



Snurg said:


> By the way, what was that spectacular kind of hardware failure? *curious*



Well, you could search through my posts...but I'll be nice.

What happened was that I had a hard disk failure and it wiped out the /usr directory.  So when it went to single user mode, I had no /usr directory.  That's because when I first learned about FreeBSD many years ago, I followed the recommended setup that was in the printed version of the handbook.  That was version 3.4.  I stayed with that ever since.

Some more reading:

https://forums.freebsd.org/threads/63763/
https://forums.freebsd.org/threads/63815/
https://forums.freebsd.org/threads/63830/


----------



## Sensucht94 (Jan 14, 2018)

I suggest you check out: *lrzip *- Long Range ZIP or LZMA RZIP

Don't let yourself be let down by the haevy GNU dependencies github lists (Coreutils, Bash, GCC and so on), as FreeBSD port has none of them, see: archivers/lrzip


----------



## Snurg (Jan 14, 2018)

Ahh sorry... I actually thought I missed something. 
Because, I do not consider disk failures "spectacular" things.
In my perception these are just occasional and inevitable nuisances requiring drive swapping and resilvering, due to my habit of using cheap old used 15k sas drives.

The worst thing I experienced in that direction was a system using dual deathstar drives, of which one failed and the other one half a hour later. It was that generation which first used glass platters which were highly sensitive to physical separation of the magnetic layer. I suppose the cause was that the computer was not fully acclimatized and there possibly some condensation happened.

I admit I hoped for a spectacular story like with the computer of a friend, which power supply suddenly started to emit a big flow of sparks with the air flow exhaust and then died.
When I examined the thing, it was nicely burnt inside, and the mainboard was fried also. The harddisk miraculously survived, though.


----------



## Maelstorm (Jan 14, 2018)

That's spectacular.  It was probably the 3.3v line that fried.  If it was the 5v or 12v the HD would have been toast.


----------



## Maelstorm (Jan 14, 2018)

Here's a little piece of software that I wrote in sh.  It backs up certain aspects of the system.  Use it as you see fit.  Needless to say, you have to be root to run this.  Note that this is tailored to my system, so you may need to make some modifications before you deploy this.  In the future, I may have it delete old backups by counting how many are in the backup directory and doing some math with head or something.


```
#!/bin/sh

PATH=/bin:/sbin:/usr/bin:/usr/sbin

DATE=`date -j "+%Y%m%d"`
OPTIONS=-cPJvf
BKPATH=/usr/backup/$DATE
EXFILE=.sujournal

# Removes the existing file/directory
# if it exists
remove_exist()
  {
    if [ -e $BKPATH ] ; then
      rm -Rf $BKPATH
    fi
  }

# Creates a new directory, if needed
create_dir()
  {
    if [ ! -e $BKPATH ]; then
        mkdir $BKPATH
        chmod 0700 $BKPATH
      elif [ ! -d $BKPATH ] ; then
        rm -Rf $BKPATH
        mkdir $BKPATH
        chmod 0700 $BKPATH
    fi
  }

# Prints instructions
usage()
  {
    echo ''
    echo 'usage: backup [ all | home | etc | src |'
    echo '  obj | doc | ports | local ]'
    echo ''
    echo 'Any of the above combinations will'
    echo 'be recognized.'
    echo ''
    echo 'The options above will back up the'
    echo 'following components:'
    echo ''
    echo '  all:   Everything'
    echo '  home:  Home directories'
    echo '  etc:   System configuration'
    echo '  src:   System source code'
    echo '  obj:   System object code'
    echo '  doc:   System documentation'
    echo '  ports: The ports tree'
    echo '  local: Local software configuration'
    echo ''
    echo 'If no options are given, then'
    echo 'all is assumed.'
    echo ''
  }

# Archives everything
tar_everything()
  {
    remove_exist
    create_dir
    tar_home
    tar_i386conf
    tar_etc
    tar_usrobj
    tar_usrsrc
    tar_usrdoc
    tar_usrports
    tar_usrlocaletc
  }

# Archives all the home directories
tar_home()
  {
    local exclude
    local target
    target=/home
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/home.txz $exclude $target
  }

# Archives the kernel config
tar_i386conf ()
  {
    local target
    target=/usr/src/sys/i386/conf
    tar $OPTIONS $BKPATH/usr.src.i386conf.txz $target
  }

# Archives the system config
tar_etc()
  {
    local target
    target=/etc
    tar $OPTIONS $BKPATH/etc.txz $target
  }

# Archives the compiled base system
tar_usrobj()
  {
    local exclude
    local target
    target=/usr/obj
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.obj.txz $exclude $target
  }

# Archives the base system source code
tar_usrsrc()
  {
    local exclude
    local target
    target=/usr/src
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.src.txz $exclude $target
  }

# Archives the documentation
tar_usrdoc()
  {
    local exclude
    local target
    target=/usr/doc
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.doc.txz $exclude $target
  }

# Archives the ports tree
tar_usrports()
  {
    local exclude
    local target
    target=/usr/ports
    exclude="--exclude $target/$EXFILE"
    tar $OPTIONS $BKPATH/usr.ports.txz $exclude $target
  }

# Archives the installed software
tar_usrlocaletc()
  {
    local target
    target=/usr/local/etc
    tar $OPTIONS $BKPATH/usr.local.etc.txz $target
  }

# Begins processing
process()
  {
    create_dir
    if [ $1 ]; then
        for loopvar in $*
          {
            case "$loopvar" in
              [Hh][Oo][Mm][Ee])
                tar_home
              ;;
              [Ee][Tt][Cc])
                tar_etc
              ;;
              [Ss][Rr][Cc])
                tar_usrsrc
              ;;
              [Oo][Bb][Jj])
                tar_usrobj
              ;;
              [Dd][Oo][Cc])
                tar_usrdoc
              ;;
              [Pp][Oo][Rr][Tt][Ss])
                tar_usrports
              ;;
              [Ll][Oo][Cc][Aa][Ll])
                tar_usrlocaletc
              ;;
              [Aa][Ll][Ll])
                tar_everything
              ;;
              *)
                usage
              ;;
            esac
          }
      else
        tar_everything
    fi
  }


# Entry Point
process $*
```


----------



## ShelLuser (Jan 14, 2018)

RAR, see archivers/rar.

When I want to compress something while also keeping my data safe I always rely on RAR. It's a commercial archiver, usable free of charge (though they obviously want you to get a license, just like ARJ, PKZIP and others in the past) but it has some very interesting features to keep your data safe.

And the features which can do miracles here are:


```
rr[N]         Add data recovery record
  rv[N]         Create recovery volumes
  s[<N>,v[-],e] Create solid archive
```
A solid archive means that the data gets added in a specifically sorted and compact way, this decreases the archive size even without having applied any compression. But the cool parts are the data recovery records. These are comparable to the uudeview CRC methods PAR CRC methods often used within Usenet, see also this article on UUEncoding this article on PArchive.

*Edit*: I'm mixing up my facts here. This has nothing to do with UUEncode but more so with PAR2. It's embarrassing but I forgot the name of the CRC method, but updated my post.

Another thing is that RAR often manages to create a better compression than others. And the extra space that gains me is often used by me to fill in the recovery records. Or in some cases recovery volumes (these are external CRC files). When I have data which is really important to me I usually keep the archives on one storage medium and my recovery volumes on the other.

The cool part is that when I have a multi volume archive (for example to cater to the 2Gb file size limit on some filesystems) then these recovery records can protect at least 1 volume completely. Meaning: if one archive volume becomes corrupted then I can fully recreate it using my recovery volumes. And not a predetermined volume: any random volume can go b0rk and it will be easily recreated.

Which makes RAR the best archiver for me. But I fully agree with Snurg up there: sometimes using no compression is the better choice.


----------



## Eric A. Borisch (Jan 14, 2018)

Worth a read: lzip author says don’t use XZ for archives.

http://www.nongnu.org/lzip/xz_inadequate.html


----------



## Eric A. Borisch (Jan 14, 2018)

Of course, another option would be ZFS. You can turn on gzip compression, and even set copies=2 if you want some additional protection from bad blocks on the devices. (Counteracts savings from compression, of course.) In addition, you can do your backup with rsync and have all your files available without going through another tool. Need versioning of backups? Use a snapshot!

I know you “do not use ZFS”. But imagine if you did... (if you were using it for your source, you could even use send/recv goodness.) Alas.


----------



## Maelstorm (Jan 14, 2018)

Eric A. Borisch said:


> Worth a read: lzip author says don’t use XZ for archives.
> 
> http://www.nongnu.org/lzip/xz_inadequate.html



If xz is that bad, then why did the developers include it in the tar options?  I can always change the options to tar.


----------



## Eric A. Borisch (Jan 14, 2018)

Maelstorm said:


> If xz is that bad, then why did the developers include it in the tar options?  I can always change the options to tar.



Because it compresses really well. So long as you can avoid errors, any issues with error handling are unseen. For distribution (downloads) decreased download size leads to saved time/money for everyone. Checking checksums is sufficient since you can redownload if something went wrong.

If it’s an archive, or a backup — something you don’t use unless the original is gone and you really need it to work — then resilience to/handling of errors becomes more important. So don’t equate “it is a popular format for distributing software” with “it is a good way to backup my important data.”

Also, quoting the article:


> It is said that given enough eyeballs, all bugs are shallow. But the adoption of xz by several GNU/Linux distributions shows that if those eyeballs lack the required experience, it may take too long for them to find the bugs.


----------



## Snurg (Jan 15, 2018)

The article Eric A. Borisch linked to is really excellent. One thing that I found particularly interesting is the aspect of undetected/undectable errors in archives. I experienced such things a few times and I now understand better what happened with some archives that for some reason resulted in damaged and thus unusable unpacked files.

And, albeit a bit off-topic, I really like the introductory quotes of the article.


> There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
> -- C.A.R. Hoare
> 
> Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away.
> -- Antoine de Saint-Exupery


The microcode update stuff in cpucontrol (sysutils/devcpu-data) is a good example. 
The code is so contrived and complicated that it is really hard to not use words like "crappy".
It is code of the latter type by Hoare's definition above. It is one of the worst pieces of code I ever touched.

Big chunky functions with many many gotos. A good number of bugs that are hard to see because of this. I start to understand why the devs withdrew the microcode update and keep quiet about this.
Because I do not want to wait for 11.2, I currently am breaking the stuff down into smaller functional pieces that are easy to understand and make it easy to spot mistakes.
And, there are many bit operations, some of them contaminated with hard-to-see bugs. To avoid these difficult-to-understand and error-prone anding, oring and shifting bit handling things there exist for good reason bit field structures and unions. Using those it is easy to recode in a way that makes the whole stuff easy to understand and easy to spot errors. I love the KISS principle.

I have had so many failed disk drives, backup tapes and usb sticks. so that I am just happy to be able to use ZFS mirrors and can just stash away a $20 HDD and put in another one.
In addition I regularly back up my text based data onto DVDs to have write-protected backups.

If the computer blows up, I can just get another, put in a backup drive, boot and start using it, without all that disruption you experienced.
This is my way of backing up. Cheaper, safer and way more durable than tapes, flash and the like.

So, what I am having difficulties to understand why you are thinking about such a complicated "backup" method with so many possibilities of getting hosed again.
I mean, what does you make want not to use ZFS?


----------



## vermaden (Jan 15, 2018)

@@Maelstorm
Borgbackup with ts deduplication or ZFS XZ with deduplication.


----------



## Maelstorm (Jan 15, 2018)

Snurg,

It sounds to me like someone was inexperienced in writing software.  Either that, or the code was contributed by a paid programmer who was looking out for job security.  Granted, the schools teach that you never use goto because of the danger of generating spaghetti code.

With that being said, I *DO* use goto when doing error handling.  Since I write system level software and some very low level stuff, there are cases where you have no choice.  A goto is an unconditional jump to another part of the code.  Sometimes that is useful.  Consider the following:


```
ptr1 = malloc(sizeof(some struct 1));
if (ptr1 == NULL)
  {
    errcd = errno;
    goto error1;
  }
ptr2 = malloc(sizeof(some struct 2));
if (ptr2 == NULL)
  {
    errcd = errno;
    goto error2;
  }
ptr3 = malloc(sizeof(some struct 3));
if (ptr3 == NULL)
  {
    errcd = errno;
    goto error3;
  }

do something useful

return(result)

error3:
free(ptr2);
error2:
free(ptr1);
error1:
return(errcd);
```

To me, the error handling just makes sense.  You are freeing memory that was allocated before the error occurred.


Anyways, I'm doing some benchmark testing based on space.  So far, this is the order that I have in regards to least amount of space occupied:


xz
bzip2
gzip
I'll try some of the others to see how they work.


----------



## Eric A. Borisch (Jan 15, 2018)

Compression benchmarks: https://quixdb.github.io/squash-benchmark/


----------



## Maelstorm (Jan 16, 2018)

In trying out the other compression methods, I ran into a problem...

It seems that tar does not recognize the -- for long options.  I get an error for --lrzip, --lz4, and --lzop.  I even installed the libs for those from ports and it still will not work.  What's interesting though is that it will not recognize --bzip or --bzip2 even though it's in the man page, but it will recognize -j which specifies the bzip2 compression algorithm.

Can someone else check this to make sure that it's not me?

This is on 11.1.


----------



## usdmatt (Jan 16, 2018)

What are the actual commands you are running?


----------



## ronaldlees (Jan 16, 2018)

Eric A. Borisch said:


> Because it compresses really well. So long as you can avoid errors, any issues with error handling are unseen. For distribution (downloads) decreased download size leads to saved time/money for everyone. Checking checksums is sufficient since you can redownload if something went wrong.
> 
> If it’s an archive, or a backup — something you don’t use unless the original is gone and you really need it to work — then resilience to/handling of errors becomes more important. So don’t equate “it is a popular format for distributing software” with “it is a good way to backup my important data.”
> 
> Also, quoting the article:



++

Also, xz seems to take a longer time to compress, especially when doing a backup on a system without much memory.


----------



## azathoth (Jan 16, 2018)

Maelstorm said:


> The thread title pretty much says it all: What is the best compression method for backups?
> 
> Due to the...spectacular...hardware failure that my FreeBSD box suffered last month, I've been making plans to perform automated system backups.  In addition to the USB ports on the mainboard, this machine also has a PCI Firewire/USB combo card which has an *INTERNAL* USB port.  I purchased a 128GB USB memory stick and plugged it into the internal port and formatted the drive.
> 
> ...






I know by direct experiment that pzbip2 -1  is best balance of speed and small size result.
I tried many of these when I was doing mysql backups.
lzma might be worth restesting.....but parallel bzip beats pigz paralle gzip  and doing highest compession actually loses u a lot of time for little gain.

of course if you care nothing for time them compress to max   with lzma


----------



## Maelstorm (Jan 16, 2018)

usdmatt said:


> What are the actual commands you are running?



`tar -cPv --bzip2 -f <tar filename> --exclude <exclude filename> <target dir for archiving>`

The above command does not work...  However, the following command does:

`tar -cPvjf <tar filename> --exclude <exclude filename> <target dir for archiving>`


----------



## azathoth (Jan 16, 2018)

use parallel pbzip2
or pigz


----------



## Eric A. Borisch (Jan 16, 2018)

azathoth said:


> use parallel pbzip2
> or pigz



Or `lbzip2`, or `xz -T 0` ... all of these are options for using all of your cores for compression.


----------



## stratacast1 (Jan 17, 2018)

Have you considered borg? I use its lz4 argument when I do backups with it. I have used up 550GB of storage on my external drive, and the original size of my disk that gets backed up is 502GB. I have 19 backups on the drive right now. It does deduplication of the data too. I have found it to be reliable and have restored data from it many times. It's neat because it's designed to make encryption stupidly easy too (and it is)..though both encryption and compression are all optional


----------



## T. Braun (Jan 18, 2018)

Another possibility if a dedicated backup tool might be an option: I'd like to recommend the restic project. I've been using it for more than a year now on multiple FreeBSD servers and I'm really happy with it. It's an encrypted backup, files are split into chunks and compressed, deduplication of the chunks is included, it's very reliable and (after the first initial backup) incremental backups are incredibly fast. Restore also works flawlessly, as I've had to test on multiple occasions.


----------



## herrbischoff (Jan 18, 2018)

stratacast1 said:


> Have you considered borg?



I can attest to the reliability of Borg. It's the main backup tool in place at all of my clients' servers and has proven rock solid thus far. Several backups are multiple terabytes in size and every restore ever has been successful. Also, due to (optional) built-in AES-CTR-256 encryption, offsite storage of the backups is one less headache to have.

https://borgbackup.readthedocs.io/en/stable/


----------



## Chris_H (Jan 18, 2018)

Maelstorm said:


> `tar -cPv --bzip2 -f <tar filename> --exclude <exclude filename> <target dir for archiving>`
> 
> The above command does not work...  However, the following command does:
> 
> `tar -cPvjf <tar filename> --exclude <exclude filename> <target dir for archiving>`


I've noticed that too. But then again, why would I not want to use
`tar -cPvjf <tar filename> --exclude <exclude filename> <target dir for archiving>`
in the first place? Less characters to enter, easier to understand, and remember...
In the end. I _do_ remember struggling with something like that in the past, and as memory serves; I think I finally determined that it was the dangling `-f` that tripped it up.

But for my money
`tar cvf - <some-data/some-place> | xz -9e> <some-place/some-filename>`
works really well, for all my needs.

--Chris


----------



## phoenix (Jan 18, 2018)

Maelstorm said:


> `tar -cPv --bzip2 -f <tar filename> --exclude <exclude filename> <target dir for archiving>`
> 
> The above command does not work...  However, the following command does:
> 
> `tar -cPvjf <tar filename> --exclude <exclude filename> <target dir for archiving>`



`tar -cPv --use-compress-program /path/to/program -f <tar filename> --exclude <exclude filename> <target dir for archiving>`

Then you can use any compression program you want.

Or, do it the old-fashioned way where you tar everything up into a single giant tarball, and then run the compression program against that file afterward.


----------



## Eric A. Borisch (Jan 19, 2018)

Or the Unix way:

`tar -vcPf - --exclude <excl> | lbzip2 > file.tbz`

This outputs from tar to stdout (-f -) and then compresses (stdin to stdout) with lbzip2 (replace with the tool of your choice) and redirects this output into file.tbz.


----------



## poorandunlucky (Jan 19, 2018)

I haven't read the thread, but I just want to say that I think that for something like that you should definitely run benchmarks to see what works best with your hardware and your load...  You should prepare packages that are representative of the data on your system, maybe an image if you have enough time/space, and compare the methods/algorithms...

Backups are recurrent, time, and resource-consuming...  Because of that small differences can become big differences over weeks, months, years, decades...

Also, if you're in a mission-critical situation, you may also want to benchmark and factor-in your decompression time...


----------

