# ZFS dedup switch on with data present. How for force a 'scan' ?



## frijsdijk (Feb 29, 2012)

I have a ZFS filesystem used for backups of a bunch of servers. This has been running a couple of days, and today I've switched on dedup. Is there a way to let ZFS 'scan' the filesystems for duplicate data, or can this only happen realtime while data is written?


----------



## SirDice (Feb 29, 2012)

ZFS's dedup is synchronous, which means it only works online while the data is being written.

https://blogs.oracle.com/bonwick/entry/zfs_dedup


----------



## frijsdijk (Feb 29, 2012)

Thanks SirDice. That's good reading!


----------



## phoenix (Feb 29, 2012)

Note:  do *NOT* enable dedupe if you have less than 16 GB of RAM in the system.  And do *NOT* enable dedupe if you do not have a cache device enabled in the pool.  And, if you have over 10 TB of data in the pool, you'll want at least 32 GB of RAM.  IOW, stick as much RAM into the box as you can afford.

ZFS requires a lot of RAM.  Dedupe requires even more RAM.  And a pool with multiple tens of TB requires even more RAM.


----------



## gkontos (Feb 29, 2012)

IMO and judging from my experience so far, there should be a warning regarding dedup. 

*---> Stay away unless you really know what you are doing! <---*


----------



## peetaur (Mar 1, 2012)

Remember that dedup makes things very slow. Make sure to test it before it is permanently in place. 

I found that it is slightly slow with only a few TB, but completely unusable with what I tried it with (over 10TB, up to 30TB). So maybe on a home system with 6 TB and 4 GB of RAM, things are fine... or maybe they will only seem fine and then get super slow later.

For example: Last week I tried it again to transfer 11.5 TB to some disks that were a tiny bit too small with 2 x raidz2. It started out fast at around 200-350 MB/s (making the whole process take around 11 hours; without dedup I expect about 350-400 MB/s with such a zfs send&recv), and then after a few hours, it was down to 20 or so MB/s (slowing the process to around 7 days). So I restarted with more space (1 x raidz2) instead of dedup. The system I used has 48GB of RAM all dedicated to ZFS.


----------



## peetaur (Mar 1, 2012)

phoenix said:
			
		

> Note:  do *NOT* enable dedupe if you have less than 16 GB of RAM in the system. ... 10 TB of data in the pool, ... at least 32 GB of RAM.



Have you run any successful fast systems with dedup with more than 10 TB of data? If so, how?

As I said above, I tested it on a system with 48GB of RAM and a striped cache of 150x2 on SSDs, and about 15-22 TB and it was horribly slow. And last week I tried it just enabling dedup on an empty dataset on new disks that I was moving 11.5 TB of data to, and it went horribly slow again (slower than a low end home PC from 15 years ago). I also tried setting primarycache to metadata (possibly meaining to exclude data and therefore have more space for the dedup table) and secondarycache to all, which was the same performance.


----------



## frijsdijk (Mar 1, 2012)

phoenix said:
			
		

> Note:  do *NOT* enable dedupe if you have less than 16 GB of RAM in the system.  And do *NOT* enable dedupe if you do not have a cache device enabled in the pool.  And, if you have over 10 TB of data in the pool, you'll want at least 32 GB of RAM.  IOW, stick as much RAM into the box as you can afford.
> 
> ZFS requires a lot of RAM.  Dedupe requires even more RAM.  And a pool with multiple tens of TB requires even more RAM.



Phoenix, can you refer me to any URL stating all these warnings? I'd like to verify. Also, I'd like to know how much RAM is needed for a filesystem of n-TB. Is there a lookup-table for this? I've been searching, but can't find any. I know RAM is needed for the tables to be kept in memory, in order to keep it all snappy.

Speed, in my/this case is not the most importand factor. This will be a backup server which will probably start with something like 7-10TB of disk space, and I can stick in let's say at least 16GB of RAM. That's not a problem. The rates that data are written to the fs are not that high and there is not too much concurrency.


----------



## funky (Mar 1, 2012)

frijsdijk said:
			
		

> Phoenix, can you refer me to any URL stating all these warnings? [...]


I hope Phoenix won't mind if I answer in his place ...

http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe
http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup

and maybe he will add some more.


----------



## throAU (Mar 1, 2012)

Here's a link with some de-dupe info:  http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

The general consensus seems to be that compression is more of a win than de-dupe, but YMMV.  I'd get some test data in the pool and see what sort of de-dupe ratio you can achieve to see whether or not it is worth turning on.


----------



## phoenix (Mar 1, 2012)

peetaur said:
			
		

> Have you run any successful fast systems with dedup with more than 10 TB of data? If so, how?
> 
> As I said above, I tested it on a system with 48GB of RAM and a striped cache of 150x2 on SSDs, and about 15-22 TB and it was horribly slow. And last week I tried it just enabling dedup on an empty dataset on new disks that I was moving 11.5 TB of data to, and it went horribly slow again (slower than a low end home PC from 15 years ago). I also tried setting primarycache to metadata (possibly meaining to exclude data and therefore have more space for the dedup table) and secondarycache to all, which was the same performance.



If you set *primarycache=metadata* then L2ARC is not used, since there's no data to move from ARC to L2ARC.  You have to set *secondarycache=metadata* and *primarycache=all*.

We use dedupe on our backups boxes, that do rsync backups for 150-odd servers each night, in under 14 hours.  We're limited by the uplink speeds of the remote sites (most are 768 Kbps ADSL, some are 10 Mbps E10, some are 100 Mbps E100, a few are gigabit fibre), but still manage to get over 250 Mbps of traffic through.  Sure, that's only ~30 MBps, but that's more than enough for our uses.  Especially since we have over 5x combined compression/dedupe ratio with over 10 TB of pool storage used (pool about 30% full).

These boxes have 20 GB and 24 GB of RAM, with 32 GB of L2ARC on SSD, using 4x 6-drive raidz2 vdevs, with 500 GB, 1 TB, 1.5 TB drives.

Throughput on these boxes with dedupe enabled is the same as on the previous storage boxes without dedupe.  For our uses, we are not seeing "slowness" due to dedupe.  Instead, we are seeing a massive savings on disk usage.

The newest storage box that I'm putting together will use all 2.0 TB drives, with 32 GB of RAM and 32 GB of L2ARC.  It will be the off-site replicata of the other two, using zfs send/recv to transfer data.  Still waiting on some hardware to arrive before I can start testing it.


----------



## phoenix (Mar 1, 2012)

frijsdijk said:
			
		

> Phoenix, can you refer me to any URL stating all these warnings? I'd like to verify. Also, I'd like to know how much RAM is needed for a filesystem of n-TB. Is there a lookup-table for this? I've been searching, but can't find any. I know RAM is needed for the tables to be kept in memory, in order to keep it all snappy.



No URLs handy, but there are plenty of warnings to this effect in the zfs-discuss mailing list archives.  Basically, you need an extra 260 bytes of ARC space per dedupe table (DDT) entry, and there's 1 entry per unique block of data in the pool.  And, if the DDT spills over into the L2ARC, then you need an extra 200-odd bytes of ARC space to track the L2ARC usage.  The rough calculations shown on the zfs-discuss mailing list recommend approx 1 GB of ARC per TB of data in the pool.

That's an extra 1 GB of ARC space per TB of data, over and above whatever normal ARC space you need for your normal pool usage, and over and above whatever other RAM you need to run the the other services on the box.

IOW, more RAM is always better.  



> Speed, in my/this case is not the most importand factor. This will be a backup server which will probably start with something like 7-10TB of disk space, and I can stick in let's say at least 16GB of RAM. That's not a problem. The rates that data are written to the fs are not that high and there is not too much concurrency.



16 GB will work to start.  We started out with 16 GB of RAM in our storage boxes, using rsync to do backups for over 150 remote Linux/FreeBSD servers.  Worked well, with dedupe enabled from the start.  However, trying to delete a snapshot with only 16 GB of RAM would lock up the box as it very quickly ran out of RAM (all going to Wired aka ARC).  We had to bump it to 20 GB, and lock the ARC to 17 GB, before we could delete snapshots without lockups.  We have plans to push these to 32 GB of RAM over the summer.


----------



## phoenix (Mar 1, 2012)

throAU said:
			
		

> Here's a link with some de-dupe info:  http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe
> 
> The general consensus seems to be that compression is more of a win than de-dupe, but YMMV.  I'd get some test data in the pool and see what sort of de-dupe ratio you can achieve to see whether or not it is worth turning on.




```
[fcash@betadrive  ~]$ zfs list storage
NAME      USED  AVAIL  REFER  MOUNTPOINT
storage  9.76T  11.0T   256K  none

[fcash@betadrive  ~]$ sudo zdb -DD storage
DDT-sha256-zap-duplicate: 15200443 entries, size 715 on disk, 160 in core
DDT-sha256-zap-unique: 49903545 entries, size 774 on disk, 176 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    47.6M   4.24T   3.10T   3.21T    47.6M   4.24T   3.10T   3.21T
     2    9.72M   1022G    808G    826G    21.7M   2.24T   1.78T   1.82T
     4    2.42M    250G    183G    189G    11.7M   1.17T    871G    899G
     8     944K   72.4G   45.4G   48.7G    9.66M    768G    485G    519G
    16     333K   30.1G   14.8G   16.0G    6.63M    592G    295G    318G
    32     823K   29.5G   18.8G   23.0G    37.9M   1.39T    911G   1.08T
    64     258K   20.3G   10.3G   11.5G    21.2M   1.58T    824G    927G
   128    39.9K    681M    406M    634M    6.46M    114G   68.1G    105G
   256    9.57K    221M    126M    180M    3.25M   82.0G   47.1G   65.5G
   512    4.65K    183M    116M    141M    3.06M    118G   74.9G   91.0G
    1K    1.26K   40.9M   23.8M   30.9M    1.65M   55.0G   31.4G   40.7G
    2K      987   26.5M   10.9M   16.5M    2.56M   70.5G   29.8G   44.8G
    4K      345   12.3M   5.66M   7.63M    1.94M   69.5G   31.3G   42.7G
    8K      158   2.90M    670K   1.61M    1.78M   33.1G   7.14G   18.2G
   16K      312   2.38M    964K   2.90M    7.81M   50.3G   19.1G   69.7G
   32K       70   1.41M    604K   1.01M    2.69M   55.6G   23.2G   39.6G
   64K        7   5.50K   3.50K   49.7K     585K    459M    293M   4.06G
  128K        9     13K      8K   63.9K    1.58M   2.24G   1.39G   11.2G
  256K        2      1K      1K   14.2K     739K    369M    369M   5.12G
 Total    62.1M   5.63T   4.16T   4.30T     191M   12.6T   8.51T   9.23T

dedup = 2.15, compress = 1.48, copies = 1.08, dedup * compress / copies = 2.93
```

Dedupe saves us over 2x the disk space.  Lzjb compression saves us a little under 50% disk space.  IOW, dedupe is the bigger win for this box.  With a combined savings of just under 3x.  Meaning, we're storing just under 30 TB of data in only 9 TB of physical disk space.

What's most impressive is the last line (256K line) of the zdb output.  2 blocks of data are referenced over 256,000 times, storing 5 GB of data in just 256 KB (or less).  


```
[fcash@alphadrive  ~]$ sudo zfs list storage
NAME      USED  AVAIL  REFER  MOUNTPOINT
storage  31.0T  5.81T   288K  none

[fcash@alphadrive  ~]$ sudo zdb -DD storage
DDT-sha256-zap-duplicate: 36317745 entries, size 1099 on disk, 177 in core
DDT-sha256-zap-unique: 77168470 entries, size 1163 on disk, 188 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    73.6M   7.23T   4.85T   5.14T    73.6M   7.23T   4.85T   5.14T
     2    20.8M   2.27T   1.65T   1.72T    45.9M   4.98T   3.64T   3.78T
     4    8.48M    760G    543G    577G    43.4M   3.74T   2.68T   2.85T
     8    1.85M    163G    109G    117G    19.6M   1.68T   1.11T   1.20T
    16    1.42M    149G   80.1G   87.2G    31.9M   3.34T   1.76T   1.92T
    32    1.10M    101G   63.2G   68.6G    49.3M   4.26T   2.67T   2.90T
    64     369K   32.7G   20.3G   22.0G    32.7M   2.93T   1.79T   1.94T
   128     588K   68.1G   41.0G   43.4G     104M   12.0T   7.18T   7.60T
   256    41.6K   3.34G   2.42G   2.59G    15.3M   1.29T    942G   1007G
   512    10.6K    484M    272M    337M    7.68M    339G    181G    229G
    1K    1.74K   52.3M   24.3M   36.2M    2.40M   71.6G   33.2G   49.7G
    2K      889   24.8M   10.2M   16.3M    2.29M   71.1G   29.3G   45.4G
    4K      343   5.30M   2.41M   4.74M    1.73M   31.6G   14.8G   26.8G
    8K      256   7.54M   3.58M   5.32M    2.88M   91.7G   43.9G   63.9G
   16K      293   2.97M   2.14M   4.18M    7.24M   57.3G   40.1G   92.4G
   32K       50    697K    464K    799K    1.86M   25.4G   16.0G   28.5G
   64K       11    264K     16K    104K     952K   20.9G   1.32G   8.69G
  128K        2      1K      1K   16.0K     392K    196M    196M   3.06G
  256K        1     512     512   7.99K     406K    203M    203M   3.17G
 Total     108M   10.7T   7.33T   7.75T     443M   42.2T   26.9T   28.9T

dedup = 3.72, compress = 1.57, copies = 1.07, dedup * compress / copies = 5.44
```

Dedupe saves us almost 4x the disk space, compared to only a 50% savings from Lzjb compression.  For a combined disk space savings of over 5x.  IOW, we are storing 150 TB of data in only 30 TB of physical disk.  

Note the 256K line here:  3 GB of data stored in 1 data block.  

For backups servers, where the systems being backed up are all very similar, dedupe is a very big win.    For other setups, dedupe may not work as well.  YMMV, but we are extremely happy with FreeBSD 9.x, ZFSv28, and dedupe.


----------



## gkontos (Mar 1, 2012)

@phoenix,

That is really very interesting data. Thank you for sharing!

As a general rule of thumb, would you say that:


Combined dedup + compression, requires 1 GiB of RAM for 1 TB of storage with a double sized (RAM) SSD cache device?


The calculations are based not on available storage but on total not combined. Meaning that a raidz1 consisting of      3X1 TB would require 3 GiB of RAM and 6 GiB for a cache device. Similarly a 3X2 TB configuration it would require 12 GiB of RAM and 24 GiB for a cache device.

George


----------



## phoenix (Mar 1, 2012)

As with everything in computing, _it all depends_.  

The "rule of thumb" according to the zfs-discuss mailing list is 1 GB of ARC for every 1 TB of unique data.  Note the unique part.    You need 1 DDT entry for each unique block of data.  Duplicate blocks don't require DDT entries, they're just a reference count in the DDT entry.  So, if you have extremely duplicate data in the pool, you don't need a lot of ARC.  However, if you have very non-duplicate data in the pool, then you will need a lot of extra ARC.  It all depends.    In the last case, compression would be better than dedupe.

And it's based on the amount of data in the pool, not the total size of the pool.  Although you could use the total size of the pool as an upper-bound to the amount of RAM you'd need.  It's based on the amount of storage available in the pool after all redundancy is taken out (ie a 3x 1 TB raidz1 would be 2 TB of storage available).

But, remember, it's just a "rule of thumb", and it all depends on your pool layout, your data, and your usage patterns.  What works for the goose, may not work for the gander.


----------



## gkontos (Mar 1, 2012)

Thanks for your super fast reply! 

The reason I put this so harsh is because I am trying to determine a nice quick recipe involving only the minimum values.

To be honest I have almost zero experience in regards to dedup + compression with very large storage. 
But assuming you are correct, even if I double your RAM specifications (which I will), we still get a much better value for money solution compared to a brand named solution. 

Add to that all the cool ZFS features. I love FreeBSD :e


----------



## throAU (Mar 2, 2012)

Sorry, by "more of a win", I meant for the average non-enterprise user who doesn't have 20-30 gigs of ram and an SSD in their box to spare for caching.

If you have the resources to use de-dupe, go for it.  However, if not... compression will get you some decent savings without anywhere near as much RAM needed. 

I didn't mean to imply that de-dupe won't get better space savings, sorry I should have clarified.  Was more trying to reinforce that it's not a magic bullet that some (previously, me included) seem to think it is.  There are definite trade-offs in terms of resource requirements in addition to speed.  The trade-offs for compression are nowhere near as severe.


----------



## peetaur (Mar 2, 2012)

Phoenix, 

I tried a simple dedup test on my server again. I made sure there is a cache device (I realize now that my test last week didn't have a cache device, because it was a different pool). I made sure primarycache and secondarycache are unset (default all). I started copying from the same pool to a new dedup dataset, and in the beginning, the copy was slow and bursty (3-54 MB/s, avg 17 MB/s), and then in a few minutes, was faster (22-207M/s, avg 95 MB/s) but not nearly as fast as non-deduped (350-400 MB/s):

The resulting dedupratio was 1.59x, so multiply above dedup speeds by that number.

Then I enabled compression. It kept going around the same rate.

(side note: and then something not related to dedup happened and the system hung)

So in conclusion, I can only assume that dedup was horribly broken whenever I tried it before, which was either with the 8-STABLE iso on ftp (May 2011) or what I csuped in September 2011 (not sure which I tested), but seems to work now from what I csuped February 4th 2012.


----------



## phoenix (Mar 2, 2012)

There were many, many, many, many ZFS-related stability and performance fixes that hit the FreeBSD source tree after the release of 8.2.  And most of the big changes happened in Q3 (Oct-Dec) 2011.  So you probably tested just before all the big fixes hit the tree.  

Anyone wanting to use ZFSv28 to its fullest should be running 8-STABLE from 2012, 8.3-PRERELEASE/RELEASE, or 9.0-RELEASE/9-STABLE from 2012.


----------

