# ZFS/HAST - Why so much data replication?



## GeorgeLinn (Dec 12, 2013)

I am running FreeBSD 9.2-STABLE and use ZFS wth HAST. I am trying to figure out why when I write a 100 MB file to my ZFS pool1, HAST seems to replicate around 400 MB of data. I would expect maybe 200 MB of data to get replicated because I am using mirrored vdevs, but why 400 MB?

Pool configuration:

```
NAME                 STATE     READ WRITE CKSUM
        pool1                ONLINE       0     0     0
          mirror-0           ONLINE       0     0     0
            hast/disk1.eli   ONLINE       0     0     0
            hast/disk2.eli   ONLINE       0     0     0
          mirror-1           ONLINE       0     0     0
            hast/disk3.eli   ONLINE       0     0     0
            hast/disk4.eli   ONLINE       0     0     0
          mirror-2           ONLINE       0     0     0
            hast/disk5.eli   ONLINE       0     0     0
            hast/disk6.eli   ONLINE       0     0     0
          mirror-3           ONLINE       0     0     0
            hast/disk7.eli   ONLINE       0     0     0
            hast/disk8.eli   ONLINE       0     0     0
          mirror-4           ONLINE       0     0     0
            hast/disk9.eli   ONLINE       0     0     0
            hast/disk10.eli  ONLINE       0     0     0
          mirror-5           ONLINE       0     0     0
            hast/disk11.eli  ONLINE       0     0     0
            hast/disk12.eli  ONLINE       0     0     0
        spares
          hast/disk13.eli    AVAIL
```

The command I ran to create the ~ 100 MB file: `dd if=/dev/random of=./file.out bs=1000000 count=100`. The approximate bandwidth used per second while HAST replicates is 23-24 Mbps (I used iftop to determine Mbps).

Three separate runs of the dd command to create the 100 MB file:

Run 1 ~ 2 minutes 17 seconds and transferred ~ 411 MB
Run 2 ~ 2 minutes 4 seconds and transferred ~ 345 MB
Run 3 ~ 2 minutes 12 seconds and transferred ~ 383 MB

I hope I am not missing something obvious 

Thanks.
George


----------



## wblock@ (Dec 12, 2013)

100 MB written to mirrored vdevs = 200 MB, then that is replicated to the other HAST system.  But because ZFS is COW (copy on write), if there is a previous version it has to go back and free the blocks used by the previous version.  The variations might be due to compression.  Using the same file repeatedly may make it easier to see a pattern.


----------



## GeorgeLinn (Dec 13, 2013)

Thanks for your reply.  I forgot to mention that I was creating different files each time I ran the dd command.  I can understand how COW may affect overwriting the same file name though.

I am thinking now that this extra data is being generated by the ZIL.  In the past I have used separate disks for the ZIL but not in this HAST environment.  So now the ZIL is located directly within my VDEVS. http://docs.oracle.com/cd/E19253-01/819 ... index.html says "By default, the ZIL is allocated from blocks within the main storage pool" This would make sense as to why the 200 MB turns into 400 MB.  200 MB is written to ZIL first and then committed to the file system?


----------



## usdmatt (Dec 13, 2013)

I wouldn't expect dd to use sync writes so the data shouldn't be written to ZIL.

It would be interesting to know what's making up that extra data though. HAST itself will probably make up some of it (if those transfer figures are coming from observing the network traffic). It'll have protocol overhead and I believe it also marks a certain number of blocks as dirty even if they haven't changed in order to optimise local performance at the expense of sending more data to the secondary.


----------



## GeorgeLinn (Dec 15, 2013)

I have repeated each of the following three times and took an average:

dd a new 5 gigabyte random file on the HAST ZFS pool  – ~14.5 gigabytes transferred to the second HAST node.
dd a new 100 megabyte random file on the HAST ZFS pool  – ~290 megabytes transferred to the second HAST node.
cp an already existing 100 megabyte file from the HAST ZFS pool over to a local UFS volume. Remove the original file from the HAST ZFS pool. Finally copy the 100 megabyte from UFS back over to HAST ZFS – ~300 megabytes transferred to the second HAST node.
mv an already existing 100 megabyte file from the HAST ZFS pool over to a local UFS volume. Remove the original file from the HAST ZFS pool. Finally move the 100 megabyte from UFS back over to HAST ZFS – ~325 megabytes transferred to the second HAST node.
cpan existing 100 megabyte file from the HAST ZFS pool to another filename on the same HAST ZFS pool – ~290 megabytes transferred to the second HAST node.
Everything seems to transfer three times. Twice because of the mirroring, but I'm still not exactly sure why the additional one time.  Maybe because of the ZIL?

Now that I am writing this post I am thinking maybe I should google how to monitor ZIL usage and if that is possible run the tests again and see if/when ZIL is being used.


----------



## kpa (Dec 15, 2013)

ZIL is only used when there are synchronous writes, none of the tests you have done there involve synchronous writes.


----------

