# ZFS using 'advanced format drives' with FreeBSD (8.2-RC3)



## Bucky (Feb 8, 2011)

Goal:  Add a large amount of storage to my existing home server, using zfs and a bunch of big, cheap HDDs.  Not interested in booting into zfs.  My home server already runs 24/7 as it has a mail server and DHCP server running, too.

My H/W (you want pretty fast and mostly LOTS OF RAM)


```
mobo = Asus P5WDH (only 3 'normal', on-board SATA ports)
	RAM = 6 gig ECC DDR2 (not in dual channel mode)
	CPU = C2Q6600 or C2Q9400 CPU (two otherwise identical servers)
	SATA = Supermicro AOC-USAS-L8i PCIe 8 port SATA card
	HDDs = 2 identical Brand X HDDs for mobo connectors port 0 & 1
	HDDs = 8 identical Samsung F4 HD204UI 2 TB HDDs (these are 'advanced format drives') connected to the Supermicro card.
```

Running FreeBSD 8.x AMD64 (64 bit version).

Kudos to sub.mesa and the many, many other folks who know so much more than I and especially to Pawel for porting zfs to FreeBSD.  Sub.mesa's website has a very good how-to install FreeBSD and other stuff.

This isn't *the* way to do it, just *a* way to do it.

Install FreeBSD on one of the Brand X HDDs, then gmirror(8) that drive to the other identical drive.  Edit rc.conf to enable zfs and reboot.

Identify the 8 HDDs for the zfs array watching the bootup process, doing a dmesg and/or looking in /dev/ (eg, /dev/da0-7).  Make sure the drives are free of any partioning info:
`# dd if=/dev/zero of=/dev/da0 bs=1m count=1`

Repeat for each drive.

Force the drives to ignore the 'advanced format drive' firmware (mis)information, thus:
`# gnop create -S 4096 /dev/da0`

Now build your zfs array, thus:
`# zpool create media raidz2 da0.nop da1 da2 da3 da4 da5 da6 da7`

Save it
`# zpool export media`

Trash the gnop
`# gnop destroy /dev/da0.nop`

Rebuild it
`# zpool import media`

REBOOT

Shouldn't show anything like da0.nop
`# ls /dev/`

Now test...


```
# ls /media	; should be there
# zdb		; look for ashift = 12 somewhere
```

Run write/read speed test like this:


```
write:	dd if=/dev/zero of=/media/zerofile.000 bs=1m count=20000
read:	dd if=/media/zerofile.000 of=/dev/null bs=1m
```
My write speed showed this:

```
20971520000 bytes transferred in 52.457735 secs (399779365 bytes/sec)
```

My read speed showed this:

```
20971520000 bytes transferred in 46.018040 secs (455723884 bytes/sec)
```

I'm happy.  Similar results should be had with the Western Digital advanced format drives.

Eventually, I'll put up a web site with a much more detailed How-To along with the theoretic basis as I understand it (probably wrong)

Hope this helps someone.


----------



## ian-nai (Feb 8, 2011)

Interesting!  That's the first time I've heard of the gnop utility.  I followed all of the steps you took but that one.  What was achieved by creating a 4096 byte (?) gnop provider?


----------



## AndyUKG (Feb 8, 2011)

The gnop device reports its sector size to zpool as 4096 bytes, based on this the pool is created with an appropriate ashift value for 4k disks. Tnis normally wouldn't happen as the 4k disks typically emulate 512 byte sectors, so zpool pool creates a pool optimised for 512 byte sectors of course!
The value of ashift is a per pool setting that should reflect the physical sector size of the disks in order to achieve normal/optimal performance,

ta Andy.

PS nice solution Bucky!


----------



## nakal (Feb 8, 2011)

I would not test with _/dev/zero_. ZFS seems to optimize something with such streams (I got over 500 MB/s write speed on my mirrored drive). Also ZFS has a large cache, so reading a 2 GB file twice gives me 2 different results: first about 160 MB/s, then the second time 1400 MB/s.

Just for information, this is not how you would do benchmarks on ZFS.


----------



## Bucky (Feb 8, 2011)

*Benchmarking..*

Thank you, nakal.  Shows how little I really know.

Since 2000hrs last night, my server has been copying my media collection from the [soon-to-be] old server, to the new server.  Using rsync over my gigabit home network about 90% done transferring 5 TB of files (about 6000 files)from the old zfs pool to the new zfs pool.  Average transfer "speed" showing is 70-80 MB/s.  Been running 569 minutes as of now.

Just for reference sake...  Old server has 8 WD 1TB Caviar Black drives in zfs raidz2.  I'm running out of room over there.  New server has 8 Samsung 2TB F4 drives in zfs raidz2.


----------



## phoenix (Feb 8, 2011)

Why are you transferring the data and not simple replacing the drives in the original raidz2?  Offline 1 drive, remove it from server, insert new disk, and then just `# zpool replace <poolname> <olddisk> <newdisk>`

Wait for that to finish.  Then repeat with the rest of the disks.  Once all 8 are replaced ... you have 50% free space (may need to reboot or export/import the pool).

Hrm, thinking about it, though, going from 512B disks to 4096B disks would lead to performance issues, as you need to create the pool with a 4K record size (ashift=12).


----------



## aragon (Feb 9, 2011)

Bucky, it'd be nice if you could do more benchmarks.


----------



## phoenix (Feb 9, 2011)

Installing security/openssh-portable with the HPN patches, and configuring /usr/local/etc/ssh/sshd_config to enable HPN with a buffersize of 8192, and using the *None* cipher, will increase your transfer speeds, even with rsync, *a lot*.
`# rsync --archive --hard-links --delete-during --delete-excluded --partial --inplace --numeric-ids --rsh="/usr/local/bin/ssh -oNoneEnabled=yes -oNoneSwitch=yes -oHPNBufferSize=8192" --rsync-path="sudo rsync" [email]username@remote.host[/email]:/path/to/start/from/ /path/to/copy/to/`

Be sure to disable the included SSH server (*ssd_enable="NO"*) and enable the ports version (*openssh_enable="YES"*) in /etc/rc.conf


----------



## Bucky (Feb 9, 2011)

Ummm, now you will see just how little I know about all this.

I didn't think to do the disc replacement with a bigger disc in the existing array, but figured nothing short of the way I did it would bypass the 512b vs 4096b sector size issue.  Somewhere I read that zfs uses the physical sector size reported by the *first* disc added to an array as the size for all the discs in the array, even those added or replaced later.  Can't mix/match drives too much else stability becomes an issue.

I'd be happy to run proper benchmarks if someone can tell me what program(s) to use and how to run them to your specification.

I use the base ssh as it works perfectly adequately for me.  I used to install the one from the ports, but it doesn't get me anything else that I need.

And Phoenix, thank you for formatting my initial message - I didn't know how to do that and it didn't occur to me either - first posting ever on this forum.  It looks very pretty the way you've done it and is imminently more readable.

For now, my media collection has been transferred to the new 2TB Samsung drives and I've shut down that machine until the RELEASE version of 8.2 is out.  Then I'll do an 'export' and pull out the SATA card to those drives while I build up a new server on the gmirror drives, then do an 'import' and I'll be ready to go again.


----------



## danbi (Feb 9, 2011)

The ashift value is on per-vdev base, not on per-zpool base. Therefore, in an zpool you may have one vdev with 4k drives, another with 512b drives etc. Just make sure you do not mix both types, unless you use the larger sector size found in the vdev when creating it.


----------



## phoenix (Feb 9, 2011)

Are you sure about that?  From what I can see on the zfs-discuss mailing list, it's a per-pool setting.  But, I haven't looked at the code to confirm that.


----------



## danbi (Feb 10, 2011)

Look at the output of zdb -- the ashift values are listed as property of the vdev.

One could easily test this with multi-vdev zpool..


----------



## AndyUKG (Feb 10, 2011)

danbi said:
			
		

> One could easily test this with multi-vdev zpool..



Yeah, Id seen those comments on zfs discuss too. I've just tested it, it is indeed a vdev specific setting, which makes sense.


----------



## phoenix (Feb 10, 2011)

Hrm, interesting.

Wonder how well it copes with the situation where vdev A has ashift=9 and vdev B has ashift=12.  Wonder if it would impact performance at all, where you go to write 12 KB of data across the two vdevs (multiples of 4K written to vdev B and multiple of 0.5K written to vdev A).

Anyone done any benchmarking of creating a vdev using 4K gnop devices, with non-4K Advanced Format (aka 512B sectors) drives?  Just wondering if for our next storage box, we should create the vdevs using gnop to set ashift=12, to allow us to migrate down the road to 4K drives without issues.


----------



## danbi (Feb 11, 2011)

phoenix, indeed, this makes perfect sense. Remember that ZFS is designed touse any block device for storage, therefore it is very likely that blocksize will be different for devices and therefore ashift. I wonder if it is indeed true, that zpool create would consider only the blocksize of the first device, or, what is more reasonable to expect, will consider the largest blocksize of all devices participating in the vdev.

Funny, creating vdev out of 4k drives let's you replace these later with 512b drives without loss of performance, but the opposte is not quite true. I think, with the current multi-terrabyte drive capacities using 512b as the minimum data unit does not make much sense.

Even better would be to have the ability to drop a vdev off the zpool......


----------



## boolean (Mar 30, 2011)

Thanks for this tip!


----------



## tonyalbers (Apr 7, 2011)

Slightly O/T here, but wouldn't it make more sense to copy the data using NFS instead of rsync, since ZFS is pretty much born with NFS support?

/tony


----------



## dennylin93 (Apr 7, 2011)

In some cases, rsync is much faster than NFS.


----------



## phoenix (Apr 7, 2011)

tonyalbers said:
			
		

> Slightly O/T here, but wouldn't it make more sense to copy the data using NFS instead of rsync, since ZFS is pretty much born with NFS support?



ZFS on Solaris has built-in, in-kernel servers for NFS and CIFS.  However, ZFS on FreeBSD just uses the normal nfs daemons and samba daemons.

The only difference between exporting a UFS filesystem via NFS, and exporting a ZFS fileystem via NFS is that you have two different ways of configuring the NFS export line for ZFS:  via the *sharenfs* property, or via the normal /etc/exports file.  And all the *sharenfs* property does on FreeBSD is write a line to /etc/zfs/exports.

There's nothing magical or even new about NFS exporting a ZFS filesystem on FreeBSD.


----------



## tonyalbers (Apr 8, 2011)

Thanks phoenix, I didn't know that zfs nfs shares were dealt with that way in FreeBSD.

/tony


----------



## serverhamster (May 17, 2011)

How can you see if a drive has 4k sectors?

```
da0 at mps0 bus 0 scbus0 target 0 lun 0
da0: <ATA WDC WD30EZRS-00J 0A80> Fixed Direct Access SCSI-5 device 
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 2861588MB (5860533168 512 byte sectors: 255H 63S/T 364801C)
```
Can we believe the value of 512 byte sectors? These drives probably have 4k, but I want to make sure. There is a lot of talk about the *EADR* drives, but this one is *EZRS*.


----------



## aragon (May 18, 2011)

You can't believe the advertised sector size.  Check your drive specifications - if it is said to be "advanced format" then it is a 4k drive.  You can also test by writing to it on and off 4k boundaries - it'll be reproducibly slower unaligned.


----------



## boolean (May 18, 2011)

Trying to figure out why my post above was edited.


----------



## serverhamster (May 19, 2011)

Thanks. It is "advanced format". I recreated the zpool.


----------



## danpritts (Jul 17, 2011)

*any downside to using the ashift=12 with 512b drives?*

I'm about to rebuild my system, as it turns out with 512-byte sector drives.  

Is there any downside, other than a slight loss of capacity with small files, in building an array assuming the 4k sector size?  I'm imagining replacing a disk later might be easier this way.


----------



## peetaur (Dec 6, 2011)

Firstly some praise: Thanks for this informative post. Previously the only thing I read about these non-512B sector disks is to avoid them for zfs, but your numbers look great.

And an addition: my system  set up has 48 GB of ram and I found that zfs was only using around 600 MB until I raised the kmem size and other zfs tunables. You should definitely look into that. Here is basically what I used as a template: ZFSguru advanced tuning Before tuning, I found my numbers were 60-100% higher than yours (with more expensive hardware and 16 disks) with tests like yours (simple dd or cp with something other than /dev/zero though), after verifying no caching. But, when doing reading and writing at the same time, it went horribly slow... maybe 50-100 MB/s, which is slower than my raid0 fake raid desktop. After tuning the memory, it adds up to around 500, which is faster than an XFS 22 disk SAS RAID6 system we have.

And the main reason for my post... a question.
Does anyone know what sort of ashift/sector size an SSD should use? I am thinking it might affect my ZIL performance which I have on an SSD and I think it seems far too slow. I checked now, and it says ashift=9.


----------



## bbzz (Dec 6, 2011)

I use 4K on SSD just like I use on HDD. Not using it as ZIL, though.

How did you figure out what setting you need with 48GB RAM system? Out of curiosity what setting did you use, or did you just copied one from ZFSguru?


----------



## phoenix (Dec 6, 2011)

danpritts said:
			
		

> I'm about to rebuild my system, as it turns out with 512-byte sector drives.
> 
> Is there any downside, other than a slight loss of capacity with small files, in building an array assuming the 4k sector size?  I'm imagining replacing a disk later might be easier this way.



Now's about the time to start building all ZFS pools using ashift=12, regardless of the block size used on the drives themselves.  There's negligible performance differences for using 512B disks in a 4K pool, but it future-proofs the pool:  you can add 512B and 4K disks to an ashift=12 pool, but you *cannot* add a 4K drive to an ashift=9 pool.


----------



## phoenix (Dec 6, 2011)

peetaur said:
			
		

> And an addition: my system  set up has 48 GB of ram and I found that zfs was only using around 600 MB until I raised the kmem size and other zfs tunables.



That should no longer be needed on 8-STABLE/9.x systems, as the default kmem_size has been expanded to 64 GB or thereabouts on amd64 installs.

And you should not need to tune the arc_max setting anymore, unless you need to limit it (allow more RAM for other things), as that is now auto-tuned as well.


----------



## peetaur (Dec 7, 2011)

bbzz said:
			
		

> I use 4K on SSD just like I use on HDD. Not using it as ZIL, though.
> 
> How did you figure out what setting you need with 48GB RAM system? Out of curiosity what setting did you use, or did you just copied one from ZFSguru?



I just took the template, and filled in my max memory minus 4 (leaving 1 as the guide suggested for the system, plus 3 more for no particular reason), and then set the other numbers proportionally high based on my max memory.

Here are my settings right now.

```
vm.kmem_size="44g"
vm.kmem_size_max="44g"
vfs.zfs.arc_min="80m"
vfs.zfs.arc_max="42g"
vfs.zfs.arc_meta_limit="24g"
vfs.zfs.vdev.cache.size="32m"
vfs.zfs.vdev.cache.max="256m"
vfs.zfs.vdev.min_pending="4"
vfs.zfs.vdev.max_pending="32"
kern.maxfiles="950000"
```

arc_meta_limit may be low, but my system isn't full enough to matter (16 disks filled to 30%, with 24 empty bays for the future). None of the above memory limits are ever hit. When I have more data, I expect that something will hit a limit and I will tweak further.

I check to see if I hit my limits with
`# zfs-stats -a`


----------



## yayix (Feb 15, 2013)

How would we go about with replacing drives on a mirrored or raidzN vdev previously created with gnop (all vdevs shows ashift=12)? Do we need to create a gnop for the new drive and use the gnop device on the zpool replace command?


----------



## phoenix (Feb 15, 2013)

Nope.  The ashift is set permanently on the vdev when the vdev is created.  Just replace the drives normally, and the ashift won't change.


----------



## yayix (Feb 15, 2013)

phoenix said:
			
		

> Nope.  The ashift is set permanently on the vdev when the vdev is created.  Just replace the drives normally, and the ashift won't change.



Got it. Thanks.

What if the mirrored vdev (sample below is for zil) has a current ashift=9 and I want to change it to ashift=12. Are my steps below perfectly safe? Notice the resilver process kicking in after zpool attach. Do I need the extra detach/attach steps to remove the "da2.nop" device from the vdev and use "da2" directly? Is there any difference with having da2.nop show up on zpool status? 

[CMD=""]zpool status[/CMD]

```
pool: data
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
        logs
          da2       ONLINE       0     0     0
          da3       ONLINE       0     0     0

errors: No known data errors
```

[CMD=""]# zdb|grep ashift[/CMD]

```
ashift: 9
            ashift: 9
            ashift: 9
            ashift: 9
```

[CMD=""]# zpool remove data da2 da3[/CMD]
[CMD=""]# zpool status[/CMD]

```
pool: data
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0

errors: No known data errors
```
[CMD=""]# gnop create -S 4096 da2[/CMD]
[CMD=""]# zpool add data log mirror da2.nop da3[/CMD]
[CMD=""]# zpool status[/CMD]

```
pool: data
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        data         ONLINE       0     0     0
          da0        ONLINE       0     0     0
          da1        ONLINE       0     0     0
        logs
          mirror-2   ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
            da3      ONLINE       0     0     0

errors: No known data errors
```
[CMD=""]# zdb|grep ashift[/CMD]

```
ashift: 9
            ashift: 9
            ashift: 12
```
[CMD=""]# zpool detach data da2.nop[/CMD]
[CMD=""]# zpool status[/CMD]

```
pool: data
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
        logs
          da3       ONLINE       0     0     0

errors: No known data errors
```
[CMD=""]# zdb|grep ashift[/CMD]

```
ashift: 9
            ashift: 9
            ashift: 12
```
[CMD=""]# gnop destroy /dev/da2.nop[/CMD]
[CMD=""]# zpool attach data da3 da2[/CMD]
[CMD=""]# zpool status[/CMD]

```
pool: data
 state: ONLINE
 scan: resilvered 0 in 0h0m with 0 errors on Fri Feb 15 11:03:08 2013
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
        logs
          mirror-2  ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da2     ONLINE       0     0     0

errors: No known data errors
```
[CMD=""]# zdb|grep ashift[/CMD]

```
ashift: 9
            ashift: 9
            ashift: 12
```


----------

