# ZFS write performance issues with WD20EARS



## mjrosenb (Jun 25, 2010)

I've seen a couple of other posts about using zfs on WD EARS drives which have the fun
feature of using 4096 byte sectors while reporting 512 byte sectors to the OS.

I tried the trick of using gnop to make a new device that reports itself as having 4096 byte sectors, but the improvements seem to be superficial.

system specs:
22x 2TB WD 20 EARS attached to a single 3ware 9690SA4I with a SAS expander in the backplane.
they are configured as 22 separate 'single' units on the controller, and the cache is on.

[CMD=]tw_cli /c0 show[/cmd]

```
Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    SINGLE    OK             -       -       -       1862.63   RiW    OFF
```

cpu is a Xeon 3450 @ 2.67 Ghz.
8GB ECC DDR3
FreeBSD 8.1-BETA1 amd64
the drives are in 3x (7 disk raidz) configuration.  I've tried using the last drive as a 
hot spare, log device and cache, all of which have the same performance issues.

```
vm.kmem_size="4G"
vm.kmem_size_max="4G"
vm.zfs.arc_max="2G"
```

Before gnop, I was getting about 1.5 MB/s write speeds.
Now it looks like i'm getting about 3 MB/s

I've tested the bandwidth to the drives, and found that using 20 simultaneous dd processes, I can write 15-30MB/s to each drive. 

I've done tests dd'ing into a file on the zfs file system, copying a file using rsync, nfs 
and just copying a file from a local UFS2 partition.

looking at the drives with zpool iostat, it looks like writes are occurring sporadically. (this is with gnop)

[CMD=]zpool iostat 1[/cmd]

```
capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
store        114G  37.8T      1     19   127K  1002K
store        114G  37.8T      0      0      0      0
store        114G  37.8T      0      0      0      0
store        114G  37.8T      2    252  12.0K  12.2M
store        114G  37.8T      0      0      0      0
store        114G  37.8T      1    118  7.97K  10.9M
```

I've also seen posts about writes starving everything else, causing stalls, and only intermittent disk activity,
so I attempted to follow the advice for that by setting vfs.zfs.txg.write_limit_override
I am unsure of the units for that, so I tried a bunch of different values, 256, 262144, and 268435456.  (and some others)
I mostly noticed either no change in the speed, or it dropping to under 100k/s (262144 did that.)

I've tried to use only 4 disks, and it looked like it was working better, but I did not do thorough testing
I'm sure there is more useful information that I haven't provided, and I am more than willing to provide anything that I've left out.


----------



## pennello (Jun 26, 2010)

Have you seen this post?

I'd be interested in seeing the results of your more thorough testing on a configuration with a greater number of fewer-drive raidzs.


----------



## User23 (Jun 27, 2010)

I dont really know, but maybe the 3ware 9690SA cant handle that sort of drive with the present firmware version.

At least it is not tested with it :
http://www.3ware.com/products/pdf/Drive_compatibility_list_9690SA_9.5.3codeset_900-0070-02RevK.pdf

And well it is a "desktop drive", it is not certified for 24/7. I could still be a good drive but i dont know if LSI/3ware will care about the support of it.

In the last months i saw some posts in german forums where people reporting performance problems with the WD20EARS + 3ware controllers.

In your case i would contact the LSI/3ware support asking for a solution. i hope there is any.


----------



## pennello (Jun 27, 2010)

I have a similar setup, and am having similar issues.  However, I think the problem lies with the ZFS configuration, and not with the 3ware / EARS combination.  Here's why.

There are ten 2T EARS drives connected to a 3ware 9650SE-24M8, and 8 of them are configured (with gnop in-between for 4k sectors) in a pool with two four-drive raidzs.

First, I make a memory disk to hold the test data, so we won't be bound by the speed of /dev/random.

`# mdmfs -s 500m md2 /mnt`

Next, I have a script for grabbing 400 megs of random junk.

```
# cat freshen.sh
dd if=/dev/random of=/mnt/a bs=1m count=400
```
Copying straight to the device (even without gnop) is quite zippy.

```
# sh freshen.sh
419430400 bytes transferred in 5.719221 secs (73336979 bytes/sec)
# dd if=/mnt/a of=/dev/da9 bs=1m
419430400 bytes transferred in 3.333524 secs (125821923 bytes/sec)
```
But copying onto the zfs filesystem is much slower.

```
# sh freshen.sh
419430400 bytes transferred in 5.721352 secs (73309667 bytes/sec)
# dd if=/mnt/a of=/tank/tmp/a bs=1m
419430400 bytes transferred in 25.927887 secs (16176806 bytes/sec)
```

So it's gotta be an issue with the ZFS config, right?

mjrosenb, do you have a similar result?  That is, if you copy directly to the device, do your benchmarks perform well?


----------



## sub_mesa (Jun 28, 2010)

zpool status output? I can't figure out how you partitioned your disks.


----------



## pennello (Jun 28, 2010)

sub_mesa said:
			
		

> zpool status output?



Here is my config.


```
# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        tank         ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da1.nop  ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
            da3.nop  ONLINE       0     0     0
            da4.nop  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da5.nop  ONLINE       0     0     0
            da6.nop  ONLINE       0     0     0
            da7.nop  ONLINE       0     0     0
            da8.nop  ONLINE       0     0     0

errors: No known data errors
```


----------



## mjrosenb (Jun 29, 2010)

yes, writing to a single drive maxed it out at about 90 MB/s, and I was able to write to 20 drives at the same time at 20MB/s *each*.

I tried creating 5x(4 disk raidz's).  Initially the performance was good (as with 3x(7 disk raidz), however, the performance quickly dropped off.

```
9386557875 100%   25.25MB/s    0:05:54 (xfer#1, to-check=799/801)
  4699592172 100%    7.39MB/s    0:10:06 (xfer#2, to-check=798/801)
  4694425121 100%    5.38MB/s    0:13:51 (xfer#3, to-check=797/801)
  9391526150 100%    4.71MB/s    0:31:43 (xfer#4, to-check=796/801)
```


----------



## pennello (Jun 29, 2010)

As a side note, wouldn't it not make too much sense, from a speed perspective, to use one of the other drives as a log or cache device?  I thought the idea behind those was that they would be hosted on speedier hardware than the main storage devices.

For the cache, for example, it uses main memory if you don't specify any cache devices.  And for the ZIL, "the intent log is allocated from blocks within the main pool", according to the man page for zpool.


----------



## pennello (Jun 29, 2010)

mjrosenb said:
			
		

> and as far as I can tell, it does not make sense to use a one of the raid disks as a log or a cache, but I figured it could not hurt to try it.



Ah, agreed.


----------



## mjrosenb (Jun 29, 2010)

gah, why is there no "edit reply" button

`# zpool status`

```
pool: store
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        store         ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da0.nop   ONLINE       0     0     0
            da1.nop   ONLINE       0     0     0
            da2.nop   ONLINE       0     0     0
            da3.nop   ONLINE       0     0     0
            da4.nop   ONLINE       0     0     0
            da5.nop   ONLINE       0     0     0
            da6.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da7.nop   ONLINE       0     0     0
            da8.nop   ONLINE       0     0     0
            da9.nop   ONLINE       0     0     0
            da10.nop  ONLINE       0     0     0
            da11.nop  ONLINE       0     0     0
            da12.nop  ONLINE       0     0     0
            da13.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da14.nop  ONLINE       0     0     0
            da15.nop  ONLINE       0     0     0
            da16.nop  ONLINE       0     0     0
            da17.nop  ONLINE       0     0     0
            da18.nop  ONLINE       0     0     0
            da19.nop  ONLINE       0     0     0
            da20.nop  ONLINE       0     0     0

errors: No known data errors
```
and for the smaller vdev raidz:
`# zpool status`

```
pool: store
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        store         ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da0.nop   ONLINE       0     0     0
            da1.nop   ONLINE       0     0     0
            da2.nop   ONLINE       0     0     0
            da3.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da4.nop   ONLINE       0     0     0
            da5.nop   ONLINE       0     0     0
            da6.nop   ONLINE       0     0     0
            da7.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da8.nop   ONLINE       0     0     0
            da9.nop   ONLINE       0     0     0
            da10.nop  ONLINE       0     0     0
            da11.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da12.nop  ONLINE       0     0     0
            da13.nop  ONLINE       0     0     0
            da14.nop  ONLINE       0     0     0
            da15.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da16.nop  ONLINE       0     0     0
            da17.nop  ONLINE       0     0     0
            da18.nop  ONLINE       0     0     0
            da19.nop  ONLINE       0     0     0

errors: No known data errors
```

and as far as I can tell, it does not make sense to use a one of the raid disks as a log or a cache, but I figured it could not hurt to try it.


----------



## pennello (Jun 29, 2010)

mjrosenb said:
			
		

> gah, why is there no "edit reply" button



I think you want a 
	
	



```
block, not [CMD].  The columns get all wonky in a [CMD] block.
```


----------



## pennello (Jun 29, 2010)

I was able to get decent write speeds (61MB/s) out of my array, but only by making a bunch of mirror pairs instead of raidzs.  

I still haven't given up, though, as I really want that extra space.


----------



## mjrosenb (Jun 29, 2010)

I tried to put each disk at the top level and got speeds somewhere around 6MB/s.
I'm going to see if osol has similar behavior with this setup.


----------



## sub_mesa (Jun 29, 2010)

Check each disk separately. Also check the SMART data for each HDD for UDMA CRC Error Count; indicating cabling problems.

Its all possible for one or more disks to kill the array's performance due to them being duds or having cabling CRC errors.


----------



## phoenix (Jun 29, 2010)

ZFS works wonderfully on 4K disks *IF AND ONLY IF* the harddrive firmware reports a 4K physical disk size.

IOW, ZFS is broken on the WD 4K disks as they report a physical sector size of 512 B.

This is covered fairly often on the zfs-discuss mailing list.    It applies to ZFS, not the OS it's running on.  IOW, it affects OSol, Sol, FreeBSD, Linux (via FUSE), etc.

Until WD fixes their firmware, you cannot use their 4K disks with ZFS.


----------



## mjrosenb (Jun 29, 2010)

> ZFS works wonderfully on 4K disks IF AND ONLY IF the harddrive firmware reports a 4K physical disk size.


I have read those, and it seems like the recommendation is "use gnop to create a dummy device that is a duplicate of the original device, but reports backa 4096 byte sector size"



> Until Maxtor fixes their firmware, you cannot use their 4K disks with ZFS.


Maxtor? you mean WD? pretty sure Maxtor was bought out several years ago at this point.


----------



## pennello (Jun 30, 2010)

sub_mesa said:
			
		

> Check each disk separately. Also check the SMART data for each HDD for UDMA CRC Error Count; indicating cabling problems.
> 
> Its all possible for one or more disks to kill the array's performance due to them being duds or having cabling CRC errors.



Neat suggestion; thanks.  Although mine were all good in this respect.


----------



## wonslung (Jul 3, 2010)

i think everyone who owns these drives needs to send a letter to WD and let them know you want a firmware upgrade.

This is going to be a problem for more than awhile...This is why i use hitachi seagate and SOMETIMES samsung drives.

right now i'm getting the best ZFS performance from the hitachi 7200 RPM 2TB drives


----------



## mjrosenb (Jul 9, 2010)

yeah, being able to upgrade the firmware to report 4k sectors seems trivial, but it seems like gnop and/or other means should be able to circumvent these issues

Increasing the amount of memory for ARC and the kernel seems to have increased the write speed that it bottoms out at, however, I am still confused as to why
it decreases over the course of 20GB or so, when this is significantly larger than the amount of memory on the system.


----------



## Epikurean (Jul 10, 2010)

I've got the same problem. Western Digital offers a Tool, which could solve the problem.
From http://www.amandtec.com


> In order to solve the misalignment issue, Western Digital is offering two solutions. The first solution for correcting misaligned partitions is specifically geared towards Win 5.x, and that is an option on the drive itself to use an offset. Through the jumpering of pins 7 and 8 on an Advanced Format drive, the drive controller will use a +1 offset, resolving Win 5.xxâ€™s insistence on starting the first partition at LBA 63 by actually starting it at LBA 64, an aligned position. This is exactly the kind of crude hack it sounds like since it means the operating system is no longer writing to the sector it thinks its writing to, but itâ€™s simple to activate and effective in solving the issue so long as only a single partition is being used. If multiple partitions are being used, then this offset cannot be used as it can negatively impact the later partitions. The offset can also not be removed without repartitioning the drive, as the removal of the offset would break the partition table.
> 
> The second method of resolving misaligned partitions is through the use of Western Digitalâ€™s WD Align utility, which moves a partition and its data from a misaligned position to an aligned position



Has anyone already tried this?


----------



## Deleted member 2077 (Jul 10, 2010)

These drives (advanced format) work fine in Windows and seems that the linux guys have already figured too.  Reports say that performance is good also in MacOSX.  Not sure why FreeBSD is always way behind the curve.


----------



## mjrosenb (Jul 11, 2010)

> I've got the same problem. Western Digital offers a Tool, which could solve the problem.


I do not believe that ZFS's issue is alignment.  The issue with older windows is that they like to write 4K chunks on sections that are a multiple of 4K offset from the beginning of the partition.  Since the default offset of the first partition is 63 512 byte sectors, it is not actually aligned to a multiple of 4096 bytes, which destroys the performance.

The issue with ZFS is that it attempts to pack data into sector sized chunks.  This means that as long as the drive is reporting that it has 512 byte sectors, ZFS will attempt to write 512 bytes at a time, So no matter where you attempt to start the partition, as long as ZFS thinks the sector size is 512 bytes, it will make no attempts to align to 4k boundaries.  This is why gnop should theoretically fix the issues, if ZFS thinks the disk has 4K sectors, it should write out everything aligned to 4K.  It is also recommended to not use a partition table at all, and let ZFS handle the raw device directly, so aligning the "partition" with the WD tool would most likely do nothing.


----------



## wonslung (Jul 12, 2010)

The bottom line is WD just doesn't care about ZFS customers.

If they did, they'd release firmware for these drives.

The correct response is:

do not buy WD for ZFS.

If you're unlucky enough to have already done this, then you need to write them and ask for:

a working firmware
or a refund.

it's unlikely this will work at first but if enough people voice this concern, it will eventually have an effect.


----------



## pennello (Jul 12, 2010)

I wrote them, asking for this.  I also encourage others to do the same.

In the mean time, gnop is working out just fine for me.


----------



## mjrosenb (Jul 12, 2010)

> I wrote them, asking for this. I also encourage others to do the same.


who did you write to, customer service department?

also, how do you have your disks attached to the system?

I am also beginning to wonder if this is an issue with the sum total of raidz, WD*EARS and my 3ware raid controller.  e.g. zfs says "write these 4k chunks these 20 disks", then the 3ware card itself breaks it up into 512 byte chunks.


----------



## pennello (Jul 12, 2010)

mjrosenb said:
			
		

> who did you write to, customer service department?


Yeah, I just emailed support.  I believe I used this link.

"Pablo R." replied the following back to me:


> I apologize for the inconvenience. I will forward your request to our engineer department.
> 
> Please bear in mind that this is not an indication that they will make the firmware update.



Still, I feel like we should all email them in the hopes that our voices will be heard.  It's not like it's hard to write a short email requesting a firmware change.



			
				mjrosenb said:
			
		

> also, how do you have your disks attached to the system?



They're attached as single disks through my 3ware 9650SE-24M8 with no auto-verify and the storsave profile set to perform.  On top of that, I have gnop providing 4k-sector'd disks and zfs uses those 4k transparent providers.  So, all in all, my zpool status looks like this:


```
pool: pool
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        pool          ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da6.nop   ONLINE       0     0     0
            da9.nop   ONLINE       0     0     0
            da5.nop   ONLINE       0     0     0
            da1.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da2.nop   ONLINE       0     0     0
            da4.nop   ONLINE       0     0     0
            da0.nop   ONLINE       0     0     0
            da7.nop   ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da3.nop   ONLINE       0     0     0
            da10.nop  ONLINE       0     0     0
            da16.nop  ONLINE       0     0     0
            da12.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da8.nop   ONLINE       0     0     0
            da19.nop  ONLINE       0     0     0
            da13.nop  ONLINE       0     0     0
            da17.nop  ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            da15.nop  ONLINE       0     0     0
            da11.nop  ONLINE       0     0     0
            da18.nop  ONLINE       0     0     0
            da14.nop  ONLINE       0     0     0
        cache
          ad8         ONLINE       0     0     0

errors: No known data errors
```

ad8 is a 40 gig ssd.


----------



## mjrosenb (Jul 12, 2010)

wow, your setup is looking might similar to mine.  
Is there a reason that you have the drives distributed throughout the zpools like that?

I found that I get an improvement from 8MB/s to 11 MB/s write speeds by changing from

```
zpool create store raidz1 /dev/da0.nop /dev/da1.nop /dev/da2.nop /dev/da3.nop
```
to

```
zpool create store raidz1 /dev/da0.nop /dev/da4.nop /dev/da8.nop /dev/da12.nop.
```

and the same question goes for the ordering of the devices within the raidz's and the ordering of the raidz's.

and what speeds are you getting while writing to your raidz now? still the 16.1 MB/s you stated earlier?


----------



## pennello (Jul 12, 2010)

mjrosenb said:
			
		

> Is there a reason that you have the drives distributed throughout the zpools like that?



Yes, although I'm not sure of its legitimacy.  My reasoning was that I wanted to un-correlate drive failures as much as possible since, for example, two drives failing in the same raidz is much worse than two drives failing with each in a different raidz.  So I arranged the drive assignments such that no two drives in any raidz were physically adjacent in my case.



			
				mjrosenb said:
			
		

> and the same question goes for the ordering of the devices within the raidz's and the ordering of the raidz's.



When it comes to the ordering of the vdevs, you may have no control over that--at least no useful control.  It seems highly plausible to me that zfs will decide how to stripe blocks independently of how the vdevs are ordered when you create the pool, or how they're ordered when you do a zpool status.



			
				mjrosenb said:
			
		

> and what speeds are you getting while writing to your raidz now? still the 16.1 MB/s you stated earlier?



I removed the cache device, in case it might have gotten in the way of my benchmarks.

I created a ramdisk of size 2 gigs, copied a bunch of random data from /dev/random to it, and tried dd-ing that onto zfs.  With a bs of 4m, it claimed 150 meg / s write speeds.  With a bs of 1m, it claimed 210 meg / s write speeds.

Doing a sustained rsync (of huge amounts of data--8T--way more than the ram I have) from one zfs filesystem in the pool to another, I got sustained simultaneous rates of 80 megs / s read and write.

I should also note that I've done absolutely no tuning of zfs.  Stock 8.0 installation.  8 gigs ddr2 ram, 3.2 ghz quad-core amd phenom ii x4 955.


----------



## pennello (Jul 12, 2010)

I can also give the physical port <--> unit mapping from my 3ware configuration, since it's not completely straightforward.


```
# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u1    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u2    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u3    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u4    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u5    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u6    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u7    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u8    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u9    SINGLE    OK             -       -       -       1862.63   RiW    OFF
u10   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u11   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u12   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u13   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u14   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u15   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u16   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u17   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u18   SINGLE    OK             -       -       -       1862.63   RiW    OFF
u19   SINGLE    OK             -       -       -       1862.63   RiW    OFF

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u10  1.82 TB   SATA  0   -            WDC WD20EARS-00MVWB0
p1    OK             u11  1.82 TB   SATA  1   -            WDC WD20EARS-00MVWB0
p2    OK             u12  1.82 TB   SATA  2   -            WDC WD20EARS-00MVWB0
p3    OK             u13  1.82 TB   SATA  3   -            WDC WD20EARS-00MVWB0
p4    OK             u14  1.82 TB   SATA  4   -            WDC WD20EARS-00MVWB0
p5    OK             u0   1.82 TB   SATA  5   -            WDC WD20EARS-00J2GB0
p6    OK             u1   1.82 TB   SATA  6   -            WDC WD20EARS-00J2GB0
p7    OK             u2   1.82 TB   SATA  7   -            WDC WD20EARS-00J2GB0
p8    OK             u15  1.82 TB   SATA  8   -            WDC WD20EARS-00MVWB0
p9    OK             u16  1.82 TB   SATA  9   -            WDC WD20EARS-00MVWB0
p10   OK             u17  1.82 TB   SATA  10  -            WDC WD20EARS-00MVWB0
p11   OK             u18  1.82 TB   SATA  11  -            WDC WD20EARS-00MVWB0
p12   OK             u19  1.82 TB   SATA  12  -            WDC WD20EARS-00MVWB0
p13   OK             u3   1.82 TB   SATA  13  -            WDC WD20EARS-00J2GB0
p14   OK             u4   1.82 TB   SATA  14  -            WDC WD20EARS-00J2GB0
p15   OK             u5   1.82 TB   SATA  15  -            WDC WD20EARS-00J2GB0
p16   OK             u6   1.82 TB   SATA  16  -            WDC WD20EARS-00J2GB0
p17   OK             u7   1.82 TB   SATA  17  -            WDC WD20EARS-00J2GB0
p18   OK             u8   1.82 TB   SATA  18  -            WDC WD20EARS-00J2GB0
p19   OK             u9   1.82 TB   SATA  19  -            WDC WD20EARS-00J2GB0
```


----------



## mjrosenb (Jul 13, 2010)

also, for whatever reason, my VPorts start at p8, and run up to p27.

```
Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u1    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u2    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u3    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u4    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u5    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u6    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u7    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u8    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u9    SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u10   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u11   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u12   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u13   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u14   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u15   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u16   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u17   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u18   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u19   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u20   SINGLE    OK             -       -       -       1862.63   RiW    OFF    
u21   SINGLE    OK             -       -       -       1862.63   RiW    OFF    

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p8    OK             u0   1.82 TB   SATA  -   /c0/e0/slt0  WDC WD20EARS-00S8B1 
p9    OK             u1   1.82 TB   SATA  -   /c0/e0/slt1  WDC WD20EARS-00S8B1 
p10   OK             u2   1.82 TB   SATA  -   /c0/e0/slt2  WDC WD20EARS-00S8B1 
p11   OK             u3   1.82 TB   SATA  -   /c0/e0/slt4  WDC WD20EARS-00J2GB0
p12   OK             u4   1.82 TB   SATA  -   /c0/e0/slt6  WDC WD20EARS-00S8B1 
p13   OK             u5   1.82 TB   SATA  -   /c0/e0/slt7  WDC WD20EARS-00S8B1 
p14   OK             u6   1.82 TB   SATA  -   /c0/e0/slt8  WDC WD20EARS-00S8B1 
p15   OK             u7   1.82 TB   SATA  -   /c0/e0/slt5  WDC WD20EARS-00J2GB0
p16   OK             u8   1.82 TB   SATA  -   /c0/e0/slt9  WDC WD20EARS-00J2GB0
p17   OK             u9   1.82 TB   SATA  -   /c0/e0/slt10 WDC WD20EARS-00J2GB0
p18   OK             u10  1.82 TB   SATA  -   /c0/e0/slt11 WDC WD20EARS-00J2GB0
p19   OK             u11  1.82 TB   SATA  -   /c0/e0/slt12 WDC WD20EARS-00S8B1 
p20   OK             u12  1.82 TB   SATA  -   /c0/e0/slt13 WDC WD20EARS-00S8B1 
p21   OK             u13  1.82 TB   SATA  -   /c0/e0/slt14 WDC WD20EARS-00S8B1 
p22   OK             u14  1.82 TB   SATA  -   /c0/e0/slt15 WDC WD20EARS-00J2GB0
p23   OK             u15  1.82 TB   SATA  -   /c0/e0/slt16 WDC WD20EARS-00J2GB0
p24   OK             u16  1.82 TB   SATA  -   /c0/e0/slt17 WDC WD20EARS-00J2GB0
p25   OK             u17  1.82 TB   SATA  -   /c0/e0/slt18 WDC WD20EARS-00S8B1 
p26   OK             u18  1.82 TB   SATA  -   /c0/e0/slt19 WDC WD20EARS-00S8B1 
p27   OK             u19  1.82 TB   SATA  -   /c0/e0/slt20 WDC WD20EARS-00S8B1 
p28   OK             u20  1.82 TB   SATA  -   /c0/e0/slt22 WDC WD20EARS-00J2GB0
p29   OK             u21  1.82 TB   SATA  -   /c0/e0/slt23 WDC WD20EARS-00J2GB0
```


----------



## JeremyFox (Aug 7, 2010)

In case anyone is interested, someone started a posting over in the Western Digital community idea lab about this very thing.  Please head over there and give it a vote if you're interested and have a few moments.

http://community.wdc.com/t5/Other-I...tor-emulated-firmware-for-WD-EARS/idi-p/21347


----------



## danbi (Aug 11, 2010)

Just out of curiousity, have you tried doing the RAID by the 3ware controller itself? That is, not using ZFS at all, or use it on a single exported volume. That way, you could figure out if the 3ware/WD drives combo is part of the problem.

It is sort of worrying, that you can only get up to 20MB/sec writing to a single drive..


----------



## tty23 (Aug 11, 2010)

I have 4 EARS disks running using raidz and the performance is bad as well.
When I run gstat I can see that the reason for the bad performance is that on disk is saturated with write requests and takes a long time to complete them.
This one disk is always the same one, which makes me believe that the real reason is the zfs intend log which is written on that disk. IIRC the intend log write many small files on the disk...
Did anyone of you have a look at gstat as well and sees similar issues?


----------



## phoenix (Aug 11, 2010)

The underlying issue with EARS (aka 4K Advanced Format drives) is that the firmware lies to the OS about the size of the physical sectors.  If you query the disk, it tells you it has 512 B logical sectors *AND* 512 B physical sectors.  Thus, all filesystems and OSes on top try to use it like a normal harddrive with 512 B sectors, leading to all kinds of performance and mis-alignment issues.

There's nothing that can be done until either WD fixes their firmware to report the correct physical sector size, or DES finishes his work to incorporate a workaround for these drives (see the mailing list archives for more details).

IOW, there's no point in beating this horse anymore.  It's been done to death already on the forums, the mailing lists, and all over the Internet.

Avoid EARS (Advanced Format) drives from WD.

4K drives from other manufacturers work correctly, and specify proper values for logical and physical sector size.


----------



## wonslung (Aug 12, 2010)

it's obvious this is all about money.

WD has done a few things recently which makes this clear.  They've made the WDTLER.EXE stop working on drives, for one.

They want people running their consumer drives to only use them in windows or on a desktop, adn they want people who are using raid arrays to buy the more expensive raid drives.

This is why i only use hitachi 2TB drives and seagate and samsung 1TB drives.


----------



## UNIXgod (Aug 12, 2010)

wonslung said:
			
		

> it's obvious this is all about money.
> 
> WD has done a few things recently which makes this clear.  They've made the WDTLER.EXE stop working on drives, for one.
> 
> ...



I agree with you 100% I plan to never suggest WD to anyone after the bs I had to deal with these drives.


----------



## vermaden (Aug 12, 2010)

wonslung said:
			
		

> This is why i only use hitachi 2TB drives and seagate and samsung 1TB drives.



Watch out also for seagate 7200.11 series (omit them).


----------



## jem (Aug 12, 2010)

phoenix said:
			
		

> 4K drives from other manufacturers work correctly, and specify proper values for logical and physical sector size.



Other manufacturers?  Who else is making 4K drives at the moment?


----------



## wonslung (Aug 12, 2010)

vermaden said:
			
		

> Watch out also for seagate 7200.11 series (omit them).





this isnt' entirely true.

it was only one batch of 7200.11 which were flawed.  I had a few of them. 


Seagate not only released a firmware patch/fix for these drives but they also extended the warranty of the drives by 3 years.  I've only had one of the 7200.11 drives fail and need to be replaced, and i've had tons of them in working systems which have worked fine since i bought them.


Granted, the 7200.12's are generally a much better drive (and these are the ones i currently use) but the 7200.11's are fine if you get a good deal on them.


The main reason to go with the .12's over the .11's has nothing to do with the firmware issue (which has been solved), no, the reason to use the .12's is that they use less power and generally perform better but even the .11's will outperform the 4k drives in ZFS based systems.


----------



## wonslung (Aug 12, 2010)

jem said:
			
		

> Other manufacturers?  Who else is making 4K drives at the moment?







The entire hard drive industry is making 4k drives now.

I'm not sure how many others have hit the market yet, but i know for sure Seagate,Maxtor, Fujitsu, and Hitachi are making them.


----------



## tty23 (Aug 13, 2010)

DAG-ERLING SMÃ˜RGRAV did some EARS testing, with ugly results:
http://maycontaintracesofbolts.blogspot.com/2010/08/benchmarking-advanced-format-drives.html

found on planet.freebsdish.org


----------



## vermaden (Aug 13, 2010)

WD20EARS drives use this 'problematic/hidden' 4K sectors, but WD20EADS use 512B sectors, so just use the 'non problematic' version, at least for future purchases.


----------



## nakal (Aug 17, 2010)

The latest EADS drives have also a firmware problem. You have to make some settings with MS-DOS or MS-Windows, so the drive does not die after approximately 2 months (Load_Cycle_Count bug). ataidle and camcontrol commands cannot switch off power management on these latest drives (the older EADS drives still support it). The new series simply does not support "advanced power management" anymore and takes full control of it by itself.

I recommend to avoid _all_ WD Green drives for a while until WD decides to consider how operating systems (!= MS-Windows) work.


----------



## vermaden (Aug 17, 2010)

@nakal

That is the reason I sold mine WD Green, I thought that EARS are also 'broken' in that way (8 secs idle means power off), I definitely OMIT WD Green drives now (while my WD Blue 3.5 and 2.5 drives, and current WD Passport 2.5 1TB work ok), Seagate LP drives (5900RPM) seems a good alternative for the WD Green here, at least I havent heard any bad input about them.


----------



## aragon (Aug 17, 2010)

wonslung said:
			
		

> This is why i only use hitachi 2TB drives and seagate and samsung 1TB drives.


Why only 1TB for Samsung and Seagate?


----------



## vermaden (Aug 18, 2010)

vermaden said:
			
		

> Seagate LP drives (5900RPM) seems a good alternative for the WD Green here, at least I havent heard any bad input about them.



Seems that the Barracuda LP drives have similar firmware issues to those of Seagate 7200.11 drives: 
http://en.wikipedia.org/wiki/Seagate_Barracuda


----------



## nakal (Aug 18, 2010)

Have 2 of this problematic .11 drives by Seagate. The nice thing is that you get a bootable ISO image that can upgrade the firmware, so you (mostly) don't need to install any weird operating systems to get rid of the problem.

Also SMART on these 2 drives is totally broken it shows

```
197 Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -       2047
```

on both drives. It looks dangerous, but they are completely OK. There are no read errors at all on the entire surface.

At the moment I prefer Samsung without any "green/low-power" technologies. 2 Watts more, but at least, the power management works as I want it to work. They are a bit noisy while starting and seeking, but they are twice as fast when compared to such power-saving drives.


----------



## vermaden (Aug 18, 2010)

nakal said:
			
		

> At the moment I prefer Samsung without any "green/low-power" technologies. 2 Watts more, but at least, the power management works as I want it to work.



The *Samsung F3* ones are fast and low on power at the same time (available sizes 500GB/1TB):
http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-10.html

The only drawback is a little worse 'access time' then in *WD Black/Blue* drives.


----------



## wonslung (Aug 18, 2010)

aragon said:
			
		

> Why only 1TB for Samsung and Seagate?




Because for 2tb consumer drives, hitachi is a better choice.


----------



## vermaden (Aug 23, 2010)

Another 4K drive seems to be Seagate's new Barracuda XT 3TB (7200RPM):
http://www.anandtech.com/show/3858/the-worlds-first-3tb-hdd-seagate-goflex-desk-3tb-review

... it uses 'internal' translation of logical 512K sectors 'available' for OS into 'internal' 4K sectors _'which ensures Windows XP compatibility'_ as the article states, time will tell how much that helps.


----------



## vermaden (Aug 27, 2010)

The *newegg.com* has currently very nice offer for *Hitachi* drive with *2TB* storage, spinning at 7200 RPM only for *$90* (*$89.99* to be precise):
http://newegg.com/Product/Product.aspx?Item=N82E16822145369

More information here:
http://fudzilla.com/home/news/latest/newegg-selling-hitachi-deskstar-2tb-7200rpm-drive-for-8999-shipped


----------



## wonslung (Sep 3, 2010)

I have 20 of these drives.  They are fantastic.


----------



## sub_mesa (Sep 11, 2010)

I' m not sure how you can recommend a 5-platter MONSTER drive when clean 3-platter drives now reached the same 2TB capacity.

So the newest drives will be:
3-platter WD EARS (4K sectors)
3-platter Samsung F4 EcoGreen (4K sectors)

That saves alot of power and should be much more reliable as well. It also makes the 5400rpm drives go faster than alot of lower density 7200rpm drives. With ZFS there's no real reason to go 7200rpm anymore i think. 5400rpm and sequential I/O is what HDDs do well; for random access you use an SSD configured as L2ARC, so your HDDs won't have to seek; 5400rpm or 7200rpm doesn't matter HDDs suck like floppy disks when they have to seek; so let's prevent that and use them for sequential I/O as much as possible.

The 4K sector issue is annoying though; i think the recordsize tuning or different number of disks in a vdev could solve any issues as posted before in this thread.


----------



## wonslung (Sep 13, 2010)

sub_mesa said:
			
		

> I' m not sure how you can recommend a 5-platter MONSTER drive when clean 3-platter drives now reached the same 2TB capacity.
> 
> So the newest drives will be:
> 3-platter WD EARS (4K sectors)
> ...




I can recommend them because they work, and are the best 2tb drives out right now.


i agree it would be nice to have a 7200 rpm 3 platter hitachi or seagate drive but it doesn't change the fact that if you are building large storage arrays with cheap, commodity parts, the hitachi drive is the only choice which makes sense at 2tb

and as far as power usage goes, they don't use any more than any of the other 7200 rpm drives i've used.


infact, among 7200 rpm 2tb drives, they use less than the wd drives and are about even with the seagate raid drives (which cost about 70 dollars more)


and to say there is no reason to use 7200 rpm drives isn't backed up by the mountains of evidence or loads of antidotal reports online.

The fact of the matter is, if you were a frequent reader of the zfs mailing list, you'd see plenty of reports of "green" drives (54-5900rpm) being terrible for ZFS, and generally 7200 rpm drives are recommended.


----------



## vermaden (Sep 13, 2010)

wonslung said:
			
		

> the hitachi drive is the only choice which makes sense at 2tb
> 
> (...)
> 
> The fact of the matter is, if you were a frequent reader of the zfs mailing list, you'd see plenty of reports of "green" drives (54-5900rpm) being terrible for ZFS, and generally 7200 rpm drives are recommended.



I just bought 2 x Seagate Barracuda LP 5900RPM 2TB (uses 4 x 500GB platter), I will share how they perform (comparing to Samsung F3 1TB drives that I currently have).

They should not bring any toubles, as they have 512B sectors.


----------



## aragon (Sep 13, 2010)

There are a few people experimenting with ZFS, RAID, and these new, low power hard drives.  I'll probably join the fray soon.  Is there a standard and useful file system benchmark which we can all use to compare?


----------



## wonslung (Sep 16, 2010)

don't get me wrong, zfs will function fine on lower rpm drives provided the firmware doesn't like, but i'm just one of the people who believe there is not much gain in going with under 7200 rpm (the 5400-5900 rpm drives power difference isn't enough to warrant the loss in performance for me)



If you are doing write once/ read many with mostly sequential access, the 5400-5900 rpm drives will probably be fine, especially in mirrored vdevs

but whne you start talking about raidz and raidz2 you're going to notice a huge issue, especially if you have ANY random writes/reads


The problem comes in becuase of how ZFS and raidz works, and the low rotational speed.


Remember, raidz uses writes a single block across all drives so whenever you have random i/o it has to sync those blocks across ALL drives in the vdev.

Having multiple vdevs can speed this up some but:  basically a raidz vdev is as slow as it's slowest drive.  This is why 5400 rpm drives are just so bad for raidz and raidz2


----------



## sub_mesa (Sep 16, 2010)

wonslung, i disagree almost with everything you said. I also do follow several mailinglists, so feel free to refer to any concrete messages to make your point.

My point is:
You recommend a few generations old HDD that was the first 2TB iteration; the monster disk to avoid; 5 massive platters of 400GB each. Twice the power consumption of Green drives while being slower for sequential workloads than the newest generation Green drives, which now have 666GB platters. The new Samsung F4EG pulls around 140MB/s which is exceptional for a 5400rpm disk. And you could also argue that less mechanical parts and at less friction / heat generation generally improves reliability of a HDD as well.

The whole issue on '5400rpm being slow on ZFS' is not the meager 25% faster seek times and random I/O IOps, but rather that the WD EARS drives to use 4K sectors with 512-byte emulation. So no reason to avoid 5400rpm; newer HDDs are getting this too. The Samsung F4 7200rpm series for example, also gets 4K sectors. And thus would have the same issues in RAID-Z.

I believe the issue here is that 128KiB recordsize is being spread on all data disk members in the RAID-Z (minus parity disks). Thus 4-disks in RAID-Z would be 128 / (4 - 1) = ~43KiB = not aligned with 4K, while 128 / (3 - 1) = 32KiB would be aligned. I still have to test this theory, and i hope i can suggest some fixes. Also some people assume it's the 4K sectors if they get low ZFS performance, while i personally inspected some systems which actually had memory starvation and needed kmem tuning with only 2 or 4GB RAM.

Once i've completed my tests i'll post my findings in here.
Cheers.


----------



## mjrosenb (Sep 16, 2010)

I don't know what the standard usage case for zfs is, but I'm serving files over nfs, and my
gigabit ethernet card seems to reliably be the bottleneck when serving files.


----------



## danbi (Sep 16, 2010)

There is some big confusion spread here....

There is nothing bad with the 5400 rpm speed as such. Recording density has increased recently sufficiently, that a 5400 rps drive is able to sustain 150 MBps read/write, while a 10-15,000 rpm drive is able to sustain say 300 MBps read/write.
However, this only applies to sequential operations, where the drive can prefetch data in the cache (data is read form the media at much higher speeds).

The big issue is the 'green' drives. These drives save energy, bu.. not using it  How a drive does not use energy? It can do this in many ways. Spinning the platters at lower speed is only one of these. Other ways are to use slower and less power hungry electronics. SAS drives have more complex electronic assemblies and the same drive in SAS version consumes more, that with SATA. Of course, the performance is worse. Another, very big power drain in the drive is the head assembly movement. It is simple: if you want small reposition times, you have to waste more energy. One significant way to save energy in the 'green' drives is to make the head assembly move slower, not as aggressive so they have abysmal random seek times. There is almost nothing else a drive does. Maybe, a 'green' drive could shut off parts (motors, chips) but this reflects to the ability to respond quickly to requests.

In a single-user system, or in systems with predictable sequential load any drive will perform well. In a multi-tasking system, when the drive is asked for data scattered all over, things are much different. I still have some old Cheetah drives here. Compared to a green drive, they are better for random type load, but cannot compete on sequential taks, with their 'poor' 30 MBps 

So, in summary, while the improvements in recording density have diminished the difference between slower spinning and faster spinning drives in terms of sequential load, the higher end drives spend power on more critical for performance tasks, such as seeks and faster electronics. It seems recently, it is the electronics that is the limiting factor for drive performance. When we talk about 'slow' drives, we usually talk about the random access times and multitasking performance of the drive.


----------



## aragon (Sep 17, 2010)

sub_mesa said:
			
		

> Twice the power consumption of Green drives while being slower for sequential workloads than the newest generation Green drives, which now have 666GB platters.


FWIW, higher density media improves random seek times too.


----------



## mjrosenb (Sep 17, 2010)

> FWIW, higher density media improves random seek times too.


I do not believe this is true.  Yes the distance between two tracks is smaller, so if you
only need to shift N tracks, it will take less time, and with more bits in each track, 
you will probably need to seek less often on some workloads.  This still does not change the
distance (and amount of time it takes) that the head needs to move between two random tracks.


----------



## noz (Oct 14, 2010)

I've read through the thread and apparently the EARS and even some of the EADS drives are problematic.  Are there any drives by WD that are actually GOOD for ZFS (green/blue/black or model number)?  Or, is the only solution to use non-WD 4K drives?

I'm eyeing the Samsung Spinpoint F4: http://www.newegg.com/Product/Product.aspx?Item=N82E16822152245


----------



## vermaden (Oct 15, 2010)

I have bought 2 x Seagate LP 5900RPM 2TB, put ZFS mirror on top of them, so far so good ;p

PS: They are 512/sector drives, they do not include any 4k/emulations at least.


----------



## oliverh (Oct 15, 2010)

vermaden said:
			
		

> I have bought 2 x Seagate LP 5900RPM 2TB, put ZFS mirror on top of them, so far so good ;p
> 
> PS: They are 512/sector drives, they do not include any 4k/emulations at least.



What's your usual "workload" on those?


----------



## vermaden (Oct 15, 2010)

oliverh said:
			
		

> What's your usual "workload" on those?



I would not lie saying it's zero to none, it's home (not only file) server, about 1.8TB storage space is enough for me, I have done several benchmarks with iozone/dd and it seems working ok under load, I haven't faced any 'sleep issues' with these drives as WD like to sleep after 8 seconds or so.

It's still 'uder development' hardware, so I may make some tests, it's also not CPU limited since it uses that motherboard with mobile Intel chipset GM965 and T8100 CPU: http://www.msi.com/index.php?func=downloaddetail&type=bios&maincat_no=388&prod_no=1267.

The FreeBSD's base system is on 8GB Kingston 133x CompactFlash card.

It currently has 1GB of RAM, but after I move all data to it, I will transfer 4GB RAM into it after selling older server parts.


----------



## wonslung (Oct 15, 2010)

sub_mesa said:
			
		

> wonslung, i disagree almost with everything you said. I also do follow several mailinglists, so feel free to refer to any concrete messages to make your point.
> 
> My point is:
> You recommend a few generations old HDD that was the first 2TB iteration; the monster disk to avoid; 5 massive platters of 400GB each. Twice the power consumption of Green drives while being slower for sequential workloads than the newest generation Green drives, which now have 666GB platters. The new Samsung F4EG pulls around 140MB/s which is exceptional for a 5400rpm disk. And you could also argue that less mechanical parts and at less friction / heat generation generally improves reliability of a HDD as well.
> ...



You can disagree all you want.  I'm well aware of the 4k issue.  This is the number one issue that those drives aren't good for ZFS (and more acurately, raidz1,2,3) 

Raidz uses a variable block size, which is one of the main reasons the 4k drives suffer.  I'm on all the same mailing lists, but i also have tested most of the major drives.    5400 RPM drives are CONSIDERABLY slower in raidz configurations than 7200 rpm drives, due to the same IOP issues that make raidz a poor choise for random i/o in the first place.  (it all goes back to how raidz works)

Now, for sequential access, this is much less of a problem, but the simple facts are that right now, the best 2tb drive for ZFS is the hitachi 2TB hands down, regardless of what you seem to think.

I do believe this will change, and could change overnight if WD would release a firmware which wasn't flawed.


----------



## wonslung (Oct 15, 2010)

noz said:
			
		

> I've read through the thread and apparently the EARS and even some of the EADS drives are problematic.  Are there any drives by WD that are actually GOOD for ZFS (green/blue/black or model number)?  Or, is the only solution to use non-WD 4K drives?
> 
> I'm eyeing the Samsung Spinpoint F4: http://www.newegg.com/Product/Product.aspx?Item=N82E16822152245





There are some older, non 4k drives which wd makes which are fine, and some more expensive raid drives which wd makes which work well as well.  Right now the best drives for ZFS in a 2TB size are the hitachi drives.  The best drives in 1tb from my testing is a tie between the samsung spinpoint f3's and the seagate 7200.12's.

Currently we aren't using any 1.5 tb drives, but in the past we tried several 5400 and 5900 rpm drives and found them very slow for raidz arrays (not too bad for mirrored arrays which have several vdevs)  Though if you add a good ssd SLOG device, it can help dramatically.


----------



## wonslung (Oct 15, 2010)

danbi said:
			
		

> There is some big confusion spread here....
> 
> There is nothing bad with the 5400 rpm speed as such. Recording density has increased recently sufficiently, that a 5400 rps drive is able to sustain 150 MBps read/write, while a 10-15,000 rpm drive is able to sustain say 300 MBps read/write.
> However, this only applies to sequential operations, where the drive can prefetch data in the cache (data is read form the media at much higher speeds).
> ...






EXACTLY.  Too many people here don't understand that the so called "green" drives aren't green at all in raid arrays.  They sometimes actually use MORE energy due to poor seek times (they have to spin longer and more often to read the same amount of data)  The fact of the matter is, they work well enough for what they are designed for, but for raid arrays, they tend to not offer any considerable amount of power saving.  If you REALLY want to save power, use 2.5 inch drives.    

Let me be clear, the 5400 rpm drives aren't bad drives, they just aren't great for raidz.  If you are building mirrored vdevs, or just don't care about random i/o at all, then go ahead and use them, but don't use them thinking they are going to save you a lot of energy on a ZFS raidz array...One of the biggest problems with them is they are set to "die" after 1-2 seconds of use, so if you use them in raid arrays, they will wear out FAST AS HELL due to all the powerup/power down cycles.

But hey, if you really don't want to take my word for it, just ask someone else who builds raid systems on a regular basis.  I don't know ANYONE building raid systems who recommend using green drives.


----------



## aragon (Oct 15, 2010)

wonslung said:
			
		

> One of the biggest problems with them is they are set to "die" after 1-2 seconds of use


Are you sure this holds true for all green drives though?  As vermaden reports, Seagate LP drives (for one) don't appear to be sleeping...


----------



## noz (Oct 16, 2010)

wonslung said:
			
		

> There are some older, non 4k drives which wd makes which are fine, and some more expensive raid drives which wd makes which work well as well.  Right now the best drives for ZFS in a 2TB size are the hitachi drives.  The best drives in 1tb from my testing is a tie between the samsung spinpoint f3's and the seagate 7200.12's.



I actually bought two of those F4's I mentioned to replace the green EARS drives I'm using at the moment.  I'll know in a few days whether or not they're good.


----------



## vermaden (Oct 16, 2010)

Here is some good review of these 2 TB Samsung F4's:
http://www.silentpcreview.com/samsung-f4-seagate-xt-2tb

In short, very low power consumption and noise, with 'typical 5400 RPM' performance.


----------



## aragon (Oct 16, 2010)

noz said:
			
		

> I actually bought two of those F4's I mentioned to replace the green EARS drives I'm using at the moment.  I'll know in a few days whether or not they're good.


Looking forward to this. 




			
				vermaden said:
			
		

> In short, very low power consumption and noise, with 'typical 5400 RPM' performance.


I also noticed this remark:



			
				silentpcreview said:
			
		

> The F4 seems to be the just the ticket for users who want quiet, high efficiency drives but are paranoid about the frequent head-parking endemic to the Caviar Greens.


----------



## vermaden (Oct 17, 2010)

It seems that this little patch can 'fix' issues with 4k WD Green drives:
http://lists.freebsd.org/pipermail/freebsd-fs/2010-October/009706.html



```
[B]/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c[/B]
[color="Red"]-*ashift = highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1;[/color]
[color="Green"]+*ashift = highbit(MAX(MAX(4096, pp->sectorsize), SPA_MINBLOCKSIZE)) - 1;
```
[/color]



> I'm using this for 3 months with 20 2TB 4kb sector WDC disks (in 2
> raidz2 arrays of 10) without any issues. Writes go at 300MB/s.


----------



## Epikurean (Oct 18, 2010)

I tried the suggested patch, but unfortunately it killed my ZFS Pool:


```
pool: tank
state: UNAVAIL
scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	tank        UNAVAIL      0     0     0  insufficient replicas
	  raidz1    UNAVAIL      0     0     0  corrupted data
	    ad6     ONLINE       0     0     0
	    ad10    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0
```


----------



## vermaden (Oct 18, 2010)

Maybe it works only for NEW pools.


----------



## Epikurean (Oct 19, 2010)

I believe too, that the patch works only for new pools. The problem is, that even after recompiling the kernel without the patch, the pool has not returned.

I thought it would be best to leave a comment like this in the thread: I lost ~3.5TB of Data. Fortunately for me I lost only nonessential things: all the important stuff was on a backup.


----------



## palmboy5 (Oct 20, 2010)

What if you first used gnop (one at a time, allowing for zfs to recover each time) on each drive, then tried the patch?

EDIT:
Also has anyone done performance comparisons for the patch yet?


----------



## Epikurean (Oct 21, 2010)

After following Palmboy's suggestion, here is what I did

1. Created a new ZFS Pool
2. gnop create -S 4096 on each drive at the same time
3. copied some data on the pool
4. compiled and installed a new kernel with the patch
5. reboot

As expected, the *.nop drives are "lost" after a reboot, BUT the ZFS Pool is in perfect shape!

The only thing that bugs me, is that there was no message whatoever indicating the use of the .nop drives in the ZFS pool: no degraded state, no indication, that the *.nop drives are used when entering "zpool status" (besides when I wanted to replace my adX drive with the adX.nop drive, which didn't work: ZFS told me the .nop drive was already used).


----------



## wonslung (Oct 21, 2010)

Epikurean said:
			
		

> I tried the suggested patch, but unfortunately it killed my ZFS Pool:
> 
> 
> ```
> ...



It won't work because of the variable block size ZFS uses for RAIDZ.

Until firmware updates come out, avoid those drives for raidz.


----------



## palmboy5 (Oct 21, 2010)

wonslung said:
			
		

> it won't work because of the variable block size ZFS uses for RAIDZ.
> 
> until firmware updates come out, aviod those drives for raidz


Believe me I would have avoided the 4K drives if I knew then what I know now, but most of us in this thread already invested in 4K drives and are stuck with them. As such we are trying to find workarounds to make these 4K drives behave adequately. I see that you repeatedly slam against these 4K drives but otherwise are not offering any real help.


----------



## vermaden (Oct 21, 2010)

palmboy5 said:
			
		

> Believe me I would have avoided the 4K drives if I knew then what I know now, but most of us in this thread already invested in 4K drives and are stuck with them.



Why not just SELL them and buy a good ones?


----------



## palmboy5 (Oct 21, 2010)

To who, and with what monetary loss? It would have to sell for less than the purchase price AND cost more in shipping it. Not practical.


----------



## vermaden (Oct 21, 2010)

palmboy5 said:
			
		

> To who, and with what monetary loss? It would have to sell for less than the purchase price AND cost more in shipping it. Not practical.



Have You ever heard of an eBay maybe? 

You will probably lose a little money (used vs new price), but they are on warranty, so not much difference in price, the price for shipping is on buyer side, so there is no cost for shipping.

It seems that its not that big problem if You want to stay with them (along with bundled PITA)


----------



## phoenix (Oct 21, 2010)

vermaden said:
			
		

> It seems that this little patch can 'fix' issues with 4k WD Green drives:
> http://lists.freebsd.org/pipermail/freebsd-fs/2010-October/009706.html
> 
> 
> ...



Is this patch along the same lines as this one:
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

They both deal with ashift, to set the minimum recordsize for the pool to 4 KB, but they are done in two very different places in the code.

(I've posted a reply to that message to find out.)


----------



## noz (Oct 21, 2010)

phoenix said:
			
		

> Is this patch along the same lines as this one:
> http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html
> 
> They both deal with ashift, to set the minimum recordsize for the pool to 4 KB, but they are done in two very different places in the code.
> ...




It looks like it's doing the same thing.  The patch from the freebsd mailing list alters the calculation of ashift, while the solarismen.de solution bypasses the calculation and directly sets ashift to the correct value.

Assuming 4096 is larger than pp->sectorsize and SPA_MINBLOCKSIZE, and that highbit() returns the position of the highbit in an int, the freebsd patch also sets ashift to 12.

Oh, I got my Samsung Spinpoint F4 drives.  It has 4k sectors but uses 512b emulation.  There doesn't seem to be a way to turn the emulation off.  My ashift is 9.    However, I did see an improvement over my EARS drives, possibly due to the larger platters.

`$ dd if=/dev/random of=./testfile bs=1m count=500`
Running the above is about 10 seconds quicker at 19.446565 secs (26960443 bytes/sec).  It's still terrible, but it's better than before and I now have 2TBs.  :B

I think I'll wait for an official fix before I try to recompile stuff on my own.


----------



## wonslung (Oct 23, 2010)

I'd sell them at a loss before I'd use them with raidz.


----------



## sub_mesa (Oct 24, 2010)

Instead, you may want to check a place with valuable information regarding Samsung F4 / WD EARS and FreeBSD + ZFS; they can perform quite nicely:

http://hardforum.com/showthread.php?t=1546137

More likely you just have some ZFS tuning to do. Feel free to use the benchmark feature as described in the thread above, so you can test your own disks and get the same kind of benchmark charts as posted in the mentioned thread.


----------



## tty23 (Nov 7, 2010)

> After following Palmboy's suggestion, here is what I did
> 
> 1. Created a new ZFS Pool
> 2. gnop create -S 4096 on each drive at the same time
> ...




I did the same, and it works perfectly. 

I hope there won't be any problems later on...


----------



## raab (Nov 9, 2010)

Does anyone have any before/after performance stats after applying this patch?

Having just bought 6 WD20EARS then coming across this issue I want to know if its worth applying the patch or just selling them and getting non 4k drives


----------



## vermaden (Nov 9, 2010)

raab said:
			
		

> Does anyone have any before/after performance stats after applying this patch?
> 
> Having just bought 6 WD20EARS then coming across this issue I want to know if its worth applying the patch or just selling them and getting non 4k drives



I havent tried, but You may also check method I found on [H]ard|Forum,
to align pool blocks with appreciate raid level:


```
disks  type   recordsize / ( disks - parity disks )   sector  status
raidz1     128KiB / 2                          64KiB   good
raidz1     128KiB / 3                          43KiB   BAD
raidz2     128KiB / 2                          64KiB   good
raidz1     128KiB / 4                          32KiB   good
raidz2     128KiB / 4                          32KiB   good
raidz1     128KiB / 8                          16KiB   good
raidz2     128KiB / 8                          16KiB   good
```
[H]ard|Forum --> http://hardforum.com/


----------



## raab (Nov 9, 2010)

Yeah, I'll have 8 in total however only ordered 6 to ease the pain between pay cycles although I'm hesitant in getting an additional two what with the 512 byte emulation issues.

Which thread in particular on hardforum were you referring to?


----------



## palmboy5 (Nov 9, 2010)

sub.mesa posted that table/chart in many threads, including I believe this one in one of the previous pages. He says it is just a theory he has, it isn't actually proven yet.


----------



## vermaden (Nov 9, 2010)

Here are results of that theory in practice:
http://hardforum.com/showpost.php?p=1036276885&postcount=26


----------



## tty23 (Nov 9, 2010)

phoenix said:
			
		

> Is this patch along the same lines as this one:
> http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html
> 
> They both deal with ashift, to set the minimum recordsize for the pool to 4 KB, but they are done in two very different places in the code.
> ...



Those patches are not needed. It seems zfs figures out the sector size of the zpool devices by looking at the sector size of the "real" devices.

Now if you create your zpool on top of gnop devices emulating 4k sectors, your zpool device will end up having 4k sectors (you can verify that by running zdb, it will display ashift=12, for 2^12=4k sectors). Then you can reboot, which removes the gnop 4k sector emulation. ZFS won't event notice, as it identifies the disks by uid and not by device name. 

Just do this:

```
gnop create -S 4096 adaX
```
for each device.
Create an zpool on top of the adaX.nop devices.
check with zdb that the ashift value is correct:

```
zdb <poolname>
```
Reboot.
Recheck ashift value: its still 12, the zpool is fine, nop devices are gone.

And you have an zpool with 4k sector size without having to patch the source.


----------



## palmboy5 (Nov 9, 2010)

I see in the zdb output that ashift exists in the level above the drives, so ashift doesn't exist for each drive. Does this mean that one would only need to gnop create -S 4096 one of the drives in order to force ZFS to do 4K on all?


----------



## tty23 (Nov 9, 2010)

> Does this mean that one would only need to gnop create -S 4096 one of the drives in order to force ZFS to do 4K on all?



Good question, could be. As I already set up my drives and am using them again, maybe someone else wants to give it a shot?

Or even better, I guess this behavior should be described somewhere in the zfs docs.

Perhaps I will have a look on the weekend... but maybe someone wants to try it..?


----------



## tty23 (Nov 9, 2010)

This thread over at arstechnica is quite interesting, too. They also figured out that issue, but there is some useful information not mentioned here before:
http://arstechnica.com/civis/viewtopic.php?f=11&t=37779&start=1200.

Especially this post, where sub.mesa explains why he thinks raidz performance is that bad with the WD EARS drives:
http://arstechnica.com/civis/viewtopic.php?p=20797605&sid=d09ee0bd397ffd18c0dbaac4ba2e0678#p20797605


----------



## usdmatt (Nov 9, 2010)

I was interested if finding this out the other day when I learnt that ashift appears to be stored at vdev level. I wondered what would happen if you mixed drives, or used a new 4k native device to replace a drive in a 512b vdev.

Here's what happens when you do it with mdX devices:


```
files-backup# mdconfig -a -t malloc -s 100M -S 512
md2
files-backup# mdconfig -a -t malloc -s 100M -S 4096
md3
files-backup#
files-backup#
files-backup# zpool create test mirror md2 md3 # <- 512b disk is specified first
files-backup# zdb |grep 'ashift'
                ashift=12
files-backup# zpool destroy test
files-backup#
files-backup# zpool create test md2
files-backup# zdb | grep 'ashift'
                ashift=9
files-backup# zpool attach test md2 md3
cannot attach md3 to md2: devices have different sector alignment
files-backup#
```

It appears ZFS uses the largest block size from the disks you are adding to the vdev so in theory you could gnop just one of them.

Also, the ZFS designers have clearly though about all this (although not documented well) and you can't add a 4k native disk to a 512b vdev (to be expected). Not much of an issue at the moment, and maybe never will be but this means that in the future you *could* (unless there's some workaround I'm not aware of) be in trouble if you need to replace a failed disk in a 512b vdev and can only get hold of 4k native disks.


----------



## sub_mesa (Nov 9, 2010)

If the .nop procedure would only have to be performed once, this would simplify setup since you won't need the .nop providers anymore upon reboot, correct? This makes it possible to automate 4K installations in my ZFSguru distribution with ease. Update: now implemented this feature on Pools->Create page, available in ZFSguru version 0.1.7-preview2c, update via System->Update.

@usdmatt
perhaps a GEOM class can be designed that emulates lower sector sizes. Right now you can go up using GNOP or GELI or something similar, but you can't go lower; that's usually a job for the filesystem to handle. But this may be an interesting small project for anyone interested. Something like:

`gsect -S 512 /dev/my4Ksectdrive`
and you would get:

```
/dev/my4Ksectdrive  (4K sectorsize)
/dev/my4Ksectdrive.sect (512B sectorsize)
```


----------



## aragon (Nov 10, 2010)

sub_mesa said:
			
		

> If the .nop procedure would only have to be performed once, this would simplify setup since you won't need the .nop providers anymore upon reboot, correct?


That would make these lying 4k drives much less discouraging to purchase.  Can anyone confirm if this works in practice?


----------



## palmboy5 (Nov 10, 2010)

Never mind working in practice or not, does it actually make performance acceptable? The benchmark results from sub.mesa show negligible improvement at best. I'm starting to not blame ZFS or the drives at all for my situation. Even the UFS OS drive has poor performance through Samba (what I need).


----------



## sub_mesa (Nov 10, 2010)

I'd say performance of 4K disks is very acceptable, especially when considering the workloads these disks are most suitable for (sequential I/O) and they develop unreadable sectors less rapidly then 512-byte disks with smaller ECC pools.

It would make sense to avoid some disk configurations when using 4K sector disks. A 5-disk RAID-Z will be good but 6-disk RAID-Z will be bad. I posted a full list of these combinations somewhere, if you can't find it i'll write it again.

For Samba performance you would want ASYNC I/O to be enabled when compiling samba port, as well as a client capable of ASYNC I/O; Windows 7 clients for example, and i think Vista as well. But XP is old and would have lower performance indeed. Tuning Samba may be worthwhile, but it appears that it is less needed on FreeBSD 8+, thanks to auto TCP buffersize tuning.


----------



## tty23 (Nov 11, 2010)

> I posted a full list of these combinations somewhere


I already wrote that a few posts ago, but here is it again:
http://arstechnica.com/civis/viewtopic.php?p=20797605&sid=d09ee0bd397ffd18c0dbaac4ba2e0678#p20797605


----------



## bthomson (Nov 14, 2010)

In case anyone is curious about the performance of mixed sector-size pools, I replaced two dead disks in a legacy 6-drive 512b raidz2 pool with two 4k WD greens (512b emulation mode).

Looking at gstat it appears sometimes write cycles end up aligned and sometimes not. When aligned I get better than 90MB/s. When misaligned less than 10MB/s.

On average (say, writing 30GB), write speed seems to average out to 30MB/s which is better than some results earlier in this thread with all drives running in emulated mode. So perhaps the more emulated drives in the pool, the higher the chances of misalignment on a given write cycle and the lower the performance.


----------



## palmboy5 (Nov 17, 2010)

So I'm about to redo my array since its currently 2x WD20EADS and 2x WD20EARS with 512 byte sectors and apparently I wouldn't be able to add an actual 4K (or gnop'ed) drive to the current array. Right now I know to *gnop create -S 4096* at least one of the drives before doing *zfs create*, but is there something else that I'm missing? Will there be an alignment issue? I'm looking for things that I can't do after the array is created...

Thanks!


----------



## bthomson (Nov 18, 2010)

palmboy5 said:
			
		

> I see in the zdb output that ashift exists in the level above the drives, so ashift doesn't exist for each drive. Does this mean that one would only need to gnop create -S 4096 one of the drives in order to force ZFS to do 4K on all?



I only made one 4k gnop device and after a reboot my pool still has ashift=12. So unless there is some parameter other than ashift, it works.


----------



## palmboy5 (Nov 21, 2010)

I can't seem to glabel the .nop drive

```
[root@brisbane-1 /dev]# gnop create -S 4096 /dev/ada0
[root@brisbane-1 /dev]# glabel label wd20ears01 /dev/ada0.nop
glabel: Can't store metadata on /dev/ada0.nop: Invalid argument.
```

Any help is appreciated! :|


----------



## noz (Dec 28, 2010)

Sorry to bump an old thread, but with the upcoming release of 8.2 I'd like to ask if anyone knows whether or not it includes fixes for the 4K sector problem.  I took a quick glance at the FreeBSD 8.2 Release Engineering TODO but it doesn't seem to mention anything.


----------



## AndyUKG (Dec 29, 2010)

Epikurean said:
			
		

> I tried the suggested patch, but unfortunately it killed my ZFS Pool:



Maybe someone else can confirm this, but I'd guess the patch makes your system incompatible with pools created on an unpatched system. If you reboot with your old kernel you may still be able to get your pool back...


----------



## olav (Feb 9, 2011)

Do anyone know if there are still problems with the newer WD20EARS drives?
They've became REALLY cheap here now.


----------



## danbi (Feb 9, 2011)

Probably because users increasingly do not want them anymore?


----------



## AndyUKG (Feb 9, 2011)

olav said:
			
		

> Do anyone know if there are still problems with the newer WD20EARS drives?
> They've became REALLY cheap here now.



Actually someone just posted a solution to one of the fundamental problems with 4k drives.

http://forums.freebsd.org/showthread.php?p=122617

The thread has gone off a bit on how to measure performance, but the important thing is the trick to set ashift correctly for a pool containing EARS or other 4k disks. You use gnop to temporarily create a 4k device in /dev which is used in the initial zpool create. Once the pool is create the ashift is set correctly and can never be changed, even when you remove the gnop device and replace it with a regular disk device.
I've not tested this, but setting ashift is fundamental to ZFS performing correctly on 4k drives so this is a solution in theory.

cheers Andy.

PS misalignment will still be possible depending on partitioning of the disk, so care is needed for this also. You can't just set ashift correctly and assume that is all that is needed.


----------



## olav (Feb 10, 2011)

Aha,

So if I have a raidz with 10 disks I only need to format the "first" one with gnop?

What about the spin-down after 8 seconds issue?


----------



## AndyUKG (Feb 10, 2011)

olav said:
			
		

> So if I have a raidz with 10 disks I only need to format the "first" one with gnop?



Yep, which after the raidz creation you can destroy the one gnop device. I just tested this, ZFS automatically uses the real devices after you delete the gnop device. Ie, you export the pool, destroy the ada1.nop device, re-import pool, ZFS uses the ada1 device instead. Therefore no further need for gnop, apart from at the time of vdev creation.



			
				olav said:
			
		

> What about the spin-down after 8 seconds issue?



Seperate issue, this isn't related to ashift or 4k sectors.


----------



## bthomson (Feb 10, 2011)

olav said:
			
		

> Do anyone know if there are still problems with the newer WD20EARS drives?
> They've became REALLY cheap here now.



I have been running 6x WD15EARS (4k sector) for several months now with no problems... set ashift with gnop and you are good to go.


----------



## jem (Feb 25, 2011)

olav said:
			
		

> What about the spin-down after 8 seconds issue?



I've just noticed in the 8.2 Release Notes the following:



> The ada(4) driver now supports a new sysctl(8) variable kern.cam.ada.spindown_shutdown which controls whether or not to spin-down disks when shutting down if the device supports the functionality. The default value is 1.[r215173]



I wonder if this will prevent these Green drives from powering themselves down?


----------



## bfreek (Mar 31, 2011)

Hello. I'll just join the round.

I've got this baby: hp proliant n36l (pdf)
plus *3x wd20ears* to use them in a *raidz1 pool* (freenas 0.7.2 w/ zfs v3 or v13, I think :q).

If I format them with *ufs*, I get *dd* write speeds of *~120mbyte/s*.
If I use them in the *3-drive raidz1 pool*, I get *dd* write speeds of *~35mbyte/s*.
I've used *gnop* pseudo-4k-devices (ad*.nop) in both cases.

For days I've been crawling the web, reading threads like this one and completely lost my head over this.


 If I would have known about 4k issues with raidz, I wouldn't have bought such drives.
 4k drives seem to be the future. What's the *future on the software side*? When will zfs natively deal with this?
 I've spent 215 euros on these 3 drives and I sincerely believe that there's a way to have them perform nearly as fast as single usage (as I stated above).
 Afaik, the Samsung F4 does emulate 512b as well as the WD20ears. Then how is it that those drives perform better in a raidz pool?
 *Everyone who's reporting NO PROBLEMS with 4k + raidz, for god's sake, tell us your read/write speeds!* "no problems" doesn't necessarily mean "great speed" as I also have "no problems" using them but also "no speed"...


----------



## aragon (Apr 1, 2011)

bfreek said:
			
		

> Everyone who's reporting NO PROBLEMS with 4k + raidz, for god's sake, tell us your read/write speeds!




```
$ dd if=/tmp/blah of=blah bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 2.584883 secs (405657012 bytes/sec)
```

/tmp/blah is on a TMPFS mount, generated from /dev/urandom.

5 Samsung F4 drives in raidz1.


----------



## AndyUKG (Apr 1, 2011)

bfreek said:
			
		

> HIf I format them with *ufs*, I get *dd* write speeds of *~120mbyte/s*.
> If I use them in the *3-drive raidz1 pool*, I get *dd* write speeds of *~35mbyte/s*.
> I've used *gnop* pseudo-4k-devices (ad*.nop) in both cases.


Quick point, write performance will be lower on ZFS RAIDZ than UFS, to what degree depends on a lot of things. About your setup, have you verified that your pool actually has ashift=12 set? Also, even with ashift set correctly you can have misalignment problems, you seem to be using the whole disk which I assume will be ok, maybe others with "no problems" can comment what they have done to avoid these issues.



			
				bfreek said:
			
		

> 4k drives seem to be the future. What's the *future on the software side*? When will zfs natively deal with this?





ZFS already does support 4k drives perfectly, unfortunately due to lack of support in other OS's current 4k drives all emulate 512byte drives! When drives start reporting their real block size to ZFS, ZFS will work great without any messing about. In the mean time I agree it would be nice to have a fix to the 4k emulation fix 

thanks Andy.


----------



## jem (Apr 1, 2011)

ZFS performance is highly dependant on the amount of available memory.  The amount of RAM a ZFS system has should be stated along with any speed test results.

I'm no expert on such things, but as I understand it you need to ensure that the size of a read or write test is large enough so that it will definitely involve accessing the disk immediately instead of just caching/buffering the entire I/O operation in RAM.  A 1GB file read or write on a ZFS system with 4GB of RAM will probably give unrealistic results.

I too have just set up an HP ProLiant MicroServer with 8GB RAM and four 2TB Samsung F4 disks (which have 4KB sector sizes).  I have a raidz1 pool with ashift=12 across those four disks.  I ran some speed tests writing 100GB of /dev/zero to a file on the zpool.  I don't have my exact results to hand as the server is currently packed up for a house move, but if I recall correctly the speeds were around the 150MB/sec mark.

This is also with a sub-optimal number of disks in the raidz pool: 4 = 3 data + 1 parity.


----------



## bfreek (Apr 1, 2011)

AndyUKG said:
			
		

> Quick point, write performance will be lower on ZFS RAIDZ than UFS, to what degree depends on a lot of things. About your setup, have you verified that your pool actually has ashift=12 set? Also, even with ashift set correctly you can have misalignment problems, you seem to be using the whole disk which I assume will be ok, maybe others with "no problems" can comment what they have done to avoid these issues.


Sure, i just wanted to prove that the drives are capable of more.
But 120mb/s compared to 35mb/s is like too much difference.

ashift: Yes, all drives are *.nop devices at least at the time of [font="Courier New"]zpool create[/font].
And, of course, i have no partitions on them. Entire drive is used.



			
				AndyUKG said:
			
		

> ZFS already does support 4k drives perfectly, unfortunately due to lack of support in other OS's current 4k drives all emulate 512byte drives! When drives start reporting their real block size to ZFS, ZFS will work great without any messing about. In the mean time I agree it would be nice to have a fix to the 4k emulation fix


I may be lacking a bit of understanding here: don't we circumvent the emulation with the *.nop method?
Are native 4k drives available already?



			
				jem said:
			
		

> ZFS performance is highly dependant on the amount of available memory.  The amount of RAM a ZFS system has should be stated along with any speed test results.


Good point. Missed it: *1gb ecc ram*.
freenas 0.7.2 (freebsd 7.2-based)



			
				jem said:
			
		

> I'm no expert on such things, but as I understand it you need to ensure that the size of a read or write test is large enough so that it will definitely involve accessing the disk immediately instead of just caching/buffering the entire I/O operation in RAM.  A 1GB file read or write on a ZFS system with 4GB of RAM will probably give unrealistic results.


As stated above, 1GB. The results have just been too bad to indicate use of caching in my tests, though.



			
				jem said:
			
		

> I too have just set up an HP ProLiant MicroServer with 8GB RAM and four 2TB Samsung F4 disks (which have 4KB sector sizes).  I have a raidz1 pool with ashift=12 across those four disks.  I ran some speed tests writing 100GB of /dev/zero to a file on the zpool.  I don't have my exact results to hand as the server is currently packed up for a house move, but if I recall correctly the speeds were around the 150MB/sec mark.
> 
> This is also with a sub-optimal number of disks in the raidz pool: 4 = 3 data + 1 parity.


150mb/s. My eyes are getting wet. What OS are you using?
*To clear this one up: does the F4 do 4k emulation just like the WD20EARS?* (That's what i know/read...)
*If that's the case, why do they perform better? RAM?*

btw: Forced capitalization? The internet doesn't have time for such pointless thing.


----------



## usdmatt (Apr 1, 2011)

Firstly, don't use dd to do performance testing. When I first started testing my own server I preferred the simple, clear throughput figure you get from dd but quickly realised it's all over the place. Just changing dd's blocksize option can make a massive difference to the output. Please use something like bonnie using a test file at least twice your RAM.

Looking at your results, I am doubtful you're actually getting real sustained 120MB/s throughput on UFS. Also jems result of 150MB/s seems unrealistic (although possible with fast hardware). It's well known using /dev/zero for throughput testing is a bad idea.

You also have to remember that ZFS is a complex filesystem. It has to calculate all the checksums, redundancy data and write all this to disk along with metadata and backup metadata. It's designed to run of machines with at least 2GB or RAM (preferably 4) and modern, powerful CPUs. You'll probably see an improvement by increasing ram and moving to FreeBSD 8.2 amd64 with zfs v15.

In regard to Samsung/WD 4k drives, they both do the same emulation.

Just for comparison here's some quick throughput figures from my system using bonnie (and dd) to 4GB files. (I have 4 2TB 512b disks in raidz).

Bonnie Write: 86MB/s
Bonnie Read: 130MB/s
Bonnie Write (compress=on): 95MB/s
Bonnie Read (compress=on): 104MB/s (Think my poor atom is struggling here...)

I think these are reasonable results for a zfs redundant array of 4 consumer disks with an atom d525 cpu / 2GB ram (FreeBSD 8.2 amd64).

Completely useless dd results:

Write from /dev/random and 8m block size to file: 19MB/s
Write from /dev/zero and 8m block size to file: 150MB/s
Write from /dev/zero with 1m block size: 379MB/s
Read with standard block size: 27MB/s
Read with 1m block size: 192MB/s


----------



## AndyUKG (Apr 1, 2011)

bfreek said:
			
		

> I may be lacking a bit of understanding here: don't we circumvent the emulation with the *.nop method?
> Are native 4k drives available already?



I was responding to your question of when ZFS will natively deal with 4k disks, as I stated it already does. What ZFS doesn't currently natively deal with is 4k disks that report 512byte sectors to the OS. My little after thought was simply that in the mean time while vendors are selling this 4k drives with 512byte emulation forced on us it would be nice if this was automatically taken care of by ZFS. Using gnop is obviously not "native", it will be missed by some people completely and still allows for misalignment (when using partitions).

Andy.

PS or perhaps the answer you were looking for is, gnop does not circumvent directly the 512 emulution. The disk presents each 4096byte sector to the OS as eight 512 byte sectors, gnop maps eight of the emulated sectors back into one 4096byte sector. So you are not disabling the emulation, just putting another layer of abstraction on top to get back to where you started from (ie the real physical sector size).


----------



## danbi (Apr 2, 2011)

bfreek said:
			
		

> freenas 0.7.2 (freebsd 7.2-based)



You should note, that ZFS has different versions. Each new version adds new features and sometimes dramatic performance improvements.

FreeBSD 7.2 has ZFS v6, 7.3+ has v13, 8.0 has v14, 8.1+ has v14, 8.2+ has v15 and 9.0+ has v28.

v15, which is in the current freebsd-stable has already made significant performance improvement even on v13, which is far better than v6. v28 has even greater improvements for NAS use, primarily because of the "slim ZIL" feature.

It is real pity, that FreeNAS has not moved to more recent FreeBSD version.


----------



## bfreek (Apr 2, 2011)

Sorry. It's FreeBSD 7.3 actually. ZFS version is 13.

Your info leaves me considering the 4k drives aren't the problem. rather zfs. I think about moving to zfsguru or my own bsd setup. All this is really sickening


----------



## bfreek (Apr 3, 2011)

Updates:

I installed FreeBSD 8.2, created a pool, installed samba, shared the pool and put some files (6-8gb) on it (remotely from a windows 7 machine).
Write speed: 60-70 mbyte/s
Read speed: 70-80 mbyte/s

I then did nothing for quite some time and eventually decided to do some more file copying.
All of a sudden, without further ado, transfer* rates dropped to 15-20 mbyte/s* in both directions.

Any explanation, anyone?


----------



## olav (Apr 4, 2011)

You might have the same problem I had. Memory leak (disappearing). It happens with the combination of 8.2-release and Samba.


----------



## jem (Apr 4, 2011)

bfreek said:
			
		

> *1gb ecc ram*.



Sorry I didn't point this out earlier, but that there is your problem.  ZFS wants LOTS of memory.  I'd strongly recommend at least 4GB for a system running raidz pools.  It's why I threw 8GB RAM into my MicroServer.


----------



## bfreek (Apr 4, 2011)

olav said:
			
		

> I might have the same problem I had. Memory leak(disappearing). It happens with the combination of 8.2-release and Samba.


Please, elaborate on that. What's up with that (disappearing) memory leak?



			
				jem said:
			
		

> Sorry I didn't point this out earlier, but that there is your problem.  ZFS wants LOTS of memory.  I'd strongly recommend at least 4GB for a system running raidz pools.  It's why I threw 8GB RAM into my MicroServer.


I now have 5GB of ram put into it. Nothing really changed -.-


----------



## olav (Apr 5, 2011)

It's a bug in 8.2-Release, there is already a patch for it available with 8.2-Stable. Use top to check if all your memory is there.


----------



## AndyUKG (Apr 5, 2011)

olav said:
			
		

> It's a bug in 8.2-Release, there is already a patch for it available with 8.2-Stable. Use top to check if all your memory is there.



Have you got a link to a bug report or anything? What does the leak affect? ZFS?

ta Andy.


----------



## AndyUKG (Apr 5, 2011)

bfreek said:
			
		

> I now have 5GB of ram put into it. Nothing really changed -.-



A lot of RAM won't help write performance in a benchmark (where the system is otherwise idle). RAM helps read performance with ZFS, and can help write performance on busy systems by avoiding reads from the disks which cause IO contention...

ta Andy.


----------



## olav (Apr 5, 2011)

AndyUKG said:
			
		

> Have you got a link to a bug report or anything? What does the leak affect? ZFS?
> 
> ta Andy.



http://blog.vx.sk/archives/24-Backported-patches-for-FreeBSD-82-RELEASE.html


----------



## carlton_draught (Apr 8, 2011)

vermaden said:
			
		

> The *Samsung F3* ones are fast and low on power at the same time (available sizes 500GB/1TB):
> http://www.tomshardware.com/reviews/2tb-hdd-7200,2430-10.html
> 
> The only drawback is a little worse 'access time' then in *WD Black/Blue* drives.


Not sure about the 2TB drives, but the 1.5TB samsungs are hopeless IME. I bought 5 of them, have RMA'd 3 of them and about to send back the 4th, in less than 1 year. I can't understand how they get good reviews on Newegg.


----------



## bfreek (Apr 8, 2011)

olav said:
			
		

> http://blog.vx.sk/archives/24-Backported-patches-for-FreeBSD-82-RELEASE.html


Interesting, thanks!



			
				carlton_draught said:
			
		

> Not sure about the 2TB drives, but the 1.5TB samsungs are hopeless IME. I bought 5 of them, have RMA'd 3 of them and about to send back the 4th, in less than 1 year. I can't understand how they get good reviews on Newegg.


Because they test them under unrealistic circumstances, that is buy them and test them for a few days. This is nothing near a real use case...
I have very bad experience with Samsung drives on the long run - return rates of 60-70% within 1-2 years.

In the meantime, I just set up FreeNAS again and up until now I'm getting good Samba speeds of 60-90 mbyte/s in both directions across my 1 gbit lan. (Drives set up via gnop 4k providers and usual zpool raidz1 spanning my 3 wd20ears. Disabled head parking of the wd20ears as well.)

Sidenote: The distinction between consumer and non-consumer drives is just bullshit as computers tend to run for hours each day even at home.


----------



## carlton_draught (Apr 8, 2011)

bfreek said:
			
		

> Because they test them under unrealistic circumstances, that is buy them and test them for a few days. This is nothing near a real use case...


Well, to newegg's credit most of the 3 star or less involve an RMA. So judging reliability by the sum of the percentage of 1-3 star reviews is not a terrible methodology in the scheme of things. If you have a better one, I'm all ears. Perhaps only use reviews in the last 6 months and do the same thing. If I do this to the drive I bought, it jumps from 26% to 40%. That's only 56 reviews to judge them on (down from 505), but I suppose as long as your sample size is 30+ it's reliable enough.

And by contrast, if you look at what I see as the shining example of reliability as far as a drive for the money (and it's an SSD of course, but at least it points to the utility of newegg as a review site), the Intel X-25 series. Scores like 7% 3 star or less reviews. That's awesome. And you browse to see why people actually gave them those ratings, it's stuff like "Never got the rebate", "cant use ssd toolbox with raid hp aloe motherboard", "Windows boot up wasn't all that impressive", and "Intel's Data Migration Utility software doesn't recognize the Intel SSD as made by Intel, so it won't even install!". Those are the 3 star reviews. And the 1 star? (There are no 2 star reviews). 3/5 are non-reliability related! (2 are rebates, 1 doesn't know how to format a HDD). To me, all those are actually really encouraging!

And when Intel publishes that with their new 320 series, that they had <1% return rates for their 2nd gen, it's believable to me. And it cross checks with what I see in newegg and other sites (e.g. Amazon). And it is why that when I go to buy another SSD, I will be buying Intel despite other drives being reputedly faster.

Still, using that same methodology, Hitachi which I think I'll try next does not seem to be much better - 39% initially and 36% last 6 months. That is for the 7K2000. The 7K3000 scores only 14%, but all reviews are 6 months or less. Too early to tell really. I don't think any drive gets less than about 35% in the last 6 months that has been out a while.

And if you look at the 2TB Samsungs, which must not have been out for much more than 6 months judging by the number of reviews, it starts out at 17% and climbs to 22% in the last 2 weeks (i.e. after 6 months approx). Again, if we go back to the Hitachi 2TB 7K3000, it appears to be better as the overall is 14% but it drops to 12% in the last two weeks but with only 16 reviews how reliable is that?

Meh. The only thing I can gather from this is that HDDs do not seem to be able to be reliably made in the 1.5TB+ era. It's like riding a motorcycle - it's not a matter of if you come off, but when you come off. With HDDs it's not a matter of if they should fail, but what percentage will fail in the first year of ownership.

Which is why running ZFS + redundancy + backups is a no-brainer for any data you care about these days.



> I have very bad experience with Samsung drives on the long run - return rates of 60-70% within 1-2 years.


Well, I'm at 80% in 1 year. We'll see how I go.


> Sidenote: The distinction between consumer and non-consumer drives is just bullshit as computers tend to run for hours each day even at home.


Exactly. I'm not sure what the solution is. At least Hitachi are honest enough to put 24x7 usage, though they qualify that and say that it's for low duty cycle.


----------



## carlton_draught (Apr 8, 2011)

Lol, I've just been browsing storagereview.com, and it appears that someone else uses the exact same methodology that I do!


----------



## Sebulon (Apr 8, 2011)

Recently bought a WD30EZRS, a 3TB 4K drive, and actually didnt get what all the fuss was about, Ive been using it now for some time with zfs end recv, as a secondary pool, replicating my primary for disaster recovery- no problemo. I just shrugged it off, thinking that Im just lucky, or that it liked me better than every one else...=)

Well...it didnt. Yesterday I sat into place eight 1TB drives raidz2 for my primary pool and (until its filled) one 3TB for the secondary, adding another 3TB to match the size of the primary later. So I had switched the polarity of the pools so that I would be able to crash my previous primary 4-drive pool and build this:


```
[karli@main ~]$ zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

	NAME               STATE     READ WRITE CKSUM
	pool1              ONLINE       0     0     0
        label/rack1:1      ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: scrub completed after 2h59m with 0 errors on Fri Apr  8 16:48:04 2011
config:

	NAME                STATE     READ WRITE CKSUM
	pool2               ONLINE       0     0     0
	  raidz2            ONLINE       0     0     0
	    label/rack-1:2  ONLINE       0     0     0
	    label/rack-1:3  ONLINE       0     0     0
	    label/rack-1:4  ONLINE       0     0     0
	    label/rack-1:5  ONLINE       0     0     0
	    label/rack-2:1  ONLINE       0     0     0
	    label/rack-2:2  ONLINE       0     0     0
	    label/rack-2:3  ONLINE       0     0     0
	    label/rack-2:4  ONLINE       0     0     0
	cache
	  label/cache1      ONLINE       0     0     0
	  label/cache2      ONLINE       0     0     0

errors: No known data errors
```

Then, when it was time to replicate everything back from the one 3TB pool1 to the eight 1TBs pool2, was where it backfired- with force. messages was flooded with zfs errors and the send recv failed shortly after. Scrubbing just made it worse. Fortunately for me, these things have made me paranoid enough to have another, somewhat older backup, on a ordinary 2TB drive from the last time the entire system almost went down the drain. What doesnt kill you...

So to sum up, dont think youre safe just because you can write to it, cause once you try to read out that data again, youre gonna be sorry if you havent gnoped it first!

/Sebulon


----------



## bthomson (Apr 9, 2011)

carlton_draught said:
			
		

> And when Intel publishes that with their new 320 series, that they had <1% return rates for their 2nd gen, it's believable to me. And it cross checks with what I see in newegg and other sites (e.g. Amazon). And it is why that when I go to buy another SSD, I will be buying Intel despite other drives being reputedly faster.



No doubt. I bought a 40G 320 this week even though it is slower and more expensive than the alternatives.


----------



## fadolf (May 5, 2011)

I'm not sure if I get this right. I have 6 of those drives, which I plan to use for a raidz2 pool. Do I have to worry about the alignment if I use the gnop method to create the zpool? And if so, would this achieve what is necessary?


```
for i in /dev/ada*;do dd if=/dev/zero of=$i bs=1m count=1;done
for i in {0..5};do glabel label disk$i /dev/ada$i;done
for i in /dev/ada*;do gnop create -S 4096 $i;done
for i in /dev/ada*;do gpart create -s gpt $i;done
for i in /dev/ada*;do gpart add -t freebsd-zfs -b 2048 $i.nop;done
zpool create storage raidz2 ada0p1.nop ada1p1.nop ada2p1.nop ada3p1.nop ada4p1.nop ada5p1.nop
```


----------



## wonslung (May 6, 2011)

Sebulon said:
			
		

> Recently bought a WD30EZRS, a 3TB 4K drive, and actually didnt get what all the fuss was about, Ive been using it now for some time with zfs end recv, as a secondary pool, replicating my primary for disaster recovery- no problemo. I just shrugged it off, thinking that Im just lucky, or that it liked me better than every one else...=)
> 
> Well...it didnt. Yesterday I sat into place eight 1TB drives raidz2 for my primary pool and (until its filled) one 3TB for the secondary, adding another 3TB to match the size of the primary later. So I had switched the polarity of the pools so that I would be able to crash my previous primary 4-drive pool and build this:



The issue doesn't affect single drive or mirrored pools in nearly the way it does raidz and raidz2.


----------



## wonslung (May 6, 2011)

carlton_draught said:
			
		

> Exactly. I'm not sure what the solution is. At least Hitachi are honest enough to put 24x7 usage, though they qualify that and say that it's for low duty cycle.



I've used TONS of hitachi 7k2000 drive with raidz and have had hardly any fail.  I've had a COUPLE DOA but I attribute this to the fact that hitachi used glass spindles, and during shipping they can get broken.

I've recently started using the 5k3000's and 7k3000's and so far haven't had to return any.

Now, I am aware this is anecdotal evidence, but I've used about 300+ of the 7k2000's in different servers (most running opensolaris or open indiana).

Take it for what it's worth.


----------



## AndyUKG (May 9, 2011)

fadolf said:
			
		

> I'm not sure if I get this right. I have 6 of those drives, which I plan to use for a raidz2 pool. Do I have to worry about the alignment if I use the gnop method to create the zpool? And if so, would this achieve what is necessary?
> 
> 
> ```
> ...



I've only played around a little with gnop, but I have an idea that if you create a gnop device for say ada0, that doesn't mean that ada0p1.gnop will also exist. If I'm wrong, then your steps look good, if I'm right then you will need to do a gnop create for the ada*p1 devices... Apart from that, yes you need to worry about alignment but you are good for that using the "-b 2048" option with gpart,

thanks Andy.


----------



## phoenix (May 9, 2011)

Why are you labeling the disks, then partitioning the disks directly, then using the partitions to create the pool?

A simpler method is to just label the disks, create the gnop devices using the labels, then create the pool using the gnop devices:

```
$ for i in 0 1 2 3 4 5; do glabel label disk0$i ada$i; done
$ for i in 01 02 03 04 05; do gnop create -S 4096 label/disk$i; done
$ zpool create storage raidz2 label/disk01.nop label/disk02.nop label/disk03.nop label/disk04.nop label/disk05.nop
```


----------



## fadolf (May 10, 2011)

AndyUKG said:
			
		

> I've only played around a little with gnop, but I have an idea that if you create a gnop device for say ada0, that doesn't mean that ada0p1.gnop will also exist. If I'm wrong, then your steps look good, if I'm right then you will need to do a gnop create for the ada*p1 devices... Apart from that, yes you need to worry about alignment but you are good for that using the "-b 2048" option with gpart



You're actually right, or well half: upon creating the partitions on the ada*.nop, there are  ada*.nopp1 slices which I used to create the zpool, but when I exported it and destroyed the gnop devices, the zpool was gone too. Those probably weren't any valid devices to begin with. I should probably have made ada*p1.nop devices. 

How can I verify the alignment by the way?



			
				phoenix said:
			
		

> Why are you labelling the disks, then partitioning the disks directly, then using the partitions to create the pool?
> A simpler method is to just label the disks, create the gnop devices using the labels, then create the pool using the gnop devices:
> 
> ```
> ...



The gnop devices are needed to tell the OS the disks have 4K sectors and the partitions are needed to get the proper alignment. 
I only labelled the disks to know which one is connected to which port, in case the disks get disconnected from the cables (they also have a physical label though, but those come off more easily).

So taking everything into consideration, this would probably have been better (have yet to try if it works).


```
for i in {0..5};do glabel label disk$i /dev/ada$i;done
for i in {0..5};do glabel label disk$i /dev/label/disk$i;done
for i in /dev/ada*;do gpart create -s gpt $i;done
for i in label/disk*;do gpart add -t freebsd-zfs -b 2048 $i;done
do gnop create -S 4096 label/disk0p1
zpool create storage raidz2 label/disk0p1.nop label/disk1p1 label/disk2p1 label/disk3p1 label/disk4p1 label/disk5p1
```


----------



## phoenix (May 10, 2011)

If you gnop the disk but then add a GPT table to the disk and add the partition to the pool ... you get an ashift of 9 (512 byte sectors).

You have to gnop the device that you add to the pool.  Meaning, you have to gnop the partition in order to get ashift=12.  Which means you can't label the disk.

And, you can't gnop a GPT label.

However, if you use the entire disk, you automatically get proper alignment, since you start at sector 0.

So, again, why not just label the disk, gnop the label, and add the gnop device to the pool?

I've been playing around with glabel, gpart, gnop, and zpool for a couple of hours now, and what you want to do is not possible.


----------



## phoenix (May 10, 2011)

Okay, after more testing, and realising the difference between `# gnop create -s 4096` and `# gnop create -S 4096`, the following is what you want:


```
# gpart create -s GPT ada0
# gpart add -b 2048 -t freebsd-zfs ada0
# gpart modify -i 1 -l disk1 ada0
# gnop create -S 4096 gpt/disk1
# zpool create poolname gpt/disk1.nop
```

That will:

create a single partition on the disk starting at 1 MB
label the partition with a meaningful name
configure the labelled partition GEOM provider to use 4 KB sectors
add the labelled/gnop'd partition to the pool, thus setting the ashift to 12

I'll leave it up to you to figure out how to script it.


----------



## fadolf (May 10, 2011)

phoenix said:
			
		

> However, if you use the entire disk, you automatically get proper alignment, since you start at sector 0.



So if I just add a gnop provider with -S 4096 to the device, it will already be aligned correctly? So far the pool has an ashift of 12, which would be correct. 
Is this the way it is supposed to look if its aligned correctly?


```
geom label list ada0
Geom name: ada0
Providers:
1. Name: label/disk0
   Mediasize: 2000398933504 (1.8T)
   Sectorsize: 512
   Mode: r1w1e1
   secoffset: 0
   offset: 0
   seclength: 3907029167
   length: 2000398933504
   index: 0
Consumers:
1. Name: ada0
   Mediasize: 2000398934016 (1.8T)
   Sectorsize: 512
   Mode: r1w1e2
```


----------



## Zare (Oct 22, 2012)

FreeBSD 9.0-RELEASE, Atom D510, 2GB RAM, 2x WD20EARX

UFS (frag-size 4096) on gmirror = 110M/s write, 156M/s read
ZFS mirror = 39M/s write, 48M/s read
ZFS mirror on 4k nop device = 80M/s write, 82M/s read


----------

