# ZFS slow on FreeBSD 9



## Qaz (Apr 3, 2012)

Hello!

I'm install*ing* FreeBSD 9.0-RELEASE on DL120/G7, it ha*s*:
CPU:Intel 4-Core 3.3GHz E3-1240
RAM:8Gb
4x2 TB HDD


```
# camcontrol devlist
<WDC WD20EARS-00MVWB0 51.0AB51>    at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD20EARS-00MVWB0 51.0AB51>    at scbus0 target 1 lun 0 (ada1,pass1)
<WDC WD20EARS-00MVWB0 51.0AB51>    at scbus1 target 0 lun 0 (ada2,pass2)
<WDC WD20EARS-00MVWB0 51.0AB51>    at scbus1 target 1 lun 0 (ada3,pass3)
```

no tuning in /boot/loader.conf, only this sysctl in sysctl.conf:

```
vfs.zfs.txg.write_limit_override=1073741824
kern.maxvnodes=250000
```

I create raid-z


```
# zpool status
  pool: tank
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0
```

and then I just try to move files from one place to another I get around 10-15 MB/sec 
What can I do with that?
Thanks.


----------



## SirDice (Apr 3, 2012)

Known issue. It's because those WD20EARS drives use 4K sectors but lie about it to the OS.

There are several threads covering these 4K drives.

Thread 15402
Thread 18879


----------



## gkontos (Apr 3, 2012)

You need to align your drives for 4K. Have a look at this thread.


----------



## Qaz (Apr 5, 2012)

I fixed this issue, but still get around 15 Mb when I copy a file from one catalog to another.
My /boot/loader.conf:


```
zfs_load="YES"
vfs.zfs.prefetch_disable="1"
vfs.root.mountfrom="zfs:tank/root"
vfs.zfs.txg.write_limit_override=1073741824
```


```
r8# zfs get all tank/root/d1
NAME          PROPERTY              VALUE                  SOURCE
tank/root/d1  type                  filesystem             -
tank/root/d1  creation              Wed Apr  4 13:29 2012  -
tank/root/d1  used                  502G                   -
tank/root/d1  available             4.65T                  -
tank/root/d1  referenced            502G                   -
tank/root/d1  compressratio         1.00x                  -
tank/root/d1  mounted               yes                    -
tank/root/d1  quota                 none                   default
tank/root/d1  reservation           none                   default
tank/root/d1  recordsize            128K                   default
tank/root/d1  mountpoint            /mnt/d1                inherited from tank/root
tank/root/d1  sharenfs              off                    default
tank/root/d1  checksum              off                    local
tank/root/d1  compression           off                    default
tank/root/d1  atime                 on                     default
tank/root/d1  devices               on                     default
tank/root/d1  exec                  on                     default
tank/root/d1  setuid                on                     default
tank/root/d1  readonly              off                    default
tank/root/d1  jailed                off                    default
tank/root/d1  snapdir               hidden                 default
tank/root/d1  aclmode               discard                default
tank/root/d1  aclinherit            restricted             default
tank/root/d1  canmount              on                     default
tank/root/d1  xattr                 off                    temporary
tank/root/d1  copies                1                      default
tank/root/d1  version               5                      -
tank/root/d1  utf8only              off                    -
tank/root/d1  normalization         none                   -
tank/root/d1  casesensitivity       sensitive              -
tank/root/d1  vscan                 off                    default
tank/root/d1  nbmand                off                    default
tank/root/d1  sharesmb              off                    default
tank/root/d1  refquota              none                   default
tank/root/d1  refreservation        none                   default
tank/root/d1  primarycache          all                    default
tank/root/d1  secondarycache        all                    default
tank/root/d1  usedbysnapshots       0                      -
tank/root/d1  usedbydataset         502G                   -
tank/root/d1  usedbychildren        0                      -
tank/root/d1  usedbyrefreservation  0                      -
tank/root/d1  logbias               latency                default
tank/root/d1  dedup                 off                    default
tank/root/d1  mlslabel                                     -
tank/root/d1  sync                  disabled               local
tank/root/d1  refcompressratio      1.00x                  -
```


```
r8# zpool status
  pool: tank
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0

errors: No known data errors
```


----------



## Zare (Apr 5, 2012)

How exactly did you "fix the issue"?


----------



## Qaz (Apr 5, 2012)

```
r8# zdb|grep ashift
            ashift: 12
```


----------



## Qaz (Apr 12, 2012)

Well...there is no answers, we move system to Debian and performance of disk operations is better...I'm very upset


----------



## UNIXgod (Apr 12, 2012)

Qaz said:
			
		

> Well...there is no answers, we move system to Debian and performance of disk operations is better...I'm very upset



How about UFS? debian wont fix the disc problem either.


----------



## Crivens (Apr 12, 2012)

The debian installer might align the partitions to the correct offsets.
Isn't it best practice to hand whole discs to ZFS, thus avoiding such things altogether?


----------



## Qaz (Apr 12, 2012)

We tried to set up the system on ufs, it's slower than ZFS, and in ZFS we have right offsets


```
r8# zdb|grep ashift
            ashift: 12
```

It's just slower and I don't know what to do.


----------



## Sebulon (Apr 12, 2012)

@Qaz



			
				gkontos said:
			
		

> You need to align your drives for 4K. Have a look at this thread.



Using gnop is the only thing we know you have done, since you never told us anything more than that. Without more info, there's very little we can do for you. But depending on your setup, there may very well be more that could be done, tweaking your setup both config- and maybe we'd have suggestions hardware-wise too.

Although, if you're more happy using Debian, stick to that. At least you won't be disappointed.

/Sebulon


----------



## Qaz (Apr 12, 2012)

I have posted sysctl, type of RAID I use and so on, I can give more information just say what else is needed, I'm aligning my drives for 4K, I think I have good hardware and that's why my question: why it's so slow?


----------



## t1066 (Apr 13, 2012)

Why do you set vfs.zfs.prefetch_disable to 1? Set it back to 0 and see if performance improves.


----------



## Sebulon (Apr 13, 2012)

Crivens said:
			
		

> Isn't it best practice to hand whole discs to ZFS, thus avoiding such things altogether?



Well, it does simplify things. But you have to partition if you want to be able to boot from them.

/Sebulon


----------



## jem (Apr 13, 2012)

Qaz said:
			
		

> ```
> r8# zdb|grep ashift
> ashift: 12
> ```



I see that your ZFS pool was built on top of GPT partitions.  If those partitions were not 4KB sector aligned, then even a zpool with an ashift of 12 wouldn't have performed well.

Really we needed to see the output of '*gpart show*' to check your gpt partition alignment, but I guess it's too late now.


----------



## Sylhouette (Apr 13, 2012)

Could you tell me how to make sure GPT uses a 4k alignment?

regards
Johan


----------



## Qaz (Apr 13, 2012)

Yes, we do gpart alignment.


----------



## t1066 (Apr 13, 2012)

Let us see how the drives are performing when you copy files.

First run the commands

`$ iostat -xz -w 1 -c 120 > iostat-2min.txt`

and

`$ zpool iostat -v 10 12 > zpool-2min.txt`.

This should capture 2 minutes of iostat of your drives. In the mean time, also start copying files. Then post the results back here.


----------



## Qaz (Apr 17, 2012)

I don't have it server, but have another and performance is not very good. Here is the listing of files:

http://pastehtml.com/view/bv1gfavuu.txt

http://pastehtml.com/view/bv1gp2e6e.txt


```
>uname -a
FreeBSD 8.2-RELEASE-p3 FreeBSD 8.2-RELEASE-p3 #0: Tue Sep 27 18:45:57 UTC 2011     
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
```


```
NAME  PROPERTY  VALUE    SOURCE
tank  version   15       default
```


```
zfs get version
NAME           PROPERTY  VALUE    SOURCE
ssd            version   4        -
tank           version   4        -
tank/root      version   4        -
tank/root/tmp  version   4        -
tank/root/var  version   4        -
```


```
cat /boot/loader.conf
autoboot_delay="3"
loader_logo="beastie"

zfs_load="YES"
vfs.root.mountfrom="zfs:tank/root"
#geom_mirror_load="YES"
vfs.zfs.zio.use_uma="0"

pf_load="YES"

accf_data_load="YES"
accf_http_load="YES"

aio_load="YES"
```


```
#for zfs
vfs.zfs.txg.write_limit_override=1073741824
kern.maxvnodes=250000

#mysql zfs tuning
vfs.zfs.prefetch_disable=1
```


```
zpool status
  pool: ssd
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        ssd         ONLINE       0     0     0
          ad8       ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            ad4p3   ONLINE       0     0     0
            ad6p3   ONLINE       0     0     0

errors: No known data errors
```


```
CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz (3411.50-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x206a7  Family = 6  Model = 2a  Stepping = 7
real memory  = 17179869184 (16384 MB)
avail memory = 16431161344 (15669 MB)

atacontrol list
ATA channel 2:
    Master:  ad4 <ST33000651AS/CC45> SATA revision 2.x
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <ST33000651AS/CC45> SATA revision 2.x
    Slave:       no device present
ATA channel 4:
    Master:  ad8 <OCZ-VERTEX3/2.15> SATA revision 2.x
    Slave:       no device present
```


----------



## lucinda (Apr 17, 2012)

Qaz, for the record I also have been considering 9.0 RELEASE as a replacement for a Solaris box. My setup is a fast motherboard with Core i3, 2 GB RAM, 1 Adaptec 3085, 4x400 GB SAS disks. With an old release of Solaris Express Community Edition, build snv_113, I get > 300 MB/s read performance out of that setup. FreeBSD 9.0 slashes these speeds by 4, at 60-80 MB/s - I've done no tuning and I'll open another post asking for tips here but it's frustrating that the same hardware is so slow. If it was for my home NAS I'd be worried about backups taking 24 hours instead of 6 but there's no way I can substitute a busy Solaris box at the office with FreeBSD with these numbers. The *Free*BSD handbook says amd64 is largely autotuning (Solaris is) but if this is the best *Free*BSD can do I wonder why Solaris is 4 times as fast on the same hardware.


----------



## t1066 (Apr 17, 2012)

Looking at some sample output in bv1gfavuu.txt, 


```
extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4       56.9   0.0  4009.1     0.0    0  48.4  87
ad6       60.9   0.0  3551.1     0.0    7  54.8 101
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4       54.9   0.0  3603.0     0.0    7  49.8  90
ad6       57.9   0.0  3708.4     0.0    0  64.0  91
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4       57.9   0.0  4075.6     0.0    2  66.1  94
ad6       40.0   0.0  2756.3     0.0    2  37.6  69
```

both ad4 and ad6 are almost 100% busy but can only get total reading speed of 6MB/s. The following snippets


```
extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4       24.0 364.3  1476.1 23789.7   10  17.5  99
ad6       28.9 355.3  1802.5 23009.7   10  20.6 101
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        2.0 561.6   127.7 55570.3   10  17.3 100
ad6        3.0 571.6   178.1 56349.8   10  17.2  99
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4        5.0 325.7   319.7 24960.3   10  35.0 101
ad6        3.0 431.6   114.9 38440.3   10  25.2  98
                        extended device statistics
device     r/s   w/s    kr/s    kw/s wait svc_t  %b
ad4       47.0 108.0  2808.8 13496.4   10  59.3 100
ad6       63.0   2.0  3662.5     9.0    7  98.1 103
```

shows that writing is much more impressive. So your problem is actually the reading speed. I would advise setting vfs.zfs.prefetch_disable=0 first to see how it will impact the performance of the whole system.

PS. recordsize should be set to match that of the database.


----------



## peter@ (Apr 18, 2012)

jem said:
			
		

> I see that your ZFS pool was built on top of GPT partitions.  If those partitions were not 4KB sector aligned, then even a zpool with an ashift of 12 wouldn't have performed well.



Speaking of gpt alignment, we're not doing ourselves any favors with the default number of partitions..  The default of 128 partitions in the header causes 34 sectors to be used.  I don't recall the number I used, it might have been 240.

Mine typically look like this:

```
silo# gpart show ada0
=>        64  5860533041  ada0  GPT  (2.7T)
          64         192     1  freebsd-boot  (96k)
         256    33554432     2  freebsd-swap  (16G)
    33554688  5826978416     3  freebsd-zfs  (2.7T)
  5860533104           1        - free -  (512B)
```

I think I used commands like:  gpart create -s gpt -n 240 ada0
and that would end up with the free space at sector 64 instead of 34.  And then used gnop with a 4096 byte sector size to get the 4k arrangement within zfs.

I've seen recipes where people specify the start address to cause the partitions to be 4k aligned, but the gap irritated me so I made the table a little bigger.


```
silo# zdb |grep ashift
            ashift: 12
```


----------



## phoenix (Apr 18, 2012)

Note:  ZFS in FreeBSD 8.2, while stable for most things, is not all that performant.  This is a known issue, and why you'll always get "upgrade to 8-STABLE after 8.2" advice when asking ZFS questions on the mailing lists.

Fortunately, FreeBSD 8.3 was just released, so you can stick to -RELEASE, and get all the latest ZFS fixes and speed-ups.

Note also, that just handing the entire drive to ZFS and using gnop is not enough to get proper 4K alignment.  Even if you give the whole drive to ZFS, it doesn't use the entire disk starting at LBA 0.  There's some slack at the beginning for various things.  To make sure you are 4K-aligned, use a single GPT partition that starts at the 1 MB boundary (-b 2048 or -b 512 depending) and spans the entire disk.  Then use that partition for ZFS.  You'll find things go much smoother/faster that way.

Finally, with those drives, you need to get the wdidle3.exe program from Western Digital, and disable all the power-saving features (mainly the Idle Timeout).  The default for those drives is under 8 seconds, and will cause all kinds of havoc with RAID controllers and ZFS setups.  (We've actually returned/replaced all our *EARS* drives due to the horrible performance of the drives under FreeBSD.)


----------



## phoenix (Apr 18, 2012)

lucinda said:
			
		

> Qaz, for the record I also have been considering 9.0 RELEASE as a replacement for a Solaris box. My setup is a fast motherboard with Core i3, 2 GB RAM, 1 Adaptec 3085, 4x400 GB SAS disks. With an old release of Solaris Express Community Edition, build snv_113, I get > 300 MB/s read performance out of that setup. FreeBSD 9.0 slashes these speeds by 4, at 60-80 MB/s - I've done no tuning and I'll open another post asking for tips here but it's frustrating that the same hardware is so slow. If it was for my home NAS I'd be worried about backups taking 24 hours instead of 6 but there's no way I can substitute a busy Solaris box at the office with FreeBSD with these numbers. The *Free*BSD handbook says amd64 is largely autotuning (Solaris is) but if this is the best *Free*BSD can do I wonder why Solaris is 4 times as fast on the same hardware.



I wouldn't qualify a Core i3 and 2 GB of RAM as "a fast motherboard".  I'd barely qualify it as "a usable desktop".    Especially if you are sticking SAS drives onto it.

There most certainly is something wrong with your setup.  My home NAS box can barely be considered "desktop-class", considering it's only a lowly 2.8 GHz P4 CPU (single-core, HTT-enabled) with 2 GB of RAM, running 32-bit FreeBSD 8-stable from January (r226546) using the on-board ICH7 SATA controller (non-AHCI) with 4x 500 GB WD Caviar Black HDs.

Pool configuration is a simple dual-mirror setup.  And I get, under normal usage, 40-60 MBps of throughput (as shown by *zpool iostat* and *gstat*) locally.  And I can see the odd burst up to 30 MBps per disk (120 MBps for the pool).

If my horribly under-powered setup can do 60 MBps with ancient SATA controllers and harddrives, then you're doing something wrong if you can only get 60 MBps out of SAS drives and controller.


----------



## gkontos (Apr 18, 2012)

peter@ said:
			
		

> I've seen recipes where people specify the start address to cause the partitions to be 4k aligned, but the gap irritated me so I made the table a little bigger.



True, there are many recipes out there regarding proper alignment. Quoting from revision r230059

[CMD=""]# gpart add -b 34 -s 94 -t freebsd-boot ad0[/CMD]

I find this always working when dealing with ZFS on ROOT systems.

Dealing with pools in large arrays needs proper alignment also:

[CMD=""]# gpart add -t freebsd-zfs -l disk0 -b 2048 -a 4k daX[/CMD]

I recently had to deal with a large array consisted of 19 striped mirrors on Intel SSD drives with a very poor performance.

The array had to be destroyed. Then gpart(8)() was used to align each disk followed by gnop(8)(). Added to that, a stripe of 2 SSDs were used for CACHE and a mirror for LOG

The results were amazing, we could easily then see 100MBps of write speed on that array. At some point it got to 300MBps which means that speed was not a bottleneck anymore.


----------



## Crivens (Apr 19, 2012)

phoenix said:
			
		

> Note also, that just handing the entire drive to ZFS and using gnop is not enough to get proper 4K alignment.  Even if you give the whole drive to ZFS, it doesn't use the entire disk starting at LBA 0.


This is very interesting and certainly not what I would have expected. So where would be the point then to hand the complete device to ZFS?


----------



## lucinda (Apr 19, 2012)

phoenix said:
			
		

> I wouldn't qualify a Core i3 and 2 GB of RAM as "a fast motherboard".  I'd barely qualify it as "a usable desktop".    Especially if you are sticking SAS drives onto it.
> 
> There most certainly is something wrong with your setup.  My home NAS (...)



You know, you should perhaps apply to the Vatican, your mindset might be of use there. In religion you attack or demean anyone who dares to go against your world view, facts couldn't be less important. In science, we have respect for the facts and that's what makes the world a better place: a little humility in the face of reality. You ignore what I am saying and go on nitpicking what is fast or slow. When I said fast I said it in the context of the task at hand: ZFS. Perhaps I should have said 'fast like > 300 MB/s when Solaris is installed on it and "barely a usable desktop" when FreeBSD is running.' Perhaps then you would not have had so much opportunity for nitpicking and would have paid more attention to the facts. Besides, I never said it was a desktop, I don't know where you got that impression. But never let facts (or the absence of them) get in the way of a good disdainful post to put n00bies in their place. Administering Solaris boxes I would not qualify myself as a newbie but I certainly am a FreeBSD newbie, so you could have a point there 



			
				phoenix said:
			
		

> If my horribly under-powered setup can do 60 MBps with ancient SATA controllers and harddrives, then you're doing something wrong if you can only get 60 MBps out of SAS drives and controller.



According to your logic I must be doing something wrong when FreeBSD is installed but as my barely usable desktop smokes yours (and I didn't start the e-Penis comparison, you did, and for the record, I don't have one) when Solaris runs on it, then I must be doing something right! We arrive at a contradiction. I think this is all the answer you will get from me specially when you say what you say here - can you be held to your words? then why the bickering in your reply?


----------



## gkontos (Apr 19, 2012)

lucinda said:
			
		

> In science, we have respect for the facts and that's what makes the world a better place:



Link: http://en.wikipedia.org/wiki/Scientific_method



			
				lucinda said:
			
		

> Wow. This situation is so unprofessional of FreeBSD devs. I was about to create a thread about another slowness on FBSD 9.0 where SXCE is going four times as fast, same hardware, same disks, 512 b/sec by the way, good old fast SAS disks. Now I won't even bother and I'll keep using OpenSolaris or Solaris Express. The numbers don't add up regarding ZFS on FreeBSD, which by any other measure is pretty good and stable. I use it for a number of things and it's been good but back to my complain (and I don't mean to hijack the thread) I wonder why FreeBSD is trumpeting ZFS as 'stable' on http://www.freebsd.org since 8.2 when it's obviously incomplete, demonstrably slow, and now panics too. Just DON'T lie! ZFS on FreeBSD is not ready for production so don't lie on the front page, it hurts credibility terribly.



You can not PROVE the hypothesis with a single experiment, because there is a chance that you made an error somewhere along the way.

Best Regards,
George


----------



## Sebulon (Apr 19, 2012)

@lucinda

I have set up many (more than I have fingers) FreeBSD systems; NAS and others, that have the same performance as when running Solaris. It's not that FreeBSD is incapable, but can demand more fine-tuning and experience to reach the same level of performance. That's why I try my best to test and document as much as possible so that others can read it and use that information. But it does require you to search for information that you perhaps didn't know beforehand that you should have to searched for, and that may be a part of why Solaris can feel "better"; because it requires less experience. Personally, I instead feel "confined" by Solaris, and that's why I choose FreeBSD any day.

But because there are plenty of people that can account for FreeBSD systems with equal or better performance than this particular system, with "similar" hardware, it is hard to just say;
FreeBSD = BAD
Solaris = GOOD

Although, until the official documentation reflects how to achieve this performance in FreeBSD from the beginning, we as a community can just do our best to come with suggestions on how you can improve your current situation.

As always, no one will ever force you to use FreeBSD. So if you feel more happy using Solaris, then go ahead.

And lucinda, it's pretty obvious to figure out your gender from your alias, and frankly, I don't care. And I hope this stays being a place where *beep* size is only measured in IOPS

/Sebulon


----------



## lucinda (Apr 19, 2012)

Sebulon said:
			
		

> I have set up many (more than I have fingers) FreeBSD systems; NAS and others, that have the same performance as when running Solaris.



During what? *T*he last year? Then it would have been 8.2 at most. Can you point to a single benchmark on the Internet that substantiates your claim, namely FreeBSD ZFS as fast as Solaris ZFS?



			
				Sebulon said:
			
		

> ItÂ´s not that FreeBSD is incapable, but can demand more fine-tuning and experience to reach the same level of performance. ThatÂ´s why I try my best to test and document as much as possible so that others can read it and use that information.



I also like to do this, to some extent. Can you share some insight? My disks are advertising 512 B/sec physical and logical (Hitach Ultrastar 15K450, HUS154545VLS300), which means no 4k strangeness. I handed them raw to zpool, which means they are 4k unaligned but it doesn't matter in 512 B/sec discs as we all know.

(If anyone is going to say "no, it matters, sometimes", please show some data with your claim)



			
				Sebulon said:
			
		

> But it does require you to search for information that you perhaps didnÂ´t know beforehand that you should have to searched for, and that may be a part of why Solaris can feel "better"; because it requires less experience. Personally, I instead feel "confined" by Solaris, and thatÂ´s why I choose FreeBSD any day.



Well, not exactly. I appreciate your point. I am not frightened by complexity when it buys me flexibility. I like FreeBSD, just ZFS is very subpar against Solaris. When running FreeBSD no part of the system looks stressed: iostat/vmstat/gstat, yet it's slow. About complexity vs banging your head against a wall: the difference is very clear.



			
				Sebulon said:
			
		

> Although, until the official documentation reflects how to achieve this performance in FreeBSD from the beginning, we as a community can just do our best to come with suggestions on how you can improve your current situation.



This is what I miss: documentation. Short of reading the source code, what's left? FreeBSD ZFS wiki and Handbook all say amd64 is autotuning. Maybe it is, but yet again, so slow. Still, if it were not, there are only three sysctls listed and none apply to my case. Benchmarks everywhere show it's far slower. I don't want to be unjust, I know it's a complicated piece of software, and evolving. But on the wiki, just say it's slower. Every release saying it 'significantly better' than the last gives no real information.


----------



## Sebulon (Apr 19, 2012)

@lucinda



> Can you point to a single benchmark on the Internet that substantiates your claim, namely FreeBSD ZFS as fast as Solaris ZFS?


IÂ´ll have to get back to you on the Solaris part, I know I put them somewhere around *newfs /dev/null*, the storage space is unbelievable, so thatÂ´s going to take a while to find again

But I can happily link to my latest shenanigan: GELI Benchmarks. A smaller but powerful NAS, aimed at around 24TB at most, but encrypted in real time with geli and scored 400MB/s write with bonnie++. Speaking of bonnie, there are more people posting on the matter- myself included with performance results from another of my systems: My ZFS V28 benchmarks. Would give more examples, but *I* have to run...

/Sebulon


----------



## Crivens (Apr 19, 2012)

Ok, calm down a bit, will the lot of you? No, I am not a moderator and I think I would not like to be, but I do not think this is going into the right or polite ways.

@lucinda:
Maybe phoenix should have written that _something_ is going wrong, not that _you_ are doing this.

You quote on iostat/vmstat/gstat being low, the CPU load should not be an issue there, is it? May I inquire how much bandwidth do the drives deliver raw when all of them are streaming to /dev/null? Just so we can rule out other factors like drivers and PCIe lanes, and be content to argue about ZFS on FreeBSD


----------



## Beeblebrox (Apr 19, 2012)

Lucinda, as any one of us, is free to believe whatever she chooses.

I have not seen Phoenix make fun of / belittle anyone in any of his posts since I have been here; allegedly "controversial" post included.


----------



## phoenix (Apr 19, 2012)

Crivens said:
			
		

> This is very interesting and certainly not what I would have expected. So where would be the point then to hand the complete device to ZFS?



Works fine for non-4K drives, and makes management simpler, as you just label the drive.  No partitioning needed.

But, with 4K drives, especially ones that emulate 512B sectors, you need to do some manual twiddling to get things properly optimised.


----------



## t1066 (Apr 19, 2012)

lucinda said:
			
		

> Qaz, for the record I also have been considering 9.0 RELEASE as a replacement for a Solaris box. My setup is a fast motherboard with Core i3, 2 GB RAM, 1 Adaptec 3085, 4x400 GB SAS disks. With an old release of Solaris Express Community Edition, build snv_113, I get > 300 MB/s read performance out of that setup. FreeBSD 9.0 slashes these speeds by 4, at 60-80 MB/s - I've done no tuning and I'll open another post asking for tips here but it's frustrating that the same hardware is so slow. If it was for my home NAS I'd be worried about backups taking 24 hours instead of 6 but there's no way I can substitute a busy Solaris box at the office with FreeBSD with these numbers. The *Free*BSD handbook says amd64 is largely autotuning (Solaris is) but if this is the best *Free*BSD can do I wonder why Solaris is 4 times as fast on the same hardware.



I would make a guess. Since your system has only 2GB RAM, in FreeBSD, prefetch would be disabled. This may be the cause of perceived slowness in FreeBSD. Could you check the output of

`$ zpool iostat -v`

Look at the column about read bandwidth. Is the bandwidth of the whole pool just a fraction of the sum of bandwidth of each drive? If possible, also run the above command in Solaris and compare the two results.


----------



## Qaz (Apr 20, 2012)

I've enable prefetch and run test again

```
vfs.zfs.prefetch_disable: 0
```
iostat
zpool iostat


----------



## t1066 (Apr 20, 2012)

@Qaz

Comparing the output of *zpool iostat*, you can see that before the change reading speed is around 1.5MB/s. After the change, it increases to 8MB/s. Also, writing is more consistent at 16MB/s. I guess with your original machine, average writing speed at 30MB/s may be achievable.


----------



## lucinda (Apr 20, 2012)

t1066 said:
			
		

> I would make a guess. Since your system has only 2GB RAM, in FreeBSD, prefetch would be disabled. This may be the cause of perceived slowness in FreeBSD. Could you check the output of
> 
> `$ zpool iostat -v`
> 
> Look at the column about read bandwidth. Is the bandwidth of the whole pool just a fraction of the sum of bandwidth of each drive? If possible, also run the above command in Solaris and compare the two results.



I'll post all the data in one place, and your suggestion improved the situation a lot. I'll discuss below:
(Using code tags since nothing else preserves leading space)


```
# camcontrol devlist
<HITACHI HUS154545VLS300 A570>     at scbus0 target 0 lun 0 (pass0)
<HITACHI HUS154545VLS300 A570>     at scbus0 target 1 lun 0 (pass1)
<HITACHI HUS154545VLS300 A570>     at scbus0 target 2 lun 0 (pass2)
<HITACHI HUS154545VLS300 A570>     at scbus0 target 3 lun 0 (pass3)
```

dmesg (some, hopefully relevant)


```
CPU: Intel(R) Core(TM) i3 CPU         530  @ 2.93GHz (2942.49-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x20652  Family = 6  Model = 25  Stepping = 2
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x98e3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT>
  AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant
real memory  = 2147483648 (2048 MB)

(...)

aac0: <Adaptec RAID 3085> mem 0xfb600000-0xfb7fffff irq 18 at device 14.0 on pci3
aac0: Enabling 64-bit address support
aac0: Enable Raw I/O
aac0: Enable 64-bit array
aac0: New comm. interface enabled
aac0: [ITHREAD]
aac0: Adaptec 3085, aac driver 2.1.9-1
aacp0: <SCSI Passthrough Bus> on aac0
aacp1: <SCSI Passthrough Bus> on aac0
aacp2: <SCSI Passthrough Bus> on aac0

(...)

aacd0: <Volume> on aac0
aacd0: 429056MB (878706688 sectors)
aacd1: <Volume> on aac0
aacd1: 429056MB (878706688 sectors)
aacd2: <Volume> on aac0
aacd2: 429056MB (878706688 sectors)
aacd3: <Volume> on aac0
aacd3: 429056MB (878706688 sectors)

(...)

pass0 at aacp0 bus 0 scbus0 target 0 lun 0
pass0: <HITACHI HUS154545VLS300 A570> Fixed Uninstalled SCSI-5 device 
pass0: 3.300MB/s transfers
pass1 at aacp0 bus 0 scbus0 target 1 lun 0
pass1: <HITACHI HUS154545VLS300 A570> Fixed Uninstalled SCSI-5 device 
pass1: 3.300MB/s transfers
pass2 at aacp0 bus 0 scbus0 target 2 lun 0
pass2: <HITACHI HUS154545VLS300 A570> Fixed Uninstalled SCSI-5 device 
pass2: 3.300MB/s transfers
pass3 at aacp0 bus 0 scbus0 target 3 lun 0
pass3: <HITACHI HUS154545VLS300 A570> Fixed Uninstalled SCSI-5 device 
pass3: 3.300MB/s transfers
```

pool (note it's simply striped)


```
# zpool status
  pool: lykke2
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        lykke2      ONLINE       0     0     0
          aacd0     ONLINE       0     0     0
          aacd1     ONLINE       0     0     0
          aacd2     ONLINE       0     0     0
          aacd3     ONLINE       0     0     0
```

the default setting for < 4 GB


```
# sysctl -a | grep vfs.zfs.prefetch_disable
vfs.zfs.prefetch_disable: 0
```

and our friend ashift


```
# zdb | grep ashift
            ashift: 9
            ashift: 9
            ashift: 9
            ashift: 9
```

reading 4 GB this happens:

cores are mostly idle


```
# vmstat -P 5
 procs      memory      page                    disks     faults         cpu0     cpu1     cpu2     cpu3     
 r b w     avm    fre   flt  re  pi  po    fr  sr ad12 aa0   in   sy   cs us sy id us sy id us sy id us sy id
 0 0 0    773M   456M     0   0   0   0   201   0   0 241  976 67079 9966  0  3 97  0  6 94  0  4 96  0  3 97
 0 0 0    773M   456M     0   0   0   0   187   0   0 197  795 58149 8144  0  5 95  0  3 97  0  2 98  0  3 97
 0 0 0    773M   456M     0   0   0   0  1670   0   0 240  970 66496 9840  0  7 93  0  5 95  0  5 94  0  4 96
 0 0 0    746M   457M     7   0   0   0  3806   0   0  71  311 21995 3400  0  2 98  0  3 97  0  3 97  0  8 92
```

disks very lightly used


```
# iostat -d -n5 5
            ad12            aacd0            aacd1            aacd2            aacd3 
KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s 
  0.00   0  0.00  113.41  58  6.38  104.43  65  6.59  101.04  65  6.43  101.86  68  6.80 
  0.00   0  0.00  126.92 252 31.18  125.99 253 31.07  126.05 251 30.94  126.14 250 30.79 
  0.00   0  0.00  125.56 224 27.51  125.34 226 27.65  124.10 223 26.97  125.67 227 27.90 
 16.00   0  0.00  125.17 261 31.92  125.06 256 31.26  125.26 257 31.48  125.05 259 31.62 
  0.00   0  0.00  127.66 237 29.57  127.93 240 29.93  127.85 239 29.83  127.84 238 29.76 
  0.00   0  0.00  127.97 195 24.42  128.00 195 24.35  127.98 196 24.54  128.00 196 24.50
```

zpool iostat

```
# zpool iostat 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
lykke2      1.08T   562G    132      0  10.9M      0
lykke2      1.08T   562G  1.00K      0   126M      0
lykke2      1.08T   562G    887      0   108M      0
lykke2      1.08T   562G  1.02K      0   128M      0
lykke2      1.08T   562G    955      0   119M      0
lykke2      1.08T   562G    787      0  98.4M      0
```

and gpart shows 19-20% utilization

Now, doing


```
# sysctl vfs.zfs.prefetch_disable=0
vfs.zfs.prefetch_disable: 1 -> 0
```

the situation is somewhat improved, there's high variability: I do a find on a directory with 30 1 GB files and cat them to /dev/null; sometimes the numbers add up to what they should be, sometimes it falls down. This filesystem was created by zfs send/recv and I don't know if fragmented files at the origin are fragmented at the destination; I also don't know if these files are fragmented at the origin or how much, I don't know how to look at that in ZFS but below is a raw dd so if not fragmentation something (probably much) is going on:


```
# zpool iostat 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
lykke2      1.08T   562G  2.92K      0   371M      0
lykke2      1.08T   562G  3.58K      0   454M      0
lykke2      1.08T   562G  3.51K      0   446M      0
lykke2      1.08T   562G  1.91K      0   241M      0
lykke2      1.08T   562G    696      0  83.6M      0
lykke2      1.08T   562G    753      0  90.8M      0
lykke2      1.08T   562G    812      0  98.0M      0
lykke2      1.08T   562G    709      0  85.4M      0
lykke2      1.08T   562G    727      0  87.6M      0
lykke2      1.08T   562G    698      0  84.2M      0
lykke2      1.08T   562G    753      0  90.5M      0
lykke2      1.08T   562G    748      0  89.9M      0
lykke2      1.08T   562G    961      0   116M      0
lykke2      1.08T   562G    858      0   104M      0
lykke2      1.08T   562G   1003      0   122M      0
lykke2      1.08T   562G    989      0   120M      0
lykke2      1.08T   562G    827      0   100M      0
lykke2      1.08T   562G  1.10K      0   137M      0
lykke2      1.08T   562G    860      0   104M      0
lykke2      1.08T   562G  1.02K      0   127M      0
lykke2      1.08T   562G  2.34K      0   296M      0
lykke2      1.08T   562G  3.69K      0   469M      0
lykke2      1.08T   562G  1.29K      0   162M      0
lykke2      1.08T   562G  1.03K      0   128M      0
lykke2      1.08T   562G  1.07K      0   133M      0
lykke2      1.08T   562G  1.06K      0   132M      0
lykke2      1.08T   562G  1.04K      0   130M      0
lykke2      1.08T   562G  1.08K      0   134M      0
lykke2      1.08T   562G  1.14K      0   142M      0
lykke2      1.08T   562G  1.02K      0   126M      0
lykke2      1.08T   562G    855      0   103M      0
lykke2      1.08T   562G    713      0  86.0M      0
lykke2      1.08T   562G    712      0  85.3M      0
```

The raw devices can do better:


```
# dd if=/dev/aacd1 of=/dev/null bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes transferred in 26.147740 secs (164257686 bytes/sec)
```

That's 157 MB/s, admittedly at the beginning of the disk. The average speed might be about 100 * 4 which I don't dare to calculate.

In fact the data path (disks + controller + OS) can handle the four devices at the same time: launching a dd for each disk, gstat shows 96-97%/disk and iostat shows this nice view:


```
# iostat -d -n5 5
            ad12            aacd0            aacd1            aacd2            aacd3 
KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s 
 16.00   0  0.00  128.00 1219 152.37  128.00 903 112.88  128.00 1018 127.25  128.00 1121 140.17 
 16.00   0  0.00  128.00 1225 153.14  128.00 1212 151.49  128.00 1220 152.54  128.00 1228 153.44 
  0.00   0  0.00  128.00 1228 153.52  128.00 1225 153.07  128.00 1227 153.35  128.00 1225 153.17 
  0.00   0  0.00  128.00 1220 152.51  128.00 1219 152.36  128.00 1226 153.23  128.00 1228 153.53 
 16.00   0  0.00  128.00 1225 153.10  128.00 1227 153.33  128.00 1227 153.40  128.00 1229 153.68 
  0.00   0  0.00  128.00 425 53.18  128.00 767 95.92  128.00 634 79.28  128.00 520 65.06
```

The write tests will have to wait for another day as I have nothing that can go near 200-400 MB/s to keep the board busy and this is a pool without redundance to just test the state of the art; lots of two way mirrors (my preferred setup) will perform worse.

So, it's a nice improvement and I'll keep prefetch enabled; still not perfect but if this speed does not lead to instability I am much happier. Thanks to everyone for the tips and sorry for my grumpiness and the beginning!


----------



## t1066 (Apr 21, 2012)

@lucinda

Glad to be helpful.

Since FreeBSD and Solaris handle memory differently, for safe guard, you may want to add 
	
	



```
vfs.zfs.arc_max="1G"
```
 to /boot/loader.conf. Rerun the above test to see what impact this change would do.


----------



## Sebulon (Apr 21, 2012)

@lucinda

If you could install benchmarks/bonnie++, I would like to see the output of:
`# bonnie++ -d /zfs/dir -u 0 (if root) -s 8g`

/Sebulon


----------



## lucinda (Apr 21, 2012)

Sebulon said:
			
		

> If you could install benchmarks/bonnie++, I would like to see the output of:
> `# bonnie++ -d /zfs/dir -u 0 (if root) -s 8g`



There you go:


```
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
bsdginz2.example 8G   157  99 491934  79 349396  66   412  99 887696  71  1039  16
Latency             68512us     162ms     304ms   54978us   34702us     111ms
Version  1.96       ------Sequential Create------ --------Random Create--------
bsdginz2.example.co -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 20879  70 +++++ +++ 22373  99 31393  99 +++++ +++ 28047  99
Latency             91972us     124ms     686us   15007us      99us     133us
```


----------



## lucinda (Apr 21, 2012)

t1066 said:
			
		

> Since FreeBSD and Solaris handle memory differently, for safe guard, you may want to add vfs.zfs.arc_max="1G" to /boot/loader.conf. Rerun the above test to see what impact this change would do.



At 1 G the results are very alike: it starts at 450 MB/s for 15 sec (*iostat* and *zpool iostat* show the same, there is very little compression) and then it quickly goes down to 100 making 250 once or twice but mostly around 100 MB/s.

I looked into this filesystem and I thought all files were 1 GB but only about 10 % of the files are really 1 GB while the rest are at 700-800 MB, incomplete, because this is a torrent that was caught in a snapshot before finishing, which is what I have been using for testing.

So there is a bit of sparseness here, might this explain the drop in performance? I have noticed abysmal performance also in two filesystems that contain VMware ESXi vmdks, heavily sparse (like 10 to 1). The rest of the filesystem (950 GB, 370,000 files) takes 1 hour to *md5* while these two (50 GB, only 280 files) takes another hour... the disks are not loaded at all save 40 MB/s every minute or two and I see a big increase in the number of freed pages per second in vmstat from 10k-200k in the 950 GB of non-sparse files to 0.9-1.2m when going over the sparse filesystems, which I don't understand at all. All the time there are about 100-170 MB free (*top*).

If I leave out the md5 and just *cat* a 21 GB file, 1.2 GB on disk, to /dev/null it takes three minutes, 110 MB/s. The 'real' files are going twice as fast! Has anyone also observed this slowdown when touching sparse files? Why are so many pages being freed at the same time?


----------



## t1066 (Apr 22, 2012)

lucinda said:
			
		

> There you go:
> 
> 
> ```
> ...



Have you enable*d* compression on the file system? The speed seems faster than what the hardware can deliver.



			
				lucinda said:
			
		

> At 1 G the results are very alike: it starts at 450 MB/s for 15 sec (*iostat* and *zpool iostat* show the same, there is very little compression) and then it quickly goes down to 100 making 250 once or twice but mostly around 100 MB/s.
> 
> I looked into this filesystem and I thought all files were 1 GB but only about 10 % of the files are really 1 GB while the rest are at 700-800 MB, incomplete, because this is a torrent that was caught in a snapshot before finishing, which is what I have been using for testing.
> 
> ...



Setting the limit to 1G is mainly for stability and leave some memory for other programs. But maybe FreeBSD had improved in memory management that this is no longer necessary. As for how FreeBSD deal with sparse files, you should better ask it on the mailing lists.


----------



## lucinda (Apr 23, 2012)

t1066 said:
			
		

> Have you enable*d* compression on the file system? The speed seems faster than what the hardware can deliver.



Yes; these are the results with compression off:


```
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
bsdginz2.example 8G   165  99 354348  59 165725  29   405  96 485177  34 498.4   5
Latency             58438us     495ms    1274ms     150ms     119ms     198ms
Version  1.96       ------Sequential Create------ --------Random Create--------
bsdginz2.example.co -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 22161  73 +++++ +++ 20936  99 23220  87 28977  93 14834  92
Latency             87640us     111ms     257us     110ms   29366us   61132us
```




			
				t1066 said:
			
		

> Setting the limit to 1G is mainly for stability and leave some memory for other programs. But maybe FreeBSD had improved in memory management that this is no longer necessary. As for how FreeBSD deal with sparse files, you should better ask it on the mailing lists.



Thanks, I'll do that


----------



## gkontos (Apr 23, 2012)

@lucinda,

if this is not a production server and you are just experimenting I would suggest an upgrade to FreeBSD 9.0-STABLE.

Perform the same tests as before with only the following options in your /boot/loader.conf:

First test:

```
vfs.zfs.arc_max="1024M"
```

Second test:

```
vfs.zfs.prefetch_disable=0
```

The first one should give you a more stable system in regards to your RAM specs. The second one should give you a much better read performance.


----------



## lucinda (Apr 23, 2012)

@gkontos,

Thanks for the suggestions. I tried 9.0 in the beginning but 8.3 was recommended as a more stable option, which I installed. Whatever I end up using I should remind the thread that this pool is just a stripe and thus the speed is a bit unrealistic, everything should be divided by 2 at least for a pool with redundance so it's pointless to keep testing; I did a stripe to have enough room instead of destroying the real backup machine, which is also Solaris. I like speed but the main point is stability and for the moment I'll unload the filesystem here to an old machine running 8.3 or maybe 9.0 that I will dedicate to backups if all goes well. After some months of testing, adding snapshots, checksumming, comparing the FreeBSD and the Solaris backups, making sure it survives failures, and whatever I can come up with, if I see no problems I will trust ZFS in FreeBSD more... doing more tests now will show little. But thanks for all the help. FreeBSD enables me to reuse old machines, old disks (and old Promise ATA controllers!) that Solaris wouldn't even look at, thus freeing the fast hardware for more play.

I should now go to understand things like how does the ARC behave, if it is given back to the OS under pressure from other programs (a la 'normal' cache or not), what's going on with what I saw earlier about sparse files and more things that'll come up as I go.


----------

