# ZFS stripe performance



## unixro (Sep 24, 2013)

Hello all,

I have the following problem which I don't understand and I've been struggling for the past few days to solve or at least understand. My configuration is as follows:

```
FreeBSD rabbit.example.com 9.2-RC4 FreeBSD 9.2-RC4 #0: Tue Sep 24 15:33:26 UTC 2013     root@rabbit.example.com:/usr/obj/usr/src/sys/GENERIC  amd64
2xE5-2620
128Gb 
3xSAS2308 (built in mps driver, driver_version: 14.00.00.01-fbsd)
12xSamsung SSD 840 
and some others not important here.
```
Issue is that performance on striped ZFS pool is degrading when adding more disks and it should be the other way around.

ZFS customization:

```
zfs set recordsize=4k Store
zfs set compression=lz4 Store
(rest default so sync=default)
```
And now for some tests:
1 SSD in pool 


```
No retest option selected
        Record Size 4 KB
        File size set to 1048576 KB
        Command line used: iozone -i 0 -i 1 -i 2 -+n -r4K -s1G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
         1048576       4  328333       0   728087        0  676709  134946
```
2 SSD in pool from second controller 

```
No retest option selected
        Record Size 4 KB
        File size set to 1048576 KB
        Command line used: iozone -i 0 -i 1 -i 2 -+n -r4K -s1G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
         1048576       4  277307       0   695430        0  678698  140415
```
3 SSD in stripe with all 3 controllers

```
No retest option selected
        Record Size 4 KB
        File size set to 1048576 KB
        Command line used: iozone -i 0 -i 1 -i 2 -+n -r4K -s1G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
         1048576       4  235144       0   428078        0  451033  214316
```


----------



## unixro (Sep 24, 2013)

What makes it wors*e* is that the IOPS limit actually degrades gstats reports around 6000 with one SSD and with 2 SSD ~ 3000 and 3000 actually lower as shown by `iostat` as well:
Tests made with sync=always

```
File size set to 204800 KB
        Command line used: iozone -O -i 0 -i 1 -i 2 -+n -r4K -s 200m
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          204800       4    6103       0   207337        0  190212    3511
```
And with 2 SSD

```
File size set to 204800 KB
        Command line used: iozone -O -i 0 -i 1 -i 2 -+n -r4K -s 200m
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random    bkwd   record   stride
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          204800       4    5900       0   207428        0  190251    4704
```

Here are the ZFS settings

```
vfs.zfs.l2c_only_size: 0
vfs.zfs.mfu_ghost_data_lsize: 131072
vfs.zfs.mfu_ghost_metadata_lsize: 24576
vfs.zfs.mfu_ghost_size: 155648
vfs.zfs.mfu_data_lsize: 5358592
vfs.zfs.mfu_metadata_lsize: 614400
vfs.zfs.mfu_size: 6358016
vfs.zfs.mru_ghost_data_lsize: 313931776
vfs.zfs.mru_ghost_metadata_lsize: 776704
vfs.zfs.mru_ghost_size: 314708480
vfs.zfs.mru_data_lsize: 12297728
vfs.zfs.mru_metadata_lsize: 4158976
vfs.zfs.mru_size: 20823040
vfs.zfs.anon_data_lsize: 0
vfs.zfs.anon_metadata_lsize: 0
vfs.zfs.anon_size: 49152
vfs.zfs.l2arc_norw: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_min_ms: 200
vfs.zfs.l2arc_feed_secs: 1
vfs.zfs.l2arc_headroom: 2
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_write_max: 8388608
vfs.zfs.arc_meta_limit: 33051212800
vfs.zfs.arc_meta_used: 27936768
vfs.zfs.arc_min: 16525606400
vfs.zfs.arc_max: 132204851200
vfs.zfs.dedup.prefetch: 1
vfs.zfs.mdcomp_disable: 0
vfs.zfs.nopwrite_enabled: 1
vfs.zfs.write_limit_override: 3221225472
vfs.zfs.write_limit_inflated: 0
vfs.zfs.write_limit_max: 8584478720
vfs.zfs.write_limit_min: 33554432
vfs.zfs.write_limit_shift: 0
vfs.zfs.no_write_throttle: 1
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.prefetch_disable: 1
vfs.zfs.no_scrub_prefetch: 0
vfs.zfs.no_scrub_io: 0
vfs.zfs.resilver_min_time_ms: 3000
vfs.zfs.free_min_time_ms: 1000
vfs.zfs.scan_min_time_ms: 1000
vfs.zfs.scan_idle: 50
vfs.zfs.scrub_delay: 4
vfs.zfs.resilver_delay: 2
vfs.zfs.top_maxinflight: 32
vfs.zfs.write_to_degraded: 0
vfs.zfs.mg_alloc_failures: 36
vfs.zfs.check_hostid: 1
vfs.zfs.deadman_enabled: 1
vfs.zfs.deadman_synctime: 1000
vfs.zfs.recover: 0
vfs.zfs.txg.synctime_ms: 1000
vfs.zfs.txg.timeout: 5
vfs.zfs.vdev.cache.bshift: 13
vfs.zfs.vdev.cache.size: 0
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.trim_on_init: 1
vfs.zfs.vdev.write_gap_limit: 4096
vfs.zfs.vdev.read_gap_limit: 32768
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.ramp_rate: 2
vfs.zfs.vdev.time_shift: 29
vfs.zfs.vdev.min_pending: 1
vfs.zfs.vdev.max_pending: 2
vfs.zfs.vdev.bio_delete_disable: 0
vfs.zfs.vdev.bio_flush_disable: 0
vfs.zfs.vdev.trim_max_pending: 64
vfs.zfs.vdev.trim_max_bytes: 2147483648
vfs.zfs.cache_flush_disable: 1
vfs.zfs.zil_replay_disable: 0
vfs.zfs.sync_pass_rewrite: 2
vfs.zfs.sync_pass_dont_compress: 5
vfs.zfs.sync_pass_deferred_free: 2
vfs.zfs.zio.use_uma: 0
vfs.zfs.snapshot_list_prefetch: 0
vfs.zfs.version.ioctl: 3
vfs.zfs.version.zpl: 5
vfs.zfs.version.spa: 5000
vfs.zfs.version.acl: 1
vfs.zfs.debug: 0
vfs.zfs.super_owner: 0
vfs.zfs.trim.enabled: 1
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
```


----------



## webxtra (Oct 10, 2013)

Strange, did you test all the SSDs individual? Might be one SSD which is bad.


----------



## unixro (Oct 11, 2013)

Yes, I did, and the performance gap really shows when I limit L2ARC so that caching is disabled. Read almost doubles when the second SSD is added but when I add the third, the increase is very low and so goes for the fourth, fifth, etc.


----------



## mav@ (Oct 13, 2013)

It is hard to say something for sure without looking on alive system with proper tools. At this moment I am working on improving FreeBSD block level storage performance. After a lot of changes on alike system with 16 SSDs on 4 LSI HBAs I was able to reach 800-900K IOPS and about 3.5GB/s bandwidth total from the raw disks (http://people.freebsd.org/~mav/disk.pdf). Experiments with ZFS on top of that unfortunately shown much lower numbers, only about 120-180K IOPS (from the disks, without ARC caching). Analysis shows congestion on several locks inside ZFS. I've shown that to Andriy Gapon (avg@) and he is now thinking on that too. May be your situation is alike.


----------

