# NFS write performance with mirrored ZIL



## Sebulon (May 3, 2011)

Hi all!

*Current Highscore*

```
[B]Local writes 4k[/B]
5.4k rpm     160GB = 32MB/s
OCZ Vertex 2 60GB  = 33MB/s
Intel 320    120GB = 52MB/s
Zeus IOPS    16GB  = 55MB/s
OCZ Vertex 2 120GB = 56MB/s
Intel S3700  200GB = 58MB/s
HP 10k SAS   146GB = 59MB/s
SAVVIO 15k.2 146GB = 60MB/s
OCZ Deneva 2 200GB = 60MB/s
OCZ Vertex 3 240GB = 61MB/s
Intel X25-E  32GB  = 72MB/s


[B]Local writes 128k[/B]
OCZ Vertex 2 60GB  = 51MB/s
OCZ Vertex 2 120GB = 61MB/s
HP 10k SAS   146GB = 101MB/s
Intel 320    120GB = 128MB/s
Zeus IOPS    16GB  = 133MB/s
SAVVIO 15k.2 146GB = 165MB/s
Intel X25-E  32GB  = 197MB/s
OCZ Vertex 3 240GB = 271MB/s
OCZ Deneva 2 200GB = 284MB/s
Intel S3700  200GB = 295MB/s
```


```
[B][U]NFS Mirrored ZIL[/U][/B]

[B]Ordinary HW[/B]
Intel 320    40GB  = 30MB/s
OCZ Vertex 2 60GB  = 32MB/s
OCZ Vertex 2 120GB = 36MB/s
Intel 320    120GB = 52MB/s
Zeus IOPS    16GB  = 55MB/s
Intel X25-E  32GB  = 60MB/s
Intel S3700  200GB = 65MB/s
OCZ Deneva 2 200GB = 67MB/s
OCZ Vertex 3 240GB = 70MB/s


[B]HP DL380 G5[/B] (default controller settings)
Controller write cache = on
Drive write cache = off

OCZ Vertex 2 120GB = 49MB/s
Intel 320    120GB = 52MB/s
SAVVIO 15k.2 146GB = 56MB/s
HP 10k SAS   146GB = 58MB/s
Intel X25-E  32GB  = 67MB/s
```

IÂ´ve been experimenting with a high performance NAS based on FreeBSD and ZFS. I have some questions about NFS write performance together with SSD ZIL-accelerators. First specs:

Hardware
Supermicro X7SBE
Intel Core2 Duo 2,13GHz
8GB 667MHz RAM
3x Lycom SATA II PCI-X controllers
2x Supermicro CSE-M35T-1


```
# camcontrol devlist
<WDC WD30EZRS-00J99B0 80.00A80>    at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD103SJ 1AJ10001>         at scbus1 target 0 lun 0 (ada1,pass1)
<SAMSUNG HD103SJ 1AJ10001>         at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD103SJ 1AJ10001>         at scbus5 target 0 lun 0 (ada3,pass3)
<SAMSUNG HD103SJ 1AJ10001>         at scbus6 target 0 lun 0 (ada4,pass4)
<SAMSUNG HD103SJ 1AJ10001>         at scbus7 target 0 lun 0 (ada5,pass5)
<SAMSUNG HD103SJ 1AJ10001>         at scbus8 target 0 lun 0 (ada6,pass6)
<SAMSUNG HD103SJ 1AJ10001>         at scbus9 target 0 lun 0 (ada7,pass7)
<SAMSUNG HD103SJ 1AJ10001>         at scbus10 target 0 lun 0 (ada8,pass8)
<OCZ-VERTEX2 1.29>                 at scbus12 target 0 lun 0 (ada9,pass9)
<OCZ-VERTEX2 1.29>                 at scbus13 target 0 lun 0 (ada10,pass10)
```


```
# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

	NAME                STATE     READ WRITE CKSUM
	pool1               ONLINE       0     0     0
	  raidz2            ONLINE       0     0     0
	    label/rack-1:2  ONLINE       0     0     0
	    label/rack-1:3  ONLINE       0     0     0
	    label/rack-1:4  ONLINE       0     0     0
	    label/rack-1:5  ONLINE       0     0     0
	    label/rack-2:1  ONLINE       0     0     0
	    label/rack-2:2  ONLINE       0     0     0
	    label/rack-2:3  ONLINE       0     0     0
	    label/rack-2:4  ONLINE       0     0     0
	logs
	  mirror            ONLINE       0     0     0
	    gpt/ssd-1:1     ONLINE       0     0     0
	    gpt/ssd-2:1     ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: none requested
config:

	NAME              STATE     READ WRITE CKSUM
	pool2             ONLINE       0     0     0
	  label/rack-1:1  ONLINE       0     0     0

errors: No known data errors
```

IÂ´ve partitioned the ssdÂ´s to be used as both ZIL and L2ARC devices with mirroring, but only using the ZIL-partitions for the moment. I noticed a rather big performance hit while using them as both at the same time- about 30% drop in fact.

```
# gpart show ada9
=>       34  117231341  ada9  GPT  (55G)
         34         30        - free -  (15k)
         64   33554432     1  freebsd-zfs  (16G)
   33554496   83676879     2  freebsd-zfs  (39G)

# gpart show ada10
=>       34  117231341  ada10  GPT  (55G)
         34         30         - free -  (15k)
         64   33554432      1  freebsd-zfs  (16G)
   33554496   83676879      2  freebsd-zfs  (39G)
```

Now for local performance:

```
# dd if=/dev/random of=/tmp/test16GB.bin bs=1m count=16384
# dd if=/tmp/test16GB.bin of=/dev/zero bs=4096 seek=$RANDOM
17179869184 bytes transferred in 50.711092 secs (338779318 bytes/sec)
# dd if=/tmp/test16GB.bin of=/dev/gpt/ssd-1\:2 bs=4k seek=$RANDOM
17179869184 bytes transferred in 381.576081 secs (45023444 bytes/sec)
# dd if=/dev/zero of=/dev/gpt/ssd-1\:2 bs=4k seek=$RANDOM
42738441728 bytes transferred in 623.003697 secs (68600623 bytes/sec)

In comparison to:
# newfs -b 32768 /dev/gpt/ssd-1\:2
# mount /dev/gpt/ssd-1\:2 /mnt/ssd/
# dd if=/tmp/test16GB.bin of=/mnt/ssd/test16GB.bin bs=1m
17179869184 bytes transferred in 348.755907 secs (49260439 bytes/sec)

And also:
# dd if=/dev/zero of=/dev/gpt/ssd-1\:2 bs=1m
42842562048 bytes transferred in 187.442289 secs (228564015 bytes/sec)
```

So just writing zeros are OK 228MB/s, but when trying to write some random data, like in every day use, no matter of how, when and where tops at about 45-50MB/s. Wierd.

Accordingly, I have that performance through NFS- about 30MB/s write speed.

Am I missing something? The data sheet for SSDs boasts with 50,000 4k random write IOPS, and I had thought that I would at least get 100MB/s NFS write?

/Sebulon


----------



## SirDice (May 3, 2011)

As I understood it you will only profit from ZIL when doing synchronous writes. It does nothing for asynchronous.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices

Also keep in mind that reading from /dev/random is quite a lot slower then reading from /dev/zero. Obviously dd can't write data faster then it receives.


----------



## AndyUKG (May 3, 2011)

Sebulon said:
			
		

> ```
> # newfs -b 32768 /dev/gpt/ssd-1\:2
> # mount /dev/gpt/ssd-1\:2 /mnt/ssd/
> # dd if=/tmp/test16GB.bin of=/mnt/ssd/test16GB.bin bs=1m
> ...



Hi,

  In the first case above you are copying from /tmp which is going to be limited by the read speed of whatever disk your /tmp file system is on. Doesn't that explain you being limited to 50MB/sec? Try *dd*'ing to /dev/null from /tmp and see if you get the same limit...

cheers Andy.


----------



## Sebulon (May 3, 2011)

Hi,

thank you so much for your inputs on this. IÂ´m reading and reading and reading but I canÂ´t seems to make any sense of this. After reading tons of performance tests/charts/whitepapers on the SSDs, comparing them to, for example ZeusIOPS SSDs- that Sun/Oracle themselves uses in their Unified Storage Systems- I think it looks like OCZs Vertex 2 actually have better specs. But we have Oracle storage systems at work and they can easily shuffle 100MB/s NFS write over 1Gbps, while I only get around 30MB/s out of my system.

@Andy
Quoting myself:

```
# dd if=/tmp/test16GB.bin of=/dev/zero bs=4096 seek=$RANDOM
17179869184 bytes transferred in 50.711092 secs (338779318 bytes/sec
```
/tmp is on my primary pool and scored 338MB/s to /dev/zero, so no, that shouldnÂ´t be a bottleneck.

@SirDice
Before I started with this, I used mdmfs to slice off 3GBs of RAM and adding that as one alone ZIL device and got around 80-90MB/s NFS write, instead of the 30MB/s I have now. Then I had to crash my primary pool and restore from the secondary to get rid of it again, but I can definately say that it does matter having good SLOG device/devices.

----------------------------

Looking at performance tests on this drive from anandtech it says that the drives are supposed to be able to write 51.6MB/s random 4k unaligned, and a whopping *164MB/s random 4k aligned*. Following that tidbit, it would seem as if my writings arenÂ´t really 4k aligned, but I think they are. I made like this when I added the partitions to the pool:

```
# gpart create -s gpt ada9
# gpart add -l ssd-1:1 -t freebsd-zfs -b 64 -s 16G ada9
# gnop create -S 4096 /dev/gpt/ssd-1:1
```
and then added the gnop device to the pool:

```
# zpool add pool1 log mirror gpt/ssd-{1:1.nop,2:1.nop}
```
I then exported the pool, destroyed the nop devices and reimported the pool again, worked like a charm. I then confirmed this by running zdb and seen the ashift value=12 on the mirrored log vdev.
By doing all this, I believe that 1: the partitions are aligned to 4k and 2: they are recognized and used as 4k devices by ZFS.

What am I missing here guys and girls?

IÂ´m really hoping someone is going to spot out some obvious flaw that IÂ´ve been to close to spot out myself.
If I/(we?) can figure out how to squeeze out the 30 000 random 4k IOPS that the specs says, youÂ´ll later find me skimping across white little clouds up in the blue, and will be more than happy to share my experience=)

/Sebulon


----------



## AndyUKG (May 4, 2011)

Have you had a look at the %busy of the SSDs during the tests? How does that look? Bad? ie use gstat...


----------



## aragon (May 4, 2011)

You aren't per chance running with power saving options enabled?  Maybe post us output of:

`$ sysctl dev.cpu`


----------



## Sebulon (May 5, 2011)

@aragon:

```
# sysctl dev.cpu
dev.cpu.0.%desc: ACPI CPU
dev.cpu.0.%driver: cpu
dev.cpu.0.%location: handle=\_PR_.CPU0
dev.cpu.0.%pnpinfo: _HID=none _UID=0
dev.cpu.0.%parent: acpi0
dev.cpu.0.freq: 2133
dev.cpu.0.freq_levels: 2133/35000 1866/30625 1600/16000 1400/14000 1200/12000 1000/10000 800/8000 600/6000 400/4000 200/2000
dev.cpu.0.cx_supported: C1/0 C2/85
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_usage: 100.00% 0.00% last 274us
dev.cpu.1.%desc: ACPI CPU
dev.cpu.1.%driver: cpu
dev.cpu.1.%location: handle=\_PR_.CPU1
dev.cpu.1.%pnpinfo: _HID=none _UID=0
dev.cpu.1.%parent: acpi0
dev.cpu.1.cx_supported: C1/0 C2/85
dev.cpu.1.cx_lowest: C1
dev.cpu.1.cx_usage: 100.00% 0.00% last 246us
```
Good? Bad?

@AndyUKG:
Pegged at 100%
Which leads me to believe that they really canÂ´t handle more than that. Plus, IÂ´ve found the REALLY fine print from OCZÂ´s webpage here, that I think are the drivesÂ´s small sequential write performance. Now, it doesnÂ´t say how big the block size is anywhere there, but the numbers are the same as what IÂ´m getting when using them as ZIL devices, so...

-------------------------------

Now for the conclusion:

For NFS write on my 8-drive 7.2k rpm SATA II, raidz2 pool

~30MB/s - mirrored ZIL
~40MB/s - no ZIL at all
~50MB/s - striped ZIL ashift=9
~60MB/s - striped ZIL ashift=12

-------------------------------

More interesting is to read on in the fine-print from OCZ:
*OCZSSD2-2VTXE60G    35MB/s* (thatÂ´s mine)
*OCZSSD2-2VTX100G    75MB/s*
*OCZSSD2-2VTXE120G   80MB/s*
Double the size- double the performance; seems logical. So if I could get my hands on either two 100- or 120GB drives, I would be able to attain about the same performance as I have now, but with mirrored ZIL instead- which is a must right now, at least in FreeBSD. Very interesting indeed.

IÂ´m rather satisfied with this, because I now know exactly what to expect from SSDs as ZIL accelerators in 2011. I mean, if we look back just 2 or 3 years, the SSDs were terrible. If we give it 2 or 3 years more, perhaps they are all the way there- with 100MB/s small sequential write. But, my fear is that the manufacturers just donÂ´t care about that. They have produced drives that push the boundaries even for SATA 6Gb/s now, at least at larger block sizes. So as far as the masses are concerned, they are flawless.

Oracle uses ZeusIOPS SSD drives, NetApp sacrifices one DIMM-slot, giving the system 5GBs of RAM, keeps the last 1GB battery-backed and uses that as a ZIL. ThereÂ´s also stuff like the DDRdrive X1, OCZ has their Z-drive series, also Velo- and RevoDrive series. All of these are just too expensive for a normal person- especially if you also need not just one, but two=)

Is it really that hard getting 100MB/s NFS write with standard consumer-grade products? Impossible even? I find that odd. If any one out there has achieved this, please speak up.

/Sebulon


----------



## AndyUKG (May 5, 2011)

http://www.ocztechnology.com/ocz-vertex-3-sata-iii-2-5-ssd.html ?


----------



## Sebulon (May 5, 2011)

@Andy
Pfft! Rubbish
"4k file writes up to 60,000 IOPS"
The leading part of it beeing *up to*. But theyÂ´ll make you pay two times the price for it=)

ThereÂ´s also Vertex 3 MAX IOPS, with *"up to"* 75,000 IOPS. But thatÂ´s random writes. What you want for a ZIL drive is sequential 4k writes.

My two SSDÂ´s may not make it all the way up to 100MB/s over NFS but still, I think itÂ´s really cool that 2 SSDÂ´s are better than 8 regular drives raidz2.

The results would have been different with 4x mirrored vdevÂ´s. Regardless, I scored 60MB/s with two 60GB striped logs. With two striped 120GB, I could potentially score x2 that = 120MB/s- wire speed. That also means that as long as you have two SSDÂ´s, or four just to have them mirrored, you can build the pool any way youÂ´d like; mirrored vdevs, raidz-2-3, it doesnÂ´t matter- youÂ´d be able to have 100MB/s NFS writes.

IÂ´m gonna have to buy me two of them 120GBÂ´s just to see this with my own eyes.

PS. I just realized, thereÂ´s gotta be a breakpoint. I scored 40MB/s with 8 drives. Which means that x3 the drives(24), you could potentially score 120MB/s without any SLOG at all. But if youÂ´re gonna build a system with less drives than that, you can definitely benefit from having two or more SSDÂ´s to even out the load.

/Sebulon


----------



## AndyUKG (May 5, 2011)

Sebulon said:
			
		

> @Andy
> Pfft! Rubbish
> "4k file writes up to 60,000 IOPS"
> The leading part of it beeing *up to*.



Did you look at the full specs?



> Max Write: up to 500MB/s


----------



## Sebulon (May 5, 2011)

@AndyUKG
Yes *"up to"* 500MB/s
But, thatÂ´s with 128k block size. We are after 4k block size.

Reading even more on this subject, IÂ´ve found out that IntelÂ´s 320 120GB is supposed to handle 130MB/s sequential 4k writes!!!
If thatÂ´s true, it means that you can have two of those mirrored and still push 100MB/s write over NFS!!!
That means that they are better and also cheaper than the 120GB OCZ drives. God damn it, I just ordered two OCZÂ´s=(

But I will take this opportunity to first test the 120GB OCZ drives- then send them back and order two intel 120GB instead=)

I will keep you all posted.

/Sebulon


----------



## AndyUKG (May 5, 2011)

Sebulon said:
			
		

> @AndyUKG
> Yes *"up to"* 500MB/s
> But, thatÂ´s with 128k block size. We are after 4k block size.



Even with 4k writes, at 60k IO/sec that gives you 240MB/sec. Also with regard random vs sequential writes, I think even on SSD random is slower than sequential, so where you read random you should be able to assume equal or greater sequential performance.

Andy.


----------



## Sebulon (May 5, 2011)

@Andy:
No, they donÂ´t handle that much. You donÂ´t get 60k IOPS at 4k sequential write.
If you manage to find a drive that actually handles 240MB/s 4k sequential write, let me know, IÂ´ll be the first to buy=)
The ones who claims to be best so far at 4k sequential write are:
OCZ Vertex 2 120GB - 80MB/s
Intel 320 120GB - 130MB/s

So far, i have tested, and scored NFS write with mirrored log:
OCZ Vertex 2 60GB - 30MB/s
Intel 320 40GB - 30MB/s

This far, the manufacturers numbers for sequential 4k writes seems to match what IÂ´ve been able to score on NFS write, so maybe in this case "size does matter", I will score higher NFS write the bigger SSD drives I use, hopefully.

But remember, that higher throughput remains to be proven! I havenÂ´t seen that multiplied performance IRL yet, but I promise, you will be the first to know=)

Exiting times

/Sebulon


----------



## AndyUKG (May 5, 2011)

Sebulon said:
			
		

> @Andy:
> No, they donÂ´t handle that much. You donÂ´t get 60k IOPS at 4k sequential write.



Where are you getting that info? Searching "benchmark OCZ Vertex 3" the second link that comes up is this:

http://www.tomshardware.com/reviews/vertex-3-sandforce-ssd,2869-10.html

Shows writes near 60K IO/sec, and about 240MB/sec using 4k....


----------



## Sebulon (May 5, 2011)

@Andy:
You are correct. That adds up! They were testing the 240GB model from your link.

Sources:
From anandtech on Vertex 3
And same on Vertex 2

Vertex 3 240GB scored 49152 IOPS 6Gb/s with random data
Vertex 3 240GB scored 39168 IOPS 3Gb/s with random data

Vertex 3 120GB scored 41472 IOPS 6Gb/s with random data
Vertex 3 120GB scored 38912 IOPS 3Gb/s with random data

Vertex 2 100GB scored 41984 IOPS 3Gb/s with random data

Then, if 4k random writes are exactly the same (or even better) as 4k sequential writes, then my drive should be able to push 84MB/s. From your link the 120GB model scored 43411 IOPS, or 169MB/s. The one I used was 60GB. Half the size, half the throughput - 169/2=84. 84MB/s

But it didnÂ´t! Curiously enough, IÂ´m seeing about half of *that*:

```
# dd if=/tmp/test16GB.bin of=/dev/gpt/ssd-1\:2 bs=4k seek=$RANDOM
17179869184 bytes transferred in 381.576081 secs (45023444 bytes/sec)
```

Explain that! Because i certainly canÂ´t=)
The closest I can think of is that itÂ´s twice as hard doing sequential writes instead of random writes, and that "seek=$RANDOM" reads random from if but still writes sequential to of.

This just gets harder the more you try to understand it=)
Anyhow, IÂ´ll keep updating when IÂ´ve gotten a chance to test the Vertex 2 120 GB, to see if twice the size really equals to twice the throughput.

/Sebulon


----------



## aragon (May 5, 2011)

Sebulon said:
			
		

> @aragon:
> 
> ```
> # sysctl dev.cpu
> ...



Looks good.  Only other suggestion that comes to mind is to change your partition alignments.  Looks like you're using a 32k alignment, but SSDs don't necessarily have a 32k erase boundary.  1 MiB would be a safer option.


----------



## AndyUKG (May 6, 2011)

Sebulon said:
			
		

> Then, if 4k random writes are exactly the same (or even better) as 4k sequential writes, then my drive should be able to push 84MB/s. From your link the 120GB model scored 43411 IOPS, or 169MB/s. The one I used was 60GB. Half the size, half the throughput - 169/2=84. 84MB/s
> 
> But it didnÂ´t! Curiously enough, IÂ´m seeing about half of *that*:
> 
> ...



So I'm also wondering where you got the 35MB/sec info from. I didn't see that, but I did see:

Ok, I just had a look at the spec sheets for the Vertex 2 and 3 drives. This seems to explain everything. The vertex 2 have 2 different metrix quoted, "max write" and "max sustained write". Your drive has zero info stated for sustained write, but on the other drives it is half the max write. So we might assume your drive will have max sustained write of abou 32.5MB/sec. The Vertex 3 don't quote sustained write info, so I'd assume they don't suffer from lower sustained write speeds. Sounds about the same as your 35MB/sec info.

I was reading from this page:
http://www.ocztechnology.com/res/manuals/OCZ_Vertex_Product_sheet_1.pdf

cheers Andy.


----------



## AndyUKG (May 6, 2011)

AndyUKG said:
			
		

> I was reading from this page:
> http://www.ocztechnology.com/res/manuals/OCZ_Vertex_Product_sheet_1.pdf
> 
> cheers Andy.



Ah no  That isn't the correct product, yours is this one isn't it?

http://www.ocztechnology.com/res/manuals/OCZ_Vertex2_Product_sheet_6.pdf

So, that kills my last post. So only question remains, where did you get the 35MB/sec rating from?

ta Andy.


----------



## Sebulon (May 6, 2011)

@AndyUKG:
Sorry, no I forgot to include source for that one. But itÂ´s in the product sheet for Vertex 2
At the bottom of page two, it says "*For AS-SSD Performance metrics, go* *here*"

@aragon:
Very good suggestion. I put it to the test:

```
# mdmfs -s 2G md0 /mnt/ram/
# cp /tmp/test2GB.bin /mnt/ram/
# gpart create -s gpt ada9
# gpart add -l log1 -t freebsd-ufs -b 2048 -s 16G ada9
# newfs /dev/gpt/log1
# mount /dev/gpt/log1 /mnt/log1
# dd if=/mnt/ram/test2GB.bin of=/mnt/log1/test2GB.bin bs=4k
2073034752 bytes transferred in 40.751293 secs (50870404 bytes/sec)
```
Compared to:

```
# gpart create -s gpt ada10
# gpart add -l log2 -t freebsd-zfs -b 64 -s 16G ada10
# newfs /dev/gpt/log2
# mount /dev/gpt/log2 /mnt/log2
# dd if=/mnt/ram/test2GB.bin of=/mnt/log2/test2GB.bin bs=4k
2073034752 bytes transferred in 40.612978 secs (51043653 bytes/sec)
```

So no real difference there. But I remember you using the word "safer" with 1MiB, so IÂ´m gonna go with that in the future as well.

/Sebulon


----------



## AndyUKG (May 6, 2011)

Sebulon said:
			
		

> @AndyUKG:
> Sorry, no I forgot to include source for that one. But itÂ´s in the product sheet for Vertex 2
> At the bottom of page two, it says "*For AS-SSD Performance metrics, go* *here*"



Kind of bizare, their documents seem to contradict each other. In the main product sheet:

http://www.ocztechnology.com/res/manuals/OCZ_Vertex2_Product_sheet_6.pdf

The "max write" speeds have this note:

"Maximum Sequential Speeds are determined using ATT"

So its states its also sequential, which gives a speed of up to 250MB/sec, then in the other doc it states sequential write is only 35MB/sec. There doesn't seem to be an additional performance metrics document for the Vertex 3 so you just get left wondering....

Andy.


----------



## Sebulon (May 6, 2011)

@AndyUKG:
Been there- felt that=)

/Sebulon


----------



## Sebulon (May 9, 2011)

Update:

This thread really should have been called "SSDÂ´s; a crapshoot"=)

IÂ´ve gotten two OCZSSD2-2VTXE120G, based on the thought that twice the size gives twice the performance. I was half right.

4k sustained write on random data locally:
OCZSSD2-2VTXE60G  = *33,5MB/s*
OCZSSD2-2VTXE120G = *83MB/s*

1m sustained write random data over NFS with mirrored log:
OCZSSD2-2VTXE60G  = *32,6MB/s*
OCZSSD2-2VTXE120G = *36MB/s*

Oddest thing is that gstat shows the logs are only around 50% busy. Why isnÂ´t ZFS using the last 50%?

IÂ´ve tried with both ashift=9 and 12 on the log vdev, without any difference what so ever. So IÂ´m going with ashift=9 as default now, less hassle that way.

IÂ´m crashing my primary pool now for the 100th time to try again but with striped logs instead. I have also ordered two Intel 320 120GB (SSDSA2CW120G310) since I found in the fine fine-print that itÂ´s supposed to handle 130MB/s sustained 4k write random data but weÂ´ll see just how that translates into NFS write performance in the end, since it sorta didnÂ´t with the OCZ drives.

/Sebulon


----------



## Sebulon (May 10, 2011)

OK, some serious performance testing with the OCZ Vertex 2 120GB

Locally UFS, 4k sustained write with random data
One transfer    - 61,3 MB/s - one drive
One transfer    - 120,6 MB/s - two gstriped

One SSD, UFS- write over NFS
One transfer    - 45,7 MB/s
Two simultaneous - 23,4 MB/s

The SSD gstripe over NFS
One transfer    - 61,8 MB/s
Two simultaneous - 32,8 MB/s

ZFS - one log
One transfer    - 38,2 MB/s ashift=12
(ashift=9 37,8 MB/s)
Two simultaneous - 23,7 MB/s ashift=12
(ashift=9 23,4 MB/s)

ZFS - two log mirror
One transfer    - 35,9 MB/s ashift=12
(ashift=9 35,9 MB/s)
Two simultaneous - 23,4 MB/s ashift=12
(ashift=9 23,4 MB/s)

ZFS - two log stripe
One transfer    - 53,9 MB/s ashift=12
(ashift=9 53,6 MB/s)
Two simultaneous - 37,4 MB/s ashift=12
(ashift=9 38,5 MB/s)

Odd here is that one drive formated with UFS and writing over NFS scores 45,7MB/s, but two striped only scored 61,8MB/s, which is exactly half of the performance I got from writing locally to the stripe. Am I imagining things? Is that just unrelated numerology?
I imagined that two striped UFS-drives over NFS would have been 45,7x2=91,4MB/s, but no- that only scored 61,8MB/s...

That same client has before pushed 100MB/s over NFS with the 3GB md-drive as ZIL, so the client wasnÂ´t a bottleneck.

I have at least proven that it doesnÂ´t make any real difference having ashift=9 or 12. Partitioning with either -b 64 or 2048 has been the only performance optimization so far, which was crucial by the way- more than 100% performance improvment compared to letting sysinstall partition and format. But except for that, this journey has been a complete mystery. Only thing left to test are the Intel drives, then IÂ´m done.

/Sebulon


----------



## danbi (May 10, 2011)

Sebulon said:
			
		

> @Andy:
> If you manage to find a drive that actually handles 240MB/s 4k sequential write, let me know, IÂ´ll be the first to buy=)



Without asking for the price? 
There ARE enough drives that do way more than 240MB/s 4k sequential, or even random, writes. But I believe will be way out of your budget.

I am not expert in OCZ drives, but reading recent spec/reviews it seems these use compression to achieve most of their 'performance'. While this might be ok, for storing DOS err, Windows, data, it does not help the SLOG (separate ZIL).
Therefore, don't test these SSDs with /dev/zero. Best create (large) RAM disk, copy data from /dev/random there and use the 'random' file for your tests.

Anyway, I am missing the FreeBSD, respectively ZFS version you play with. There have been significant improvements in performance in recent versions. Also, if you play with the ZIL, using the experimental ZFS v28 may provide much better performance.

The ZIL will help with IOPs, not with troughput. After all, however fast SSDs you have, data has to be written to the 'slower' rotating disks. 

The idea of separate ZIL is also to have an device, that is undisturbed by other tasks and can just write sequentially. As such, having the ZIL on a pair of normal disks, may provide you with better results. I believe your magnetic disks are capable of at least 100MB/s sequential write.

These SSDs you have are good for L2ARC, because they are not that fast for writing anyway.

Try using a magnetic disk or two for the ZIL and SSDs for cache and see if this helps.

FreeBSD with ZFS can certainly push 100MB/s over NFS


----------



## Sebulon (May 10, 2011)

@Danbi:
Well thank god, I was beginning to worry there=)
Yes I know of the difference between RAM SSDs and NAND SSDs, I have a big list of those manufacturers but even if I wanted to buy- for example- a ZeusIOPS drive, there arenÂ´t any resellers around.

Yes, the testing has been performed exactly as you described, with randomly generated data, stored on a RAM disk, and I can write to my pool at around 200MB/s so that shouldnÂ´t be a bottleneck.
IÂ´m sticking with FreeBSD 8 because of the stability.

But man, am I banging my head right now, of course, an ordinary drive is better sequential! ItÂ´s so obvious when you think about it. But why does it say everywhere that you should have a SSD for a ZIL?

The Evil Tuning Guide:
"...using an SSD as a separate ZIL log is a good thing"

Solaris Internals ZFS Best Practices Guide:
"Better performance might be possible by using dedicated nonvolatile log devices such as NVRAM, SSD drives"

Neelakanth Nadgir's blog - The ZFS Intent Log:
"...using nvram/solid state disks for the log would make it scream!"

These where the three first hits when searching "zfs zil" on google=)
It kinda sends you over the bridge to get water, you know what I mean? ThereÂ´s a very big difference between buying two regular 2,5" drives for about 130$ and buying two SSDs of the same size for about 560$ and getting the same performance in the end. My wallet is fealing that difference right now, for example=) Lucky for me, I have the right to send them back if I want.
Maybe this should be more clearly explained in the first place you usually consult: The Handbook?

IÂ´ve gotten my hands on a 2.5" 7200rpm SATA II drive that i will test out tonight. Fingers crossed!

/Sebulon


----------



## usdmatt (May 11, 2011)

I'm not sure how you are mounting your NFS share but I've been looking for ways to get decent NFS sync write performance for a while. I have a Linux box serving some VMware images that I really, really would like to get replaced with FreeBSD+ZFS. Not only am I much more comfortable with FreeBSD but I would get better snapshots / zfs send / zfs scrub, etc.

From what I understand the slow NFS write performance is due to the fact that every sync NFS request has to be flushed to disk before the client can proceed with the next request. The (relatively) slow access times of mechanical disks means you can't actually process that many sync NFS requests per second, dragging the throughput down.

NFS to a small test pool only gave me ~5MB/s which matches that seen in this thread.
http://lists.freebsd.org/pipermail/freebsd-fs/2009-September/006884.html
Obviously the local performance was way above this.

I was able to increase this to ~35MB/s by adding a single 60GB vertex 2 ssd.

In order to replace my Linux box I ideally need to be seeing 80MB/s+ so I'd be real interested if anyone can pull it off without spending a fortune on PCIe SSDs. So far, I'm not aware of anyone that has managed more than about 60 with standard hardware. Hacks like trying to mount async (don't think it's even possible in VMware) or disabling ZIL that could jeopardize data integrity are not really an option.

Interestingly, ixSystems make some FreeBSD 8.2 NAS boxes with Fusion-io cards that can apparently max 10Gb ethernet, although I assume that's not with NFS. I be interested to see what their NFS performance is like though (not that I could afford them).


----------



## Sebulon (May 16, 2011)

Extended 4k sustained write tests

MO:

```
# mdmfs -s 4096m md0 /mnt/ram
# dd if=/dev/urandom of=/mnt/ram/test4GB.bin bs=1m count=4096
# dd if=/mnt/ram/test4GB.bin of=/dev/ada0(.nop,s1,s1d,p1,p1.nop) bs=4k
Or, if towards filesystem:
# dd if=/mnt/ram/test4GB.bin of=/mnt/ada0/test4GB.bin bs=4k
```


```
[B]OCZ Vertex 2 120GB[/B]
  Local writes:
  raw            56 MB/s
  fdisk          17 MB/s
  fdisk/label    17 MB/s
  gpart          55 MB/s
  gnop           55 MB/s
  gpart/gnop     54 MB/s

  raw [B]bs=128k[/B]    60 MB/s

  sysinstall ufs 56 MB/s
  gpart ufs      60 MB/s
  raw zfs        44 MB/s
  gpart zfs      43 MB/s
  gnop zfs       51 MB/s

  [U]Score as mirrored ZIL:[/U]
  raw            49 MB/s
  fdisk          56 MB/s
  gpart          56 MB/s
  gnop           56 MB/s
  gpart/gnop     53 MB/s
--------------------------------


--------------------------------
[B]Intel 320 120GB[/B]
  Local writes:
  raw            52 MB/s
  fdisk          52 MB/s
  fdisk/label    51 MB/s
  gpart          52 MB/s
  gnop           51 MB/s
  gpart/gnop     50 MB/s

  raw [B]bs=128k[/B]    128 MB/s

  sysinstall ufs 132 MB/s
  gpart ufs      131 MB/s
  raw zfs        70 MB/s
  gpart zfs      73 MB/s
  gnop zfs       72 MB/s

  [U]Score as mirrored ZIL:[/U]
  raw            52 MB/s
  fdisk          52 MB/s
  gpart          52 MB/s
  gnop           52 MB/s
  gpart/gnop     52 MB/s
--------------------------------


  References:
--------------------------------
[B]CompactFlash 16GB[/B]
  Local writes:
  raw            18 MB/s
  fdisk          17 MB/s
  gpart          18 MB/s

  raw [B]bs=128k[/B]    64 MB/s

  sysinstall ufs 46 MB/s
  gpart ufs      37 MB/s
--------------------------------


--------------------------------
[B]5.4k rpm 160GB[/B]
  Local writes:
  raw            32 MB/s
  fdisk          34 MB/s
--------------------------------


--------------------------------
[B]10k rpm 146GB[/B]
  Local writes:
  raw            59 MB/s
  fdisk          55 MB/s

  raw [B]bs=128k[/B]    101 MB/s

  sysinstall ufs 96 MB/s

  [U]Score as mirrored ZIL:[/U]
  raw            52 MB/s
  fdisk          52 MB/s
```

The NAND SSD ZIL pimple is popped!

These tests have been performed on my rig and also a HP DL380 G5 and produced the same results. You get the exact same performance out of a 10k rpm rotating disk as you get from a SSD of the same size. Intel says 130MB/s and they are at least only half lying since they do have that, but only at 128k block size. 

At 4k, the performance dropped down to the same performance as both OCZ and the 10k rpm drive. So far, the performance you get from doing 4k writes to a raw device or partition is the same performance you get over NFS in the end. Therefore, my eyes have turned towards the Seagate SAVVIO 15k.2. 

Looking at the 5.4k rpm and 10k rpm drives, the performance has been linear to the amount of rpmÂ´s, so what IÂ´m hoping, is that a 15k rpm drive will have the combined performance off of that, like "5.4k rpm + 10k rpm = 15k rpm", which in performance would be "32MB/s + 59MB/s = 91MB/s" at 4k block size. That would make them perfect as SLOG!. IÂ´ll update once IÂ´ve had a chance to test it out.

/Sebulon


----------



## Sebulon (May 17, 2011)

Seen around the net about SeagateÂ´s Pulsar

http://www.foodreview.imix.co.za/node/91093
"Seagate claims the solid-state drives can perform up to 30,000 random read and up to 25,000 random write IOPS, or 240 MBps sequential reads and 210 MBps sequential writes with a 4K block size."

http://www.theregister.co.uk/2011/03/15/seagate_enterprise_drive_refresh/
"The Pulsar XT.2 â€“ with 100, 200 and 400GB capacity points â€“ has the same endurance and reliability, and offers 48,000 sustained random read IOPS (4K blocks), 22,000 write IOPS, 360MB/sec bandwidth for sequential reads and 300MB/sec write bandwidth: *good figures.*"

I couldnÂ´t agree more=)
If anyone has a chance to test them out, please do!


Also read here:
http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/
Write-up about a rig that can saturate two 1GigE trunked using two Intel X25-e as mirrored ZIL. 

ItÂ´s so hard to really understand the difference between SSDÂ´s, cause when youÂ´re looking at IntelÂ´s datasheets, the 320Â´s I tested actually looked better in comparison. They really should start to post "sequential *4k* read/writes" to make this easier to spot out. Right now, they only post "sequential read/write" without mentioning block size. Now, I know by experience that the Intel 320 120GBÂ´s scored 130MB/s only with 128k block size and not even half of that with 4k. Being clearer on that point would save you the trouble having to test everything out yourself before you could be completly sure.

Intel 320 120GB posts 130MB/s sustained write
Intel 320 160GB posts 165MB/s sustained write
Intel X25-E 32GB posts 170MB/s sustained write
Intel 320 300GB posts 205MB/s sustained write

As tested before, the 320 only scored *52MB/s* sustained write at 4k
And judging from the earlier linked write-up, the X25-E really could have 170MB/s sustained write at 4k, but what would be the cost of testing all of this out?

So to get about the same performance:
2x Intel 320 160GB costs about 790$
2x Intel X25-E 32GB costs about 1080$
2x Intel 320 300GB costs about 1400$

Yeah, like thatÂ´s gonna happen=)

2x Seagate SAVVIO 15k.2 146GB costs about 560$ in comparison.

/Sebulon


----------



## danbi (May 18, 2011)

You will never, ever need 146GB for a ZIL. Sharing that (rotating) disk with another task will make the ZIL perform very poorly.


----------



## Sebulon (May 18, 2011)

@danbi:
You wouldnÂ´t need 32,64,120, or even 300 either, but I just couldnÂ´t find any smaller than that=) ItÂ´s all about raw performance. Either way, 15k rpm drives are faster than regular 7.2k rpm, or perhaps 5.4k rpm, if youÂ´re building your pool with 2.5" drives for example. And at random writes, IOPS is limited to the amount of vdevs in your pool. So if youÂ´re building a pool with raidz(2,3), youÂ´re probably gonna have more random write IOPS from one 15k rpm anyway.
And IÂ´ll only be using them as ZIL, doing sequential writes. Think transferring large ISOÂ´s, GIS-data, VMÂ´s and so on.

I am aware that youÂ´ll only use what it takes to saturate with network IO, in my case, 100MB/s, or about 1GB. However, I have measured the difference between having a 1GB large md-drive as ZIL, and comparing with having 3GB as ZIL scores higher in the end, so itÂ´s at least good to have a little overhead.
The Oracle systems have as much as 16GB ZIL. So partitioning the SAVVIOÂ´s for the first 8GB would ensure both proper size and best performance, using the most outer tracks on the disk.

/Sebulon


----------



## Sebulon (May 27, 2011)

Finally gotten my hands on two SAVVIO 15k.2 and had a chance to test them out. These test have been performed on a HP DL380 G5. The pool has been made up by three 10k rpm HP-drives raidz and two SAVVIOÂ´s as mirrored ZIL.

First checking the network:

```
iPerf:
------------------------------------------------------------
Client connecting to 10.20.0.99, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[  3] local 10.20.0.56 port 46240 connected with 10.20.0.99 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   937 Mbits/sec
```

Local writes on server:

```
# mdmfs -s 2304m md0 /mnt/ram
# dd if=/dev/urandom of=/mnt/ram/test2GB.bin bs=1m count=2048
# dd if=/mnt/ram/test2GB.bin of=/dev/ada0(s1,s1d) bs=4k
```

NFS writes from a client:

```
# mdmfs -s 2304m md0 /mnt/ram
# dd if=/dev/urandom of=/mnt/ram/test2GB.bin bs=1m count=2048
# mount 10.20.0.99:/export/tank /mnt/tank/perftest
# dd if=/mnt/ram/test2GB.bin of=/mnt/tank/perftest/test2GB.bin bs=1m
# dd if=/mnt/ram/test2GB.bin of=/mnt/tank/perftest/test2GB-2.bin bs=1m
# dd if=/mnt/ram/test2GB.bin of=/mnt/tank/perftest/test2GB-3.bin bs=1m
# umount /mnt/tank/perftest
```

Finally, the results:

```
--------------------------------
[B]SAVVIO 15k.2 146GB[/B]
  Local writes:
  raw            60 MB/s
  fdisk          56 MB/s

  raw bs=128k    165 MB/s

  [U]Score as mirrored ZIL:[/U]
  raw            50 MB/s
  fdisk          50 MB/s
--------------------------------
```

As you can see, the results are amaizingggggggly bad- unfortunately. They give the exact same performance you get from just about any other drive IÂ´ve tested so far. Therefore, since IÂ´ve apparently started chasing windmills, IÂ´ve also ordered the supposed knight in shining armor, the Intel X25-E, which is about the only drive IÂ´ve been able to find any kind of performance benchmarking been made on elsewhere. 

The best source of info IÂ´ve found on the matter is from here:
http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs

Where the writer states:
"SLC flash is faster for writes and more reliable, therefore it's the best choice for a ZIL"

And then:
"MLC flash will give you more capacity for your money, but is less reliable and less fast than SLC. Therefore, MLC makes a good read accelerator when you're budget-constrained."

But at the same time, these statements doesnÂ´t seem to be based on factual experience:
"I don't have any empirical data, but from the way SLC and MLC work, *it is to be expected* that SLC drives are faster and more reliable than MLC drives."

Well, I had great *expectations* when I woke up this morning too, but look how that turned out=)

IÂ´m also gonna give the SAVVIOÂ´s a fair chance at some other machines as well. I bought SAS to SATA converters here for testing but the drives didnÂ´t even spin up, much less show up in the OS. So IÂ´ve ordered a pair of these instead, hoping theyÂ´ll work better.

/Sebulon


----------



## usdmatt (May 27, 2011)

Hi Sebulon,

Thank you for providing this information. It's great to know how these drives perform locally *and* over NFS. Pretty much all our storage needs are NAS/SAN based so network storage performance is my main concern. I can't afford to buy drives just for testing and my company wouldn't be too happy spending a fortune on new hardware just to find out it doesn't perform any better than our existing systems, even if it does make our data more secure and easier to manage.

Have you tested any of these drives as striped ZIL?
Obviously in a live system you'd want striped mirrors but if the performance does scale, it wouldn't be out of the way to have 4 or 6 drives as ZIL, enough to max out 1Gb ethernet at least.

Looking forward to the X25-E results. Hopefully we'll get some good news for once...


----------



## Sebulon (May 27, 2011)

@usdmatt:
No problem man, happy to help. But donÂ´t think IÂ´m sponsored or that itÂ´s through my job or such. I buy those drives out of my own budget, then I have the right to send them back within 14 days by law. I think of it as paying to test them out=)


```
Over NFS single transfer:
1x md-drive     128MB             = 54MB/s
1x md-drive     256MB             = 57MB/s
1x md-drive     512MB             = 60MB/s
1x md-drive     768MB             = 67MB/s
1x md-drive     1GB               = 77-80MB/s
1x md-drive     2,4GB             = 77-80MB/s

Over NFS double transfer total:
1x md-drive     128MB             = 64MB/s
1x md-drive     256MB             = 66MB/s
1x md-drive     512MB             = 70MB/s
1x md-drive     768MB             = 78MB/s
1x md-drive     1GB               = 90MB/s
1x md-drive     2,4GB             = 90MB/s


---------------------------------------------------
Over NFS single transfer:
2x OCZ Vertex 2	120GB stripe log  = 69MB/s
2x Intel 320    120GB stripe log  = 69MB/s
2x HP 10k SAS   146GB stripe log  = 69MB/s
2x SAVVIO 15k.2	146GB stripe log  = 66MB/s

2x HP 10k SAS   146GB mirror log  = 58MB/s
2x SAVVIO 15k.2 146GB mirror log  = 56MB/s
```

Well, what about that! The HP-drives actually outrun the SAVVIOÂ´s. God damn it.
I also redid the tests with mirrored ZIL and got better results than what I posted before. The reason was because I had a 2,2GB large md-drive configured on the server and when I  deleted it and gave that RAM back to the OS, it also performed better.

Also, I tested what difference the size of the ZIL actually has. ItÂ´s exactly as explained. The size of the ZIL only has to be as large as your bandwidth. If you have 1GigE, you only need 1GB ZIL. If you have 10GigE, youÂ´re gonna need 10GB ZIL. Good to know.

/Sebulon


----------



## danbi (May 28, 2011)

usdmatt said:
			
		

> I can't afford to buy drives just for testing and my company wouldn't be too happy spending a fortune on new hardware just to find out it doesn't perform any better than our existing systems, even if it does make our data more secure and easier to manage.



That would be an non-commercial company then 

wasted time = loss of money
data loss = loss of lots of money
secure data = less data loss
easier to manage data = less time spent

and so on 

Anyway, I just tried to sort of repeat Sebulon's tests, on similar hardware.

system 1: Xeon X3450, 8GB RAM, LSI 1068e, 2xST9146852SS (Savvio 15k), 2xMBF2600RC (VERY BUSY database server)

This is copying to ZFS filesystem.


```
# mdmfs -s 2304m md1 /mnt
# dd if=/dev/urandom of=/mnt/test2GB.bin bs=1m count=2048
2048+0 records in
2048+0 records out
2147483648 bytes transferred in 29.238521 secs (73447068 bytes/sec)
# dd if=/mnt/test2GB.bin of=/fast/junk bs=4k
524288+0 records in
524288+0 records out
2147483648 bytes transferred in 24.428726 secs (87908131 bytes/sec)
```

system 2: 2x Xeon E5620, 48GB RAM, LSI 2008, 2xMBF2600RC (idle)

This is copying to the raw GPT partition.


```
# mdmfs -s 2304m md1 /mnt
# dd if=/dev/urandom of=/mnt/test2GB.bin bs=1m count=2048
2048+0 records in
2048+0 records out
2147483648 bytes transferred in 32.296183 secs (66493419 bytes/sec)
# dd if=/mnt/test2GB.bin of=/dev/gpt/data0 bs=4k
524288+0 records in
524288+0 records out
2147483648 bytes transferred in 3153.543603 secs (680975 bytes/sec)
```

Observations: writing to the raw devices for some reason has always been slow. Putting those under management of ZFS made them run much faster. In fact, using UFS made them run faster. Go figure. 

On system 2, the disk was satturated at 166 IOPs (about 680KB/s), weird! With 1m block size, it satturated at 145 IOPs (about 19 MB/s). 

An non-raw example, on system 2:


```
# newfs /dev/gpt/data0
# mount /dev/gpt/data0 /media
# dd if=/mnt/test2GB.bin of=/media/junk bs=4k
524288+0 records in
524288+0 records out
2147483648 bytes transferred in 14.184637 secs (151395037 bytes/sec)
```

The drive was going over 1150 IOPs and 150MB/s. So there might be some raw device IO weirdness going on here.

What is your IOPs value (observed with gstat)?


----------



## Sebulon (May 28, 2011)

To all:
IÂ´m just a dude, you know. And not the most 1337 dude in the world either=) Chances are that IÂ´ve been doing something wrong all along but so far no one on this forum seems to argue my methods. But donÂ´t just take my words absolute, go research all this and find out for your selves! Because this feels like a complete crapshoot from the start.

@danbi
Yes, way to go man, starting a knowledge revolution over here=) Awesome to have others testing this as well and posting their results!
IÂ´m testing deafult installs of amd64 FreeBSD 8.2-RELEASE, the same methods, as similar networks, clients and the same SLOGÂ´s, but on as many different servers/hardware as possible to see if anything changes, depending on what hardware/enviroment you have, but so far, the results have been quite the same. And not a single drive IÂ´ve tested so far seems to hold itÂ´s worth as ZIL.

IÂ´ve also noticed the same behaviour as you; the difference between writing to a filesystem and writing to a device. The only explanation I can think up for it is:


```
dd if=ram of=/dev/something bs=4k (little MB/s)
dd if=ram of=/dev/something bs=128k (lots of MB/s)

dd if=ram of=/mount/file bs=4k (still lots of MB/s)
dd if=ram of=/mount/file bs=128k (same same)
```

That should mean that writing to a file in a filesystem transforms the write blocksize to that which it likes best, which is about as big as possible- no matter what you specify as bs to the application. ZFS e.g. has 128k as default, simply because thatÂ´s the block size hard drives likes to write the best.
When youÂ´re writing directly to a device however, there is no filesystem thatÂ´s allowed to have a say in the matter, so dd can write what you specify with bs without any convertion, showing itÂ´s real speed at 4k- since thatÂ´s the performance you get over NFS when using them as SLOG devices.

/Sebulon


----------



## danbi (May 28, 2011)

By the way, I just discovered, that my SAS drives had write cache disabled. Enabling it on the Toshiba drives:


```
dd if=/dev/zero of=/dev/gpt/data0 bs=4k count=1m
1048576+0 records in
1048576+0 records out
4294967296 bytes transferred in 125.264367 secs (34287223 bytes/sec)
```
So, much different than before.

Check your SAS drives:

`camcontrol modepage da0 -m 0x08`

look for WCE, if this is 0 your write cache is disabled; to change, use

`camcontrol modepage da0 -m 0x08 -e`

set 


```
WCE:1
```

You may discover what your Savvio can do for you


----------



## Sebulon (Jun 20, 2011)

@danbi
Tried to use the camcontrols you specified. None of them have worked so far, on any system unfortunately. But I know that write cache is on by default in FreeBSD and that is true with AHCI as well. If I cranked it up with the old ata driver and had:

```
hw.ata.wc=0
```
In loader.conf, then the local write speed went down to about 8MB/s on the X25-E, proving that it has been active, before as well.

---------------------------------------------

OK, before you read further I want you all to be comfortably sitting, because the results were shocking... I take no responsibility for sudden faints, broken arms- bones and so on.

IÂ´ve-ahh...gotten the X25-E now. Gotten a chance to test it. But also, at work, I took the opportunity to pull out a STEC Zeus IOPS SSD 16GB from a shut down oracle system- to test it the same way as with the rest.


```
[B]Intel X25-E 32GB[/B]
  Local writes:
  raw            77 MB/s

  raw bs=128k    197 MB/s

  [U]Score as ZIL:[/U]
  raw            60 MB/s
```


```
[B]Zeus IOPS 16GB[/B]
  Local writes:
  raw            64 MB/s

  raw bs=128k    133 MB/s

  [U]Score as ZIL:[/U]
  raw            55 MB/s
```

Yeah, thatÂ´s right. The X25-E outperformed the almighty Zeus. Also, I have compiled my numbers from all the previous tests, to get a better overview. These tests have been made on as many different machines as possible and the results have been about the same.

*2011-11-01: Highscore is moved to top of first post. ItÂ´s a more logical place to have. Faster to find for those reading for the first time and also easier for me to update, as time passes.*

The results were definitely not what IÂ´ve expected. Not one device is able to shuffle NFS at 100MB/s and testing the Zeus was probaly the biggest anticlimax ever. But, no matter how you look at it, we can see that Intels X25-E is the winner of every test.

So if speed is your top priority, my advice would be to buy X25-EÂ´s until you hit your target. ZFS V28 is STABLE now, so at least you wonÂ´t have to mirror the logs any more. Just donÂ´t count on two logs giving double performace. At least with ZFS V15, I would say I won about 20% striped logs vs mirrored.

As a last test, I am going to upgrade to V28 and test one more time with the X25-E to report what difference that might give.

/Sebulon


----------



## danbi (Jun 21, 2011)

Sebulon said:
			
		

> @danbi
> Tried to use the camcontrols you specified. None of them have worked so far, on any system unfortunately.



There is difference in SAS and SATA write cache and how it is enabled in FreeBSD. You have enabled SATA write cache, but that does not affect SAS drives at all.

What does

`# camcontrol modepage da0 -m 0x08`

produce?


----------



## Sebulon (Jun 21, 2011)

@danbi
OK, didnÂ´t know that. Good to know but does little for me:

```
[root@tank ~]# camcontrol modepage da0 -m 0x08
camcontrol: error sending mode sense command
```
This is probably because of the HP Smart Array controller. It has no JBOD mode. You have to create RAID0Â´s for each drive.

/Sebulon


----------



## Sebulon (Jun 29, 2011)

Hi,

Upgraded to 8-STABLE ZFS V28

Still the same performance with 1x Intel X25-E 32GB. About 60-70MB/s at best. I'm hoping someone will prove me wrong, but it seems that is as good as it gets. Striping several has a proven positive effect, but not linearly multiplied by the number of devices, so I can't say for sure how many you'd need to hit a 100MB/s.

Perhaps someone here is feeling daring enough to pick up were I leave?

The truth is out there.

/Sebulon


----------



## gyrex (Aug 10, 2011)

Apologies for hijacking this thread but I was trying to figure out whether the Supermicro CSE-M35T-1 is SATA II capable. This was about the only result that came up!

Sebulon, do you know if the CSE-M35T-1 is SATA II capable? Could this be the reason why your performance is degrading?

There's no mention on Supermicro's site (http://www.supermicro.com/products/accessories/mobilerack/CSE-M35T-1.cfm) that indicates that this drive caddy is SATA II capable - it only says that it's a SATA (assuming SATA I) drive caddy/backplane.

I have 2 of these and was just wondering if this backplane will limit my performance.

Cheers,

John


----------



## Sebulon (Aug 10, 2011)

Hi gyrex and welcome!

I have two of those. Each port is SATA-150 and I have never connected a disk that could have exceeded that in IO. I mean, I have only regular hard drives connected to the caddy and they only generate about 80-100MB/s tops. I have however tested running:

```
# dd if=/dev/ada0-9 of=/dev/zero bs=1m
```
To test the total backplane and controller bandwidth, as the caddys are evenly connected to three PCI-X SATA-300 controllers. It generated IO around 1GB/s. So cool=)

The SSDÂ´s IÂ´ve tested have been directly connected to the built in ICH9 SATA-300 controller on the motherboard, so they donÂ´t get "disturbed" by IO coming from the other drives. The Intel X25-E could generate as much as 197MB/s but only at bs=1m. At bs=4k it shuffled only 77MB/s.

/Sebulon


----------



## danbi (Aug 11, 2011)

Isn't this passive drive enclosure? As far as I can see, it doesn't even have a backplane or port multiplier.


----------



## Sebulon (Aug 11, 2011)

@danbi

http://imageshack.us/photo/my-images/829/skrmavbild20110811kl103.png/

ItÂ´s not passive.

/Sebulon


----------



## Sebulon (Oct 17, 2011)

Have an update!

I have gotten my hands on a OCZ Vertex 3 240GB. I took my time testing it in the following rig:

```
[B][U]HW[/U][/B]
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2GHz
4x  2GB 1333MHZ DDR3 ECC UDIMM
10x SAMSUNG HD204UI (in a raidz2 zpool)
1x  OCZ Vertex 3 240GB

[B][U]SW[/U][/B]
[CMD="#"]uname -a[/CMD]
FreeBSD server 8.2-STABLE FreeBSD 8.2-STABLE #0: Mon Oct 10 09:12:25 UTC 2011     root@server:/usr/obj/usr/src/sys/GENERIC  amd64
[CMD="#"]zpool get version pool1[/CMD]
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default
```


```
[CMD="#"]iperf -c server[/CMD]
------------------------------------------------------------
Client connecting to server, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[  3] local client port 45921 connected with server port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.08 GBytes   927 Mbits/sec
```


```
[B][U]LOCAL WRITES[/U][/B]
[CMD="#"]gpart create -s gpt da5[/CMD]
da5 created
[CMD="#"]gpart add -t freebsd-zfs -b 2048 -l log1 da5[/CMD]
da5p1 added
[CMD="#"]gpart show da5[/CMD] 
=>       34  468862061  da5  GPT  (223G)
         34       2014       - free -  (1M)
       2048  468860047    1  freebsd-zfs  (223G)
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/dev/gpt/log1 bs=4k[/CMD]
524288+0 records in
524288+0 records out
2147483648 bytes transferred in 34.741185 secs (61813771 bytes/sec)
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/dev/gpt/log1 bs=128k[/CMD]
16384+0 records in
16384+0 records out
2147483648 bytes transferred in 7.921530 secs (271094554 bytes/sec)
```


```
[B][U]OVER NFS[/U][/B]
[B]with ssd log:[/B]
async)  2147483648 bytes transferred in 26.854198 secs (79968266 bytes/sec)
sync)   2147483648 bytes transferred in 30.528600 secs (70343339 bytes/sec)
[B]with md log:[/B]
async)  2147483648 bytes transferred in 38.788051 secs (55364567 bytes/sec)
sync)   2147483648 bytes transferred in 121.933071 secs (17611987 bytes/sec)
[B]without log:[/B]
async)  2147483648 bytes transferred in 38.690945 secs (55503520 bytes/sec)
sync)   2147483648 bytes transferred in 136.648112 secs (15715429 bytes/sec)
```

Tests over NFS have been made like:

```
[B]async)[/B]
[CMD="#"]mount -o async server:/export/perftest /mnt/tank/perftest[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m[/CMD]
[CMD="#"]umount /mnt/tank/perftest[/CMD]
[B]sync)[/B]
[CMD="#"]mount server:/export/perftest /mnt/tank/perftest[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m[/CMD]
[CMD="#"]umount /mnt/tank/perftest[/CMD]
```

The ZIL-tests were positive! It is the fastest disk IÂ´ve tested so far and bested even the X25-E as ZIL. Also the "Highscore" a couple posts above has been updated.

Odd reflection from these tests was from when I added a 1GB large ram-md disk as a "best possible" disk for ZIL and it didnÂ´t even use it?! I mean, I added md0 as ZIL in the pool, started the syncÂ´ed transfers from the test-client, watched gstat on the server during this time and the md0 drive was never written to. I then removed it from the pool and re-added the Vertex as ZIL instead, and instantly ZFS started using the ZIL as it normally does. Tried restarting the server, destroyed and created with a bigger sized md-device, partitioned it, destroyed the VertexÂ´s partition and label and used that same gpart-label on the md0p1 (gpt/log1) instead. Nothing worked. The only disk it wrote to as ZIL was the Vertex. Has worked in earlier tests though. Very odd.

/Sebulon


----------



## olav (Oct 18, 2011)

Impressive results!
Is the OCZ Vertex 3 240GB safe to use as ZIL?


----------



## Sebulon (Oct 19, 2011)

No capacitors on this model. IÂ´m just gonna be using it as L2ARC so I donÂ´t mind. I have however found information that Vertex 3 *Pro* has "Power loss data protection":


> http://images.anandtech.com/doci/4100/VERTEX3PRO_specs.jpg


Also found this:
http://www.legitreviews.com/article/1547/2/


> On the bottom left is the large Cap-XX HZ202 super-capacitor that ensures that all writes are completed in the event of power interruption.



So Vertex 3 Pro is what you want.

/Sebulon


----------



## danbi (Oct 19, 2011)

It is faster than X25-E, because that Intel drive is really old. But it also uses SLC flash, which means:

- less risk of device/data failure at power outage
- much, much larger lifespan

If you are using it in enterprise environment (X25-E is clearly an enterprise drive), you must care about both features. If not.. you know, you can build an indefinitely fast system, if data integrity is not a concern.


----------



## Sebulon (Oct 19, 2011)

That is something that only time could tell, honestly. And if you look at the numbers:

*OCZ Vertex 3:*
http://www.ocztechnology.com/ocz-vertex-3-sata-iii-2-5-ssd.html


> MTBF: 2 million hours



*Intel X25-E:*
http://www.intel.com/design/flash/nand/extreme/index.htm


> Life expectancy 	2 Million Hours Mean Time Before Failure (MTBF)



Which one of us is gonna be around to judge?
In other words; hakuna matata=)

/Sebulon


----------



## peetaur (Dec 5, 2011)

How does your spinning disk pool do when reading and writing to the same zpool?

eg. 


```
#clear cache (does this work? works on Linux for ext3/4)
zfs umount -a
zfs mount -a
#or maybe:
zpool export pool
zpool import pool
#or pick a file that nobody read for days

dd if=/tank/openSUSE-11.4-DVD-x86_64.iso of=/tank/testfile bs=128k
35208+0 records in
35208+0 records out
4614782976 bytes transferred in 20.468894 secs (225453460 bytes/sec)

[edit: since FreeBSD's dd has no conv=fdatasync I didn't know what to do above for the best results, but now I do, and here it is]

gdd if=/tank/openSUSE-11.4-DVD-x86_64.iso of=/tank/testfile bs=128k conv=fdatasync
35208+0 records in
35208+0 records out
4614782976 bytes (4.6 GB) copied, 26.8473 s, 172 MB/s
```


My performance was surprisingly lame (16 disks doing read at 600MB/s, and write+read together at 150 or so when combined [slower than a consumer fake raid 4 disk stripe]) until I set up the /boot/loader.conf to change the zfs tunables. Now as you can see, it will read and write at around 450 (faster than some other untuned 24 disk SAS system we have). I used this page as my template:
http://hardforum.com/archive/index.php/t-1551326.html


----------



## danbi (Dec 6, 2011)

Perhaps the only useful settings in that thread are


```
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"
```
This is because SATA disks don't perform well in concurent load. ZFS has defaults that are probably designed for SAS drives, although the current values were updated to


```
vfs.zfs.vdev.min_pending: 4
vfs.zfs.vdev.max_pending: 10
```

What is your current ZFS tunning?


----------



## peetaur (Dec 6, 2011)

My post above (#50) is confusing... I remember writing that, but not in this thread. It is out of context... nothing to do with ZIL from what I can see. I don't even remember going to this thread yesterday. So I'll reply to danbi in a visitor message. (And FYI 1200 MB/s is only when there is cache so it would be misleading to read it like I worded it).


----------



## peetaur (Dec 6, 2011)

Is your 70 MB/s number over a plain ol' 1GBps link? I've never been able to get FreeBSD to report more than 77 MB/s (using ssh, scp, rsync, etc.) going over 1GBps. Linux commands report 110 sometimes. iperf always shows the same numbers for FreeBSD and Linux though, so I think it is just weird calculations (compression? overhead included? Linux reporting MB and FreeBSD reporting MiB?) rather than different tuning or performance capability.

So I suggest you should repeat some of your tests with 10Gbps... or a local synchronous NFS mount. If NFS refuses to work, use an ssh tunnel to localhost to trick it.

Also, I suggest you test a ramdisk ZIL just to prove it is not a stupid software bug, and to have a practical test of what the real limit of your network might be. My ramdisk ZIL used for an ESXi virtual disk storage host makes it write at only 80 MB/s. That makes me think it is a software bug. But with an SSD, it goes 5-9MB/s unlike yours, so my weird virtual disk case is likely the worst case... maybe your ramdisk would saturate your network and be limited only by that.

BTW what I wanted to do and didn't have time to was:

Share a ramdisk on a few hosts as an iSCSI target.

Connect them together with a dedicated network (we have so many servers with dual/quad network onboard and we only use one port... could use the 2nd for this)

Mirror the local ramdisk and the iSCSI ramdisks together into my ZIL device.

Run off of that and hope at least one of the systems is alive at all times (so the ZIL is not lost, corrupting whatever changes were there).

Based on my lame 80 MB/s ramdisk number, I was discouraged from investing any time in that.
Maybe you have time to try that if you are interested.

I was also thinking of just hooking the machine up to the UPS over the serial port, setting sync=disabled, and when the UPS reports that power went out, have a script set sync=standard again... but there are other ways to fail other than power outage. I don't know if this is a good idea (or if the UPS can do this).

But my dream is for them to come up with something like this cheap consumer quality Gigabyte iRAM thing but have it copy the RAM to flash memory when the power goes out, and not be so expensive. NetApp has such a thing, but for like $5,000. I think this might be one.


----------



## Sebulon (Dec 7, 2011)

@peetaur
LOL Too much blood in your caffeine system

My 70MB/s is indeed over a plain ol' 1Gbps link. I have tried setting up server and client with just one switch between, and also tried a wire going directly from server to client. Both server and client were FBSD and with *async* NFS it actually did perform around 100-110MB/s

About that ramdisk, have you checked if it has write cache dis- or enabled? At least that SSD performing at 5-9MB/s sounds exactly like how the X-25E performed for me when write cache was disabled

/Sebulon


----------



## peetaur (Dec 7, 2011)

The SSD goes way faster than 5-9MB/s when I write directly to it... it goes over 220 with a simple async dd test (I forget what the bs=4k test says, but it is way higher than 10). I said it only goes that slow when I do this:

on ESXi, mount the zfs share over NFS, create a virtual machine with a virtual disk on that NFS share, install a file system on it (another layer of caching and flushing), and then do writes with dd, copy, scp, etc. to the disk in the running virtual machine's OS.

Another solution I thought of is to use iSCSI to share zvols, but zvols seem dangerously unstable and inefficient. At some point I will test a file on the zfs fs as an iSCSI target instead of a zvol. I tested a UFS zvol the same way with a virtual disk, and it went 60 MB/s, which is 10x as fast, but when doing so, it would put all of my disks in the pool at 100% load.

1 Gbps link, sync NFS client


```
# dd if=/dev/zero of=/tmp/testfile bs=128k count=6000
6000+0 records in
6000+0 records out
786432000 bytes (786 MB) copied, 11.8221 s, 66.5 MB/s
```


```
dT: 5.021s  w: 5.000s  filter: gpt/root|label/.ank|gpt/log|gpt/cache
    0    715      0      0    0.0    703  69324   11.8   30.4| gpt/log0
    0    715      0      0    0.0    703  69324   19.2   47.2| gpt/log1
```

1 Gbps link, virtual disk files (ESXi NFS client) (10 Gbps seems the same)


```
# dd if=/dev/zero of=testfile bs=128k count=2000
2000+0 records in
2000+0 records out
262144000 bytes (262 MB) copied, 37,618 s, 7,0 MB/s
```


```
dT: 5.021s  w: 5.000s  filter: gpt/root|label/.ank|gpt/log|gpt/cache
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    1    372      0      0    0.0    186   6608    0.1   75.5| gpt/log0
    1    372      0      0    0.0    186   6608    0.1   75.5| gpt/log1
```

The VMs are running Linux.

SSD is a 256GB Crucial m4 SSD CT256M4SSD2 2.5" (6.4cm) SATA 6Gb/s MLC synchron


```
root@bcnas1:/tank/bcnasvm1/bcvm01-02# camcontrol modepage da5 -m 0x08
IC:  0
ABPF:  0
CAP:  0
DISC:  0
SIZE:  0
WCE:  1
MF:  0
RCD:  0
Demand Retention Priority:  0
Write Retention Priority:  0
Disable Pre-fetch Transfer Length:  0
Minimum Pre-fetch:  0
Maximum Pre-fetch:  0
Maximum Pre-fetch Ceiling:  0
```


And no I don't know how to check the write cache on the ramdisk.

`# camcontrol modepage md10 -m 0x08`

```
camcontrol: cam_lookup_pass: CAMGETPASSTHRU ioctl failed
cam_lookup_pass: No such file or directory
cam_lookup_pass: either the pass driver isn't in your kernel
cam_lookup_pass: or md10 doesn't exist
```

`# kldstat`

```
Id Refs Address            Size     Name
 1   35 0xffffffff80100000 dc8658   kernel
 2    1 0xffffffff80ec9000 203848   zfs.ko
 3    2 0xffffffff810cd000 49d0     opensolaris.ko
 4    1 0xffffffff810d2000 c818     if_ixgb.ko
 5    1 0xffffffff810e1000 30990    mpslsi.ko
 6    1 0xffffffff81112000 117d8    ahci.ko
 7    1 0xffffffff81124000 ab40     siis.ko
 8    1 0xffffffff81212000 40c2     linprocfs.ko
 9    1 0xffffffff81217000 1de21    linux.ko
```

`# ls /boot/kernel/*pass*`

```
ls: /boot/kernel/*pass*: No such file or directory
```



Ramdisk was created this way (seems like a hack... unmounting it because for some reason it decides to implicitly format it UFS and mount it for me)


```
mkdir /mnt/ramdisk
/sbin/mdmfs -s 4G md10 /mnt/ramdisk
umount /mnt/ramdisk/
zpool add tank log md10
```


----------



## Sebulon (Dec 8, 2011)

@peetaur

Oh you ment THAT kind of ramdisk I thought it was a hardware-based PCIe card or something. Yeah, strangely, IÂ´ve noticed that behavior as well. Quoting myself here:


> Odd reflection from these tests was from when I added a 1GB large ram-md disk as a "best possible" disk for ZIL and it didnÂ´t even use it?! I mean, I added md0 as ZIL in the pool, started the syncÂ´ed transfers from the test-client, watched gstat on the server during this time and the md0 drive was never written to. I then removed it from the pool and re-added the Vertex as ZIL instead, and instantly ZFS started using the ZIL as it normally does. Tried restarting the server, destroyed and created with a bigger sized md-device, partitioned it, destroyed the VertexÂ´s partition and label and used that same gpart-label on the md0p1 (gpt/log1) instead. Nothing worked. The only disk it wrote to as ZIL was the Vertex. Has worked in earlier tests though. Very odd.







> Ramdisk was created this way (seems like a hack... unmounting it because for some reason it decides to implicitly format it UFS and mount it for me)


Yeah it does that. I used the same approach as you, except I sized the log after a 1Gbps connection:

```
# mkdir /mnt/ramdisk
# mdmfs -s 1G md10 /mnt/ramdisk
# umount /mnt/ramdisk/
# zpool add tank log md10
```

I have only benchmarked the server towards a client, an ESXi for example. This was made to see what kind of performance the ESXi could expect from itÂ´s datastore. You took this one step further, that was interesting. I will try to do that as well and see how it performs inside of a VM.

That Crucial SSD looks like a good performer, I read some benchmarks on it, 4k writes should be around 60-70MB/s. Shame itÂ´s not safe to use as a ZIL, since itÂ´s without any capacitor.

/Sebulon


----------



## peetaur (Dec 8, 2011)

Isn't an Intel X25-E also without a capacitor? Would you use one as a ZIL without a capacitor? Doesn't synchronous writing prevent problems with not having a capacitor? (doesn't sync mean that the write cache must be fully flushed before the command is complete?)


----------



## peetaur (Dec 8, 2011)

> Yeah, strangely, IÂ´ve noticed that behavior as well. Quoting myself here:
> Quote:
> ... added a 1GB large ram-md disk as a "best possible" disk for ZIL and it didnÂ´t even use it?! [...] was never written to. I then removed it from the pool and re-added the Vertex as ZIL instead, and instantly ZFS started using the ZIL as it normally does. ... Nothing worked. The only disk it wrote to as ZIL was the Vertex. Has worked in earlier tests though. Very odd.



That isn't the behavior I had. I could see the ramdisk being used in gstat every time. The "odd" thing I was talking about was just that when I have a virtual machine with a virtual disk file accessed through ESXi's NFS client, even a ramdisk ZIL would go way below the normal sync nfs client speed.

Here are all my numbers I got with various experiments... to add to your MUCH appreciated data.


```
virtual disk comparison tests
    no log device (offline mirror)

        $ sudo dd if=/dev/zero of=/testfile4 bs=128k count=10000
        ^C924+0 records in
        924+0 records out
        121110528 bytes (121 MB) copied, 45.4686 s, 2.7 MB/s

    2 way mirror with ramdisk, gpt/log0 (crazy idea, just to see what happens)

        $ sudo dd if=/dev/zero of=/testfile10 bs=128k count=3280
        3280+0 records in
        3280+0 records out
        429916160 bytes (430 MB) copied, 65.7471 s, 6.5 MB/s

    'striped' SSD log device

        $ sudo dd if=/dev/zero of=/testfile5 bs=128k count=10000
        ^C2682+0 records in
        2682+0 records out
        351535104 bytes (352 MB) copied, 54.3825 s, 6.5 MB/s

    mirrored SSD log device

        $ sudo dd if=/dev/zero of=/testfile5 bs=128k count=10000
        ^C2726+0 records in
        2726+0 records out
        357302272 bytes (357 MB) copied, 45.5657 s, 7.8 MB/s

    software striped  (non-ZFS stripe) log device

        zpool remove tank gpt/log0 gpt/log1
        kldload geom_stripe
        gstripe label -v st0 gpt/log0 gpt/log1
        zpool add tank log stripe/st0

        $ sudo dd if=/dev/zero of=/testfile6 bs=128k count=10000
        ^C2816+0 records in
        2816+0 records out
        369098752 bytes (369 MB) copied, 45.2785 s, 8.2 MB/s

    3 way stripe with ramdisk, gpt/log0, gpt/log1 c

        $ sudo dd if=/dev/zero of=/testfile10 bs=128k count=10000
        ^C3174+0 records in
        3174+0 records out
        416022528 bytes (416 MB) copied, 46.7078 s, 8.9 MB/s

    single SSD log device

        $ sudo dd if=/dev/zero of=/testfile5 bs=128k count=10000
        ^C3130+0 records in
        3130+0 records out
        410255360 bytes (410 MB) copied, 44.9023 s, 9.1 MB/s

    ramdisk log device

        mkdir /mnt/ramdisk
        /sbin/mdmfs -s 4G md10 /mnt/ramdisk
        umount /mnt/ramdisk/
        zpool add tank log md10

        dd if=/dev/zero of=/testfile10 bs=128k count=10000
        10000+0 records in
        10000+0 records out
        1310720000 bytes (1.3 GB) copied, 16.2831 s, 80.5 MB/s

    UFS zvol (bypassing log)

        zfs create -V 110g tank/vmufstest

        45-117 MB/s

    unexpected:
        ZFS-Striped <  mirror
        ZFS-Striped < single disk
        software-striped < single disk

    expected but noteworthy anyway:
        mirror < single disk
```

2 tests seem to be missing.

I don't see my "spinning disk ZIL" test which I think was somewhere between the no log test, and the mirrored SSD test.

I don't see the test where I used an 8 consumer disk striped mirrored setup with no log device instead of my 16 enterprise disk raidz2, which was the same as the other "no log" test. (and in my 16 consumer vs 16 enterprise tests, the numbers were only a few % different, so I don't think that threw it off)


----------



## peetaur (Dec 8, 2011)

Here is some text about SSDs that I did not read before, but something like it made me think that with a ZFS ZIL, I don't care about a capacitor. Do you suggest otherwise? And I don't know if MLC vs SLC is an issue. What do you think?

"But the drive may have answered to the OS, that it wrote the data to the non-volatile media. That's really a problem."
"Someone has lied in the chain OS to disk."
"You have to disable the drive caches, to ensure that the data is really on disk"
"ZFS doesn't have that problem because it flushes the cache after every write to the ZIL. It circumvents the problem of the write cache by effectively disabling it due to the frequent flushing of the caches."

http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html


----------



## Sebulon (Dec 8, 2011)

@peetaur

I was misinformed in that X25-E had a capacitor, I always believed it had. OK, screw that, IÂ´ve dug up something even better; the OCZ Deneva 2 240GB MLC. It has the exact same specs as the Vertex 3 240GB and also equipped with a capacitor. I would love to get my hands on one of those and test to see if it really does what itÂ´s supposed to in case of a power fail.

SLC is definitely better than MLC at the same size. For example 32GB SLC is way better than 32GB MLC. But itÂ´s also a question of marketprice. SLCÂ´s are too expensive for manufacturers to be big in size, so far less people buy SLC and because of that, they are more expensive. In fact, I tried to find SLC drives around in Swedens consumer market and found just one internet shop that still sells X25-E, and it cost more than buying a 240GB MLC that has the same perfomance, if not better.

The Zeus IOPS drives has capacitors. It is a must for any SLOG to ensure that data gets written to the ZIL safely. ItÂ´s either a capacitor that works or disable write cache altogether, and that lands you at about 5-10MB/s

/Sebulon


----------



## peetaur (Dec 8, 2011)

Did you do a test yet, with an SSD that has a capacitor, in an ESXi virtual disk over NFS test like mine that goes 5-9MB/s?

My SSD has the write cache enabled. My normal NFS clients set to synchronous go 65 MB/s. So why should the ESXi client somehow tell my SSD to not use the write cache? Is this what is really happening, or is it just a client-side bug/braindead configuration?

This page is highly unrelated... about ESXi's iSCSI initiator and "starwind" iSCSI target (probably not ZFS). But sounds like a very similar problem, and then the guy reports better results when using the recommendations on this page. I don't think it applies to NFS though. 

But it did make me think of one thing... I can do a test that does not have another file system in there to remove "another layer of caching and flushing" as I said above.

```
mirrored SSD log device. writing to direct virtual disk with no file system

        # dd if=/dev/zero of=/dev/sdc1 bs=128k count=1000
        1000+0 records in
        1000+0 records out
        131072000 bytes (131 MB) copied, 24.0635 s, 5.4 MB/s
```



So there is something about the ESXi NFS client that really sucks. And you were planning on using it for that, right? So we are in the same boat.

And you didn't answer my question about the necessity of a capacitor on ZIL on zfs.
"ZFS doesn't have that problem because it flushes the cache after every write to the ZIL. It circumvents the problem of the write cache by effectively disabling it due to the frequent flushing of the caches."
I have believed that was true in my initial research, and everything in between, and still now. *Do you believe that also?* If so, then unlike other SSD applications, using it in a ZIL does not require it have a capacitor, not for performance nor data integrity (but makes it slower compared to non-zfs applications).


----------



## danbi (Dec 9, 2011)

peetaur said:
			
		

> I don't see the test where I used an 8 consumer disk striped mirrored setup with no log device instead of my 16 enterprise disk raidz2, which was the same as the other "no log" test. (and in my 16 consumer vs 16 enterprise tests, the numbers were only a few % different, so I don't think that threw it off)



This sort of tests is pretty much useless and very much misleading.

Consumer disks may have the same or even better sequential read/write speeds compared to enterprise drives. Where enterprise drives excel is reliability and multi-threaded performance. Also, it is unwise to compare SATA and SAS disks with sequential loads such as dd, because such scenarios almost never happen in an multitasking OS.
For example, SAS drives have independent read/write data paths, while SATA devices by design have single data path that is switched for read/write -- there are usage patterns, where this significantly impacts performance.

You will be much better doing benchmarks with bonnie++, especially multi-threaded tests.

The SLOG is intended to help with multiple small sync writes, that would typically happen in a database or heavily multitasking setup. I believe in newer versions of ZFS large synchronous writes bypass the ZIL already. The primary purpose of the SLOG is to reduce sync write latency, by using small, fast dedicated device.

Daniel


----------



## peetaur (Dec 9, 2011)

If the test is misleading, you read it wrong. I didn't say the consumer disks are the same speed. I said that in my testing (zfs file system, caching, sync NFS, in some cases an SSD ZIL), the system performs the same. And in real world tests, I am not testing raw disks without cache, so I don't want to run your benchmarks with cache disabled; that would tell me nothing practical.

And what I was saying is not useless; I was talking about the effect of mirroring it to try to fix the slow ESXi problem, not enterprise vs consumer disks. You need to be wise in how you read results from anywhere, considering the context including benchmarks. And benchmarks are even worse with ZFS, showing obvious fake results due to caching. I can disable the cache or do real world tests to avoid that. So don't tell me it is useless. Try to be constructive instead.

And bonnie++ won't run at all for me in FreeBSD, only Linux machines I tried it on.


```
/tank/test# bonnie++ -d /tank/test -c 8 -s 1500 -x 10 -u peter
Using uid:1001, gid:1001.
File size should be double RAM for good results, RAM is 49124M.

/tank/test/bonnie# bonnie++ -d /tank/test -c 8 -s 100000 -x 10 -u peter
Using uid:1001, gid:1001.
format_version,bonnie_version,name,file_size,io_chunk_size,putc,putc_cpu,put_block,
put_block_cpu,rewrite,rewrite_cpu,getc,getc_cpu,get_block,get_block_cpu,seeks,
seeks_cpu,num_files,max_size,min_size,num_dirs,file_chunk_size,seq_create,
seq_create_cpu,seq_stat,seq_stat_cpu,seq_del,seq_del_cpu,ran_create,ran_create_cpu,
ran_stat,ran_stat_cpu,ran_del,ran_del_cpu,putc_latency,put_block_latency,
rewrite_latency,getc_latency,get_block_latency,seeks_latency,seq_create_latency,
seq_stat_latency,seq_del_latency,ran_create_latency,ran_stat_latency,ran_del_latency
Can't open file ./Bonnie.97370

root@bcnas1bak:/tank/test/bonnie# ls -l
total 3
drwxr-xr-x  3 peter  peter  3 Dec  9 08:37 largefile1
```

The directory called largefile1 was created with filebench, not bonnie++. Filebench seems to work other than reporting bogus CPU numbers, but again, it gives bogus "disk" results due to caching, and basically valid but not real-world "file system" results.


----------



## Sebulon (Dec 9, 2011)

@peetaur

OK, to answer your question; I believe you MUST have a SSD with a capacitor, or you WILL end up with a corrupt ZIL (possibly klling the entire pool) after a power failure. And IÂ´m far from alone about that:

http://hardforum.com/showthread.php?t=1591181


> use SSDs with supercapacitor protection to protect your SLOG from corruption



http://www.nexentastor.org/boards/1/topics/972


> in case of a power failure, data in the device's internal cache must not be lost, or the device will at least have to honor ZFS cache flush requests



http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30412.html


> You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache that can't be disabled neither flushed by ZFS.



http://www.natecarlson.com/2010/05/07/review-supermicros-sc847a-4u-chassis-with-36-drive-bays/


> The only thing it lacks is a supercapacitor, but that shouldnâ€™t be a problem if you have dual PSUâ€™s connected to dual UPSâ€™s



BUT I havenÂ´t had the opportunity to test this personally, so I cannot say I know this for sure yet. IÂ´m hoping I can build a rig soon where itÂ´s ok to test a power failure, first with a non-capacitor SSD like the Vertex 3, and then again with a capacitor-backed SSD, like the Deneva 2, to see if thereÂ´s any difference. I mean, if the pool will get faulted with the Vertex as SLOG and then the same for the Deneva. That kind of test.
After that, we will know for sure if the capacitor really has an impact or not.

/Sebulon


----------



## peetaur (Dec 9, 2011)

Sounds like fun. Can't wait to hear the results.

Add this to your fun list: ramdisk ZIL with hard reboot during a sync write.


----------



## RusDyr (Jan 27, 2012)

Very interesting topic. I have two Intel SSDSA2CW120G3 (120Gb SSD) and can say that UFS really good. I've got *double* speed on UFS over ZFS:
ZFS partition:

```
dd if=/tmp/ram/rand of=/tmp/z/testfile.bin bs=4k
515584+0 records in
515584+0 records out
2111832064 bytes transferred in 32.458456 secs (65062616 bytes/sec)
```

UFS partition:

```
dd if=/tmp/ram/rand of=/tmp/ufs/testfile.bin bs=4k
515584+0 records in
515584+0 records out
2111832064 bytes transferred in 14.829704 secs (142405543 bytes/sec)
```

And since I wanted to use ZFS, it drives me really crazy.


----------



## Sebulon (Jan 27, 2012)

@RusDyr

That doesn't quite show the whole picture, I've got a couple of questions to go with with those dd's:

Are you writing to one disk with zfs and one with ufs?
How are the disks partitioned? Output of:
`# gpart show`
What is the output of:
`# zdb | grep ashift`
Where you planning on having those two for a system pool and a bunch of other disks for "tank", or are those two it?
Are planning on using some type of redundancy on the SSD's, like gmirror or zfs mirror?
You are writing to a file in a file system, which cancels the bs-flag on dd and uses the block-size the file system wants. If you want to benchmark that difference, you have to write directly to a device, eg: /dev/daX(pX)

/Sebulon


----------



## RusDyr (Jan 30, 2012)

1. Yes.
2. 
	
	



```
=>       34  234441581  ada4  GPT  (111G)
         34          6        - free -  (3.0k)
         40        128     1  freebsd-boot  (64k)
        168    2097152     2  freebsd-swap  (1.0G)
    2097320   52428800     3  freebsd-zfs  (25G)
   54526120    6291456     4  freebsd-zfs  (3.0G)
   60817576    6291456     5  freebsd-zfs  (3.0G)
   67109032   83886080     6  freebsd-zfs  (40G)
  150995112   83446496     7  freebsd-zfs  (39G)
  234441608          7        - free -  (3.5k)

=>       34  234441581  ada5  GPT  (111G)
         34          6        - free -  (3.0k)
         40        128     1  freebsd-boot  (64k)
        168    2097152     2  freebsd-swap  (1.0G)
    2097320   52428800     3  freebsd-ufs  (25G)
   54526120    6291456     4  freebsd-zfs  (3.0G)
   60817576    6291456     5  freebsd-zfs  (3.0G)
   67109032   83886080     6  freebsd-zfs  (40G)
  150995112   83446496     7  freebsd-zfs  (39G)
  234441608          7        - free -  (3.5k)
```

3. It was "ashift 9".
4. I would liked to use it for system pool (ZFS mirror), for ZIL (ZFS mirror, per pool), and for L2ARC (ZFS stripe, per pool).
5. Yeah, currently system partition is gmirror'ed, but I'm slightly disappointed what TRIM isn't supported over gmirror.



> You are writing to a file in a file system, which cancels the bs-flag on dd and uses the block-size the file system wants. If you want to benchmark that difference, you have to write directly to a device, eg: /dev/daX(pX)


I did it like you did. 
Benchmark with direct device (or more accuratly, to partition) pretty close to UFS results.

P.S. Current config:
`#  camcontrol devlist`

```
<SAMSUNG HD204UI 1AQ10001>         at scbus0 target 0 lun 0 (pass0,ada0)
<SAMSUNG HD204UI 1AQ10001>         at scbus1 target 0 lun 0 (pass1,ada1)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (pass2,ada2)
<SAMSUNG HD204UI 1AQ10001>         at scbus3 target 0 lun 0 (pass3,ada3)
<INTEL SSDSA2CW120G3 4PC10362>     at scbus4 target 0 lun 0 (pass4,ada4)
<INTEL SSDSA2CW120G3 4PC10362>     at scbus5 target 0 lun 0 (pass5,ada5)
```

`# # zpool list`

```
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage  3,53T  1,56T  1,97T    44%  1.00x  ONLINE  -
zstripe   103G  9,73G  93,3G     9%  1.00x  ONLINE  -
```

`# # mount -v`

```
/dev/mirror/system on / (ufs, local, noatime, journaled soft-updates, fsid 72fe234fbf939f57)
devfs on /dev (devfs, local, multilabel, fsid 00ff007171000000)
procfs on /proc (procfs, local, fsid 01ff000202000000)
storage on /storage (zfs, local, noatime, nfsv4acls, fsid 791cf6fedea1d203)
zstripe on /zstripe (zfs, local, noatime, nfsv4acls, fsid bb656ff7de90fc76)
```

`# # zpool status`

```
pool: storage
 state: ONLINE
  scan: resilvered 52K in 0h0m with 0 errors on Fri Jan 27 10:39:43 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        storage                                         ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/73247b20-46b2-11e1-8642-a0369f0010fc  ONLINE       0     0     0
            gptid/cc27ad35-475d-11e1-8383-00259052b005  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/84178b08-46b2-11e1-8642-a0369f0010fc  ONLINE       0     0     0
            gptid/cae28809-475d-11e1-8383-00259052b005  ONLINE       0     0     0
        logs
          mirror-2                                      ONLINE       0     0     0
            gptid/3600f2e2-48eb-11e1-9076-485b39c5c747  ONLINE       0     0     0
            gptid/3b54ceb9-48eb-11e1-9076-485b39c5c747  ONLINE       0     0     0
        cache
          ada4p6                                        ONLINE       0     0     0
          ada5p6                                        ONLINE       0     0     0

errors: No known data errors

  pool: zstripe
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zstripe     ONLINE       0     0     0
          ada3p4    ONLINE       0     0     0
          ada0p4    ONLINE       0     0     0
          ada1p4    ONLINE       0     0     0
          ada2p4    ONLINE       0     0     0
        logs
          mirror-4  ONLINE       0     0     0
            ada4p5  ONLINE       0     0     0
            ada5p5  ONLINE       0     0     0
        cache
          ada4p7    ONLINE       0     0     0
          ada5p7    ONLINE       0     0     0

errors: No known data errors
```


----------



## Sebulon (Jan 30, 2012)

@RusDyr

Thank you, that helped me understand your system better.



> I did it like you did.
> Benchmark with direct device (or more accuratly, to partition) pretty close to UFS results.


DonÂ´t do what I do, do what I say
I know, I did that, and it was wrong of me. I didnÂ´t know that the filesystem interfered that way before. I did however explain it in a later post (#35)

Now, IÂ´m like the worst person at math, so IÂ´m just gonna ask; those partitions, they donÂ´t look to me as if they are aligned to 4k? The partitions should have at least one evenly dividable by 1M, like:
`# gpart add -t freebsd-zfs -b 1m -a 4k daX`
On some SSDÂ´s IÂ´ve tested, that has made a huge difference, like doubled the performance. Might not have been the case with the 320Â´s though, but it couldnÂ´t hurt anyways.
Also I would suggest you make sure to have ashift set to 12 on a SLOG. Have you used the gnop trick before? I can describe the steps for you if you like.

In the beginning, I tested having two disks with two partitions each, acting as both ZIL and L2ARC, which was a very bad idea. I noticed a performance hit of about 30% that you should be aware of. Using the same disks for multiple things, like boot, ZIL and L2ARC is bad practice. The point is to have separate(undisturbed) ZIL and L2ARC that you provision to the level of your performance, so that the data-disks behind are less important performance-wise, because the ZIL and L2ARC will always guarantee that level. 
Or more clearly; if you have good enough ZIL and L2ARC, you can configure the data-disks for biggest possible storage, instead of acceptable performance.

I would suggest you boot off of a USB-stick, set up- and mount _root_ on the pool and have one of the 320Â´s dedicated as ZIL and the other as L2ARC for best performance. That way you only have one big pool with best possible performance. If you run 8.2-STABLE or 9.0-RELEASE you donÂ´t have to worry about mirroring the ZIL any more.

/Sebulon


----------



## peetaur (Feb 1, 2012)

RusDyr, can we see your /boot/loader.conf and /etc/sysctl.conf?

In particular, I am looking for these settings (and here are my values for my 48GB machine, all in /boot/loader.conf, nothing zfs related in my /etc/sysctl.conf):


```
vm.kmem_size="44g"
vm.kmem_size_max="44g"
vfs.zfs.arc_min="80m"
vfs.zfs.arc_max="38g"
vfs.zfs.arc_meta_limit="24g"
vfs.zfs.vdev.cache.size="32m"
vfs.zfs.vdev.cache.max="256m"
kern.maxfiles="950000"
```

Please note that at every chance they get, many people say you should never set "vm.kmem_size_max". So probably should not set that one. But I have not had any problems with it so far.


----------



## peetaur (Feb 1, 2012)

RusDyr,

I did some tests on an old computer: Dell PowerEdge 2850


```
device     = 'Expandable RAID Controller (PERC 4e/Si and PERC 4e/Di)'
    class      = mass storage
    subclass   = RAID
```

Can't tell you exactly what the disks are, but they are 10k RPM SCSI.


```
6 disk gstripe UFS:
    write at 161356565 B/s, 
    read at 738257082 B/s (cached)
    unmount, remount and read 37920878 B/s (uncached, but bursty looking in gstat, with disks spending much time at 0 kbps 0% load)

6 disk zfs 'stripe' and write at 159819737 B/s.
    write at 159819737 B/s
    read at 1996571694 B/s (cached)
    playing with "primarycache" setting to read uncached: 224047484 (confirmed this works as uncached in gstat) 
    [EDIT: apparently I repasted the above read number here before, so fixed that with the correct value averaged 
    over 5 tests; and previous edit of 211382222 was with raidz1]
```

So it would look like generalizing and saying "UFS is twice as fast as ZFS" is not correct. So we should look into why your numbers seem that way. My above post about loader.conf is based on my experience, where ZFS is a terrible performer with low RAM (such as the default settings), even compared to other file systems with low RAM.

I was mainly testing the difference between a raid10 like setup and raidz1. Raidz1 is much faster writing sequentially, and same reading (as I hypothesized). But you were talking about UFS vs ZFS rather than gmirror vs ZFS. Can someone recommend the best way to test random read and write (to comare raidz1 and striped mirror config)? I am probably not interested in avoiding caching (disk test, such as what bonnie++ would do for me), only the 'real world' test (file system test), which would include caching, and sync writes.


----------



## Sebulon (Feb 1, 2012)

@peetaur

I believe you are confusing RAID-levels:

RAID0 = 1x- or >1x no parity zfs stripe/d vdev/s (gstripe)

RAID1 = 1x zfs mirror vdev (gmirror)
RAID5 = 1x zfs raidz1 vdev
RAID6 = 1x zfs raidz2 vdev

RAID10 = >1x zfs mirror vdev (gstriped gmirrors)
RAID50 = >1x zfs raidz1 vdev
RAID60 = >1x zfs raidz2 vdev

You can, of course, make a raidz1 vdev with only two disks in it, but IIRC performance is better using mirrored vdevs.

/Sebulon


----------



## peetaur (Feb 1, 2012)

I agree with you on those definitions. What part of my previous post was wrong? [and not listed is RAID 0+1 which as far as I know is not possible with pure zfs.]

I mentioned raid10 in the last paragraph, which I also tested, but didn't include the results in the code blocks above because (1) they are out of context of this "zfs vs UFS" performance question. I wanted to make my post shorter, since your thread is about NFS and ZILs, not comparing UFS.  and (2) I didn't want to bother creating gstriped gmirrors in my test (the UFS comparison of that otherwise incomparable test), because although related to this discussion, it is not related to what I want to do with this server I am building.

Here is my zfs stripe:


```
zpool create pool \
    gpt/pool1d1 gpt/pool1d2 \
    gpt/pool2d1 gpt/pool2d2 \
    gpt/pool3d1 gpt/pool3d2
```

Here is my UFS stripe:


```
gstripe create pool \
    gpt/pool1d1 gpt/pool1d2 \
    gpt/pool2d1 gpt/pool2d2 \
    gpt/pool3d1 gpt/pool3d2

newfs /dev/stripe/pool
mkdir /pool
mount /dev/stripe/pool /pool
```

Here is my zfs RAID10:


```
zpool create pool \
    mirror gpt/pool1d1 gpt/pool1d2 \
    mirror gpt/pool2d1 gpt/pool2d2 \
    mirror gpt/pool3d1 gpt/pool3d2
```


----------



## Sebulon (Feb 2, 2012)

@peetaur

Yeah, a little OT, but I donÂ´t mind. This is what I reacted to:


> I was mainly testing the difference between a raid10 like setup and raidz1.


RAID10 and raidz1(RAID5/0) are nothing alike. YouÂ´re comparing apples and pears. ItÂ´s not useless, but untrue.

What you can benchmark instead is the difference between striped zfs mirrors and gstriped gmirrors. That would be a fun and true comparison.



> and not listed is RAID 0+1 which as far as I know is not possible with pure zfs.


Correct. You could perhaps create two gstripe-devices made up of N disks and create a zfs pool with one mirrored vdev using those two gstripes. Might also be a fun test.

/Sebulon


----------



## peetaur (Feb 2, 2012)

It is not untrue to compare apples and pears. It just depends on what you want. If you simply say "apples are better than pears", then you are mistaken, but if you say "apples suit my needs better than pears", then there is no problem, but not all can say the same statement.

What I am not sure of:

Do I need the excellent random read and write of RAID10?
Do I need the extra space or faster sequential speed from raidz1?

What I was testing was:

Is raidz1 slower than raid10 like everyone says, or does it match my hyphothesis: sequentially it is always faster than raid10; equal to raid0 in read, 80% (1/(disks-1) slower) compared to raid0 in write. And in "random" operations, I expect it to be slower, but I'm not sure how to test... My plan of testing random access is to "make buildworld". I don't like benchmark tools; they seem to just test block level access to a big randomly generated file, not a full file system (creating files, reading directories, etc.) which is more real world.

Others make stupid but intelligent sounding conclusions. So it is very difficult to figure out the whole truth just from reading. 

And I am slowly getting fed up with ESXi. 

<ESXi rant>
Today, the ESXi server was happily running while it said some vms are down and others are up. Pinging the "down" vms and using their web servers, etc. worked. So ESXi is reporting that they are "down" when really they are up. How is that even possible? One server that was "up" was responding VERY slowly, and others are reporting load >4 when idle. So I rebooted ESXi. Why should I need to reboot just to fix this? I feel like I'm running Windows... And commands like "top" and "vmstat" or even looking in /proc/... to find cpu stats isn't possible in the ESXi command line. (I'm guessing there is a way, but it is a mystery to me). Things like that make me want a real OS, not this incomplete VMware ESXi Busybox, with a semi-well documented GUI and completely obscure proprietary command line. I wanted to know the CPU usage of the vmware-vmx processes (or the ESXi equivalent), because it is a common problem for them to all hog 100% CPU and make the system crawl in other VMware products.

And a week ago, I wanted to create my first non-NFS virtual machine, and to my surprise, there was no local datasource. The path was "dead" it said. But the OS disk worked fine, so clearly the hardware was working. The bios RAID setup said things were optimal A reboot fixed it, so I can only assume the hardware is fine, but how can I trust it in the future?

So my new plan is: Run my NAS and replication/backup server with lots of disk space. Run the VMs on separate FreeBSD + ZFS + VirtualBox machines, with the virtual disk files stored locally. Send replication snapshots to the backup server, and put large volume low latency demanding stuff (which is most of what we do here) on the NAS directly. (Luckily, I am in charge of this, so I can decide whether or not to throw away ESXi; Do you have the same control?). Another option would be to netboot the ESXi, or to run it in a VirtualBox. Both of those sound like bad hacks, and since I'm fed up with ESXi, I am leaning towards a VirtualBox solution.

So my "raid10 apples vs raidz1 pears" comparison is just to decide... do I want the significantly higher space and sequential performance from raidz (raidz1 in the case of this 6 disk machine, and maybe also the 4 disk machine that currently runs ESXi, and raidz2 for larger ones), or do I want the performance characteristics of the raid10 (~50% better random [did not test myself], ~33% slower sequential write (63MB/s vs 92MB/s) and equal read (204MB/s vs 193MB/s) [my own testing on the old PowerEdge 2850]).

So far, I think I will choose raidz1 for the faster sequential writes. I think for my needs, faster transfers over the network are more important than random performance (like for a database, or compiling things). (Maybe I will compare random speeds by running make buildworld)

We are getting further off the topic of ZILs... but I think what you are interested in is virtual machine performance... so I guess we are slightly on topic. And let me know if you prefer less long-winded replies in the future.


----------



## peetaur (Feb 2, 2012)

peetaur said:
			
		

> That isn't the behavior I had. I could see the ramdisk [ZIL] being used in gstat every time.



Oh by the way... now *gstat* doesn't show much load on the ZIL during a sync Linux client write. I don't know why it stopped, but my best guess is that it is because I destroyed the pool and recreated it. The old one was an old version that was upgraded to v28. The new one was created v28. (another quirk I found in upgraded pools vs created as v28 is that you really can't remove the log... You can remove log vdevs, or run with the log OFFLINE, but the last one won't go away.)


----------



## Sebulon (Feb 6, 2012)

@RusDyr

To find out for yourself the exact difference between ZFS and UFS you can benchmark striped zfs mirrors (without log- or cache-device) and gstriped gmirrors to make a completly accurate comparison. Then, as peetaur is on to, comes the next question, "What do I really need?". Then you can compare your initial results with the results from one raidz vdev, two raidzÂ´s, one raidz2, two raidz2Â´s, and so on. Please post your results in a new thread called like "Comparative benchmark between zfs mirrors and grstriped gmirrors", something or other. Use my MO to create a RAM-disk, fill up a big file of random data and use that with dd to test write-speed. Then itÂ´s ok to read from that random data-file and write towards another file in the ZFS or UFS file system, like:
`# dd if=/mnt/ram/randfile of=/foo/bar/randfile bs=1m`
Also install benchmarks/bonnie++ from ports and test:
`# bonnie++ -d /foo/bar -u 0 -s Xg`
-d /foo/bar (the directory with the ZFS or UFS filesystem mounted)
-u 0 (if youÂ´re running as root)
-s Xg ("X" should be double the size of your RAM)
I would love to see those numbers.
Also keep in mind the tips I gave you about gpart and gnop. They have been big performance enhancers for me personally.



@peetaur

I quite like rants. ItÂ´s about the only way to find out what the the support and sales wonÂ´t tell you. How many times have you heard, "Our products are *not* good at etc, etc" or "Our products have N bugs that muck up this and that in this way"


> (Luckily, I am in charge of this, so I can decide whether or not to throw away ESXi; Do you have the same control?)


No. We are running 3-400 VMÂ´s in a HP blade chassis with NetApp NAS for serving NFS to VMWare and SMB to our users, farming about 200TB. We are however planning on a much more price-efficient solution for a gigantic video archive running Supermicro HW and FreeBSD or FreeNAS, which I will be in charge of.



> Oh by the way... now gstat doesn't show much load on the ZIL during a sync Linux client write. I don't know why it stopped, but my best guess is that it is because I destroyed the pool and recreated it. The old one was an old version that was upgraded to v28. The new one was created v28.


Aha! So we have the same behaviour. That is so strange. A big regression, IÂ´d say.


> (another quirk I found in upgraded pools vs created as v28 is that you really can't remove the log... You can remove log vdevs, or run with the log OFFLINE, but the last one won't go away.)


Big bummer. I wonder how a power-outage would affect the pool if running in that state... Best to have a new pool created with V28 and send/recv between then.

/Sebulon


----------



## phoenix (Feb 6, 2012)

Just a side note:  VMWare ESXi's NFS client has absolutely horrible performance when connected to a FreeBSD NFS server backed by ZFS. Separate ZIL does not help much. Async mount does not help much. Disabling the ZIL doesn't help much. The only way to get good performance from it is to modify the FreeBSD NFS server in such a way that it eliminates all data protection.

There's several threads about this on the -stable and -current mailing lists.

Best solution is to dump ESXi.  Second best solution is to dump ZFS.


----------



## Sebulon (Feb 6, 2012)

@phoenix

WHAT, WHAT, WHAAAT?!

Please post a link or three, for us poor search-impaired people It would be nice for others reading this to have a direct reference from someone who've read about it. I could start searching for them and post links myself, but chances are I'd just find the "wrong" ones...

/Sebulon


----------



## phoenix (Feb 6, 2012)

For the searching impaired: ZFS sync / ZIL clarification


----------



## Sebulon (Feb 7, 2012)

@phoenix

Awesome, thanks!

@peetaur

Wooow! I remember you writing about this earlier but I didnÂ´t know quite how bad it was, and that more people are starting to notice this as well. Fortunately, the article-link you posted in the mail-thread completly solves the problem:
http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html
When you are running ZFS with a ZIL you know you can trust (after extensive testing, of course), youÂ´ll want to depend on that, instead of having the NFS-server flushing the ZIL all of the time.

/Sebulon


----------



## AndyUKG (Feb 8, 2012)

Sebulon said:
			
		

> http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html



With respect to the link, I would guess disabling NFS flushing will likely cause data corruption of the VMware file system in the event of a powerloss of the NFS server. The article said that this hack had been in use for some time without issues but didn't specify if they had actually tested NFS server power failure. The whole point of NFS flushes and the ZIL etc is to assure data consistency when something goes wrong.
Anyway something to at least investigate before you start to use a hack like this...

cheers Andy.


----------



## Sebulon (Feb 8, 2012)

@AndyUKG

That is 100% correct, excellent of you to point out! IÂ´m hoping I have a chance to test this personally very soon. IÂ´m hunting for the perfect SLOG at the moment, which I then intend to install in our dev-storage and serve out a NFS datastore to our dev-ESXi, so this can be investigated properly.

/Sebulon


----------



## AndyUKG (Feb 8, 2012)

I've read that istgt with ZFS provides better performance with ESXi than NFS, have you looked into that/discounted that for any reason?

cheers Andy.


----------



## peetaur (Feb 8, 2012)

If you use iSCSI for virtual disks, is it easy to export and use the disks in other ways without ESXi?

For example, if I create a VMware ESXi vm that uses iSCSI, and then later change it to VirtualBox, can I simply use it as is? Can I mount it and run `# VBoxManage clonehd ...` to recover it?

Or can I mount the disk and read the files with the ESXi host machine shut down?


----------



## AndyUKG (Feb 8, 2012)

peetaur said:
			
		

> If you use iSCSI for virtual disks, is it easy to export and use the disks in other ways without ESXi??



An iSCSI disk device when connected to an iSCSI client system behaves just like a direct attached disk. So anything you can do with a physical disk attached to VMware, you can do with an iSCSI disk in terms of connecting to other systems. I haven't tried moutning a VMFS volume on any other OS if that's what you want to do, but a quick google did turn up an open source readonly VMFS driver:

http://code.google.com/p/vmfs/

thanks Andy.


----------



## peetaur (Feb 8, 2012)

I know you can mount a vmdk file in other vms, or mount it with a fuse driver, but does VMware's iSCSI client add some metadata junk in the iSCSI share that is unique to Vmware's implementation and unusable in others?


----------



## AndyUKG (Feb 8, 2012)

I haven't tried it, but I'd guess it will be mountable to non-VMware systems. As I said an iSCSI volume is treated as a local disk, VMware state this is the case on VMware too so if fuse etc can mount normal VMware devices it should work with an iSCSI devices too. But unless anyone else can confirm I guess you'll have to try to be sure...

thanks Andy.


----------



## Sebulon (Feb 27, 2012)

[GirlieGiggle] It has arrived... [/GirlieGiggle]

/Sebulon


----------



## lockdoc (Mar 8, 2012)

Hi Sebulon


			
				Sebulon said:
			
		

> Hi all!
> [...]
> 
> ```
> ...


I also have 6 of those HD103SJ and 2 Spinpoint F1, are you sure they are really 4k blocksize. I think I read somewhere that they are actually 512b. Cannot find the source anymore.

My setup is: 4 mirrors of geli encrypted disks in a stripe without any cache or log. If yours is also encrypted we could do a couple of benchmarks together as we have the same disks.


----------



## Sebulon (Mar 8, 2012)

@lockdoc

I've posted several HW specs for different servers

Those HD's are in my home-NAS and they are good old 512's, ashift=9.

No geli for me, yet. I've looked up that my CPU (Core i5 560) supports HW assisted AES-256, so I'm very tempted to give it a go some time and see how the performance is affected. But that testing is another topic completely

/Sebulon


----------



## lockdoc (Mar 8, 2012)

Sebulon said:
			
		

> Those HD's are in my home-NAS and they are good old 512's, ashift=9.



Hmm they report  ashift=12 on mine with geli 4096 blocksize. Could this be a bottleneck?


----------



## phoenix (Mar 8, 2012)

GELI is setting the minimum block size to 4K.  Hence, ZFS configured the vdev to use 4K blocks as the smallest size (aka *ashift=12*).


----------



## Sebulon (Mar 9, 2012)

Initial scores from OCZ Deneva 2 R-series MLC (sync). HW is set up to test the Deneva first and will be reconfigured with 4x Vertex 3Â´s in two mirrors, for fault-tolerance later.


```
[B][U]HW[/U][/B]
1x  Supermicro H8SGL-F
1x  Supermicro AOC-USAS2-L8i
1x  Supermicro SC111T-560CB
1x  AMD Opteron 8C 6128 2.0GHz
2x  16GB 1333MHZ DDR3 ECC REG
1x  OCZ Deneva 2 R-series 200GB
2x  OCZ Vertex 3 240GB

[B][U]SW[/U][/B]
[CMD="#"]uname -a[/CMD]
FreeBSD default 9.0-RELEASE FreeBSD 9.0-RELEASE #0: Tue Jan  3 07:46:30 UTC 2012     
root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
[CMD="#"]zpool get version pool2[/CMD]
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default
[CMD="#"]zpool status[/CMD]
  pool: pool1
 state: ONLINE
 scan: resilvered 392K in 0h0m with 0 errors on Wed Mar  7 14:32:33 2012
config:

	NAME          STATE     READ WRITE CKSUM
	pool1         ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    gpt/usb0  ONLINE       0     0     0
	    gpt/usb1  ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
 scan: resilvered 0 in 0h0m with 0 errors on Thu Mar  8 16:42:41 2012
config:

	NAME         STATE     READ WRITE CKSUM
	pool2        ONLINE       0     0     0
	  gpt/disk0  ONLINE       0     0     0
	  gpt/disk1  ONLINE       0     0     0
	logs
	  gpt/log1   ONLINE       0     0     0

errors: No known data errors
[CMD="#"]zdb | grep ashift[/CMD]
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 12
[CMD="#"]camcontrol devlist[/CMD]
<ATA D2RSTK251M11-020 E>           at scbus0 target 0 lun 0 (da0,pass0)
<ATA OCZ-VERTEX3 2.15>             at scbus0 target 4 lun 0 (da1,pass1)
<ATA OCZ-VERTEX3 2.15>             at scbus0 target 5 lun 0 (da2,pass2)
<USB Mass Storage Device \001\000\000?>  at scbus7 target 0 lun 0 (da3,pass3)
<USB Mass Storage Device \001\000\000?>  at scbus8 target 0 lun 0 (da4,pass4)
[CMD="#"]gpart show[/CMD]
=>       34  390721901  da0  GPT  (186G)
         34       2014       - free -  (1M)
       2048  390719880    1  freebsd-zfs  (186G)
  390721928          7       - free -  (3.5k)

=>       34  468862061  da1  GPT  (223G)
         34       2014       - free -  (1M)
       2048  468860040    1  freebsd-zfs  (223G)
  468862088          7       - free -  (3.5k)

=>       34  468862061  da2  GPT  (223G)
         34       2014       - free -  (1M)
       2048  468860040    1  freebsd-zfs  (223G)
  468862088          7       - free -  (3.5k)

=>     34  7744445  da3  GPT  (3.7G)
       34       30       - free -  (15k)
       64      128    2  freebsd-boot  (64k)
      192     1856       - free -  (928k)
     2048  7742424    1  freebsd-zfs  (3.7G)
  7744472        7       - free -  (3.5k)

=>     34  7744445  da4  GPT  (3.7G)
       34       30       - free -  (15k)
       64      128    2  freebsd-boot  (64k)
      192     1856       - free -  (928k)
     2048  7742424    1  freebsd-zfs  (3.7G)
  7744472        7       - free -  (3.5k)
```


```
[CMD="#"]iperf -c 10.10.0.12[/CMD]
------------------------------------------------------------
Client connecting to 10.10.0.12, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[  3] local 10.10.0.10 port 63461 connected with 10.10.0.12 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   934 Mbits/sec
```


```
[B][U]LOCAL WRITES[/U][/B]
128k)  284MB/s
4k)    60MB/s
```


```
[B][U]OVER NFS[/U][/B]
sync)   2147483648 bytes transferred in 31.963803 secs (67184860 bytes/sec)
async)  2147483648 bytes transferred in 22.719088 secs (94523321 bytes/sec)
```

Tests over NFS have been made like:

```
[B]sync)[/B]
[CMD="#"]mount 10.10.0.12:/export/perftest /mnt/tank/perftest[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m[/CMD]
[CMD="#"]umount /mnt/tank/perftest[/CMD]
[B]async)[/B]
[CMD="#"]mount -o async 10.10.0.12:/export/perftest /mnt/tank/perftest[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m[/CMD]
[CMD="#"]dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m[/CMD]
[CMD="#"]umount /mnt/tank/perftest[/CMD]
```

Next up is Ivan Drago: "I will break you."

/Sebulon


----------



## peetaur (Mar 9, 2012)

Almost everything running around 60-65 MB/s (even the ZEUS) makes me think there is a software bottleneck: a needless sleep(), bad interrupt handling, lock acquisition, etc.

Maybe it would be interesting to try setting your CPU multiplier or some busses lower (underclock it) to see if it changes the speed. But I don't think server boards have those options. Or run a nice -n -19 100+ thread CPU waster in the background. Or someone else more familiar with low level things could suggest a better test.

Also, have you tried striping across a few SSDs for your log to compare? I see you wrote 





> At least with ZFS V15, I would say I won about 20% striped logs vs mirrored.


 but I found the same effect between mirrored and single SSD (non-stripe).


----------



## Sebulon (Mar 11, 2012)

Ok, explain to me this...

IÂ´ve spent two days now, trying to fault this non-redundant pool and it just wonÂ´t.

First- with the Deneva as log, I started with the same client that did the performance tests. I basically re-ran the performance tests but I reset the server during every transfer, at which point, the client just paused and waited for the server to come back up again (it rebooted without any issues), the client resumed the transfers and lived happily ever after. After all three transfers were complete, I scrubbed the pool and to my surprise, there were no errors.

`# mount 10.10.0.12:/export/perftest /mnt/tank/perftest`
`# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m`
(Reset)
`# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m`
(Reset)
`# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m`
(Reset)
`# umount /mnt/tank/perftest`

`# mount -o async 10.10.0.12:/export/perftest /mnt/tank/perftest`
`# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB.bin bs=1m`
(Reset)
`# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-2.bin bs=1m`
(Reset)
`# dd if=/mnt/ram/rand2GB.bin of=/mnt/tank/perftest/rand2GB-3.bin bs=1m`
(Reset)
`# umount /mnt/tank/perftest`

At that time, I realised I needed a sure-fire way of faulting it so that I could have something to compare with, so I switched the Deneva for a Vertex- which shouldnÂ´t work as a SLOG as it lacks the battery-backing the Deneva has.

So the pool now looks like this:

```
pool: pool2
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Fri Mar  9 16:12:29 2012
config:

	NAME         STATE     READ WRITE CKSUM
	pool2        ONLINE       0     0     0
	  gpt/log1   ONLINE       0     0     0
	  gpt/disk0  ONLINE       0     0     0
	logs
	  gpt/disk1  ONLINE       0     0     0

errors: No known data errors
```

Redid the battery of transfers, but it still passed. All is still well. Why?

I then gave NFS-access to our test-esxÂ´s, mounted it as a datastore and started a Storage V-Motion on of the guests and reset the server during that migration... Still OK. Why?

So from within a guest, I started a buildworld-process and reset the server; the guest just paused, the server came back up ey-ok, and then the guest resumed as if nothing had happened. Why?

I mean, IÂ´m actually trying to do wrong here, how hard can it be?

Suggestions are most welcome!

/Sebulon


----------



## danbi (Mar 12, 2012)

You are trying to make ZFS lose data? Something it is designed not to do.

About the only way you can achieve this is if you have VERY faulty hardware.


----------



## AndyUKG (Mar 12, 2012)

Hi,

  regarding trying to cause a ZFS data error by resetting the ZFS server or attached storage. You shouldn't need a battery backed disk so long as the disks do not ignore cache flush commands sent from ZFS.
Do you have any reason to believe that the Vertex won't do cache flushes when requested?

thanks Andy.


----------



## Sebulon (Mar 13, 2012)

@danbi

Really? Ohoy, Captain Obvious!

@AndyUKG

And the operative word in that sentence is "should" I mean, what good does a cache flush command do when thereÂ´s no power to flush itÂ´s caches with?

How come everyone else, including SUN/Oracle uses battery-backing for their logs? Well, in most cases itÂ´s about battery-backed RAM, which would be useless otherwise, but if a consumer-grade SSD would do the same job, even without battery-backing, then IÂ´m having a hard time understanding why they donÂ´t just use that instead. Imagine the savings for Oracle if they ditched the ZEUSÂ´s for ordinary VertexÂ´s, and getting better performance at the same time.

I have thought that you definitely needed battery-backing to maintain a consistant ZIL. Say you build a database NAS, or a VMWare datastore, export over NFS and once youÂ´ve gone into production and that database/datastore has grown to 50TB- then you have a power-outage, which does happen to everyone from time to time. What happens then?

For everyone that wants to build mission critical systems based on FreeBSD and ZFS, is it really enough relying on a non battery-backed SSD?

/Sebulon


----------



## danbi (Mar 14, 2012)

Sometimes, the obvious is the answer, no?

You need large capacitors for FLASH storage, in order to prevent the FLASH drive from being *destroyed* by an power failure.

Obeying cache sync is a different thing. If your drive obeys cache sync, then ZFS can be confident, that critical data is already on stable storage and thus can fulfill it's promise to have your data safe.

If you have faulty hardware, that claims to, but does not support cache sync, you will get corrupt data and ZFS can't help you.

All these things are obvious, of course 

PS: By the way, at any rate, in any scenario, battery backed RAM for ZIL is way, way faster and more performing than any SSD could be, now or in the future. Perhaps this is why people are using it for the higher-end.


----------



## AndyUKG (Mar 14, 2012)

Sebulon said:
			
		

> @AndyUKG
> 
> And the operative word in that sentence is "should" I mean, what good does a cache flush command do when thereÂ´s no power to flush itÂ´s caches with?



Well when there's no power, it isn't flushed and the ZFS write transaction isn't complete, end of story. ZFS is still in a consistent state but doesn't have any transactions relating to the last flush that didn't happen. Same idea as a database.
As you said, battery backup is a must for RAM disks or RAM write back cache. It's not technically required for disks (spinning or flash).



			
				Sebulon said:
			
		

> How come everyone else, including SUN/Oracle uses battery-backing for their logs? Well, in most cases itÂ´s about battery-backed RAM, which would be useless otherwise, but if a consumer-grade SSD would do the same job, even without battery-backing, then IÂ´m having a hard time understanding why they donÂ´t just use that instead. Imagine the savings for Oracle if they ditched the ZEUSÂ´s for ordinary VertexÂ´s, and getting better performance at the same time.
> 
> I have thought that you definitely needed battery-backing to maintain a consistant ZIL. Say you build a database NAS, or a VMWare datastore, export over NFS and once youÂ´ve gone into production and that database/datastore has grown to 50TB- then you have a power-outage, which does happen to everyone from time to time. What happens then?
> 
> ...



Haven't read up enough on it to know why Oracle use what they use, I guess reliability. I'd still expect the Vertex to work correctly most of the time, but if in 1 in a 1000 or 10000 power outages it doesn't store the data correctly thats enough for Oracle to choose a different more reliably technology. But also explains why you can't intentionally create a data error by turning the power off on your test rig...

cheers Andy.


----------



## peetaur (Mar 14, 2012)

Instead of dd, use gdd with conv=sync option. Normally the "sync" doesn't happen until the end of the file. Without sync happening, the client never discards uncleanly sent data, so it is ready to resend whatever is lost when the server reappears.

Or you could try mounting NFS with the "intr", "timeout" and "deadthresh" options, so the client drops when the server resets.


```
export PAGER=less
```
(how I hate how the FreeBSD version of "more" quits when it hits the end of the document...)

```
man mount_nfs
```


```
deadthresh=<value>
        Set the ``dead server threshold'' to the specified number
        of round trip timeout intervals before a ``server not
        responding'' message is displayed.

intr    Make the mount interruptible, which implies that file
        system calls that are delayed due to an unresponsive
        server will fail with EINTR when a termination signal is
        posted for the process.

timeout=<value>
        Set the initial retransmit timeout to the specified
        value.  May be useful for fine tuning UDP mounts over
        internetworks with high packet loss rates or an over-
        loaded server.  Try increasing the interval if nfsstat(1)
        shows high retransmit rates while the file system is
        active or reducing the value if there is a low retransmit
        rate but long response delay observed.  (Normally, the
        dumbtimer option should be specified when using this
        option to manually tune the timeout interval.
```


----------



## TheBang (Mar 15, 2012)

Sebulon, if I could ask, where did you purchase the Deneva 2 R Series?  I've been looking to get one for a ZIL SLOG, but no one seems to have them in stock.  Thanks!


----------



## Sebulon (Mar 19, 2012)

@TheBang

Of course you can

Correct, no one has them currently. We asked Dustin, a reseller in Sweden to order one in from the US. So it was a special order. But if more people starts asking, they might order in more to have in stock.

/Sebulon


----------



## ghandalf (Mar 30, 2012)

Hi,

This is a very nice and informative thread!

I have one question: Did you ever try over-provisioning the SSD and retest them? For the slog device, there is no need for more than 8-10GB. With over-provisioning, you could maybe gain some improvement in write throughput. If you still have the OCZ Deneva, I would be very interested with the over-provisioned performance!

Maybe, you could read these reviews:
http://www.storagereview.com/smart_storage_systems_xceedstor_500s_enterprise_ssd_review
http://www.storagereview.com/intel_ssd_520_enterprise_review
In both reviews, they test the performance with and without over-provisioning and the influence is enormous.

Regards ghandalf


----------



## Sebulon (Apr 2, 2012)

@ghandalf

Nice tip, thanks!

It didnÂ´t make any difference in performance compared to what I had the last time I benchmarked it (67MB/s), but perhaps itÂ´ll help maintain it. I ran dd from zero over the whole drive, repartitioned it with the same start boundary but made only a 48GB large partition for the SLOG and left the rest empty. Why 48GB? I provisioned it after 4x10GbE (a guy can dream right).

/Sebulon


----------



## ghandalf (Apr 3, 2012)

Sebulon said:
			
		

> @ghandalf
> 
> Nice tip, thanks!
> 
> ...



Hi,

*H*ow did you do the overprovisioning? I read that it is not enough to make only a smaller partition. I read some articles about OP and they describe how to do it with linux, but unfortunately, they are in German. You can do this with hdparm, but I don't know if hdparm is available in freebsd FreeBSD!


```
root@ubuntu-10-10:~# hdparm -N /dev/sdb

/dev/sdb:
 max sectors   = 312581808/312581808, HPA is disabled
root@ubuntu-10-10:~#
```
Here you can see, that HPA (host protected area) is disabled.

With this command, you can enable it:

```
root@ubuntu-10-10:~# hdparm -Np281323627 /dev/sdb

/dev/sdb:
 setting max visible sectors to 281323627 (permanent)
Use of -Nnnnnn is VERY DANGEROUS.
You have requested reducing the apparent size of the drive.
This is a BAD idea, and can easily destroy all of the drive's contents.
Please supply the --yes-i-know-what-i-am-doing flag if you really want this.
Program aborted.
root@ubuntu-10-10:~# hdparm -Np281323627 --yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb:
 setting max visible sectors to 281323627 (permanent)
 max sectors   = 281323627/312581808, HPA is enabled
```

The real enterprise SSDs uses approx. 28% OP and in the benchmark, they use up to 90% OP.

Maybe you can retest the SSD?! :e

Regards ghandalf


----------



## Sebulon (Apr 4, 2012)

@ghandalf

OMG that has to be the coolest command-flag EVER!

Okey, yeah I understand, just partitioning less might not do the trick. IÂ´ll try installing the drive into a linux-box and do hdparm from there. Just a question about the hdparm-command; if I wanted to have only 48GB of useable space on it afterwards, would this commmand be correct?

`# hdparm -Np49152000 --yes-i-know-what-i-am-doing /dev/sdX`

/Sebulon


----------



## ghandalf (Apr 4, 2012)

@Sebulon,

I think, it is calculated this way:

48GB -> Byte = 51539607552 Byte

You need sectors:
51539607552 Byte / 512 Byte/Sectors = 100663296 Sectors.

BUT: I don't know, if the SSD has 512 Byte or 4096 Byte Sectors.
When you issue the command:

```
hdparm -N /dev/sdX
```
You will see how many Sectors you have.
An example calculation:
A 160GB Intel 320 SSD has 312581808 Sectors.
So 312581808 Sectors * 512 Byte/Sector = 160041885696 Byte => 149,05 GB usable space!

You should also note the max Sectors that you can reset it to factory defaults.

I really hope, that there is a gain in performance!:beergrin 

Regards ghandalf


----------



## TheBang (Apr 5, 2012)

According to OCZ technical support, doing it the way Sebulon initially did by creating a small partition, and leaving the rest unpartitioned is called "manual overprovisioning" and provides the same benefits as doing it the HPA way:

http://www.ocztechnologyforum.com/f...over-provision&p=622788&viewfull=1#post622788
http://www.ocztechnologyforum.com/f...How-to-increase-overprovisioning-on-SF-drives


----------



## t1066 (Apr 21, 2012)

Just found the following article. Basically, it says that sync writing to a seperate ZIL is done at queue depth of 1 (ZFS sent out sync write request one at a time to the log drive. Wait for the write to finish before sent out another one. Also, it work in a round robin way similar to how cache work. Hence, stripped log devices would not help). So the relevant data is the IOPS at queue depth 1 only, not the maximum IOPS.


----------



## Sebulon (May 2, 2012)

@ghandalf+TheBang

Somewhat delayed is here the result, that it didn't make any difference. Just wanted you to know that.

I still have my SLOG HPA'd down to 48GB and even though there wasn't any difference in performance, I'm going to keep it there since I don't need more space on that disk any way. Nice tip though, it was definitely worth a shot.

/Sebulon


----------



## peetaur (Jun 14, 2012)

Have you tried this RAM based SSD? http://www.stec-inc.com/product/zeusram.php

I heard on IRC that even ESXi, the most horrible case possible, goes at least 200 MB/s over 10Gbps network with sync enabled and no server hacks to disable O_SYNC.


----------



## Sebulon (Jun 14, 2012)

@peetaur

I have actually. We were fortunate to have a SUN 7300 with two STEC Zeus IOPS installed in one of itÂ´s JBODÂ´s for our VMWare storage. Later, when we were about to decommission it, I pulled one out and installed it in the same server as I have tested the rest of the drives. If you read the high score back on #1, it was actually bested by quite alot, both locally and remote.

Perhaps they should have a sticker saying _"Results may vary"_?

But it may be quite big differences in HW and networking that could affect the outcome. I have performed all my test with 1GbE, whereas you mentioned 10GbE. IÂ´m guessing that the Zeus performs very different depending on controller and driver used, due to itÂ´s special nature.

/Sebulon


----------



## peetaur (Jun 14, 2012)

The Zeus IOPS is a flash array based SSD... I'm talking about a RAM based one.


----------



## Sebulon (Jun 14, 2012)

@peetaur

Oh sorry, I must have read you wrong. In that case no, I have not tested with a STEC RAM based SSD

I have however bought an ACARD 5.25" SATA-II SSD RAM DDR2 - ANS-9010BA, and it worked horribly, haha It was really crappy. After starting a write to the device with dd or something, messages was flooded with write DMA errors and then it vanished from the OS.

/Sebulon


----------



## peetaur (Jun 14, 2012)

Sebulon said:
			
		

> @peetaur
> ...I have not tested with a STEC RAM based SSD
> 
> ...ACARD 5.25" SATA-II SSD RAM DDR2 - ANS-9010BA, and it worked horribly, ...DMA errors and then it vanished...



I kind of think you should test your ACARD RAM with memtest, or return it and get a different one. 

And I can't wait to *h*ear your test results with the Zeus RAM based SSD. I am tempted to buy one for my vm datastore server but first I'm testing some NFS kernel tuning stuff. *S*ee: http://lists.freebsd.org/pipermail/freebsd-fs/2012-March/013994.html

So far, I've achieved 79 MB/s read, 75.8 MB/s write, and 51 sync writes from the guest OS. With the defaults, it was under 40 MB/s. But in all cases, it was above 100 without virtual machines. (all sequential with dd)

And unlike in the mailing list post, my NFS client doesn't fail, so I don't need a Linux client. Possibly this is because I upped the memory buffers, like in this guide: https://calomel.org/network_performance.html

Soon I'll test that on a 10 Gbps link, just need to move some hardware around.


----------



## Sebulon (Jun 14, 2012)

peetaur said:
			
		

> And I can't wait to year your test results with the Zeus ram based SSD.



No problem, if youÂ´re paying for it  *Be*cause I donÂ´t have anywhere near that dough, I can tell you for sure. In case you thought differently, the money to test all this comes from my own pockets, (except for the Zeus of course, got lucky with that one), so saving that up is going to take a while, haha!

Next disk IÂ´m eager to get my hands on is the OCZ Vertex 4 (either 128- or 256GB, not sure), which sadly comes without supercap, but IÂ´d like to test it anyway to see what kind of performance to expect from the next generation Deneva.

Now, itÂ´s not at all for sure the Deneva 3 is going to have the same specs as the Vertex 4, but thatÂ´s what they did with the previous generation; they let Vertex 3 live in the real world for a while and when all the child diseases where cured, they took Vertex 3Â´s HW and FW, added a supercap and called it Deneva 2, so thatÂ´s probably what theyÂ´ll do with the Deneva 3 as well.

/Sebulon


----------



## kvolaa (Dec 17, 2012)

*Best cheap profi SSD for ZIL*

Hi to all,

Do you know OWC Mercury Pro 6G SSDs ? Only one supplier that shows their enterprise SSDs prices publicly . They sell it through their own web shop (maybe exclusively).

I'm using OWC Mercury Pro 6G 50GB as ZIL. Can get one for $400. It's fantastic.
It has over 60MB/s with sequential 4KB/s write (qd=1). But with endurance 730 TB. It has OP (over-provisioning) set to 28%. Of course, you can use 'hdparm -N' and set it higher to get longer endurance. It has capacitors, 7 years warranty.

So, for ZIL it is really dream (relatively cheap). 

It's there http://eshop.macsales.com/shop/SSD/OWC/Mercury_6G/Enterprise

Reviews are there:
http://thessdreview.com/our-reviews...and-lsi-combine-for-a-great-enterprise-entry/
http://www.storagereview.com/owc_mercury_enterprise_pro_6g_ssd_review

Another good drive (the same price tag as OWC Mercury Pro) is Intel 710. It has capacitors too, but is slower then Mercury and has cca 1/3 endurance of Mercury (it is 100GB with 500 TB endurance, Mercury has 50GB with 730 TB).

By my mean, it's the best possibility, if you can't use DDRDRIVE or ZeusRAM, e.g. RAM based SSD. With OP, it can lasts for years. Mine is set to 20GB (the second Mercury drive I have is under testing of capabilities, so maybe I can raise OP to 10GB drive and get more endurance).

You can't use stripping ZIL for higher speed, because of qd=1,  but for longer endurance it's possible. And it is often used in this manner.

In tests you must use dd(1) with conv=sync flag, to be as ZIL write. Look at source code of ZFS, it's free (best documentation ).

Our use of ZFS in production just now is primarily for PostgreSQL Plus databases, e.g. cashing zpools for it. I can't test any NFS numbers now. But PostgreSQL shines at ZFS !

But just now I'm in building of new filer for our ESXi and RHEV (RedHat KVM hypervisor) machines. We plan to use VDI next year, so we test cheap technology (relative to ... Oracle ... gold is cheaper ).
It's dual Xeon E5-2600 class machine, 256GB (RAM is cheap and we plan using deduplication hardly). Mainboard and case from SuperMicro. Processors are E5-2650, 8 cores (maybe overkill, but dedup, compress, ... and we have the same model in VM hosts too - so we want one type).
Bunch of LSI92xx SAS2 HBAs, dualport 10GE (board alone has Intel quad 1Ge). 
Only boot, system, swap drives internally (SSD).

All of other drives I want in external (SuperMicro) JBOD cases (for case of service, replacement of server and so). Many of SSDs, Mercury Pro 6G 100-400GB - it's best and relatively cheap.

Why JBOD ? Because I'm hating RAIDs. Hehe, not really, but classic hw RAIDs, especially RAID5,6 are completely dead. ZFS is there. Because we plan to buy SATA drives for budget, there can be SATA/SAS expanders too (so I can use more JBOD cases in future; without be out of # of SAS2/SATA HBAs).

As SAN we are using SAS2 switches - nice little known thing - "from DAS (direct attached storage) to SAN". It's better, much cheaper, much speeder then 8G FC or 10GE or FCoe. From LSI, switch LSISAS 6160. Dual (HA config) multipath 24Gbit/s connect. Super. Cheap. Flexible. Simple. No additional protocol levels/stack on road (no encapsulation/translation: SCSI commands/FC/eth/switch/eth/FC/SCSI commands - simple SAS/switch/SAS - commands flown on wires ). Try it.
Cons is only 2m range (10m cables exists, but must be active and are more expensive).

So, I haven't own modern experiences with practical daily use of NAS based on ZFS (and NFS, iSCSI, CIFS) in enterprise. Our ESXi's are filled through SAS2 directly. So one of my predicted tests will be using SAS2 HBA as target - e.g. I want to make my own ZFS based SAS2 storage (e.g. SAS2 "RAID storage box"). And test ZFS speeds.
As an operating system, maybe Linux (preferably RHEL), Solaris|OpenSolaris|OpenIndiana|..., FreeBSD - which one win competition in speed and reliability.

I want to test FCoE, iSCSI, NFS, CIFS, too - to make benchmarks of all of this. Compared to SAS2 fabric/network.
So, next time I hope I can present same numbers - especially about NFS and iSCSI speeds.

Cheers


----------



## TheBang (Dec 18, 2012)

Thanks for the pointer to the OWC drive.  It looks like it has pretty good specs for a ZIL SLOG device (especially the all-important supercap).  Nice to have an alternative out there.  We've been using the Deneva 2 R Series, which has worked pretty well, and was readily available for a while.  It seems to be difficult to source again now though.

However, for the future, I think we will be standardizing on the Intel SSD DC S3700.  It is the true successor to the Intel 710, and from the early review, it looks like one of the best, most consistent, good performing, and affordable enterprise SSD's out there.

http://www.anandtech.com/show/6433/intel-ssd-dc-s3700-200gb-review

It has the battery backup necessary for the ZIL, it has very good sequential write speed, the performance consistency over its life, 5 year warranty, and it's only $235 MSRP for a 100 GB drive.  Sounds like a winner to me.  This drive should be the drive of choice for affordable SATA SSD ZIL SLOG devices.


----------



## Sebulon (Dec 18, 2012)

@kvolaa

Hi, thanks for sharing! Specs seems rather nice. I see itÂ´s using more or less the same SandForce controller as the Deneva 2 so performance should be comparable. And thanks for the tip about "conv=sync". IÂ´ll make sure to use that in future benchmarking.

Cool to hear you talking about RHEV also. Are you paying RH for RHEV-M or are you running oVirt? IÂ´ve done some benchmarking from inside different guests with VirtIO HW and IÂ´ve experienced some kind of logical boundary on writes. Both Hosts and Storage are connected via 2x1GbE LACP, and when the hosts are doing storage-related tasks, they can easily saturate that connection 2Gbps read and write. So I tried to install a Ubuntu guest and ran bonnie++, write IO was less than 1Gbps, while read IO was a full 2Gbps. Please share with us any tests you make from inside a guest, since that is the experience a "customer" will receive back from the system as a whole.

@TheBang

That is a drive that has been on my radar as well, looks very promising. DonÂ´t remember if I found any benchmark of 4k write at QD=1 IOPS or write latency on it though... Either go with that, or wait for a Deneva 3 maybe.

If one is worried about the budget, maybe Vertex 4 without disabling cache flushing would also work.

/Sebulon


----------



## jrm@ (Dec 18, 2012)

*IOR benchmarks*


```
% ./IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Tue Dec 18 16:09:32 2012
Command line used: ./IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/
Machine: FreeBSD awarnach.mathstat.dal.ca

Summary:
        api                = MPIIO (version=2, subversion=2)
        test filename      = /mnt/archives/
        access             = file-per-process
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 2 GiB
        aggregate filesize = 2 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          60.95      60.95       60.95      0.00      60.95      60.95       60.95      0.00  33.59947   EXCEL
read           87.60      87.60       87.60      0.00      87.60      87.60       87.60      0.00  23.37930   EXCEL

Max Write: 60.95 MiB/sec (63.91 MB/sec)
Max Read:  87.60 MiB/sec (91.85 MB/sec)

Run finished: Tue Dec 18 16:10:29 2012
```

Hardware/Configuration

Asus RS300-E7-PS4 1U Server 
E3-1230V2 Xeon CPU
32GB Memory
4 x Intel 60GB SSD (520 Series)
LSI 9205-8e SAS Controller
Supermicro SC847E16-RJB0D1
10 x WD30EFRX 3TB Hard Drive

FreeBSD 9.3-RC3 is installed on a zfs mirror using two of the SSDs.
One SSD is used for a ZIL (will eventually mirror the ZIL as well) and the other SSD is used for L2ARC.
I created a raidz3 zpool with nine of the 3TB drives.
The file system I used to test with was created with:
`# zfs create -o sharenfs="root,network 192.168.0.0,mask 255.255.255.0" -o atime=off -o compression=on -o setuid=off /storage/archives`

The NFS share was mounted on an 8.3 system.  IOR complained without the nolockd option.
`# mount_nfs -o nolockd 192.168.101:/storage/archives /mnt/archives`

I used IOR 2.10.3.
To compile IOR I have to add these lines to src/C/Makefile.config

```
####################
# FreeBSD SETTINGS #
####################
CC.FreeBSD = mpicc
CCFLAGS.FreeBSD = -g -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
LDFLAGS.FreeBSD = -lmpich
HDF5_DIR.FreeBSD =
NCMPI_DIR.FreeBSD =
```
and I had to modify src/C/utilities.c to replace #include <sys/statfs.h> with

```
#include <sys/param.h>
#include <sys/mount.h>
```


----------



## Sebulon (Dec 19, 2012)

@jrm

Thank you for these numbers! IÂ´m wondering about your choice of tool for benchmarking, is IOR any "better" or "worse" than the more commonly used bonnie++? How does the network between server and client look like? Any Jumbo frames, lagg?

Another fun thing I usually do while benchmarking is watching gstat for how stressed the SLOG gets. And how was the pool set up with regards to partition alignment and ashift?

/Sebulon


----------



## jrm@ (Dec 19, 2012)

Hello @Sebulon;

The reason I used IOR was because a system administrator I know set up a similar storage system and he ran his benchmarks with IOR and I wanted to compare results.  I think he mentioned he went with IOR because of the mpi support, but I have little experience with either benchmark program.  The differences with his setup are that he is mounting from a GNU/Linux box and on the storage side he chose FreeBSD 8.3 / NFSD (v3) with async and no ZIL.  Here are his benchmark results.


```
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Mon Dec 10 16:35:30 2012
Command line used: IOR-gcc -a MPIIO -t 1M -b 2G -i 1 -F -o /scratch1/test
Machine: Linux

Summary:
        api                = MPIIO (version=1, subversion=2)
        test filename      = /scratch1/test
        access             = file-per-process
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 2 GiB
        aggregate filesize = 2 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)
---------  ---------  ---------  ----------   -------  --------- ---------  ----------   -------  --------
write          95.26      95.26       95.26      0.00      95.26 95.26       95.26      0.00  21.50000   EXCEL
read           75.19      75.19       75.19      0.00      75.19 75.19       75.19      0.00  27.23828   EXCEL

Max Write: 95.26 MiB/sec (99.88 MB/sec)
Max Read:  75.19 MiB/sec (78.84 MB/sec)

Run finished: Mon Dec 10 16:36:19 2012
```

I haven't tinkered with the network yet.  Right now there is a single gigabit line going through two switches.  The Asus RS300-E7-PS4 has four gigabit ports, so I plan to play with the network.  I'll post more benchmark results before/after network tweaks with IOR and bonnie++.

I did pay attention to alignment when I set things up (both the storage pool and the pool the OS is installed on).  Here is how I created the pool.


```
DISKS= 
for i in `seq 0 8`; do DISKS="$DISKS da$i "; done

for I in ${DISKS}; do
        NUM=$( echo ${I} | tr -c -d '0-9' )
        glabel create storage_disk${NUM} /dev/da${I}
done
glabel create spare_drive0 /dev/da9

gnop create -S 4096 /dev/label/storage_disk0


zpool create storage raidz3 /dev/label/storage_disk0.nop /dev/label/storage_disk1 ... /dev/label/storage_disk8
zpool export storage
gnop destroy /dev/label/storage_disk0
zpool import -d /dev/label storage
```

I see the that the ashift for the pool is indeed 12, but, now that you have caused me to take another look, I see the ashift for the ZIL is 9.  Hopefully fixing the alignment on the ZIL will give a performance bump.  I don't think I paid attention to alignment with the L2ARC either.


----------



## jrm@ (Dec 20, 2012)

I 4k aligned the ZIL and L2ARC and the performance, according to IOR, actually dropped a little.


```
zpool remove storage label/zil
zpool remove storage label/l2arc
gnop create -S 4k label/zil
gnop create -S 4k label/l2arc
zpool add storage log label/zil.nop
zpool add storage cache label/l2arc.nop
zpool export storage
gnop destroy label/zil.nop
gnop destroy label/l2arc.nop
zpool import storage

zpool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        storage                  ONLINE       0     0     0
          raidz3-0               ONLINE       0     0     0
            label/storage_disk0  ONLINE       0     0     0
            label/storage_disk1  ONLINE       0     0     0
            label/storage_disk2  ONLINE       0     0     0
            label/storage_disk3  ONLINE       0     0     0
            label/storage_disk4  ONLINE       0     0     0
            label/storage_disk5  ONLINE       0     0     0
            label/storage_disk6  ONLINE       0     0     0
            label/storage_disk7  ONLINE       0     0     0
            label/storage_disk8  ONLINE       0     0     0
        logs
          label/zil              ONLINE       0     0     0
        cache
          label/l2arc            ONLINE       0     0     0

errors: No known data errors
```


```
% IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/            
IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Thu Dec 20 00:54:57 2012
Command line used: ./IOR -a MPIIO -t 1M -b 2G -i 1 -F -o /mnt/archives/
Machine: FreeBSD awarnach.mathstat.dal.ca

Summary:
        api                = MPIIO (version=2, subversion=2)
        test filename      = /mnt/archives/
        access             = file-per-process
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 2 GiB
        aggregate filesize = 2 GiB

Operation  Max (MiB)  Min (MiB)  Mean (MiB)   Std Dev  Max (OPs)  Min (OPs)  Mean (OPs)   Std Dev  Mean (s)  
---------  ---------  ---------  ----------   -------  ---------  ---------  ----------   -------  --------
write          57.83      57.83       57.83      0.00      57.83      57.83       57.83      0.00  35.41387   EXCEL
read           86.74      86.74       86.74      0.00      86.74      86.74       86.74      0.00  23.61122   EXCEL

Max Write: 57.83 MiB/sec (60.64 MB/sec)
Max Read:  86.74 MiB/sec (90.95 MB/sec)

Run finished: Thu Dec 20 00:55:57 2012
```

Am I missing something?  Why would the performance go down compared to the results when the SSDs for the ZIL and L2ARC weren't 4k aligned?


----------



## Sebulon (Dec 20, 2012)

@jrm

The results always fluctuate a little, so what I do is to run all tests three times and then post the middle as the score. So your results where more or less the same.

Your collegues benchmark is rather invalid since it was performed asynchronosly, but good to see it as a "best case", only reflected the theoretical speed of the network as such.

/Sebulon


----------



## jrm@ (Dec 20, 2012)

I actually did run the tests several times and all the results were with a fraction of a MB/s and the speeds were consistently slower after the 4K alignment of the SSDs with the ZIL and L2ARC.  Perhaps a confounding variable here is that the pool had almost nothing on it when I ran the benchmarks before and there was about 200GB when I ran it after updating the alignment.


----------



## Sebulon (Oct 1, 2013)

Time for an update on a new addition to our ZIL-family, the Intel DC S3700 200GB MLC!


```
[B][U]HW[/U][/B]
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Xeon X3470 2.93GHz
4x  8GB 1333MHZ DDR3 ECC RDIMM
10x ST4000DM000-1F2168
2x  Intel S3700 200GB
1x  OCZ Deneva 2 R-series 200GB
1x  OCZ Vertex 3 240GB

[B][U]SW[/U][/B]
[CMD="#"]uname -a[/CMD]
FreeBSD server 9.1-RELEASE-p7 FreeBSD 9.1-RELEASE-p7 #0 r255487M: Thu Sep 12 08:26:57 CEST 2013     root@server:/usr/obj/usr/src/sys/server  amd64
[CMD="#"]zpool get version pool2[/CMD]
NAME   PROPERTY  VALUE    SOURCE
pool1  version   28       default
[CMD="#"]zpool status[/CMD]
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Sep 29 03:30:10 2013
config:

	NAME          STATE     READ WRITE CKSUM
	pool1         ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    gpt/usb1  ONLINE       0     0     0
	    gpt/usb2  ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
  scan: resilvered 0 in 0h40m with 0 errors on Tue Oct  1 11:25:30 2013
config:

	NAME             STATE     READ WRITE CKSUM
	pool2            ONLINE       0     0     0
	  raidz2-0       ONLINE       0     0     0
	    gpt/rack1-1  ONLINE       0     0     0
	    gpt/rack1-2  ONLINE       0     0     0
	    gpt/rack1-3  ONLINE       0     0     0
	    gpt/rack1-4  ONLINE       0     0     0
	    gpt/rack1-5  ONLINE       0     0     0
	    gpt/rack2-1  ONLINE       0     0     0
	    gpt/rack2-2  ONLINE       0     0     0
	    gpt/rack2-3  ONLINE       0     0     0
	    gpt/rack2-4  ONLINE       0     0     0
	    gpt/rack2-5  ONLINE       0     0     0
	logs
	  mirror-1       ONLINE       0     0     0
	    gpt/log1     ONLINE       0     0     0
	    gpt/log2     ONLINE       0     0     0
	cache
	  gpt/cache1     ONLINE       0     0     0
	  gpt/cache2     ONLINE       0     0     0
[CMD="#"]zdb | grep ashift[/CMD]
            ashift: 12
            ashift: 12
            ashift: 12
[CMD="#"]camcontrol devlist[/CMD]
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 0 lun 0 (da0,pass0)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 1 lun 0 (da1,pass1)
<ATA INTEL SSDSC2BA20 0265>        at scbus0 target 3 lun 0 (da2,pass2)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 4 lun 0 (da3,pass3)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 5 lun 0 (da4,pass4)
<ATA ST4000DM000-1F21 CC52>        at scbus0 target 6 lun 0 (da5,pass5)
<ATA INTEL SSDSC2BA20 0265>        at scbus0 target 7 lun 0 (da6,pass6)
<ATA ST4000DM000-1F21 CC52>        at scbus1 target 0 lun 0 (da7,pass7)
<ATA ST4000DM000-1F21 CC52>        at scbus1 target 1 lun 0 (da8,pass8)
<ATA ST4000DM000-1F21 CC52>        at scbus1 target 2 lun 0 (da9,pass9)
<ATA OCZ-VERTEX3 2.15>             at scbus1 target 3 lun 0 (da10,pass10)
<ATA ST4000DM000-1F21 CC52>        at scbus1 target 4 lun 0 (da11,pass11)
<ATA ST4000DM000-1F21 CC52>        at scbus1 target 5 lun 0 (da12,pass12)
<ATA D2RSTK251M11-020 E>           at scbus1 target 6 lun 0 (da13,pass13)
<Kingston DataTraveler SE9 PMAP>   at scbus6 target 0 lun 0 (da14,pass14)
<Kingston DataTraveler SE9 PMAP>   at scbus7 target 0 lun 0 (pass15,da15)
[CMD="#"]gpart show -l[/CMD]
=>        34  7814037101  da0  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack2-1  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da1  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack2-2  (3.7T)
  7814037128           7       - free -  (3.5k)

=>       34  390721901  da2  GPT  (186G)
         34       2014       - free -  (1M)
       2048  100663296    1  log2  (48G)
  100665344   33554432    2  swap2  (16G)
  134219776  256502159       - free -  (122G)

=>        34  7814037101  da3  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack2-3  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da4  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack2-4  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da5  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack2-5  (3.7T)
  7814037128           7       - free -  (3.5k)

=>       34  390721901  da6  GPT  (186G)
         34       2014       - free -  (1M)
       2048  100663296    1  log1  (48G)
  100665344   33554432    2  swap1  (16G)
  134219776  256502159       - free -  (122G)

=>        34  7814037101  da7  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack1-3  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da8  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack1-4  (3.7T)
  7814037128           7       - free -  (3.5k)

=>        34  7814037101  da9  GPT  (3.7T)
          34        2014       - free -  (1M)
        2048  7814035080    1  rack1-5  (3.7T)
  7814037128           7       - free -  (3.5k)

=>       34  468862061  da10  GPT  (223G)
         34       2014        - free -  (1M)
       2048  468860040     1  cache1  (223G)
  468862088          7        - free -  (3.5k)

=>        34  7814037101  da11  GPT  (3.7T)
          34        2014        - free -  (1M)
        2048  7814035080     1  rack1-1  (3.7T)
  7814037128           7        - free -  (3.5k)

=>        34  7814037101  da12  GPT  (3.7T)
          34        2014        - free -  (1M)
        2048  7814035080     1  rack1-2  (3.7T)
  7814037128           7        - free -  (3.5k)

=>       34  390721901  da13  GPT  (186G)
         34       2014        - free -  (1M)
       2048  390719880     1  cache2  (186G)
  390721928          7        - free -  (3.5k)

=>      34  15356093  da14  GPT  (7.3G)
        34       128     1  (null)  (64k)
       162      1886        - free -  (943k)
      2048  14680064     2  usb1  (7.0G)
  14682112    674015        - free -  (329M)

=>      34  15470525  da15  GPT  (7.4G)
        34       128     1  (null)  (64k)
       162      1886        - free -  (943k)
      2048  14680064     2  usb2  (7.0G)
  14682112    788447        - free -  (385M)
```


```
[B][U]LOCAL WRITES[/U][/B]
128k)  295MB/s
4k)    58MB/s
```


```
Connection speed test from a KVM virtual machine running FreeBSD-9.2-RELEASE.
[CMD="#"]iperf -c 10.10.0.12[/CMD]
------------------------------------------------------------
Client connecting to 10.10.0.12, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[  3] local 10.10.0.10 port 63461 connected with 10.10.0.12 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   952 Mbits/sec
```


```
[B][U]OVER NFS[/U][/B]
sync)   68528896 bytes/sec (65 MB/s)
async)  69435545 bytes/sec (66 MB/s)
```

Tests over NFS have been made like:

```
[CMD="#"]dd if=/dev/random of=rndfile bs=1m count=2048[/CMD]
[CMD="#"]dd if=rndfile of=/dev/zero bs=1m[/CMD]
1766810021 (1684 MB/s)

[B]sync)[/B]
[CMD="#"]mount 10.10.0.12:/export/perftest /mnt/perftest[/CMD]
[CMD="#"]dd if=rndfile of=/mnt/perftest/rndfile bs=1m[/CMD]
[CMD="#"]dd if=rndfile of=/mnt/perftest/rndfile bs=1m[/CMD]
[CMD="#"]dd if=rndfile of=/mnt/perftest/rndfile bs=1m[/CMD]
[CMD="#"]umount /mnt/perftest[/CMD]
[B]async)[/B]
[CMD="#"]mount -o async 10.10.0.12:/export/perftest /mnt/perftest[/CMD]
[CMD="#"]dd if=rndfile of=/mnt/perftest/rndfile bs=1m[/CMD]
[CMD="#"]dd if=rndfile of=/mnt/perftest/rndfile bs=1m[/CMD]
[CMD="#"]dd if=rndfile of=/mnt/perftest/rndfile bs=1m[/CMD]
[CMD="#"]umount /mnt/perftest[/CMD]
```

So, still what a guest VM can expect to get is about 60MB/s. HighscoreÂ´s been updated.

/Sebulon


----------



## usdmatt (Oct 1, 2013)

Has anyone ever tried playing with the settings (disclaimer: less 'settings' than 'fixed source code constants') mentioned in these messages?

http://lists.freebsd.org/pipermail/freebsd-fs/2012-March/013994.html
http://lists.freebsd.org/pipermail/freebsd-fs/2013-June/017519.html

I assume there's a good reason these are set the way they are, but these users appeared to get fairly substantial performance gains and it looks like Solaris/ESXi/Linux use bigger defaults (or at least support mounting NFS exports with larger values).


----------



## RusDyr (Oct 2, 2013)

65 MBytes/sec over NFS is still too low for that configuration.

And it's much better to test with iozone or sysbench instead of dd - they will automatically test with different sync/async write/read/rewrite.


----------

