# Can an ZFS share be faster than local SSD?



## boris_net (Nov 10, 2012)

Hi all,

I have spent a lot of time reading through the forum especially around the performance of ZFS (as well as the tweaks and challenges around getting the most of the ZFS config).

Unless I am wrong, I have not seen the following scenario covered and was wondering if it could work.

I would like to build 1 server with 10Ge NIC to serve 10Ge clients. In case you ask for the reason for more than 1G, it is really to do realtime video editing.
I am wondering if a ZFS network share on such a server could offer transfer rate close to what an SSD would offer in the local client? If so, I would give the client a RAID-0 of SSD and mount an iSCSI on the ZFS share on the server for the data.

I am well aware that 10Ge solution is expensive at the moment, but for argument sake, let's ignore this cost for a minute and consider a 10Ge NIC with TOE to let the CPU+RAM handle the load associated to ZFS.

Would you expect to get a ZFS array capable of offerin more or around 500MBytes/s over a 10G network?
If so, what would be the main factor in achieving such a speed (I know the answer could be all, but I really want to get a good sense of what would help the most):
- choice of motherboard
- choice of HDD 
- choice of controller
- choice of CPU
- amount of RAM
- choice of ZFS raidz type
- choice of ZFS array structure
- ZFS tweaking
- would iSCSI be possible to use without degradation

Let me know your thoughts, I have not found the best perf for ZFS over the network, but if you have pointers, I will happily go and read your sources.

Thanks in advance,


----------



## Terry_Kennedy (Nov 10, 2012)

boris_net said:
			
		

> Would you expect to get a ZFS array capable of offerin more or around 500MBytes/s over a 10G network?


You can definitely achieve 500Mbyte/sec locally - I have some relatively un-optimized ZFS pools that will handle continuous I/O at that speed. iozone plot


> If so, what would be the main factor in achieving such a speed (I know the answer could be all, but I really want to get a good sense of what would help the most):


First, make sure the systems can communicate with each other at well over the file transfer speed you need. Use some of the TCP benchmarking ports to test this, since they don't do any disk I/O, only network. This is a good place to start tuning your TCP stack as well.

Next, do some local benchmarking with the sizes of files you expect to be using. As above, if you can't get the sustained performance you need with local I/O, net I/O isn't going to be any faster. Tune as appropriate.

Last, you'll need to determine what the protocol with the lowest overhead is for your environment - NFS, SAMBA, etc. Once you find the best protocol, then you can experiment with various protocol-specific tuning options.


----------



## boris_net (Nov 11, 2012)

Thanks Terry, any guidance on the hardware you use to get this kind of transfer rate at least locally?


----------



## Terry_Kennedy (Nov 11, 2012)

boris_net said:
			
		

> Thanks Terry, any guidance on the hardware you use to get this kind of transfer rate at least locally?


Massive overkill... :e

X8DTH-iF motherboard
2 x E5520 CPU
48GB ECC RAM
3Ware 9650-16ML controller w/ BBU (exporting individual units)
256GB PCIe SSD
16 x WD2003FYYS (RE4 2TB)

Pic 1
Pic 2

Burst transfer rate (for around 10 seconds) is > 4GB/sec.


----------



## bbzz (Nov 11, 2012)

Sexy. 
What's the case?
And how's the noise/heat?


----------



## Terry_Kennedy (Nov 11, 2012)

bbzz said:
			
		

> Sexy.


Thanks! I try to build things that surpass commercial products, otherwise there's not much point in doing it myself. 


> What's the case?


Semi-custom version of the CI Design NSR-316


> And how's the noise/heat?


Not bad under normal operating conditions. They're in a rack with a bunch of loud stuff, so I can't easly get a dB reading for you. You certainly wouldn't want it in a home theatre room. The only really hot parts of it are the exhaust fans on the power supply. They run a bit hotter than I'd like, but choices in that form factor are somewhat limited. The power supply is a redundant hot-swap assembly from 3Y Power, and it does run within their environmental specs.

On power-up / reboot, all the fans run at full speed for about 10 seconds until the BMC firmware takes over and asserts PWM control. That's a safety feature to prevent thermal runaway in the case of a wedged BMC or hard system fault. During that time period, it sounds like a jet taking off.

Here is a static snapshot of the monitoring data I collect from each of these servers.


----------



## ziyanm (Nov 11, 2012)

Hi Terry,

What do you use to capture, store and graph the sensor data? I want to implement something similar for a couple of servers.


----------



## Terry_Kennedy (Nov 11, 2012)

ziyanm said:
			
		

> What do you use to capture, store and graph the sensor data? I want to implement something similar for a couple of servers.


It is based on a heavily-modified version of the utility in the contrib area of the sysutils/ipmitool port. Most of the differences involve formatting the received data by the script, as well as also using SMC's own Java-based SMCIPMITool to collect the power supply data (since our ipmitool port doesn't decode PMBus). If you want to build your own rather than use the samples in contrib, you'll probably want:

```
ipmitool -I open sensor
ipmitool -I open sel list
```
And if you're using a Supermicro board connected to a compatible PMBus power supply, you'll also need to know the offset registers for your particular power supply. On my systems, the command is:

```
java -jar /usr/local/bin/SMCIPMITool.jar aaa.bbb.ccc.ddd username password pminfo 7 b0
```


----------



## boris_net (Nov 11, 2012)

Thanks a lot Terry, it does not look that overkill to me, especially since you are getting great performance !
Looking at your components list, am I correct it is SATA-II only (HDD+Controller)?
Do you use your SSD for ZFS? For cache? I am asking as I see a reasonable amount of memory, just curious to see if the SSD (if used for ZFS) is used that much on such a setup...

As you said I will make sure to get high speed locally and start on GigEth and once I know I can max out the GigEth consistently, I will get a 10GE networking solution.

Thanks again for sharing your experience !


----------



## Terry_Kennedy (Nov 12, 2012)

boris_net said:
			
		

> Looking at your components list, am I correct it is SATA-II only (HDD+Controller)?


Yes. The 16 RE4's are attached to the 3Ware 9650 controller, when then exports them as individual units to FreeBSD.

The 2 320GB 2.5" drives and the DVD drive are connected to the motherboard's SATA ports. All of FreeBSD is on those 2 320GB drives using whole-disk gmirror and UFS2 filesystems.



> Do you use your SSD for ZFS? For cache? I am asking as I see a reasonable amount of memory, just curious to see if the SSD (if used for ZFS) is used that much on such a setup...


The SSD is currently dedicated as a ZIL device for the ZFS pool. That's massive overkill, as 32GB would be fine for that. Since the box is so much faster than GigE speed, I didn't bother tuning it past 500MByte/sec.


----------

