# CPU(s) consideration for 240 Drives



## gkontos (Aug 31, 2012)

We are currently designing a Petabyte storage consisting of SuperMicro Dual servers connected to several JBOD chassis enclosures. 

Having solved most if the issues in the design faze, I am getting concerned about the CPU horse power that such a solution would require. I am looking at something capable of driving around 240 disk drives. 

Do you think that 2 Processors Intel E5-2630 (6 cores, 2.3ghz, 15mb cache)  would be sufficient? 

Any input is highly appreciated!


----------



## Slurp (Sep 1, 2012)

Can't help you with that, but it would be cool if you wrote about your experiences once you've done it.


----------



## Uniballer (Sep 1, 2012)

I don't think you provided enough information for anyone to to give you a good answer.

If the total needed throughput is 1 MB/second with a maximum 1 minute response time then I would say your CPU choice is total overkill.  If it is close to the peak throughput of the drives (say ~150MB/second * 240 = ~36GB/second) then how can it possibly be enough?

If this is intended to be a SAMBA or NFS server how will the data get in and out of the box?  If it is an SQL server then a very complex analysis would be required.


----------



## gkontos (Sep 1, 2012)

Uniballer said:
			
		

> I don't think you provided enough information for anyone to to give you a good answer.



240 drives are the max we can combine in 3 JBOD chassis. Each chassis will contain 80 drives, 1 Pool with 8 stripped RAIDZ2X10 drives.



			
				Uniballer said:
			
		

> If the total needed throughput is 1 MB/second with a maximum 1 minute response time then I would say your CPU choice is total overkill.  If it is close to the peak throughput of the drives (say ~150MB/second * 240 = ~36GB/second) then how can it possibly be enough?



I don't know that is why I am asking. Do you have any formula that could assist?



			
				Uniballer said:
			
		

> If this is intended to be a SAMBA or NFS server how will the data get in and out of the box?  If it is an SQL server then a very complex analysis would be required.



The storage will be used as a remote DR backup for several small business that already have a 6-7 TB local ZFS storage. Their first synchronization will occur via a physical means of data transfer. Then we will be receiving incremental snapshots on a daily basis.

Throughput is a bit tricky here. For normal daily incremental backups we are limited to their Internet uplink which in most cases it doesn't go above 2Mb. We on the other hand are limited to 100Mb at the moment in the DC. 

However, like a mentioned before, the initial synchronization will occur in the DC which can reach the theoretical 150MB e-SATA speed. 

I hope this time the information is more sufficient.


----------



## Uniballer (Sep 1, 2012)

With the throughput requirements you stated, (approximately 11MB/second when your internet connection is saturated, 150MB/second for initial synchronization) I have a hard time believing that your CPU choice is going to be the bottleneck.  But I would caution you to be sure you are building a system that can be managed to deal with failures, etc.  I certainly don't have the experience with ZFS to guide you there.

Check out these threads if you haven't already:

Backup solution for ginormous ZFS pool? (Terry Kennedy notes he is getting ~500MB/second throughput from his ZFS pool)

Improving ZFS Resilver time (extrapolate the times to peta-byte levels and you may be in trouble)

zfs: configure many disks


----------



## gkontos (Sep 1, 2012)

The design is such to simulate HAST but in a direct attached storage scenario. The system can sustain 1 server failure per 3 JBOD chassis. 

Resilver speeds have been calculated to last 12 hours max per drive failure at which point no data will be transferred to or from the storage.

The 150MB is a practical limit but it is not something that we can put the limit because in the future we plan to replicate the data to a "cold storage" via a 10GB fiber link. 

So, it really comes down to CPU choice. What I am really missing here is a formula that will let me estimate the CPU loads.


----------



## Uniballer (Sep 1, 2012)

Just to clarify, if you are hoping to saturate a 10Gbps link then you will need to achieve perhaps 1.2GB/second throughput.


----------



## gkontos (Sep 1, 2012)

Uniballer said:
			
		

> Just to clarify, if you are hoping to saturate a 10Gbps link then you will need to achieve perhaps 1.2GB/second throughput.



I would be happy with something close to 1GB which I have successfully achieved with the use of misc/mbuffer.

Again, this would not be running on a 24h basis but on particular schedule(s) that would take into account several factors such as load average and queued  processes.


----------



## Uniballer (Sep 4, 2012)

I have not heard of anyone getting this much throughput reading from a ZFS pool and sending it out over the network.  What does your vendor say?  I would think everything would have to be optimal (motherboard, CPU, RAM, disk controllers, network interfaces, distribution of disks over controllers, etc.) to even come close.  Any chance of getting a commitment from your vendor that you can upgrade the CPUs for the difference in cost if you can't achieve the needed throughput?  That would take some of the sting out if your choice isn't quite enough.

Of course, the system at the "cold storage" facility will need as much (or more) capability to write the data as fast as it can be read from your main system.

How much throughput have you gotten to/from the ZFS pool in your test setup?  What hardware did you use?


----------



## ralphbsz (Sep 11, 2012)

The hardware is capable of that.  I've seen similar x86 machines run a file servers (from SAS-attached local file system to a network) at 5 GB/s, while running a parity-based software RAID stack and a file system.  That was older hardware a few years ago; today's hardware is probably somewhat faster.

BUT: To do that, you need to look at all your bottlenecks.  Let's start with the disks and JBODs.  How are they connected?  SAS links to the host?  How many SAS cables?  6 GBit SAS?  What type of HBA?  What is the PCIe bandwidth (in most cases today, SAS HBAs are limited by their PCIe slots, only with PCIe gen 3 this is beginning to be balanced).  

Next question: What type of Intel CPU are you using, and what is the memory bandwidth?  Remember, your data will have to go in and out of memory several times, so memory bandwidth will be extremely important.  I'm not an expert on CPUs and memory.

Then you need some IO device.  It seems to me that the best results are accomplished with Infiniband cards these days; if using Ethernet, you need to make sure that your protocol can use RDMA.  Or else your CPU is going to spend much of its effort playing stupid networking games.

But the elephant in the room is the software stack.  I have no idea whether this can be accomplished with FreeBSD, ZFS, and whatever file server you are intending to use (Samba?  NFS server?  WebDAV?).  

From a performance point of view, there is no good formula.  There are so many bottlenecks, raw CPU performance will be the least of your problems.  With really good software and expert tuning, your system as specified could probably do 10x more than the 1 GB/s you want, so it is likely that it will just work, even with a less-than perfect software stack.

With 240 disks, you will obviously need some really solid RAID solution.  Remember, the expected lifetime of disks is such that you expect a disk failure every few weeks or months (1M hours specified MTBF = ~120 years, with 240 disks you expect roughly  2 failures a year, the reality is probably 3x or 10x worse).  Furthermore, with such large disks and today's error rates, you can unfortunately expect that many resilvering operations will detect a read error when resilvering.  If you want to store data sets this large with good reliability, you really need a RAID code that can handle two faults (the common case being complete failure of one drive, and then a read error when resilvering).  And given that there will be multiple failures per year, you probably want automated ways of handling failures, and of contacting field service for drive replacement.  The question of RAID and automated RAID management is probably way more work than performance tuning.


----------

