# So. Might be time to go tapeless. Massive ZFS deployment?



## zebra (Jan 2, 2014)

Hi all. First time poster, long time lurker.

I'm after the wisdom of the crowd. I've been a big tape user for large HSM based fileservers for a long time. I use high end HSM filesystems such as DMF and SAM-QFS for peta-scale storage and data integrity requirements. Currently under management of around ~3 PB of data. Here is the thing though. 3 PB of data isn't that much any more (that fits in a couple of racks of disk given many 84 drive bay solutions from the likes of Dell, EMC and DDN) - and filesystem data integrity validation has come a long, long way, as has filesystem recoverability. I'm considering (as a consequence of some of the obvious usability issues that disk --> slower disk --> tape based HSM solutions create) moving to an all disk world for my primary file server technology and keeping monsters such as SAM-FS/QFS on the side for the "deep archive" vaulting of massive, massive data. 

I work in an industry where a few terabytes a day between friends is kind of normal in terms of user generated data, so scale is important. So - if money were not of consequence here, and in an ideal world, I would have been considering something like this:

 Build a whomping massive IB or FC connected (or SAS Gen3 maybe!) connected array next to a FreeBSD/Solaris 11.1/OmniOS distribution controlling it from a ZFS perspective. Share it out, quota it, CIFS's, NFS, whatever.
 Build a _secondary_ whomping massive IB or FC (or SAS gen3) connected array running an identical distribution and "send sideways" ZFS snapshots daily as a data integrity and protection mechanism.
 Cron `zpool scrub`'s weekly in slow times /non peak load time frames to try and maintain on disk consistency etc.
 Considering a RAID-Z2 or RAID-Z3 given the significant number of spindles involved. I dare say I'll consider LOG/ZIL as a result of some of the workload dealing with async/sync NFS writes and reads commonly.
I'm just interested in people's thoughts about the idea of _just_ using disk based systems for big, big data. I come from a background where many of my associates would say 





> What are you, bats#$#t insane? Where is your tape DR?


 but I have a niggling feeling that I should use ZFS where its strength lies in massive, massive raw file serving and use things like SAM where its strength lies, in behaving like a giant vault, rather than trying to shoe-horn HSM solutions like SAM and DMF into a come one come-all generic fileserver (with usability woes as a result).

I'd love to hear thoughts/considerations from any and all of you.

z


----------



## phoenix (Jan 3, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*

We're using a similar setup to what you want, only on a smaller scale, and currently only for backups, not direct usage.  Three separate servers (24 drives, 24 drives, 45 drives) doing remote rsync-based backups for 120-odd servers each night, then sending snapshots over to an off-site box with 90 drives.  ZFS on all four boxes, using standard SATA drives, with SSDs for the OS/L2ARC.  Works quite nicely, even with dedupe enabled (although the two boxes with dedupe disabled run a lot better).

I've read about lots of people online doing similar setups for NFS/SMB, and the odd person using iSCSI.  All with massive numbers of disks attached.  So far, very few issues.

One thing to note:  you can't pause/continue a scrub.  If you stop it, then start it again, it starts at the very beginning.  So you need to plan to run the scrub during normal hours, throttle it if needed, and make sure there's enough IOps to spare.

And, you always have the possibility of writing data out to tape from the second box.


----------



## zebra (Jan 3, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*



			
				phoenix said:
			
		

> And, you always have the possibility of writing data out to tape from the second box.



True that. One thing I'd really like to have done is `zfs send` my snapshots to an archiving filesystem like SAM-QFS. My admin guys are telling me however that in order to `zfs send` (for `zfs recv` to actually work and take the snapshot) it must actually "plant" itself on a ZFS filesystem on the other side. Do you know if that's an actual limitation or it's all in our heads?

We ideally want to `zfs send` snapshots (-i) over to the secondary array or into SAM-FS via 10GbE NFS and let them live in deep cold archive, rather than a secondary standard disk array. Just not sure if it's possible to pipe the ZFS snapshot to a filesystem tech OTHER than ZFS. I.e - does it just write a POSIX compliant sparse file, or does it need to _live_ on a ZFS formatted block device?

Thoughts?

z


----------



## jalla (Jan 3, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*



			
				zebra said:
			
		

> We ideally want to zfs send snapshots (-i) over to the secondary array or into SAM-FS via 10GbE NFS and let them live in deep cold archive, rather than a secondary standard disk array. Just not sure if it's possible to pipe the ZFS snapshot to a filesystem tech OTHER than ZFS. I.e - does it just write a POSIX compliant sparse file, or does it *need* to live on a ZFS formatted block device?


It sends only diskblocks that have changed from a previous common snapshot, so sending snapshots to anything other than ZFS doesn't make sense.

And, BTW SAS is the way to go as FC is near end-of-life as a disk technology.


----------



## phoenix (Jan 3, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*



			
				zebra said:
			
		

> phoenix said:
> 
> 
> 
> ...



You can redirect the output of `zfs send` to a file instead of (or in addition to) piping it through SSH to a remote `zfs recv`. That file can later be piped into `zfs recv` to restore the snapshot/filesystem.

However, if there is a single bit flipped anywhere in that file, the entire send is corrupted and unusable.  So, if you're dealing with multi-TB snapshots, you really don't want to write them out as files unless the filesystem they're written to includes some form of parity-/error-correction.

IOW, your safest setup would still be two separate ZFS servers (one for "live" data, one for backups) with a separate long-term archive solution.  That way, you just send snapshots between the two ZFS servers, ZFS protects the data on both ends and in transit, and then you migrate data off to long-term storage as needed.

Alternatively, you can just use one ZFS server to host the live data, and then migrate it off to long-term storage as per normal, without the use of `zfs send` anywhere.


----------



## zebra (Jan 3, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*



			
				jalla said:
			
		

> zebra said:
> 
> 
> 
> > And, btw SAS is the way to go as FC is near end-of-life as a disk technology.



I smell a holy war .

I'm not so convinced FC is dead to rights or is even dying in the big kids space. All the new big end Hitachi and EMC arrays are still FC8 or FC16. What will they do next iteration? Who knows. Maybe you're right. Maybe SAS Gen3 12Gbit/sec signalling is the future. I sense Infiniband has some bigger part to play yet, that said.

Thinking about the disk/technology and controllers I'll use, at this point...

z


----------



## jalla (Jan 4, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*



			
				zebra said:
			
		

> I smell a holy war .



To me it's not about religion, but similarly unexplicable. It's called the market.


----------



## zebra (Jan 5, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*

Heh. 

The market is a frantic and fickle thing. I guess from that standpoint, you're absolutely correct. The market pushes commodity. Commodity is SAS now.

z


----------



## ralphbsz (Jan 5, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*

Side note:



			
				zebra said:
			
		

> Build a whomping massive IB or FC connected (or SAS Gen3 maybe!) connected array next to a FreeBSD/Solaris 11.1/OmniOS distribution controlling it from a ZFS perspective.



Are you suggesting: (a) getting RAID boxes (you mention 84-disk enclosures from DDN/EMC/etc), configuring each of them with internal RAID, and then running ZFS over a few dozen of those boxes, or (b) getting 84-disk JBOD enclosures (meaning the host OS sees each individual spindle as a block device), and then running ZFS over about a thousand drives?  If it's the latter, I wonder about the scalability limits of the host OS and the file system.  Personally, I have never seen a single computer (a single OS instance) use more than about 400 disk drives (1400 block devices with multi path) in a single file system.  Larger systems (thousands to tens of thousands of drives in a single file system) are usually done with cluster file systems.

About the connectivity: What comes out of the disk drives itself today is either SATA or SAS.  If you are using disk arrays, I would pick IB or SAS over FC any day, since FC seems to be losing the market share battle.



> I'm just interested in people's thoughts about the idea of _just_ using disk based systems for big, big data.



In the end, this is not so much a technical question and more a psychological or philosophical one.  With today's disk prices, is it possible to forego tape for DR, backup and archive, even for big data systems (petabytes)?  Sure.  Is it cost-effective?  Depends on the workload, how much archiving you need to do, and how much tape media you will therefore need.  But there is a primal simplicity to tape: You can take a bunch of tape cartridges, stack them into a moving box, put them into your trunk, and drive them to a far-away location (like an abandoned mine), and know that this particular set of data is safe, even against the moral equivalent of `rm -Rf` (the moral equivalent today is the command to destroy all snapshots or something like that).  With tape, offline really means offline.

Before you go down this route, contact your tape vendor (AFAIK there are in reality only two vendors left, everyone else repackages IBM and STK), and ask them for the best tape can do, and their road map.  I keep hearing that tape's capacity and speed is making it more competitive, and that will become particularly true in the near future, when disk capacities start hitting some of the physics limits, while tape still has years to go.

Personally, for my home machine, I haven't touched a tape in a decade, and don't intend to.  But my data sets are three orders of magnitude smaller than yours.  Google just (unintentionally) released an interesting testimonial for tape: they published some very pretty pictures of some of their new data centers, with the beautifully painted cooling water pipes, the fronts of the servers all lit by blue LEDs, and aisle after aisle of... drum roll... tape cartridges.


----------



## Terry_Kennedy (Jan 7, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*



			
				ralphbsz said:
			
		

> Personally, for my home machine, I haven't touched a tape in a decade, and don't intend to.  But my data sets are three orders of magnitude smaller than yours.  Google just (unintentionally) released an interesting testimonial for tape: they published some very pretty pictures of some of their new data centers, with the beautifully painted cooling water pipes, the fronts of the servers all lit by blue LEDs, and aisle after aisle of... drum roll... tape cartridges.


I'm running a set of four 32 TB boxes (FreeBSD 8.4, ZFS) and a Dell TL4000 (rebadged IBM) LTO library (currently LTO-4, 44 slots). I posted a write-up about it here (look at the "backups" section in particular). That write-up also links to a reasonably large discussion back here about the pros/cons of tape vs. block replication vs. ZFS send/receive, etc.). Here is a picture of the whole setup in my spare dining room.

Note that this build is nearly four years old at this point. If I were starting over, I'd go with SAS instead of SATA and use external disk-only expansion bays, rather than duplicating the CPUs/memory /etc. between multiple host boxes.


----------



## throAU (Jan 10, 2014)

*Re: So. Might be time to go tapeless. Massive ZFS deployment*

My only concern with no tape would be for archival purposes. Are you going to keep snapshots indefinitely?  Do you have any retention requirements for legal purposes? For regular backup and recovery I'd see no problem at all being tapeless, but for archiving - keeping old archive only data on disk is expensive, and a waste of spindles/RU/power/cooling that could be better used for performance on live, or recent data. Push that data to tape, take it out of the live environment, and lock in a safe/bomb/shelter/other continent.  If the data is on tape and locked in a safe, it is a lot harder to accidentally remove a snapshot, destroy the pool it is on, etc.

But yes, I agree disk to disk backup for the first stage makes sense; as noted above though I'd strongly consider doing periodic snapshots to tape from your backup environment.


----------

