# NSA data storage



## abhay4589 (Dec 5, 2013)

http://arstechnica.com/tech-policy/...-5-billion-cellphone-location-records-per-day

According to this article NSA stores 5 billion cellphone locations and 27 TB of data, That is Insane. But my question is different: how would you store that much data in some sane and manageable way? With FreeBSD will we be able to store it in the first place? Something like teradata may be?


----------



## serverhamster (Dec 11, 2013)

I'm not sure if there is a sane and manageable way. Nearline storage probably? This is a prime example how money could be better spent. I suppose they also keep backups, which would double the data. Something like what is used for the Wayback Machine is actually smaller. According to Wikipedia, the Wayback Machine stored 100 TB each month in 2009. It will probably be more now, but the NSA manages that in four days, (assuming they don't take backups).


----------



## drhowarddrfine (Dec 11, 2013)

abhay4589 said:
			
		

> http://arstechnica.com/tech-policy/...-5-billion-cellphone-location-records-per-day
> With FreeBSD will we be able to store it in the first place? Something like teradata may be?


Of course it can. Don't ask me how cause I've never done it but I have every confidence it can do it. Google's Hadoop is Linux and FreeBSD is a better Unix-y so that's one point of evidence. Netflix uses FreeBSD to serve all their movies so another point. I'm sure the government uses some form of Unix to do this so FreeBSD can, too.


----------



## ShelLuser (Dec 11, 2013)

Now that's an intriguing question 

Up front: I've never been involved with something like this and my guess is as good as yours. But having said that I'm convinced that this would be possible, and relatively easy too. For starters you could opt to put the whole thing into a database which, depending on the system, would be able to distribute the data across several nodes.

But to provide a better (easier) example lets say that we have a _lot_ of files which we want to store on the file system itself for easier access. We need to have this data distributed due to size concerns. One option could be to opt for a SAN solution, iSCSI comes to mind, then simply getting one system to access all those SAN devices.

If you then set up a structure comparable to that being used in a caching environment (multiple directories which each contain a specific selection of data) and divide that across the several NAS devices by mounting them on certain parts of the "caching tree" you'd already achieve something quite usable.


----------



## serverhamster (Dec 11, 2013)

It explains why Snowden quit his job as a systems administrator. Imagine having to replace hundreds of hard drives each day.

I remember reading that Google doesn't care about losing drives with search results, because the web crawlers will find the data back anyway. The NSA probably doesn't take that risk. Start calculating the number of RAID-Z3 vdevs in that pool!

Or they could just have used some random cloud provider §e . Quite a number of those offer unlimited storage.


----------



## Crivens (Dec 11, 2013)

serverhamster said:
			
		

> Or they could just have used some random cloud provider §e . Quite a number of those offer unlimited storage.



If true this would be interesting to know  They can not dump all traffic to that provider into their info dump, because that would be dumped again and again.  :beergrin 

I do not think they will do backups in that sense, the information will be cross referenced a lot and thus be copied around (if the information morsel is smaller than the location ID). Self healing data structures/graphs - looks like an interesting research area. I can promise that I would take a lot of money for that (and ages to do something, also. At least for that customer).


----------



## throAU (Dec 18, 2013)

27 TB of data per day really isn't that much for a government agency with the NSA's budget.

I was chatting with one of the IT nerds from our public transport department here in Perth, Western Australia and they have a 1.3 PB array (well, had.  this was a couple of years ago now?) that they use to store 30 days worth of video from the public transport network (stuff like train platforms, etc. for legal reasons).  30 days at 27 TB per day is like, half of that. And Perth is probably the most isolated major city in the world, with a population of only two million for the entire state.

The NSA no doubt will have a big contract with someone like EMC or Netapp to handle it. I suspect their storage is larger scale than replacing individual drives.  They'd no doubt have redundancy across shelves/racks, and just not bother replacing individual failed drives - but repair/replace an entire shelf/rack as a unit when it goes off-line.


----------



## beatgammit (Dec 20, 2013)

I'm not an expert in this field, but hadoop is in ports (devel/hadoop), so it shouldn't be too difficult

Another (non-hadoop) approach would be to use a distributed database like Cassandra, Mongodb, etc. on a ton of servers with huge ZFS arrays, and write a start-up script that adds itself to the cluster. This probably wouldn't be too bad to maintain, and with S.M.A.R.T. monitoring, all an admin would have to do to maintain it is swap drives/servers when they die. With a RAID-Z3 and multi-server redundancy (say every record is stored on 3 servers), replacing drives can be done in bulk.

I want to set up a similar (although _much_ smaller) system in my home and host an own cloud (or similar) instance.


----------



## Nukama (Dec 20, 2013)

Ceph could be an easier way to achieve  redundant data storage (placement group can be specified: rack - datacenter - planet ). 
So moving from RAI*D* to RAI*N* should mitigate the risk of failing *d*isks and *n*odes. 
And storage is expandable as you grow (by adding disks or nodes).


----------

