# ECC or non-ECC



## poorandunlucky (Nov 23, 2017)

Discuss.


----------



## Phishfry (Nov 23, 2017)

Why would I need ECC memory on a HTPC. So your question is broad.

ECC if you care about your data. How about that for a short answer.

On my security camera server I have ECC. On my laptop I don't (and can't).

It don't make me feel any different.


----------



## poorandunlucky (Nov 23, 2017)

Phishfry said:


> Why would I need ECC memory on a HTPC. So your question is broad.
> 
> ECC if you care about your data. How about that for a short answer.
> 
> ...



Well, I imagine data integrity comes at a price... either a dollar amount, a performance metric, a risk, peace of mind, etc...

Even when needed, is it worth it?  Is it worth the dollar amount?

Is the overhead and bottleneck, and possible reduced performance cost worth it?  In which scenarios do you think it's worth it?

In an ideal world, should all RAM be ECC?

In what scenarios should ECC be most considered, or weighted for?  You say for your security camera, is that a good example or when ECC should be used?  Wouldn't fast hard drives be better?  idk...

It's a broad question indeed, but it's been a long time since I've asked it, and a lot of technological changes happened...

And on that note, are there generational boundaries where you should weigh in favor of ECC, like maybe older systems pre-something should use ECC more than more modern systems, again, idk... that's why I'm asking...


----------



## ralphbsz (Nov 23, 2017)

poorandunlucky said:


> Well, I imagine data integrity comes at a price... either a dollar amount, a performance metric, a risk, peace of mind, etc...


The price is a dollar amount.  I don't think there is a performance penalty, if you pay enough.  The highest available memory bandwidth probably is on high-dollar server machines, which all have EDD.

Risk and peace of mind is not a price.



> Even when needed, is it worth it?  Is it worth the dollar amount?


Depends.  How much money do you lose every time the computer crashes and you have to wait a minute or ten for rebooting?  How much money do you lose if memory errors (which do exist but are rare) silently corrupt your data?  That heavily depends on the usage.  For a laptop that's on my lap and used to browse the web and send e-mail, a crash means I get to get up and pour another glass of wine, and continue working a minute later.  My laptop has nearly no data stored on it, so the risk of corruption is very low.  On the other hand, for a server at a bank, which is needed to operate thousands of ATM machines, and where the data is the content of the customer's bank account, the answer is different.



> In what scenarios should ECC be most considered, or weighted for?  You say for your security camera, is that a good example or when ECC should be used?  Wouldn't fast hard drives be better?  idk...


ECC does not compete with fast hard drives.  It competes with RAID and good storage systems (such as ZFS): both make your computer more reliable, and make loss of data or corruption of data less likely.

One factor that goes into it is the amount of RAM.  Most laptops or consumer computers sold today seem to have in the neighborhood of 4 to 16 GB of RAM.  Many enterprise servers have hundreds of GB, and 1TB of RAM is starting to be seen in production regularly.  On one hand, that means that the cost of ECC becomes much higher; on the other hand, it means that the utility of ECC is also much higher (much more data in RAM that makes a big target for corruption).

For an amateur or home user, it is indeed a tradeoff: You can invest $300 into a better motherboard and more expensive DIMMs and get ECC, or into a second hard disk, and get RAID.  Which one is a better deal?  I don't know.

Would I spend an extra $100 or $200 to get ECC on my user interface device (I use Mac laptops): Absolutely, if I could.  Unfortunately, the only laptops with ECC seem to be very large and impractical portable workstations (Lenovo and Dell make them), which cost several thousand $ more than a sensible laptop.  Not a useful discussion to have, for lack of options.

On a home server, it is a more interesting question.  Personally, I use RAID but not ECC at home, but I know that I'm very biased (being a professional storage person).  And when I bought my server a few years ago, I was interested in very small physical size and low power consumption; I don't think ECC in a micro-ATX form factor is even a thing.  If I could get ECC the next time I upgrade, I would probably do it.

For an enterprise user, the answer is nearly always ECC for server-class machines; and if it isn't, it's because people have thought about it carefully and have made the tradeoff.  For compute engines, it's more mixed; I've seen large clusters with inexpensive non-ECC machines in them (but then carefully managed so crashes don't take the whole cluster down).


----------



## poorandunlucky (Nov 23, 2017)

ralphbsz said:


> ECC does not compete with fast hard drives.  It competes with RAID and good storage systems (such as ZFS): both make your computer more reliable, and make loss of data or corruption of data less likely.



I don't see how ECC compares with RAID or mirrors...  Like one is for stored data, whereas the other is for computed data... or is it not possible for the computer to store corrupted computed data?


----------



## Phishfry (Nov 23, 2017)

The files have to get to disk somehow. This somehow is via memory.

My point about my security cam server versus my HTPC is that I could stand to loose a frame of OTA TV but on something like my security camera server I need every frame to be there.


----------



## poorandunlucky (Nov 23, 2017)

I just never had to deal with corrupted data before, not that I know of, or noticed...  I don't really know what it's like, or what I'm/we're talking about, by the same token...

Sometimes you know things but they don't really make sense until you're able to observe them...

What about Zn shields?  Do you think one should mind gamma rays and radiation when it comes to data integrity?


----------



## SirDice (Nov 23, 2017)

poorandunlucky said:


> I don't see how ECC compares with RAID or mirrors.


ECC for memory works pretty much the same way as RAID 5 does for disks. It's slightly different but the idea is the same. You can loose one memory chip and there will be enough redundancy in the rest of the chips to keep the data intact.



poorandunlucky said:


> I just never had to deal with corrupted data before, not that I know of, or noticed... I don't really know what it's like, or what I'm/we're talking about, by the same token...


Believe me when I say you will never want to deal with it. It's a royal pain trying to recover from it and there's never a guarantee the data you recovered is not corrupted in some very subtle way (a few flipped bits here and there).

Back in the olden days on the Amiga there was this virus called "Lamer Exterminator". It was a royal pain if you where infected. It was one of the first that was memory resident (it was still active after a reboot). If it was active it hid itself when you tried to read the boot sector. But the worst part of it was that it randomly filled tracks on disk with the "LAMER" text. One track at a time at random intervals. So at first you don't notice it's active. Then you get more and more weird disk errors. Until you realize half your disk was silently overwritten by this monster and there was no way to recover from it anymore.

Same with memory errors. They can silently corrupt files in memory before the data is written to disk. And it can take a while for you to notice those files are corrupt. Now imagine you diligently backup your data every day. Your backups will be corrupted too because you're backing up bad files. If this goes unnoticed long enough all your backups will be worthless too. And then you get hired to sort this mess out. Expensive, tedious, exercise which could have been avoided if they spent just a little bit more money initially.


----------



## poorandunlucky (Nov 23, 2017)

SirDice said:


> ECC for memory works pretty much the same way as RAID 5 does for disks. It's slightly different but the idea is the same. You can loose one memory chip and there will be enough redundancy in the rest of the chips to keep the data intact.
> 
> 
> Believe me when I say you will never want to deal with it. It's a royal pain trying to recover from it and there's never a guarantee the data you recovered is not corrupted in some very subtle way (a few flipped bits here and there).
> ...



Well more and more people handle sometimes irreplaceable data on their computers...  You may not think that someone's vacation pictures to Cuba are irreplaceable data, but maybe to them, those pictures mean the world, and they figure they're safe in the computer, and I don't know, they rotate them or perform some sort of operation on them, and the files all end-up streaked by bad bits, or worst, unreadable...

Sure it's not like it affects millions of people, but I'm pretty sure those pictures were probably worth $40 extra to that person... not to mention that they have to replace their non-ECC RAM anyway...

I was starting to think ECC was just for database servers, but you're making me realize that today, there's a database server (often even more than one) on all computers out there, and that once the data's gone... it's gone.  So I think I'm going to change my answer from "Depends..." to "ECC"...


----------



## Phishfry (Nov 23, 2017)

ralphbsz said:


> I don't think ECC in a micro-ATX form factor is even a thing.


Oh yea it is. My server board is Gigabyte MATX and Supermicro makes plenty of boards in the MicroATX form factor that take ECC.
There are even ITX boards that take ECC, from SuperMicro and ASRock Rack that I know of.


----------



## OlivierW (Nov 25, 2017)

ralphbsz said:


> And when I bought my server a few years ago, I was interested in very small physical size and low power consumption; I don't think ECC in a micro-ATX form factor is even a thing.  If I could get ECC the next time I upgrade, I would probably do it.



Exactly what I needed, and I bought HP N40L microservers, with ECC RAM.


----------



## _martin (Nov 25, 2017)

ECC, definitely. I do care about my data.

It's pity it's not widely popular on desktop boards. Especially in 2017 when there's no problem overpaying for a smartphone.


----------



## poorandunlucky (Nov 25, 2017)

_martin said:


> ECC, definitely. I do care about my data.
> 
> It's pity it's not widely popular on desktop boards. Especially in 2017 when there's no problem overpaying for a smartphone.



Honestly, I wonder why Non-ECC is even a thing...  I think all memory should be ECC...

I think there's cost-cutting in favor of efficiency, and there's cost-cutting in the favor of greed (just being cheap).  I think efficiency is a good thing, whereas being greedy or cheap should be punishable by death.

Life is already shitty enough people who make a point of making it worse should meet their makers at the gallows.


----------



## _martin (Nov 26, 2017)

poorandunlucky said:


> Honestly, I wonder why Non-ECC is even a thing... I think all memory should be ECC...



I completely agree. Personally I think it's also a historic thing - it was more expensive to produce these modules before. 

But just a thought that most of FS operations goes through RAM and you have no way of knowing if "data in" == "data out" .. Scary ..

My job is managing big corp servers, from small RX Itanium servers, through Blade servers ending with high-end Itanium Superdomes. Of course all of these have ECC RAM. In two-three years of time you can find single-bit errors on considerably large amount of RAM modules. Imagine you would use non-ECC RAM -- that's like playing Russian roulette with your data.


----------



## ralphbsz (Nov 26, 2017)

_martin said:


> I completely agree. Personally I think it's also a historic thing - it was more expensive to produce these modules before.


It is still more expensive; you end up using a few percent more RAM (in the sense of gates and silicon area, perhaps not chips or DIMMs).  That costs money.  For large computer users (Google, Facebook, the US government, ...) that is a tradeoff; but those large users can make informed choices.  Where I agree with you: consumer computers should all have ECC, because (a) the extra cost is minimal compared to the price elasticity of consumers, and (b) end users are not capable of making those informed choices.



> But just a thought that most of FS operations goes through RAM and you have no way of knowing if "data in" == "data out" ...
> In two-three years of time you can find single-bit errors on considerably large amount of RAM modules. Imagine you would use non-ECC RAM -- that's like playing Russian roulette with your data.


It's not quite that bad.  In large systems, file system data is written to disk pretty quickly, so at least the write path (application -> disk) is only vulnerable to memory errors for about 30s or less.  The read path for cache hits is obviously still a problem.  And some file systems (for example ZFS, but others too) protect the data with checksums, and the checksums are also kept in memory for the data in memory.  Obviously, this is not perfect: during the calculation of the checksum the data is still unprotected, but in a well-designed system, the checksum is calculated first and checked last, to keep that window as small as possible.  Some kernel software products even keep light-weight checksums of other long-lived data structures in memory, to protect them against bit rot.  Again, this is not perfect (the cost of protecting every data structure would be too high, it would amount to implementing ECC in software), but it covers a very large fraction of the memory in use.

Still, if I had to specify a server, I would go for ECC if reasonably possible.  That you Phishfry for pointing out some MicroATX boards with ECC; it might be time to do a hardware upgrade.


----------



## Brian Cully (Nov 28, 2017)

ralphbsz said:


> ECC does not compete with fast hard drives. It competes with RAID and good storage systems (such as ZFS): both make your computer more reliable, and make loss of data or corruption of data less likely.



ECC is highly encouraged with ZFS, so it's not at all in competition. Memory is one of the few weak links in the chain, as a memory corruption will screw with your on-disk checksums and lead to corruption, potentially of the entire pool, no matter how much redundancy you have. It's potentially far worse to have memory corruption with ZFS than a non-checksummed filesystem.


----------



## Snurg (Nov 28, 2017)

SirDice said:


> Believe me when I say you will never want to deal with it. It's a royal pain trying to recover from it and there's never a guarantee the data you recovered is not corrupted in some very subtle way (a few flipped bits here and there).
> [...] And it can take a while for you to notice those files are corrupt. Now imagine you diligently backup your data every day. Your backups will be corrupted too because you're backing up bad files. If this goes unnoticed long enough all your backups will be worthless too.


I was in that situation and had to sort that mess out, to save at least part of valuable data.
After that I came to the conclusion that I do no longer want this kind of hidden data rot, decided "Never again", sold my non-ECC desktop PCs, and bought some cheap used workstations with buffered ECC DDR3, which are the cheapest RAMS in $/GB when bought used in bulk.
My laptop has no ECC, so I equipped it with 14800 ram modules from first source, being clocked as 10600, making sure it's well underclocked. It seems quite reliable, but who really knows.



_martin said:


> I completely agree. Personally I think it's also a historic thing - it was more expensive to produce these modules before.


Not really. Up to the mid-late 1980s there was a ninth socket (parity) for each byte row. When memory modules came up then, they first were commonly ECC also in consumer grade.
But the cheapo mind slowly took over. Some 8088 PC clones already had a dip switch to deactivate parity checking, so people could save 1/9th of the memory cost.
More and more memory modules were sold with the 9th chip missing. In the late 1990s it was almost impossible to find consumer grade PCs that were still able to use ECC/parity protected memory.
And the data rot age began...


----------



## poorandunlucky (Nov 30, 2017)

Snurg said:


> Not really. Up to the mid-late 1980s there was a ninth socket (parity) for each byte row. When memory modules came up then, they first were commonly ECC also in consumer grade.
> But the cheapo mind slowly took over. Some 8088 PC clones already had a dip switch to deactivate parity checking, so people could save 1/9th of the memory cost.
> More and more memory modules were sold with the 9th chip missing. In the late 1990s it was almost impossible to find consumer grade PCs that were still able to use ECC/parity protected memory.
> And the data rot age began...



... That's awful...  For a single chip...  All that for a single, miserable, chip...  it's not like they don't have the design for it, either, or have to draw a special ECC card... they draw the non-ECC from the ECC one...

That's so low...


----------



## ralphbsz (Nov 30, 2017)

In computers with large memory (lots of enterprise machines ship with 256 or 512GB these days, and 1TB is not uncommon), the cost of memory is a driving factor.  Now take that and multiple that into a large cluster (with thousands of machines, which for the likes of Google and Facebook and supercomputers and non-existing agencies would be a small cluster).  In that situation, an customer user can make the deliberate tradeoff that his machine will crash occasionally (very rarely, memory errors are actually not common), and generate wrong results (also very rarely).  The wrong results will usually be caught by the toolchain (since usually they create corruption which is detected by the next processing stage), and crashes can be handled transparently by rerunning jobs on remaining machines.  So the performance loss due to crashes/reruns is very small, and the cost saving is many percent, which works out to millions (it is not uncommon for single customers to buy $10M or $100M clusters).  Economically, this may be a win.  It may also be a loss, if the cost of worrying about wrong results is worse.  If I remember right, Google uses ECC for all machines, even data processing; but I know some large analytics customers deliberately do not use ECC.

For an end-user with one machine, and without a well-developed data processing chain, cluster management, and job scheduling tools, the answer is obviously very different.  There, the memory saving is a few dollars, and the cost of one crash is high, and of one data corruption very high (it might mean days of downtime).


----------



## _martin (Nov 30, 2017)

ralphbsz said:


> In that situation, an customer user can make the deliberate tradeoff that his machine will crash occasionally (very rarely, memory errors are actually not common), and generate wrong results (also very rarely).  The wrong results will usually be caught by the toolchain (since usually they create corruption which is detected by the next processing stage), and crashes can be handled transparently by rerunning jobs on remaining machines.



I disagree ; this is not how business is done where I work (note: it doesn't mean it's not done like this somewhere). 

Silent corruption is bad. You need to trust your data (trust but audit). Even the more weird solution architect would not sacrifice ECC RAM to save small amount of money. We have some PROD HANA boxes with 12TB of RAM (SuperdomeX). The cost of these boxes is so huge you don't think about saving spare change on non-ECC ram. 

But even price of entry-level class servers is high enough not to consider non-ECC RAM. And even if you scale it to few 100s servers it's not that huge savings. If you need this amount (and more) of servers you are in business - money makes money. And if you are small starting company -- you can't afford to have silent data corruption undetected. Fines you would (probably) have to pay to a customer would ruin your business. 
And that's all business ..

Personally I don't want a ZFS server storing my personal data with non-ECC RAM. I bought 32GB ECC RAM (8GB DDR3-1600MHz Kingston ECC CL11 w TS Intel) in 2013 for 303.60 EUR. Unfortunately I don't know how much it was for non-ECC ones, I'm guessing around 150EUR maybe. So not that huge difference overall.


----------



## drhowarddrfine (Nov 30, 2017)

If one wants to search for it, codinghorror.com has an article about this and Google published one, too. Both had quite a bit of detail.


----------



## Snurg (Nov 30, 2017)

poorandunlucky said:


> ...they draw the non-ECC from the ECC one...



I see, I was oversimplifying in my historic lookback for brevity. The topic is more complex. So I tell a bit more history.

DRAM data security is not a new issue, and this is one of the reasons why mission-critical embedded hardware often is static (SRAM, static processors etc).

Memory safety always had a high esteem in "serious" computing. Before the advent of the IBM PC microcomputing were factually toys for enthusiasts. Maybe except the S-100 bus based CP/M and MP/M systems, which ran the killer app "WordStar" and were much used professionally even though belittled by the "real computing" mainframe world.
And all these 8-bit systems had _no_ memory parity checking.

Memory safety in the form of parity checking was introduced by the IBM PC into the microcomputing world.
This meant a high reliability leap, which was _very important because at the same time hard disks began to become a mass article_ when the miniaturization reached the 5 inch full height form factor.
The first PC hard disks were whopping 10 megabytes large and blazing fast (80 ms average access time) to the then industry standard 8" 1.2MB diskette.
In comparison to diskettes, which were typically specified around 40h operating lifetime, but in reality often lasted much less, this was a revolution.
People were accustomed to regularly change much-used diskettes every few weeks to months. A hard disk had practically unlimited lifetime in comparison.
Before the advent of hard disks microcomputer users thus were accustomed with regular diskette failures and it was not necessary to do much backup education.

Apple products, for example, were considered as hobbyists' toys back then. Apple is an example of a company who began using ECC memories quite late, albeit afaik only for their professional-grade products.
The "real computing scene", i.e. Big Blue, DEC, Amdahl, NCR etc, who served professionals, who depend on their data to last more than a few months, in contrast, had used error correction on RAM and disk already back in the 1950s.

--

The codinghorror article drhowarddrfine mentions is a good example of the mindset that led to the discarding of RAM data integrity checking when deemed economical.
It begins with showing Google's first handmade servers as an example of commercial use of non-ECC protected RAM, when they were still a startup.
But think about it... their business is to rebuild their data continuously. Small mishaps will practically go undetected and go away by themselves. So I think this is a typical misleading type of example when argumenting contra ECC.

Google's 2009 study is quite good, but they also see only a part of the whole picture. There are many more neglected reasons why soft and hard errors happen that I rarely see discussed, if at all.

I still remember the discussion that was around 1980, when in the course of the transition from 16kbit to 64kbit DRAMs it became common to cover the chips with a particle-shielding cover, as it turned out that the structures, then still many micrometers, thousands of times bigger than nowadays cutting edge, became so small that _a single beta particle_ could flip bits in the DRAMs if it hits the right spot at the right time.
And you just cannot shield for the trillions of cosmic particles that hit earth every second. The most powerful cosmic subatomic sized particles have a kinetic energy of a well-pitched baseball because they travel with almost light speed. It is really hard to imagine the immensely powerful micro-wreckage when such a particle hits an atomic nucleus in, say, a memory cell.
And keep in mind that the radiation is not constant, there are constant outbursts and spikes like solar flares.

It is a well-observable fact in huge server farms that reboot rate rises sharply when exceptionally strong flares happen (still far away from a Carrington event).

And now imagine, your computer happens to get hit by a cosmic particle shower which flips, say 10, 100 or 1000 bits of your gigabytes.
If you have ECC memories, the risk that you will suffer data rot is relatively low. Your computer might indicate a spurious increase of corrected errors or even reboot, that will probably be all.
But if you have no ECC, you could end up with very nasty scenarios, without noticing the actual incident at all.

I just have to look at my own data rot case.
I started to notice more and more corrupted files. Here and there, just random. Just a bit. It was easy to recognize in text files. Images displayed either distorted, or even crashed the viewer etc.
That was very disturbing because I knew that these files once were intact. I checked my backup DVDs and got the impression that the main damage must have at some time or some time interval quite a long time ago.
At that time in question my main pc (consumer grade, no ECC) was running Linux with ext3 filesystem (I was a few years on Linux because my Symbios hardware prevented me booting FreeBSD).
I don't know what caused the errors. RAM? Bus? Other things? The RAM error issue can be mitigated to a high degree by using ECC. The disk error issue likewise.

So I finally changed to ECC hardware and ZFS and hope I won't experience such thing again.
It really felt like a cosmic particle storm made my whole data a Swiss cheese with many holes. The most scariest thing was the realization how late I realized that.


----------



## rigoletto@ (Nov 30, 2017)

There is also a difference in price depending on who is buying the memory. Home users and small business usually pay a lot of more for memory (and everything) than big business, who do not need to rely on local distribution.

If you are need for a quality number of memory sticks you can call the factory in China directly and buy directly from them. It will cost a lot of less, but you will need to handle all the import related stuff (or have someone to handle for you), and know the right contacts abroad to not fall in a scam.

Just take a look on the prices in something like eBay or AliExpress, and those (re-)sellers are already making money on it.


----------



## Snurg (Nov 30, 2017)

Luckily as a small user I am not in the situation that I must use the most brand-new hardware. My computing needs can be perfectly fulfilled with used workstations.
Many of these are of much higher quality than any brand new consumer PC and their performance is comparable.

What I also like is that registered DDR3 ECC memories (sold in mass on ebay by sellers specialized on refurbished stuff) cost about one third of the price that even used non-ECC consumer grade memories cost!
And you can put in much more modules into most workstations than into any consumer PC!

So this is my recommendation to those of you FreeBSD friends who have similiar needs


----------



## Terry_Kennedy (Dec 26, 2017)

poorandunlucky said:


> ... That's awful...  For a single chip...  All that for a single, miserable, chip...  it's not like they don't have the design for it, either, or have to draw a special ECC card... they draw the non-ECC from the ECC one...


It is worse than that. 30-pin SIMMs (the 9-chip ones) often came with "logic parity" (*fake* parity), where the SIMM computed the expected parity value based on the data that was in the 8 memory chips (which may not have been the data that the system actually thought it was storing). There was enough of this that a number of BIOS companies skipped the parity test on power-up and instead just zeroed all of memory (they needed to write a predictable pattern to all of memory on the off chance that there was real parity memory in there, in which case the system would get an NMI if it read from uninitialized memory that happened to have the wrong parity).

Things got somewhat better with 72-pin modules, as there were now 32 bits of data per module. A parity module would have 36 bits, one parity bit per byte. But when installed in pairs, that gave you 8 parity bits for 64 bits of memory. ECC became practical.

However, there were other issues to confuse buyers - FPM (traditional DRAM) vs EDO, parity / ECC vs non, and worse of all, gold leads vs tin leads. Tin leads were used in piercing (pointy pin) sockets, and gold leads were used in non-piercing (wiping finger) sockets. Putting a gold module in a piercing socket damaged the socket, as the pins "stubbed their toes" on the gold pins (while gold is soft, it is a very thin layer on top of copper, which isn't).


----------



## Terry_Kennedy (Dec 26, 2017)

Snurg said:


> Memory safety always had a high esteem in "serious" computing. Before the advent of the IBM PC microcomputing were factually toys for enthusiasts. Maybe except the S-100 bus based CP/M and MP/M systems, which ran the killer app "WordStar" and were much used professionally even though belittled by the "real computing" mainframe world. And all these 8-bit systems had _no_ memory parity checking.


Not true. I was designing and my company was selling MP/M systems with 512KB of parity-protected 70ns static RAM. And if you think it is still a toy system, it ran 8 copies of WordStar simultaneously when the average S-100 system had troubles running one. Clarke's 2010 was typeset (original US hardcover) after being input on one of my 8-terminal MP/M systems by Fisher Composition's Arkville, NY office.

Enough for now - I have to run to dinner...


----------



## vermaden (Dec 27, 2017)

If Your platform supports ECC, use it, if not, then You will not be able to use it ... as that simple.


----------



## Snurg (Dec 27, 2017)

Terry_Kennedy said:


> Not true. I was designing and my company was selling MP/M systems with 512KB of parity-protected 70ns static RAM. And if you think it is still a toy system, it ran 8 copies of WordStar simultaneously when the average S-100 system had troubles running one. Clarke's 2010 was typeset (original US hardcover) after being input on one of my 8-terminal MP/M systems by Fisher Composition's Arkville, NY office.


I have serious doubt that this is true.

Clarkes's 2010 was released 1982.
At that time 6 MHz Z80 was the top end processor for CP/M, but most computers ran at 4 MHz. (Cromemco introduced the first 10 MHz 16-bit MP/M 68k stuff not before 1982)

70 ns SRAMs (Cache RAMs) were introduced around 1982, and were very, very expensive.
They saw widespread use not before the onset of the 386 boards in 1987.
I heavily doubt it makes sense to use such fast and expensive SRAMs as main memory for 2, 4 or at most 6 MHz processors.

Imagine the sheer size of installed memory boards with almost 300 16kbit SRAM memory chips in large DIL packages.
1980 the memory price was around $10 per kilobyte DRAM and about $40 per kilobyte SRAM.
Using SRAM instead of DRAM would mean back then, that the extra cost for a 512k system would be about $20000. (four times the number of more expensive and larger chips, at least 5-6 sq. ft more PCB, more casing, more power supply,...)
This is very much when you consider that a typical office system (computer, terminal, daisywheel printer, OS, software) cost about $10-15k that time.
In short, the cost issue is the reason why practically nobody uses SRAM as main memory for large installations.
And Cromemco, a technology leader back then, afaik _never_ produced parity protected memory boards.

As I cannot remember having seen any memory board for 8-bit computers that has parity in 40 years, I would appreciate very much if you can show me any proof, for example pics, schematics, or the like.

As the 8-bit time went over 35 years ago, long ago, I thus guess you just were confusing some things.
Otherwise I would kindly ask you for more technical detail, as I am sure you can tell things from that time that are interesting to a lot of the forum's readers.


P.S. Regarding Wordstar:
It was predestined for multi-user application, as it is an application that most of the time idles.
So it was well possible to run multiple Wordstar terminals on a 2 MHz Z80 system.
Of course it would get laggy if all 8 users would print at the same time, for example.
But people knew this and adjusted their workflows accordingly.


----------



## ralphbsz (Dec 28, 2017)

Fast SRAM chips were available much earlier.  We had them on CAMAC-to-Unibus converters called "MBD2" in the early 80s; those things were able to run at 40MHz (they were sold at 20 or 25MHz, but we upgraded them by optimizing the circuits, which were completely wire-wrapped), so that must have been 25ns memories (or somewhat faster). Now I'm not sure of the density of those chips, but it wasn't very high.

I can not remember whether 8-bit Z80 computers with parity existed or not; I only started using 8-bit machines in the era of 4K DRAM memory chips (4kBit each, so 64kByte of memory required 128 chips, which is why many computers shipped with 16 or 32kByte).  The 16K chips were easier to live with, and the first 64K chips were a godsend; for the first time we could put over 64KByte onto a single Eurocard.

P.S.: The first paragraph above talks about the speed of SRAM; the second paragraph about density of DRAM.  Sorry if that causes confusion.


----------



## Snurg (Dec 28, 2017)

ralphbsz
Yes, overclocking was way easier back then 
If 50% higher clock didn't work, it was usually only the matter of finding the murky chip that was too near at the minimum spec. (Ok I am exaggerating a bit   but it was totally different than nowadays...)

Iirc 16k DRAM were introduced 1976 or 77, but the 4k DRAMs were sold in computers until about 1980. Their speed was as low as down to 350ns. The 64k DRAM were introduced around 1980. These were substantially faster, too, up to 150ns. ~1984 then 256kbit, down to ~100  ~88 then 1Mbit, down to 60ns. (tCAS->Data ready)

I know because memory hardware was one of my special interests back then.


----------



## recluce (Dec 28, 2017)

For me, it is ECC where ever possible. Even my new laptop will come with ECC memory, I would never consider building/buying/using a server without it. Also a good argument for Ryzen CPUs, which mostly support ECC (I believe only Ryzen 3 does not).


----------



## Terry_Kennedy (Jan 5, 2018)

Snurg said:


> I have serious doubt that this is true.


I had been working for Expertype in NYC doing support on the Computer Composition International phototypesetter front-end systems (Data General Novas). Expertype and CCI came to some agreement where Expertype would market and support CCI systems in the NY region. One of the first customers was Fisher Composition, just up Park Ave from Expertype. After a round of layoffs at Expertype, I was approached by Steve Fisher to design a "front-end for the front-end", as additional terminal licenses on the CCI hardware were incredibly expensive.

The first system I did for Fisher was an Intertec Superbrain which replaced a primitive entry system with a 1-line display that punched on paper tape (which was then fed to the tape reader on the CCI Nova). The Superbrain allowed local editing and more than a line of text, and the output from the Superbrain went directly into the CCI Nova, initially by pretending to be a fast paper tape reader (since there was no support for paper tape on the Superbrain).

Fisher wanted something much more customizable, so I proposed a multi-user (MP/M) system with 8 Ann Arbor Ambassador terminals. The system communicated with the terminals at 38.4Kbaud - 4 times the "industry standard" of 9600. And Rob Barnaby implemented some esoteric WordStar features for us - in addition to relinquishing the CPU to the MP/M scheduler instead of spinning in an idle loop, a lot of the "optional performance improvement feature" stuff like line insert with control of scroll direction were added to support the very advanced features of the Ambassador terminals.

The chassis was a rackmount 22-slot TEI chassis with a constant current power supply, with a custom front cover which replaced the power on/off toggle and reset button with an ACE cylinder lock. We upgraded the chassis with Torin fans. The CPU and some of the supporting cards were from Ithaca Audio, with a custom CPU daughterboard of my own design to provide a line-time-clock and (somewhat) High Precision Event Timer for MP/M. I went with static memory because of its great compatibility - an Altair dynamic memory board wouldn't work in an IMSAI system and vice versa - only front panels were less compatible with foreign systems. I put the bank select and parity logic on the card edge and left side areas, leaving the rest of the card for the memory chips. Parity errors lit up LEDs on the memory card as well as sending an interrupt via my LTC / HPET CPU daughterboard. The system would report the card and bank of the error to the maintenance terminal (A Volker Craig VC404 - that's the TTL one, not the VC4404 ("chat") which was microprocessor-controlled)

Everything that could possibly be interrupt-driven was. I re-used a lot of that technology on later single-user systems. You might be interested in https://www.glaver.org/cpmstuff/pc8cpm.txt - that's the manual for the CP/M we shipped for single-user boxes. Note the TOD support and background, interrupt-driven print spooler (which used threading support we added to CP/M, since it had to be able to do filesystem reads independently of any running user program.


> Clarkes's 2010 was released 1982.


This was not the first book input through my systems - "The Islamic Bomb" by Weissman &  Krosney (1981) was the first. The second was "Licence Renewed" by Gardner (the first authorized James Bond continuation novel). 

If you still don't believe me, go contact Dave Burstein on LinkedIn - https://www.linkedin.com/in/daveburstein - he's the actual source of the "handles 8 WordStar users faster than most computers handle 1" quote as well as being the mover who delivered the system in question to Fisher Composition in Arkville.


----------



## tingo (Jan 7, 2018)

I enjoyed this post - thanks!


----------



## Snurg (Jan 8, 2018)

Terry_Kennedy 
Thank you very much for your very enjoyable post, too!
It makes me reappear long ago memories vividly.

The Superbrain   An all-in-one computer, ready to be placed onto the desktop! No more separate units, like CPU, drives, terminal.
But if I remember correctly, when I saw it back then, I disliked that its display's dot matrix was very coarse. I mean, you could clearly see every char being composed of seven or so lines, like the IBM logo split into horizontal lines.

And as the whole stuff was way more expensive than today, and way less standardized, there was much custom manufacturing, as ralphbsz' also very interesting report confirms.
If I understand correctly, you basically designed a custom system for Fisher, designed to be well-adapted for publishing, which they sold as a complete hw/sw package to publishing customers.
But it was no mainstream system like Cromemco, NorthStar, Altos, etc, whose components were advertised in BYTE etc and of which I do not remember having seen more than 8 chips per byte.

The reliability issues you mentioned regarding the fickle DRAM timing and its potential to cause instability and mangle data are a good reason to use SRAM, especially when high-value work involved.
To be honest, I had been curious which chips you used (I guess 6167 or the like). I guess then you will have put 64k (or at most 128 if tightly packed) onto a memory board and have 8 or 4 memory cards in that 22-slot monster rack, maybe filling only every second slot to maintain a good ventilation? 

Whatever, a bit sad is that there seem to be no photos etc. Such would illustratively document what design quality is possible when there is no need to sell individual components as cheap as possible. 
Thus, despite the fact that I believe you, I can only say that I never saw parity protected memory systems on 8 bit. I can say I know of anecdotal source that apparently some existed in some high-end small-scale manufactured systems, but cannot prove it.

*Anyway, your story and the ECC topic points to the crucial questions:*
How cheap is sensible?
How much quality and security can we afford, do we want to afford to sacrifice?

Highly actual, in the view of Meltdown and Spectre, as these things reportedly originated in the attempt to cut reliabilty of safety in order to lower manufacturing costs.
In the end, the extra cost per chip would have been minor, considering the whole system cost.

But, isn't it a crazy idea if, for example, a nuclear plant blows up causing trillions of damages, only because safety logic was ommitted in a processor which would have increased the whole system cost by, say, $10 at max?


----------



## ralphbsz (Jan 8, 2018)

And that's why nuclear power plants (or life safety applications) don't rely on a single computer, much less on a consumer-grade one, for critical functions.


----------



## Snurg (Jan 8, 2018)

I remember the Apollo missions, they had usually three computers, and if one deviated from the two others, it was deactivated and repaired. This happened quite a few times.

By the way, regarding nuclear incidents, the main causes are not failing computers, but human error, often in conjunction with usage of safety override mechanisms (intentional and unintentional ones). 
Chernobyl is an example of this. The old shift leader pressed the very young reactor operator to override the emergency shutdown program of the Scala computer, and what happened then is well-known.

The counterfeit stuff problem also has already led to some incidents in the nuclear industry.
And I think it's hard to sell counterfeit modules with bad memories if there is parity checking.
But even ECC can be counterfeited using parity generator chips.


----------



## Datapanic (Jan 8, 2018)

It's as simple as this:  If you are running FreeBSD with ZFS, then you WILL use ECC memory or else regret the BIT ROT that comes up later.


----------



## tingo (Jan 8, 2018)

Datapanic said:


> It's as simple as this:  If you are running FreeBSD with ZFS, then you WILL use ECC memory or else regret the BIT ROT that comes up later.


That is your opinion, and you are entitled to have one.

However, it is not necessarily a fact, nor "a truth" or something like that. Me (and probably others) have many years of experience that says that zfs works nicely and troublefree even without ECC memory. There is nothing "special" inside zfs that makes it require / depend on ECC memory any more than any other filesystem. One of the people behind zfs has stated that in writing (you can google it if you are interested.)
ECC memory is nice, but it is certainly not required.


----------



## ralphbsz (Jan 8, 2018)

I think what Datapanic is trying to say (*) is: If you use ZFS and configure it to have both checksums and RAID redundancy, the biggest source of bit rot in a computer (which is the disk drives) is eliminated.  Thereby main memory bit rot becomes the dominant source of corruption, which ECC memory cures.  Now, whether your particular usage needs to worry about bit rot at that level or not, and whether using ECC is a good cost/benefit tradeoff for your particular usage pattern is still a different question.  For certain uses, not having ECC, and even not having checksums in your file system is perfectly acceptable.

(* Footnote: I'm deliberately putting things in Datapanic's mouth here, to try to get to a synthesis of opinions.)


----------



## poorandunlucky (Jan 19, 2018)

Datapanic said:


> It's as simple as this:  If you are running FreeBSD with ZFS, then you WILL use ECC memory or else regret the BIT ROT that comes up later.



Would you mind elaborating a bit?


----------



## SlySven (Jan 19, 2018)

Might I make the observation that for some _other_ Operating Systems, the robustness of the core memory is not going to be the _most_ significant cause of data/bit rot or other catastrophic failure?


----------



## LVLouisCyphre (Feb 25, 2020)

Use ECC whenever possible.  I have a pair of HP Microserver G7 N54Ls.  I made sure they had ECC for ZFS.  Anything that's a memory hog should have ECC.  There's a few horror stories of people losing their RAIDZ2 on the FreeNAS forum by simply not using ECC.  Fortunately I found out that the HP MS G7s use the same memory as a Lenovo Thinkserver TS430.  Just buy a pair of 8 GB ECC UDIMMs for one of those and you're set.


----------



## LVLouisCyphre (Feb 25, 2020)

poorandunlucky said:


> Would you mind elaborating a bit?


It's _well_ documented on the FreeNAS forum. If you don't use ECC with ZFS which is a memory hog, you are putting your data in a noose awaiting for the chair to be kicked out from under it.


----------



## shkhln (Feb 25, 2020)

LVLouisCyphre said:


> It's _well_ documented on the FreeNAS forum. If you don't use ECC with ZFS which is a memory hog, you are putting your data in a noose awaiting for the chair to be kicked out from under it.



Anything the FreeNAS forum has to say on ECC is complete bullshit. (It's the birthplace of non-ECC-mem-will-break-your-entire-ZFS-pool FUD.)


----------



## PMc (Feb 25, 2020)

shkhln said:


> Anything the FreeNAS forum has to say on ECC is complete bullshit. (It's the birthplace of non-ECC-mem-will-break-your-entire-ZFS-pool FUD.)


Thanks. I was wondering how this came up. Probably too many people running a computer who shoudn't.


----------



## shkhln (Feb 25, 2020)

Well, there was (is?) a certain _moderator_ named Cyberjock (cyberj0ck) with a very strong opinion on ZFS reliability in a non-ECC scenario. He's mildly infamous for that reason.


----------



## 6502 (Feb 25, 2020)

Unfortunately, it is not easy to find ECC motherboard. In addition in some cases it may pretend for ECC support but not support it. How to test? There is not easy way. Some manufacturer has to produce special RAM modules with switch "ECC failure mode" - to verify that ECC will catch it.


----------



## Vadim_Mkk (Feb 25, 2020)

*RAM*
It’s not surprising that Sun’s documentation said you needed ECC RAM to use ZFS well. Sun sold high-end servers. But according to Matt Ahrens, “*ZFS on a system without ECC is no more dangerous than any other filesystem on a system with ECC.” ZFS’ built-in error correction compensates for most but not all memory-induced errors*. The generic arguments in favor of ECC RAM are still valid, of course. A machine with non-ECC memory can suffer memory corruption, and it’s possible for some of those errors to get to disk. That would happen regardless of the filesystem you’re using, however, and ZFS checksums offer a hope of identifying the problem. *If you’re running a high-availability service, you want ECC for all of the usual reasons*. But *your ZFS laptop or home movie server will function just fine with normal RAM.* 
© FreeBSD Mastery: Storage Essentials by  Michael W Lucas & Allan Jude


----------



## ralphbsz (Feb 25, 2020)

The answer is not black and white, it is complicated. What is the biggest danger to data? To data becoming inaccessible temporarily or permanently?

We used to think it was disk failure. That was actually somewhat true. We fixed that with RAID, and many other technologies. We fixed many other things too, for example accidental deletion; one attempt is Microsoft Windows asking "are you sure" if you type "del *.*": that was a good starting point. Another smaller cause of data loss is corruption of memory content. There are many ways to combat that. One is to use ECC, which protects against single bit errors (alpha particles, cosmic ray air showers). Another is to use checksums on in-memory data structures. Yet another is software design that prevents things like wild pointers and accidental memory overwrites.

The answer to "does ZFS need ECC" is multi-faceted. Compared to what? ZFS on a non-ECC system is no less safe than UFS or ext4 on a non-ECC system. It is actually considerably safer (because of checksums, both at rest and in flight). On the other hand, with ZFS in normal installations having solved the disk reliability problem (with RAID and on-disk checksums), the next biggest hardware source of data problems is memory, so it is important to fix.

In reality, the largest danger to data is humans. Accidental deletion and mis-administration is by far worse than anything ECC can fix. Instead of arguing over whether ECC is important or not, we should all site down with the man pages for ZFS commands and learn how it works.


----------



## 6502 (Feb 25, 2020)

My first desktop with 386sx16 had ECC. I think at that time all RAM modules (SIMM) had ECC.


----------



## SirDice (Feb 26, 2020)

6502 said:


> I think at that time all RAM modules (SIMM) had ECC.


Not all of them. Some 30 pin SIMMs had parity, which is not the same as ECC. ECC on SIMMs was only on some of the 72 pin SIMMs, at a much later date.


----------



## zader (Feb 26, 2020)

is kind of funny, b/c ZFS the "program" really doesn't care what kind of memory is in the machine .. ECC ram adds an "extra/optional" layer of corruption protection between when the processor writes information to ram, and when that information is dumped from ram..   the only real (non-technical) difference is that you get 1 extra check with ECC ram before the data is passed off to ZFS to be written.. 

does it help? sure it's nice to have. ... is it required nope.. ZFS is self contained and doesn't care what the os or ram reports .. It does all of that internally. is it worth 30 messages of debit? lol nope ...  I would guess most people that use ECC ram only do so because they are using a Xeon..  I forget what episode of BSD Now , but AJ and BH discussed it in depth ... you could check the show notes it you really want to deep dive into it..


----------



## PMc (Feb 26, 2020)

Well, there is some logic in it: ZFS keeps a huge amount of filesystem data in memory for long times, and AFAIK it is not checksummed while being there. So if a bitflip occurs, it may hit that data, and when it is later written to disk, the flaw will not be detected and will write all mirrored copies the same. Then, if it happens that this bitflip indeed concerns the toplevel master node (or whatever the thing is called) of a pool, then data might be written out damaged in a way that the pool structure becomes unintellegible. Then it would be necessary to analyze that pool and fix it manually (there are companies offering such service, if you won't try by yourself).
This is different to other filesystems. There fsck could repair such errors. And the bitflip would rather hit cached application data, so then the application data might be silently wrong, which might be just as bad.

What about the probability? Assuming that a cosmic ray triggering a bitflip happens about once a year (at sea level), it would hit the master node (lets assume 100kB, out of 16GB) every 100'000 years.

Nevertheless, I do use ECC where I can, and I recommend it for 24/7 operation.


----------



## ralphbsz (Feb 26, 2020)

PMc said:


> Well, there is some logic in it: ZFS keeps a huge amount of filesystem data in memory for long times,


So do other file systems. Normally, file systems fill all memory with disk cache. ZFS's caching is not particularly aggressive, nor does it keep unwritten data (write cache) in menory particularly long.



> and AFAIK it is not checksummed while being there.


I don't know. I should read the slides from the talk that Kirk McKusick gave a few days ago about how ZFS is implemented on FreeBSD. I think that all data and metadata in the memory cache is checksumed.



> Then, if it happens that this bitflip indeed concerns the toplevel master node (or whatever the thing is called) of a pool, then data might be written out damaged in a way that the pool structure becomes unintellegible.


Yes. But one of the design principles of ZFS is that it never overwrites anything on disk. So if a metadata block is written damaged, the old copy of it is still on disk too, which makes the problem restorable and fixable for a long time (until the old data needs to be overwritten due to space concerns). I don't know how hard it is to fix such a problem though.



> This is different to other filesystems. There fsck could repair such errors.


Fsck can repair some errors, but not all. A completely shredded metadata block (like inode or directory content) if often not repairable.


> And the bitflip would rather hit cached application data, so then the application data might be silently wrong, which might be just as bad.


Traditional file systems also have metadata (superblocks, inodes, directories, indirect blocks), and they also cache those in memory.



> Nevertheless, I do use ECC where I can, and I recommend it for 24/7 operation.


I agree, using ECC is good, but not any all cost. You are still better off with ZFS on non-ECC than with a other file systems on non-ECC.


----------



## LVLouisCyphre (Mar 7, 2020)

shkhln said:


> Anything the FreeNAS forum has to say on ECC is complete bullshit. (It's the birthplace of non-ECC-mem-will-break-your-entire-ZFS-pool FUD.)


As others have commented here that's entirely subjective if you have an option of to ECC or not ECC.  However, I'd rather pay the extra for ECC memory if it's within my budget for my servers running ZFS than not for an additional level of data integrity.   Plus some servers _*only*_ support ECC memory such as my Lenovo TS430.  If ECC wasn't that big of a deal then why do many servers require it at a hardware level?

ZFS only cares about available memory.  ECC memory requirement is a system board issue.   That's a moot point.  The issue is the probability if memory corruption before ZFS writes if your system has the option of to ECC or not ECC.  I'm not one to gamble with my data.


----------



## shkhln (Mar 7, 2020)

I don't care what _you_ personally use. Just please stop scaring hobbyists into choosing ext4-based Linux NAS solutions. This is completely counterproductive.


----------



## Ofloo (Mar 7, 2020)

To me it seems simple if you have an option to go with ECC choose ECC, ECC is always better.


----------



## LVLouisCyphre (Mar 7, 2020)

shkhln said:


> I don't care what _you_ personally use. Just please stop scaring hobbyists into choosing ext4-based Linux NAS solutions. This is completely counterproductive.


This is a *FreeBSD* forum; not Lummux. Since you're here on your high penguin, you should be ban hammered.


----------



## shkhln (Mar 7, 2020)

Oh my… You are welcome to report me.


----------



## LVLouisCyphre (Mar 7, 2020)

shkhln said:


> Just please stop scaring hobbyists into choosing ext4-based Linux NAS solutions. This is completely counterproductive.


I don't even know WTF you're talking about because I don't use Lummux.  Speak BSD or fasthalt.


shkhln said:


> Oh my… You are welcome to report me.


Already done.  I also commented in the feedback forum because I've seen people who don't follow the rules posting here that should be gone.  If they need an enforcer for it, I'm volunteering.


----------



## shkhln (Mar 7, 2020)

LVLouisCyphre said:


> If they need an enforcer for it, I'm volunteering.



You'll ban 3/4 of the forum members  It's not a coincidence that the moderators are the most chill people here.


----------



## ralphbsz (Mar 7, 2020)

Ofloo said:


> To me it seems simple if you have an option to go with ECC choose ECC, ECC is always better.


If you have to make a tradeoff between ECC and an extra disk or better disk, the extra disk MIGHT be a better investment. It depends on how much disk redundancy you already have. But I agree, the extra cost of ECC DIMMs (if you already have an ECC-capable motherboard) is not very high, so in general it is a good idea.

If you are running ZFS, the best investment is to buy a sufficiently large number of disks to use a sufficient RAID code (the more fault tolerant, like Z2 or Z3, the better). The second best investment is to buy high-quality disks (enterprise grade). The third best investment is to buy or rent hardware for remote backups, to protect against destruction of the primary server (either by hardware or operator error). Perhaps the priority of "second" and "third" could be switched. And ECC is next.


----------



## Crivens (Mar 7, 2020)

shkhln said:


> You'll ban 3/4 of the forum members  It's not a coincidence that the moderators are the most chill people here.


Someone called? Splendid! 
I do believe that someone has been misreading things. Someone does not know how to handle things. Someone only doubles down. All this does not need to be the same someone.
Now you all pop a cold one, put up your feet and get mellow. Or else...

(Yes, the mods here are chill. Like a sleeping bear in his cave. You don't want to poke it _too hard_.)


----------



## shkhln (Mar 7, 2020)

Perhaps a thread closure is in order. There is hardly anything left to explore.


----------



## Ofloo (Mar 8, 2020)

ralphbsz said:


> If you have to make a tradeoff between ECC and an extra disk or better disk, the extra disk MIGHT be a better investment. It depends on how much disk redundancy you already have. But I agree, the extra cost of ECC DIMMs (if you already have an ECC-capable motherboard) is not very high, so in general it is a good idea.
> 
> If you are running ZFS, the best investment is to buy a sufficiently large number of disks to use a sufficient RAID code (the more fault tolerant, like Z2 or Z3, the better). The second best investment is to buy high-quality disks (enterprise grade). The third best investment is to buy or rent hardware for remote backups, to protect against destruction of the primary server (either by hardware or operator error). Perhaps the priority of "second" and "third" could be switched. And ECC is next.



Maybe i'm to simple minded but it's not because you get better memory that you should spend less on disks. Also you can gradually add more disks. Disks get cheaper a lot faster then ram, so buying over a bigger time frame will allow you to invest more and get better disks over time.

What you need to think about is good motherboard that allows you to expand, good cpu, good memory, good disk controller/backplane disks come later.


----------



## zader (Mar 8, 2020)

spec wise here is a couple of examples..  

if your set up for racks, aka dont mind noise .. this is a killer bang for the buckl for a 12bay (works 100% with freebsd ) actually have one keeping my feet warm..  (if they havent changed it .. mine also came with a 4port 10gb intel card, not sure if that was a mistake tho)








						Supermicro 2U SAS3 FreeNAS 20-Core 64GB 12x Trays IT MODE Sp
					





					www.theserverstore.com
				




or
if you want to build your own.. its more money but this isnt to bad granted linus is not the best resource for zfs/storage systems.. the hardware selection is ok.




_View: https://www.youtube.com/watch?v=FAy9N1vX76o_


----------



## Mirror176 (Apr 12, 2020)

I'd think this just comes down to priority. If your data is important, you want error correction and redundancy (immediately available or not). Some tasks have a higher importance on uptime than they do on data integrity.
  If data is important, RAID does not come before backups. RAID is good at maintaining uptime and joining multiple disks into one storage area for the additional space too. RAID does not defend against accidental deletions/overwrites, viruses, software/OS/hardware issues; backups may if an originally good copy makes it out to the backup without the corruption/damage/mistake. Data integity then has to hold up in those backups until later needed. Checksums, or better yet digital signatures, are useful to make sure while multiple backups help in case one does have 'any' problem. Many communication busses have their own integrity checks (AHCI, TCP/IP, etc.) that the data will be passed through but only watch it during transit.
  Bad RAM 'could' lead to corruption of ZFS structures or the data within the files as the garbage may go in checksummed. It could also impact checksumming itself. As a bigger picture, it may also corrupt the data a program was going to pass to disk well before ZFS ever saw it and ZFS cannot protect against 'garbage in'. The likely hood of each would be best guessed as quantity of RAM used for each of those tasks divided by the chance a quantity of RAM will be impacted with corruption; less RAM used makes it less likely the problem impacts your workload in any way.
  Without ECC, filesystem checksums, etc. you won't know you have a problem unless it causes a program to crash/misbehave from corruption, causes a checksum failure to detect it, or causes structural/visible corruption in the data when you look at it later. A jpeg or mp3 file when reviewed later may be more obvious but a wav or bitmap may easily have corruption that is a small enough change after the value(s) drift that it goes unnoticed even after review. ECC RAM + communication buss checksums + ZFS still doesn't cover the rest of the memory in the computer such as the cache in your CPU. You are also left to wonder if your disk controller has nonbus data integrity checks too. In the end, none of that gets you past a buggy program though.
  If so important, you can always perform a task more than once and check the outcome is what is expected by comparing those outcomes. Doing that across multiple pieces of hardware helps answer the question of if there was a hardware issue; as long as each piece doesn't have the same 'bug' engineered into them. The project seti@home takes results of 3 users performing a computation and if all don't match has the option to get more calculation attempts to be performed too.
  I've had bad RAM in my desktop cause different kinds of corruption including causing a ZFS filesystem to have issues; the RAM was VERY unstable and months of play were still necessary to reach a point where ZFS filesystem issues could be detected despite random bits of data erroring out multiple times over two hours according to memory testers. Most times it was shown in program crashes or OS crashes. One hour of 'uptime' would require multiple tries to reach. I've also had disks overheat and cause small bits of corruption occasionally on them which the disk internally identified as unreadable sectors (rewrites caused no sector reallocations as the sectors would be seen as working on write). I had another disk in the past that as it failed would lead to corrupt data being passed with no warning but as a scratchpad for a video project I was actively working on it was easy to see data glitches that did occur.
  I personally do not have ECC in my desktop as much as I'd like to; higher price hardware for lower performance on a device that doesn't make me money hasn't seemed wise at the time of shopping. I have been using ZFS for years. Its nice to think problems don't happen silently for no added $ cost but I have taken a performance hit to use ZFS; my FreeBSD box handles low memory VERY poorly with slowing down like swap is in use to use the filesystem even when swap is disabled and other nondisk slowdowns with long pauses and system freezes when I would expect the system to say, 'RAM too low; closing a program' but it just wont. ZFS + scattered data (fragmentation?) has shown large performance issues too. ZFS by design currently has no way to deal with it poorly arranging data by reoptimizing it to 'not suck' on its own and since they don't want to rewrite data to fix it and I don't want to backup/restore disks regularly to work past that bug, I usually copy a folder with data I expect to be impacted and delete the old+rename the new to the old to get a LARGE performance bump back. I have had to go back to a backup due to corruption causing a reboot during scrub and a running scrub restarts on reboot. No reliable way to fix or live with ZFS data issues (I consider the previous issue I had to go beyond, 'checksum failed; replacing data' type of a fix) means backups should be in order or my data will be unusable someday.
  So is ECC a good idea...Yes. Mandatory...Not for my use. Will it eliminate all data corruption problems when combined with ZFS...No, but more could be prevented (and much more likely is that other runtime issues can also be prevented). Backups have saved my data from damage that ZFS, ECC, RAID with mirroring/parity, etc. would have just lost. Program bugs also have caused me far more grief than RAM or hard drives short of them reaching a state considered 'failed' (including intermittent issues, I consider that a fail if non-rare).


----------



## shkhln (Apr 13, 2020)

Mirror176 said:


> my FreeBSD box handles low memory VERY poorly with slowing down like swap is in use to use the filesystem even when swap is disabled and other nondisk slowdowns with long pauses and system freezes when I would expect the system to say, 'RAM too low; closing a program' but it just wont.



See https://forums.freebsd.org/threads/out-of-memory.69755/page-2#post-439537.


----------



## PMc (Apr 13, 2020)

This is indeed incredible - it shows the mindset of "desktop" users (and why it is so difficult to deal with them).
(Thanks shkhln for pointing at this.)



Mirror176 said:


> I would expect the system to say, 'RAM too low; closing a program' but it just wont.



There are shell-scripts. Shell scripts can be used as a programming language, with branches, loops and decisions, where each command is actually a program to be run. When we start to unexpectedly "close" arbitrary programs, it is the same as ripping instructions out of the script, arbitrarily, at runtime.
You can have the same effect by installing defective memory or a broken harddisk. Or by shooting yourself in the foot for fun.

If the system ever gets into an OOM situation, you must reboot anyway, because all the scripted tools are no longer in a defined state afterwards. That is why one uses ample swapspace, much more than might ever be needed: to avoid the reboot.

For the OS there is no difference between a "program" and a process. Anything that runs is a process.
And the system has no means to "close a program". The only thing it can do is kill a process, violently und unexpected. And that is the worst thing that can happen. It is better to create a system panic and reboot and do the things from start again.

It is similar with cars. If you buy a car, you expect to just sit into and drive away. But if you happen to drive something that is good for professional work, say a truck or a caterpillar, then you are expected to do some adjustments to the workload before running it. Same here: if you want to run a professional OS, that can do real work, you are expected to adjust it to the workload (or pay a maintenance shop to do that for you, like you do with your car).


----------



## shkhln (Apr 18, 2020)

For what it's worth, I'm perfectly fine with Firefox being killed on OOM. It's practically impossible to estimate RAM requirements for that thing anyway and the alternative is a complete lockup.



PMc said:


> And the system has no means to "close a program". The only thing it can do is kill a process, violently und unexpected. And that is the worst thing that can happen. It is better to create a system panic and reboot and do the things from start again.



Meh. This is no different from a crash on segfault. Programs definitely should be written to handle restarts.


----------



## PMc (Apr 18, 2020)

shkhln said:


> For what it's worth, I'm perfectly fine with Firefox being killed on OOM. It's practically impossible to estimate RAM requirements for that thing anyway and the alternative is a complete lockup.
> Meh. This is no different from a crash on segfault. Programs definitely should be written to handle restarts.



That's why I think this is basically a server-desktop conflict. Firefox is only run on desktops. OTOH, I just cannot write shellscripts so that they could handle the crash of any called function. I have, for instance, some shellscripts that handle the backup of database redo-logs (for point-in-time recovery). These are different from normal backup as one must keep them in sequence and never loose a single one of them. So I do this twice to two different backup-volumes, and there is a lot of interlocking, as the machine may be busy and the backups may delay for an undefinite time, and still another script must precisely know which logs are successfully backed away and can be deleted. That fabric will re-initialize on a complete crash, but I am not sure what it does when an arbitrary component gives a wrong status. 
For the same reason, if a single worker from the postgres fails, it will propagate a full database crash. So probably, if you have to guarantee transactional integrity, there is no other way.

And no, I do not have segfaults on a server.  I do see segfaults on the desktop, e.g. from cups (and that gives me a clear opinion about that ultra-crap), but on a 24/7 server any segfault should be analyzed for it's root-cause.

And, as I just notice: even my firefox seems to behave properly, these days. Can't remember it having failed for many weeks. Lets have a look:

```
$ /usr/local/sbin/pkg info | grep firefox
firefox-esr-68.4.2,1
$ ps axlw | grep firefox
1100  1245  1204   0  21  0 3531752 832108 select   Ss    -    407:23.97 firefox
1100  1247  1245   0  20  0 3273584 699068 select   S     -    462:20.97 /usr/local/lib/firefox/firefox -contentproc -childID 1 -is
1100  1248  1245   0  82  0 3457580 837980 -        R     -   1030:30.98 /usr/local/lib/firefox/firefox -contentproc -childID 2 -is
1100  1249  1245   0  20  0 2462732  66572 select   S     -      1:50.30 /usr/local/lib/firefox/firefox -contentproc -childID 3 -is
$ top -b
last pid: 38885;  load averages:  0.99,  1.11,  1.07  up 6+19:28:48    20:48:01
116 processes:  2 running, 114 sleeping

Mem: 788M Active, 1284M Inact, 381M Laundry, 5066M Wired, 775M Buf, 326M Free
ARC: 2389M Total, 1727M MFU, 262M MRU, 1666K Anon, 63M Header, 336M Other
     1568M Compressed, 3436M Uncompressed, 2.19:1 Ratio
Swap: 10G Total, 1733M Used, 8507M Free, 16% Inuse
```

Doesn't look bad. And *surprize* not even any zfs tuning is currently in place. Only my usual recommendation: reduce the number of content worker processes (within firefox) to something that makes sense, e.g. one per 4G ram. By default it is the number of CPUs, and that is usually too much.


----------



## Mjölnir (Aug 6, 2020)

recluce said:


> For me, it is ECC where ever possible. Even my new laptop will come with ECC memory, I would never consider building/buying/using a server without it. Also a good argument for Ryzen CPUs, which mostly support ECC (I believe only Ryzen 3 does not).


Anyone who knows about subnotebooks below 1.5 kg supporting ECC, please drop me a note.
EDIT Just do not use that crap firefox. It's evil. Period


----------



## Jose (Aug 7, 2020)

Figure it out for yourself: https://github.com/Smerity/bitflipped

Personally, I would much rather spend extra money on ECC RAM than on "fast" RAM.


----------

