Has anyone run into problems related to Hetzner's servers that use non-ECC RAM (which is most of them)? Reading stuff like [1] has made me wary of using non-ECC RAM for servers, but a lot of people seem to be getting by without it, so I'm not sure what to make of that.
Isn't AWS well known for being unreliable, though? One postmortem claims that their average AWS instance dies after 200 days [1]. People use it, but within architectures that plan for unreliability of any individual instance. In that case I wouldn't expect anyone to care about ECC, either, since it's a quite different use-case from wanting a reliable dedicated server. Although it's interesting that Google uses ECC anyway, despite a massively redundant server cluster.
Scary attitude. Lack of ECC in servers has bitten me personally at work to that extent I even use ECC in all of my home desktop computers. Heck, I've been looking for laptops with ECC - unsuccessfully so far.
Anything can happen, and it's all up to luck. And I don't want to rely on luck.
Although I guess it's ok if all you're doing is serving somewhat unimportant static resources. That said, I think AWS servers do have ECC, but I've got no proof.
The same person claiming it then, also claimed that ECC RAM can be twice as expensive, which is far from typical -- it's quite common to see differences of less than 15%. Maybe he was confusing ECC with FB-DIMM?
The RAM itself isn't much more expensive, but if you're trying to build a cheap Intel computer then it will get much more expensive with ECC, since there are no cheap Intel CPU/mobos with ECC.
(My team owns product management for the Amazon EC2 instance platform)
I wanted to clear up all the confusion on this topic and am cross-posting on other threads where this has come up recently.
All the hardware underlying Amazon EC2 uses ECC memory. In our experience ECC is a necessary requirement for server infrastructure. We will be updating our detail pages/FAQs with that information.
It really depends on your application. If you're fault tolerant, then it's irrelevant. If even a single random error will throw everything down the drain, then you'll obviously want every precaution.
Most web applications don't need that kind of protection, simply because if the page load screws up, well, the user will just hit refresh and keep going.
Well, there's a good chance the same misbehavior or error will happen again and again, until the server is rebooted. Memory errors are not necessarily transient.
You're right, I agree. And again, I'll stress that it depends on the circumstances.
Running a dedicated server with ECC memory costs $$$. Is that client worth it or not? That's only something you can determine based on your circumstances.
[1] Google estimates that 8% of DIMMs per year will produce an ECC-catchable error: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf