Monday, July 27, 2009

ZFS and ECC RAM

I started this thread over on the ZFS discuss list.

My question dug into why the ZFS bigwigs alway so strongly recommend ECC RAM. Was it simply for the added security of preventing a few corrupted files (because non-ECC RAM will likely flip a few bits in the lifetime of your computer)?

Or... was there something more spooky going on such that something catastrophic (losing an entire RAID-Z pool) is possible if your RAM has a bit error at a particularly bad time?

This is important to know, because when spec'ing out a new server, you need the facts in order to make proper cost/benefit tradeoffs. This decision should be no different from whether I should get the latest and greatest hard drive (risky, since it has no track record), or an known-good older generation drive that has less capacity and performance but has a good record. If non-ECC RAM means I risk losing the pool then I'll fork out the extra $$$!

It's great to see that ZFS has such a vibrant community that my simple question received so many answers. In this day and age the health of the community behind the software you are using is more important than the health of the software itself! Lucene also has a great community, though, I'm biased!

My thread also indicates one of the challenges with open-source: sometimes you can't get a "definitive" answer to questions like this. Many people chimed in with "opinions", on both sides, but if I tally up the votes, and take into account the number of posts (rough measure of "authority") behind each vote, more people say "a bit error will just corrupt files/directories" than "a bit error can lose the pool".

The dicussion also pointed out this very-important issue, which is to create a way to rollback ZFS to a prior known good state. It's the closest ZFS will get to providing something like fsck, I think. Sort of spooky ZFS doesn't already have that. I hope by the time I need it, if ever, this issue is done!

1 comment:

  1. "ZFS protected data is safe"

    But, this statement assumes that "ZFS" is used in processing data.
    When a memory bit used by ZFS driver is changed unintentionally,
    then this driver will not function as ZFS driver as designed. It is
    functioning as something else.

    Let's suppose that bit flip occurred in zfs scrub result data.
    Scrub result was good, but that result was reported to user as bad because
    of bit flip. So user exchanged problematic, as reported, hdd to new one,
    which can trigger overall data lost for raid 5 configuration.

    ReplyDelete