Monday, July 27, 2009

ZFS and ECC RAM

I started this thread over on the ZFS discuss list.

My question dug into why the ZFS bigwigs alway so strongly recommend ECC RAM. Was it simply for the added security of preventing a few corrupted files (because non-ECC RAM will likely flip a few bits in the lifetime of your computer)?

Or... was there something more spooky going on such that something catastrophic (losing an entire RAID-Z pool) is possible if your RAM has a bit error at a particularly bad time?

This is important to know, because when spec'ing out a new server, you need the facts in order to make proper cost/benefit tradeoffs. This decision should be no different from whether I should get the latest and greatest hard drive (risky, since it has no track record), or an known-good older generation drive that has less capacity and performance but has a good record. If non-ECC RAM means I risk losing the pool then I'll fork out the extra $$$!

It's great to see that ZFS has such a vibrant community that my simple question received so many answers. In this day and age the health of the community behind the software you are using is more important than the health of the software itself! Lucene also has a great community, though, I'm biased!

My thread also indicates one of the challenges with open-source: sometimes you can't get a "definitive" answer to questions like this. Many people chimed in with "opinions", on both sides, but if I tally up the votes, and take into account the number of posts (rough measure of "authority") behind each vote, more people say "a bit error will just corrupt files/directories" than "a bit error can lose the pool".

The dicussion also pointed out this very-important issue, which is to create a way to rollback ZFS to a prior known good state. It's the closest ZFS will get to providing something like fsck, I think. Sort of spooky ZFS doesn't already have that. I hope by the time I need it, if ever, this issue is done!

Saturday, July 18, 2009

WDTLER and WDIDLE3

Western Digital states that the Caviar GP drives are not recommended for RAID arrays, and that instead you should get their enterprise RE-4 drive. But there's a $100 price difference between the two right now! ($230 vs $330 at Newegg). So I decided to risk it and build my RAIDZ array with the GP drives. Check back in a couple of years to see if I have any regrets!

In building the array I discovered two very important fixes I needed to make to the drives, in order to make them behave more like the RE-4 drives.

First was to enable Time-Limited Error Recovery. This tells the drive to NOT make ridiculous efforts to recover a sector that it's having trouble reading, and to instead quickly report back an error that the sector could not be read. See, if the drive takes too long to answer a read request, the RAID level will assume it has gone kaput and boot it from the array. By enabling TLER, you prevent this from happening, thus letting the RAID level handle the error. Use the WDTLER utility to do this.

Second, the GP drives have a feature called Intellipark, which parks the drives heads (moves them off the platters) so as to reduce air resistance drag on the motor that spins the platter (every little power saving counts!). You can hear it clearly when it kicks in: it makes a slight clicking sound when parking. When you need to use the drive again, there's a clear delay and new clicking sound while the disk head unparks.

While nice in theory, it's unfortunately rather frustrating in practice. See, modern OS's use write caching to gather up a bunch of writes in RAM, and only actually write to the hard drives in bulk, every 10-30 seconds. The GP's idle timer is 8 seconds by default (a rather poorly chosen default). As a result the drive incessantly parks and unparks as random services write a few bytes here and there. Eventually, too many such cycles (I've read in forums that 300,000 is the spec'd limit) will cause wear & tear and increase the chance of failure. This thread on the Linux Kernel mailing list gives some details. While this is a problem even in non-RAID settings, it's exacerbated by RAID because now you have N drives that park/unpark, in sequence.

Fortunately, there's another utility called WDIDLE3 that lets you increase the time (to a max of 25.5 seconds, which I don't think is enough), or to disable the timer entirely, which is what I did.

If you don't run Windows and thus cannot directly run these EXEs, one simple workaround is to slipstream them into the Ultimate Boot CD as described here. Those instructions are for WDTLER specifically, but simply slip in WDIDLE3 at the same time. Keep the resulting CD accessible since you'll likely need to run it again if you have to replace any drives in your array!

As best I can tell, Western Digital does not officially support these utilities, so use them at your own risk. They both worked fine for me, on OpenSolaris, but your mileage may vary!

Newegg vs Amazon

I'm using 6 of the 2 TB Western Digital Caviar GP drives in my new build, in a RAID-Z array. Despite reading horror stories online, eg the many users seeing drives die quickly in the customer reviews at Newegg, mine are working great despite the sizable stress tests I've been running.

Except: one of my drives keeps reallocating sectors. I see this in its SMART diagnostics (5 sectors as of 2 weeks ago, 14 reallocated as of yesterday). This isn't normal (eg, the other 5 drives have 0 reallocated sectors), so I'll be keeping an eye on it and at some point might ask WD for a warranty replacement. I wonder if there's an accepted "policy" on how many reallocated sectors is too many? This reminds of the numerous "how many dead pixels are too many" discussions for new LCD monitors.

Of course, I don't lose any data because of this; ZFS's RAIDZ simply corrects the error for me.

I bought 3 of the drives from Newegg and 3 from Amazon. If I were more patient I would have spread them out over time as well. In general you should buy your drives across space and time, to minimize the chance of "correlated failures". If you buy all your drives from the same place, it's likely they were manufactured in the same "batch" which means any manufacturing defect in the production of that batch would make it more likely that you'd lose 2 or more drives at once, thus destroying all your data.

Newegg, it turns out, does a poor job shipping hard drives. They simply wrap them in bubble wrap and tape it up, sometimes packing 2 drives together inside the bubble wrap. What they don't realize is, because of rough handling from UPS, those bubbles pop, one by one I imagine, during transit, such that by the time I receive it, there is zero protection (no bubbles left) along at least one edge of the hard drives. It's rather shocking because Newegg is otherwise excellent. I've read several posts in the user comments noting exactly what I just said, yet Newegg hasn't improved. It's a bad sign when a company stops listening to its customers.

In contrast, Amazon (whose price matched Newegg's) packed each drive into it's own dedicated foam packing and box. Fabulous!

[EDIT Jul 28, 2009: I just received one more drive from Amazon, and they unfortunately have taken a turn for the worse! They now ship in a similar fashion to Newegg, wrapping the drive in minimal bubble wrap which pops during transit. They also take the wasteful step of "box within a box", which I don't think adds much protection to the drive. This drive will be my "hot spare", so if/once it get swapped into the array, I'll try to remember to watch for reallocated sectors and any other problems. Sigh.]

The one drive I see failing was in fact one from Newegg (I kept track of the serial numbers); it's entirely possible Newegg's poor shipping and the rough handling from UPS led to this drive's failure.

Losing one drive in a RAID array is quite terrifying because until you get the new drive resilvered, you're running with no safety margin! If you lose another drive, you've lost all your data. RAIDZ is not a replacement for good backups. It's best to have a spare drive on hand; you can even install it and notify ZFS to keep the drive as a hot spare, meaning if any drive drops out of the array, ZFS will immediately start the resilvering process to bring the new drive in. Or you could create a RAIDZ2 array, which has two drives worth of redundancy, but then you've "lost" 2 drives worth of storage!

Friday, July 17, 2009

OpenSolaris challenges

Whenever I encounter someone who's overly ecstatic about some new technology or gizmo or something, I quickly say "tell me what's wrong with it". If they can't think of anything, then I can't trust their opinion.

Nothing is perfect. There are always tradeoffs to be made. Only once you are properly informed with the facts, clearly seeing the goods and the bads, minus all the hype, can you finally make a good decision.

If you are passionate about something, and you use it day in and day out, then you ought to have a big list of the things that bother you most about it. Next time you see someone loving their iPhone, try asking them what's wrong with it.

Unfortunately, hype, "popular opinion", "conventional wisdom", "everybody's doing it", etc. drive so many decisions these days. Not long ago, when you bought a house, everyone pushed you to choose these newfangled mortgages like ARMs, interest only loans, etc., instead of the boring old-fashioned 30 year fixed rate mortgage. Alan Greenspan was giving speech after speech praising the "innovation" in the financial services industry. Look where that got us!

I came across this quote recently: "If you find yourself in the majority then it's time to switch sides". I've been realizing lately how true that is.

So in this spirit of presenting a balanced picture, here are some of the challenges I've hit with OpenSolaris:
  • It took practically an Act of God to switch from a dynamically (DHCP assigned) IP address to a static one. I ran the nice GUI administration tool, made the change, and at first all seemed good. But then on my next reboot, appparently a bunch of services failed to start. After much futzing, it was only when I uninstalled VirtualBox that things finally worked (I think VirtualBox's virtual adapter somehow conflicted). I now have a static IP!
  • There is apparently no SMART support for SATA drives, which is stunning. These days, as drives become more and more complex, we need access to their diagnostics. I rely on SMART to monitor the health, temperatures, remapped sectors, etc. of my drives.
  • The 1-wire File System has not been ported to OpenSolaris. I have a network of 1-wire devices in my house to monitor temperatures, eg, outdoors, in the kid's bedroom, the attic, etc. I'm still working on this one... there seem to be some problems talking to libusb. I may end up simply running a tiny Linux PC (the Fit PC 2 looks cute) instead, for such random services.


Tuesday, July 14, 2009

Sun's ZFS filesystem

I've been test driving Sun's (now Oracle's!) OpenSolaris (2009.06) and ZFS filesystem as my home filer and general development machine.

I'm impressed!

ZFS provides some incredible features. For example, taking a snapshot of your entire filesystem is wicked fast. This gives you a "point in time" copy of all files that you can keep around for as long as you want. It's very space efficient because only when a file is changed does the snapshot actually consume disk space (preserving the old copy).

From the snapshot, which is read-only, you can then make a clone that's read-write. This effectively lets you fork your filesystem, which is amazing. Sun builds on this by providing "boot environments", which let you clone your world, boot to it, do all kinds of reckless things, and if you don't like the results, switch back to your current safe world again, no harm done. I used to leave my home filers pretty much untouched once I started using them for fear of screwing something up. Now with boot environments I can freely experiment away.

I have a great many Lucene source code checkouts, to try out ideas, apply patches, etc., and by using ZFS's cloning I can now create a new checkout and apply a patch in only a few seconds. And it's very space efficient because only the changed files in the new checkout consume disk space. Since I'm using an Intel X25 SSD as my primary storage, space efficiency is important. The machine uses Intel's Core i7 920 CPU, which has fabulous concurrency and can run the Lucene unit tests 3X faster than my old machine. This all nets out to wonderful productivity gains.

ZFS also nicely decouples the raw storage device (the "pool"), from filesystems that pull from that storage. For the secondary storage I set up a RAID-Z pool (like raid5, but fixes the "write hole" problem) using 6 of the Western Digital Green Caviar 2TB drives. Be sure to use the WDTLER utility if you use these drives in a RAID array. This gives me 9TB usable space to play with; from here I've created many filesystems that all share the pool.

Performance is excellent: copying a 1TB directory on the RAID-Z pool to another directory on the same pool averages 100 MB/sec.

I also just read this morning that ZFS will add de-duping at the block level, thus making it even more space efficient.

ZFS can provide these features because it has a write-once core: no block is ever overwritten (unless it was already freed). Lucene has the same core approach: no file is ever overwritten in the index. Lucene's transactional semantics derive directly from this as well (though Lucene can't "fork" an index... maybe someday!).

Bye bye Linux, hello Solaris! I only hope this innovation continues now that Oracle has acquired Sun.