Changing Bits: June 2010

Wednesday, June 30, 2010

Our house was hit by lightning

Believe it or not, our house was struck by lightning! It happened a few days ago, as a cold front swept over, bringing with it some intense but short-lived thunderstorms.

I was working on my computer when I heard a loud POP sound of a spark, behind me, in the utility closet. At the same time my computer went blank, and there was insanely loud thunder clap. My poor son was on the toilet at the time and told me he was so startled that he jumped high in the air and almost fell in! Fortunately, none of us were hurt.

The strike destroyed our central 16-port gigabit ethernet switch, 3 out of 4 LAN ports on my FIOS NAT box, a couple power supplies and one netcam. It also fried the device I use to read the electrical (charging, inverting) data from the solar panels in my back yard, but the solar panels themselves, including the thick copper ground wires designed to "guide" lightning into the ground and away from the house, were all fine, as well as the charger, inverter and batteries. My 1-Wire network, which I use to measure various indoor & outdoor temperatures, is also still dead. My wife's computer immediately shut down and rebooted, several times (spooky), but apparently unharmed. My computer seemed to lose both ethernet ports, but then after much rebooting and testing plug-in ethernet cards, they came back to life.

A large tree branch in our neighbor's yard fell down; the neighbors across the street called the fire department; yet another neighbor saw bright sparks in his basement and also lost a bunch of electronics.

Almost certainly this was not a direct strike for us; otherwise things would have been vaporized instead of simply dead. Instead, the sudden immense electro-magnetic field created at the direct strike radiates outward, creating the local equivalent of an EMP bomb. This EMF then induces high voltage and current in any wires it crosses; the closer you are to the direct strike, and the longer your wires are, the more damaing the induced voltage and current is. In my case, apparently, the extensive network of ethernet wires in my house cause most of the damage. This is a good reason to use WiFi!

I will now buy myself something to try to prevent this from happening again.

Lightning is crazy stuff. The process the lightning goes through in seeking the path through which it will dump insane amounts of current is fascinating. National Geographic has a great facts page; for example, talking on your land-line telephone is the leading cause of lightning injuries inside the home. I suspect we may have been hit by positive lightning, because we seemed to be hit, out of the blue, well before the storm itself seemed to arrive.

Lightning strikes are apparently rather common; in just my immediate family this has now happened twice to me, and once each to my brother, father and grandparents!

Tuesday, June 29, 2010

Lucene in Action 2nd Edition is done!

Lucene in Action, 2nd Edition, is finally done: the eBook is available now, and the print book should be released on July 8th!

The source code that goes along with the book is freely available and free to use (Apache Sofware License 2.0), and there are two free chapters (Chapter 1, and Chapter 3). There is also a free green paper excerpted from the book, Hot Backups with Lucene, as well as the section describing CLucene, the C/C++ port of Lucene.

Writing is the best way to learn something -- because of this book I've learned all sorts of nooks and crannies in Lucene that I otherwise would not have explored for quite some time.

Monday, June 21, 2010

Beware String.substring's memory usage

I'm trying to build up a test index for testing Lucene's sort performance, to track down a regression in String sorting performance between 3.x and 4.0, apparently from our packed ints cutover.

To do this, I want to use the unique title values from Wikipedia's full database export.

So I made a simple task in Lucene's contrib/benchmark framework to hold onto the first 1M titles it hits. Titles tend to be small, say maybe average worst case 100 characters per document, so worst case RAM would be ~200 MB or so, right?

Wrong!

It turns out, in Java, when you call String's substring method, the resulting String returned to you keeps a reference to the original String, so the original String can never be GC'd if you hold onto the substring. Java can do this "optimization" because Strings are immutable.

For me, this "optimization" is a disaster: the title is obtained by getting the substring of a large string (derived from a line-doc file) that holds the full body text as well! Instead of ~200 characters per unique title I was looking at ~25K characters! Ugh.

Fortunately, the workaround is simple -- use the String constructor that takes another String. This forces a private copy.

I imagine for many cases this "optimization" is very worthwhile. If you have a large original string, and pull many substrings from it, and then discard all of those substrings and the original string, you should see nice gains from this "optimization".

There is a longstanding bug opened for this; likely it will never be fixed. Really, GC should be empowered to discard the original string and keep only the substring. Or perhaps substring should have some heuristics as to when it's dangerous to keep the reference to the original String.

Sunday, June 20, 2010

Geek Dad

We've had this great family ritual, for a few years now: at the start of every summer we pitch a tent in our back yard! We leave it there all summer, and the kids have great fun, with neighbors and friends, playing in it, hiding from the rain, etc.

We also pick a few nights to sleep in the tent, which feels almost as real as camping, yet you have the delightful freedom of taking a short walk back to the house if you forgot something.

So, last night we slept in the tent. But, this year I brought our new iPad with us. The kids took turns playing a few games; one of their favorites is Gomi HD, a game I also love for its not-so-subtle message that we humans are screwing up the planet and it's up to the kids to fix it. Then we watched Shrek 2, which I had previously encoded for the iPad (using Handbrake). Finally, the kids, one by one, fell asleep.

Then, in the middle of the night, I woke up and heard rain hitting the tent. I was worried that it could turn into a big storm (in past years our tent has been literally flattened by passing thunderstorms, once when we were inside!), so, I turned on the iPad and loaded the local NexRAD radar, and confirmed that in fact it was just a small passing cell: phew!

Finally, I woke up, and was ready to get up, so I used the iPad again, this time to start the coffee maker. I have a simple Web server, written in Python, that exposes an HTML interface to variance lights and appliances controlled via Insteon, including the coffee maker. It's vital that I minimize the time from when I first stand up in the morning to when I consume my coffee!

Yes, I'm a geek Dad.

Monday, June 14, 2010

Lucene and fadvise/madvise

While indexing, Lucene periodically merges multiple segments in the index into a single larger segment. This keeps the number of segments relatively contained (important for search performance), and also reclaims disk space for any deleted docs on those segments.

However, it has a well known problem: the merging process evicts pages from the OS's buffer cache. The eviction is ~2X the size of the merge, or ~3X if you are using compound file.

If the machine is dedicated to indexing, this usually isn't a problem; but on a machine that's also searching, this can be catastrophic as it "unwarms" your warmed reader. Users will suddenly experience long delays when searching. And because a large merge can take hours, this can mean hours of suddenly poor search performance.

So why hasn't this known issue been fixed yet? Because Java, unfortunately, does not expose access to the low-level APIs (posix_fadvise, posix_madvise) that would let us fix this. It's not even clear whether NIO.2 (in Java 7) will expose these.

On the Lucene dev list we've long assumed that these OS-level functions should fix the issue, if only we could access them.

So I decided to make a quick and dirty test to confirm this, using a small JNI extension.

I created a big-ish (~7.7G) multi-segment Wikipedia index, and then ran a set of ~2900 queries against this index, over and over, letting it warm up the buffer cache. Looking at /proc/meminfo (on Linux) I can see that the queries require ~1.4GB of hot RAM in the buffer cache (this is a CentOS Linux box with 3G RAM; the index is on a "normal" SATA hard drive). Finally, in a separate JRE, I opened an IndexWriter and called optimize on the index.

I ran this on trunk (4.0-dev), first, and confirmed that after a short while, the search performance indeed plummets (by a factor of ~35), as expected. RAM is much faster than hard drives!

Next, I modified Lucene to call posix_fadvise with the NOREUSE flag; from the man page, this flag looks perfect:

Specifies that the application expects to access the specified data once and then not reuse it thereafter.

I re-ran the test and.... nothing changed! Exactly the same slowdown. So I did some digging, and found Linux's source code for posix_fadvise. If you look closely you'll see that the NOREUSE is a no-op! Ugh.

This is really quite awful. Besides Lucene, I can imagine a number of other apps that really should use this flag. For example, when mencoder slowly reads a 50 GB bluray movie, and writes a 5 GB H.264 file, you don't want any of those bytes to pollute your buffer cache. Same thing for rsync, backup programs, software up-to-date checkers, desktop search tools, etc. Of all the flags, this one seems like the most important to get right! It's possible other OSs do the right thing; I haven't tested.

So what to do?

One approach is to forcefully free the pages, using the DONTNEED flag. This will drop the specified pages from the buffer cache. But there's a serious problem: the search process is using certain pages in these files! So you must only drop those pages that the merge process, alone, had pulled in. You can use the mincore function, to query for those pages that are already cached, so you know which ones not to drop. A neat patch for rsync took exactly this approach. The problem with this is mincore provides only a snapshot, so you'd have to call it many times while merging to try to minimize discarding pages that had been recently cached for searching.

We should not have to resort to such silly hacks!

Another approach is to switch to memory-mapped IO, using Lucene's MMapDirectory, and then use madvise. The SEQUENTIAL option looks promising from the man page:

Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)

Looking through the linux sources it look like the SEQUENTIAL option is at least not a no-op; that setting has some influence over how pages are evicted.

So I tested that, but, alas, the search performance still plummets. No go!

Yet another approach is to bypass all OS caching entirely, only when merging, by using the Linux-specific O_DIRECT flag. Merge performance will definitely take a hit, since the OS is no longer doing readahead nor write caching, and every single IO request must hit the disk while you wait, but for many apps this could be a good tradeoff.

So I created a prototype Directory implementation, a variant of DirectNIOFSDirectory (currently a patch on LUCENE-2056), that opened all files (input and output) with O_DIRECT (using jni). It's a little messy because all IO must be "aligned" by certain rules (I followed the rules for 2.6.* kernels).

Finally, this solution worked great! Search performance was unchanged all through the optimize call, including building the CFS at the end. Third time's a charm!

However, the optimize call slowed down from 1336 to 1680 seconds (26% slower). This could likely be reduced by further increasing the buffer sizess (I used 1 MB buffer for each IndexInput and IndexOutput, which is already large), or possibly creating our own readahead / write cache scheme.

We really should not have to resort to such silly hacks!

Sunday, June 6, 2010

Finding the lead in your house

It's known that exposure to lead, even in tiny amounts, causes loss of intelligence and other nasty problems in children.

We have now phased out lead paint, leaded gasoline, and lead (and other dangerous elements) in electronics. But, surprisingly, it was only relatively recently (2008) with the passage of Consumer Product Safety Improvement Act that we finally set limits on the amount of lead in consumer items, particularly kids toys. As of June 2010, the legal limit for lead in kid's toys is 300 ppm (parts per million) and will drop to 100 ppm by August 2011. The legal limit for cadmium in paint on kid's toys is 75 ppm.

Our house is filled with all sorts of toys, many of which we've picked up as hand-me-downs from various sources. I've long wondered whether these toys have lead, cadmium, etc...

So, I decided to test them myself, using an XRF analyzer. This amazing handheld device emits a small directed xray beam out the front, causing elements to fluoresce with specific spectral signatures. The device detects the strength of these signatures and reports back to you the breakdown of elements in the sample, either in parts per million (ppm) or in percentage (1% = 10,000 ppm!).

In addition to lead, the device reliably detects other elements like cadmium, bromine (used in brominated flame retardants), chlorine, arsenic, mercury, tin, antimony, chromium, and many others!

There are some limitations. For example, all of the door knobs in my house tested high (1-2%) for lead, however, it turns out they are nickel plated but the XRF analyzer was seeing through this layer to the lead beneath. Likely this lead would never transfer to our hands, unless the nickel wears through.

Another limitation is that the device detects elements, not their molecular form. For example, certain forms of chromium, like hexavalent chromium, are toxic, while other forms, like trivalent chromium, is useful and necessary in the human body.

Finally, just because a material contains a given element doesn't mean that element would ever find its way into a child's body. For example, lead bound in plastic is likely difficult to dislodge.

For all these reasons, just because the analyzer detects certain elements in a given item, does not mean the item could cause harm. Think of the analyzer as a simple fact-finder: it reports the raw elements it sees. What action you then choose to take is your decision. However, the precautionary principle applies here: if an item does have measurable amounts of lead or cadmium, why risk exposing your family to it?

While the devices are still insanely expensive, they have come down in price enough to make rental just barely within reach for the end consumer. I rented mine (a Niton XL3T) for a few hundred dollars a day from from Ashtead Technology, and split the cost with a neighbor (she tested by night and I tested by day!).

So I walked around my house, testing everything, and the results were a real eye-opener! A number of ordinary looking things had lead, sometimes at levels far higher than the 300 ppm legal limit, plus other possibly harmful elements:

The green rain coat has 6,537 ppm lead; the handle on the lacrosse stick has 14,700 ppm lead; the basketball has 3,320 ppm lead and 322 ppm arsenic; the lunchbox has 677 ppm lead; the doctor's bag has 2,511 ppm antimony and 55 ppm arsenic. Many smaller toys also have lead:

The red measuring spoon has 1,651 ppm lead; the blue dinosaur has 767 ppm lead and 91 ppm cadmium; the red car has 860 ppm lead; the red plate has 3,268 ppm lead; the blue train has 271 ppm lead; the little green ABC has 1,015 ppm lead. The three legos were surprising, having between 1,245 and 2,427 ppm lead -- they are not actually legos; they are a copycat brand (Mega Bloks). All of our "real" legos tested fine.

Other toys have high cadmium:

The little slinky has 527 ppm cadmium and 1,365 ppm lead; the salt shaker has 22,634 ppm cadmium; the ice cream scoop has 1,188 ppm cadmium. Here were a bunch of other small toys that had varying amounts of lead:

Small toys are especially spooky since babies are more likely to put them in their mouth. These pictures show only about 1/3rd of the things I found with lead! Here are some other surprising discoveries:

Old Christmas ornaments often have very high lead. I had one silver ball that had ~1% arsenic, ~20% lead. Worse, it was well worn, which means the arsenic and lead had rubbed of onto people's hands, over the years.
Car keys have lead (~6,700 ppm for my 2 keys)! Don't let your babies play with them.
Nearly every wire has high lead content in the insulation; christmas lights had especially high lead.
The soft (usually black) plastic/rubber kids bike handles, and also tricycles, are often very high in lead.
An old recliner had 29,500 ppm lead on the foot rest and 20,000 ppm lead on the arm rests.
Our door knobs, cabinet knobs, faucets, and spouts, had ~1-2% lead.
We had a jar of kid's vitamins. The vitamins tested OK, but the jar (it was a dark brown tint) had 140 ppm lead.
One of our garden hoses had 5,255 ppm lead; not great because we had used this hose to water the plants in our edible garden.
Our waffle iron had 1,143 ppm lead on the cooking surface (likely, though, this was the metal under the teflon).
My supposedly lead-free solder (was not cheap!!) in fact contains 0.5% lead.

Here are some "rough" patterns:

The older the toy, the more likely it is to have high lead, cadmium, etc.
Newer things that are "burnable" (fleece, bedding, futons, mattresses, beds, plush chairs, etc.) often have very high (10s of thousands ppm) levels of bromine.
Colors like yellow, orange, red often have lead in their pigment.
Beware glasses that have paint on them! We had one glass that was 8% lead -- 266 times the legal limit. Colored glass (where the color is embedded into the glass) are also more likely to be leaded.

I'm not the only consumer playing with XRf analyzers: a recent recall of 12M Shrek glasses by McDonald's was initiated by at least two people who discovered high levels of cadmium in the paint on the glass using XRF analyzers.

Saturday, June 5, 2010

Lucene's PulsingCodec on "Primary Key" Fields

Update Aug, 2014: the pulsing approach described here works well and has now been incorporated into Lucene's default postings format, so there's really no need to use PulsingPostingsFormat yourself unless you are using a custom postings format that doesn't do its own pulsing.

Flexible indexing in Lucene (now available on trunk, which will eventually be the next major release, 4.0) enables apps to use custom codecs to write/read the postings (fields, terms, docs, positions, payloads).

By default, Lucene uses the StandardCodec, which writes and reads in nearly the same format as the current stable branch (3.x). Details for a given term are stored in terms dictionary files, while the docs and positions where that term occurs are stored in separate files.

But there is an experimental codec, PulsingCodec, which implements the pulsing optimization described in a paper by Doug Cutting and Jan Pedersen. The idea is to inline the docs/positions/payloads data into the terms dictionary for low frequency terms, so that you save 1 disk seek when retrieving document(s) for that term.

The PulsingCodec wraps another fallback Codec that you provide; this allows the pulsing to be dynamic, per term. For each term, if its frequency (the number of documents that it appears in) is below a threshold (default 1) that you provide, then that term's postings are inlined into the terms dictionary; otherwise, the term is forwarded (pulsed) to the wrapped codec. This means PulsingCodec should be helpful for ordinary text fields which obey Zipf's Law, as many terms will be rare-ish.

PulsingCodec should really shine on "primary key" fields, where each term occurs in exactly one document, and batch lookups (for example because the app performs deletes, updates and/or lookups) are common.

I created a simple performance test to confirm this.

The test first creates an optimized index with 10M docs, where each doc has a single field with a randomly generated unique term, and then performs term -> doc lookup for N (parameter) random terms. It's a self-contained test (source code is here).

It's important to flush your OS's IO cache before running the test; otherwise you can't measure the reduced number of seeks. On recent Linux kernels, just run echo 1 > /proc/sys/vm/drop_caches. That said, in a real production usage, the IO cache will typically (legitimately) help you, and pulsing should make more efficient use of the IO cache since the postings data is contiguously stored.

To measure the speedup from using PulsingCodec on a primary key field, as well as the impact of the OS's IO cache, I ran the above test on an increasing number of random term lookups (always flushing the the OS's IO cache first):

The results are compelling! When performing a small number of term lookups relative to the total number of terms on a cold OS IO cache, which is likely the more common case in a real application, pulsing shows a ~45-50% speedup, as expected, since it requires 1/2 the seeks.

As the number of random term lookups increases, PulsingCodec's gains decrease, because more and more of the lookups are hitting the OS's IO cache and thus avoiding the seek (the machine I ran the test on had plenty of RAM to cache the entire index). It's interesting that PulsingCodec still shows ~15% gain once the lookups are mostly cached; likely this is because PulsingCodec saves the deref cost of finding the postings in the frq file.

Pulsing also makes the index a bit smaller (211 MB vs 231 MB), because it saves one vLong pointer per term. For the test, the index with pulsing had a 0 byte frq file since all postings were inlined into the terms dict. There is no prx file because I index the field with setOmitTermFreqAndPositions(true).

Note that the test case simply uses PulsingCodec for all fields; if you'd like per-field control you should use the PerFieldCodecWrapper. However, because PulsingCodec is dynamic (per term), it is likely a good default for all fields.

Another way to speed up primary key lookups through Lucene is to store your index on a solid-state disk, where seeks are much less costly than they are on spinning magnets (though, still several orders of magnitude more costly than RAM). Or better yet, do both!