I'm trying to build up a test index for testing Lucene's sort performance, to track down a regression in String sorting performance between 3.x and 4.0, apparently from our packed ints cutover.
To do this, I want to use the unique title values from Wikipedia's full database export.
So I made a simple task in Lucene's contrib/benchmark framework to hold onto the first 1M titles it hits. Titles tend to be small, say maybe average worst case 100 characters per document, so worst case RAM would be ~200 MB or so, right?
It turns out, in Java, when you call String's substring method, the resulting String returned to you keeps a reference to the original String, so the original String can never be GC'd if you hold onto the substring. Java can do this "optimization" because Strings are immutable.
For me, this "optimization" is a disaster: the title is obtained by getting the substring of a large string (derived from a line-doc file) that holds the full body text as well! Instead of ~200 characters per unique title I was looking at ~25K characters! Ugh.
Fortunately, the workaround is simple -- use the String constructor that takes another String. This forces a private copy.
I imagine for many cases this "optimization" is very worthwhile. If you have a large original string, and pull many substrings from it, and then discard all of those substrings and the original string, you should see nice gains from this "optimization".
There is a longstanding bug opened for this; likely it will never be fixed. Really, GC should be empowered to discard the original string and keep only the substring. Or perhaps substring should have some heuristics as to when it's dangerous to keep the reference to the original String.