This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.
In fact, he created two new terms dictionary implementations. The first,
FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second,
FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (
TermsEnum.ord()) and looking up a term given its ordinal (
TermsEnum.seekExact(long ord)). The second one also uses this
ordinternally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by
Like the default
BlockTreeterms dictionary, these new terms dictionaries accept any
PostingsBaseFormatso you can separately plug in whichever format you want to encode/decode the postings.
Han also improved the
PostingsBaseFormatAPI so that there is now a cleaner separation of how terms and their metadata are encoded vs. how postings are encoded;
PostingsReaderBase.decodeTermnow handle encoding and decoding any term metadata required by the postings format, abstracting away how the long/byte were persisted by the terms dictionary. Previously this line was annoyingly blurry.
Unfortunately, while the performance for primary key lookups is substantially faster, other queries e.g.
WildcardQueryare slower; see LUCENE-3069 for details. Fortunately, using
PerFieldPostingsFormat, you are free to pick and choose which fields (e.g. your "id" field) should use the new terms dictionary.
For now this feature is trunk-only (eventually Lucene 5.0).
Thank you Han and thank you Google!