This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.
In fact, he created two new terms dictionary implementations. The first,
FSTTermsWriter/Reader
, hold all terms and metadata
in a single in-memory FST, while the second,
FSTOrdTermsWriter/Reader
, does the same but also supports
retrieving the ordinal for a term (TermsEnum.ord()
) and
looking up a term given its ordinal (TermsEnum.seekExact(long
ord)
). The second one also uses this ord
internally so that the FST is more compact, while all metadata is
stored outside of the FST, referenced by ord
.
Like the default
BlockTree
terms dictionary, these new
terms dictionaries accept any PostingsBaseFormat
so you
can separately plug in whichever format you want to encode/decode the
postings.
Han also improved the
PostingsBaseFormat
API so that
there is now a cleaner separation of how terms and their metadata are
encoded vs. how postings are encoded;
PostingsWriterBase.encodeTerm
and PostingsReaderBase.decodeTerm
now handle encoding and
decoding any term metadata required by the postings format,
abstracting away how the long[]/byte[] were persisted by the terms
dictionary. Previously this line was annoyingly blurry.
Unfortunately, while the performance for primary key lookups is substantially faster, other queries e.g.
WildcardQuery
are slower; see
LUCENE-3069
for details. Fortunately, using PerFieldPostingsFormat
,
you are free to pick and choose which fields (e.g. your "id" field)
should use the new terms dictionary.
For now this feature is trunk-only (eventually Lucene 5.0).
Thank you Han and thank you Google!
Interesting feature! Would it be possible to share the list of terms between different indexes? This would have a single entry in vector terms, but one would need a relationship with your index/docId.
ReplyDeleteHi Marcio,
DeleteI'm not sure what you're asking? Can you give more detail?
Sharing a term index across indices (or segments) is tricky because there important per-segment metadata (file pointers to the docs, freqs, positions/payloads/offsets files) that is still per-index/segment.
Hi Michael, thank you for the feedback.
DeleteWhen working with multiple indexes can have common fields between the indexes. How to share them without creating a new file for each index ?
I think that the terms could be shared between indexes.
Sample:
Term0 -> Segment0 (index) -> [docId0, docId1]
-> Segment1 (index) -> [docid0, docId3]
Term1 -> Segment0 (index) -> [docId0]
-> Segment1 (index) -> [docId2]
...
Thanks,
Marcio
OK, I understand.
DeleteThis would be tricky in Lucene, since every segment is "write once"; e.g., with this sharing, when another segment is written, you would need a malleable data structure to go and update the terms to point to metadata for the new segment. E.g., our FST impl is not "updateable"; this is just a limitation of our impl. and not FSTs in general.
Maybe you could build something on top of Lucene to maintain this...
Hi Michael,
DeleteWould really have to change much.
Thanks for your advice.
Marcio
hi Michael,i am M.Tech student and doing dissertation on information retrieval. i read your article from http://people.apache.org/~mikemccand/lucenebench/indexing.html.
ReplyDeletebut i am not able to download index.java so please re-upload the code. which will be helpful for my dissertation. Thank in Advance
Hi Mrugendra,
DeleteAll the source code for the Lucene nightly performance tests is here: https://code.google.com/a/apache-extras.org/p/luceneutil
But that code is not very approachable; it's probably better to start with Lucene's "demo" module? That code is meant to be the "hello world" of using Lucene.
thank you sir. what are the steps to run indexer.java? [i am confused because code implemented in python and java]i go through the readme.txt but it's little bit difficult for me to understand. Is it possible to run on Windows 8 or linux mint 15 OS with Eclipse kepler?
Delete[if you don't mine then pls give me your mail id]
Indexer itself is a low-level tool that takes a zillion command-line arguments. But normally it's "spawned" via the Python scripts, which give a higher level access.
DeleteStill, I don't think this is the best way to get started using Lucene; this code is rather messy because it's trying to test so many different Lucene features in a single tool. Why not start with Lucene's demo module (IndexFiles.java, SearchFiles.java) instead?
Sir my research topic is to optimize the indexing file [reduce index file size, RAM usage, CPU utilization, and create index with payload to improve searching speed].
Delete1.i am using lucene for indexing the pdf files[indexing file name and content]. after applying standard analyzer lucene index file size is 11 MB for 1.77GB against the windows 8 windows.edb file size 42 MB for 1.77GB[Tested for windows desktop environment]
2. but sir how to apply lemmatization with standard analyzer to reduce index file size and ADD PAYLOAD during indexing.
3. sir from where i can find the test benchmark.
for detail discussion sir can you give me your mail id? [mi mail ID: rahevar.mrugendra@gmail.com]
Hi Mrugendra,
DeleteI think it's best if you email the Lucene user's list with these questions (java-user@lucene.apache.org)?