Acorns

Marcel's blog

Plucene vs. Ferret

Switching from Plucene to Ferret for full-text search yielded huge performance improvements in both memory usage and execution time.

I setup search for an email list a year or two year ago. The original search used Plucene, a Perl port of the well-known Apache Lucene search library. Performance was never great at about 15 sec for first search results when I set it up. But over time, performance degraded to 60+ sec.

In the original search CGI program I setup, much of the time was spent looking up the document IDs returned from Plucene in .mhonarc.db to get text. The .mhonarc.db file contains all the data in in the archive in the form of Perl code, so it was slow just to include the file. More recently, though, most of the time was spent searching the Plucene index. The problem was memory consumption and thrashing. Many searches would run out of memory (I've only got 160MB on that box) before finishing. That's in spite of "optimizing" the search index regularly.

I've had good experiences with Ferret, a Ruby port of Apache Lucene, so I ported my search CGI program to that. Now peak memory usage is 8MB, and first hit response time is just 7 sec -- that's on my eight year old server. Subsequent searches take an almost snappy 1.5 sec.

So what was taking so long in Plucene? I'm not completely sure. A quick look at the call tree with the perl profiler, Devel::DProf, indicates that Plucene::Index::TermInfosReader::_read_index was the deepest long-running function in the stack that didn't return when it ran out of memory at 100MB. But print statements revealed that the call returned long before the excessive memory consumption started. Hitting Ctl-C under the perl debugger points to these lines of Plucene::Index::SegmentReader:

89  $self->freq_stream(
90       [ unpack "(w)*", read_file("$self->{directory}/$segment.frq") ]);
The file is only 1.6MB, nothing extraordinary.

Post a comment

Name or OpenID (required)


(lesstile enabled - surround code blocks with ---)