Lucene – the de-facto OEM search engine
Brian Remmington, Software Consultant 19 May 2009
Ixxus has positioned itself as an expert in building solutions around content management systems (CMS), and is now a partner of three CMS vendors in the UK: Alfresco; FatWire; and Percussion. Together, these products cater for a broad segment of the CMS-buying public: Alfresco is open-source and particularly good at document management; FatWire is a comprehensive web content management (WCM) system that utilises a “fried” (aka “coupled” or “dynamic”) approach to page delivery; and Percussion is another strong WCM product that utilises a “baked” (aka “decoupled” or “static”) approach to page delivery.
As this illustrates, these products vary quite considerably when it comes to their approach to the problems of managing and delivering content, but they do share a couple of features. The first of these is that they are all implemented using Java technology as opposed to, for example, .NET technology. The second is that they all use the open-source search engine, Lucene.
Now, when you think about it, this is quite remarkable. The ability to find content in a CMS is one of its most important features, and all of these CMS vendors have looked at Lucene and decided that they can entrust it with that feature. And they are not alone. Lucene is embedded within several other CMS products from vendors such as Escenic, Day, and Clickability. Because of this, I thought I’d provide a brief, 1000-metre overview of Lucene’s basic concepts. Note that this is not intended for developers, but rather for “normal” people (no offence) that have an interest in learning a little background about the search element of content management systems.
The very highest view of Lucene is that it takes textual data, breaks it down into “terms”, and stores these terms on disk in an index. An index in this case is similar in purpose to an index that you’d find at the back of a book: for a given word it indicates which page that word appears on. In Lucene terminology, the page is a “document”, the page number is the unique document identifier, and the word is a “term”. In Lucene, however, a document isn’t just a block of text like a page in a book. Each document comprises a sequence of “fields”, and each field has a name and a value. If we consider an email as a document, for example, the field names may be “subject”, “sender”, “recipient”, “time sent”, and “body”. The value for each field is a sequence of terms (words) – even the value of the “time sent” field is considered as text by Lucene.
When a new document is passed to Lucene to be added to its index, it is passed through something called an “analyzer”. This is responsible for a couple of tasks: firstly to break the field value up into the separate terms (called “tokenising”); and secondly to filter the resulting terms. Breaking the field value text into terms is fairly simple, typically breaking on whitespace or hyphens. For example, the text “The black-and-white cat walked over to the beautifully-coloured painting” would be broken down into the terms “the”, “black”, “and”, “white”, “cat”, “walked”, “over”, “to”, “the”, “beautifully”, “coloured”, and “painting”.
Once the text has been tokenised, the filter(s) then has a chance to run. Filters are often more complex. For example, terms like “the”, “and”, “over”, and “to” are so common as to be useless from a search perspective. It would be the job of a filter to remove them prior to being stored in the index. It’s also possible that terms like “walked”, “beautifully”, “coloured”, and “painting” would be better represented in the index as “walk”, “beautiful”, “colour”, and “paint” instead to allow this document to be found if one of these shorter terms was used in the search. This is called “stemming”, and is also the responsibility of a filter if desired. Going a step further, if one wanted to support synonym search then a filter could be used to add new terms such as “pretty” when it comes across the term “beautiful”. As you can see, filters can be both complex and language-specific. There are a range of tokenisers and filters available for Lucene, although it’s fair to say that it doesn’t have the broadest range of language support.
Once both the tokeniser and filter(s) have run, Lucene has a list of terms for each field along with statistics about these terms such as how many times they occurred and at what position in the text they occurred. It stores all this information in the index file on disk in a way that makes it very quick to find answers (results) to questions (searches) such as “which documents contain the word ‘ixxus’ in their ‘subject’ field?”. As you might expect given its popularity, Lucene is blisteringly fast at carrying out searches.
There are many high-quality open-source products out there that are widely used, but relatively few have gained the pre-eminence in its particular field that Lucene has. It is a truly impressive search engine.






Comments
Be the first to comment.
Add your comment