Thanks to Erik Hatcher for his excellent introduction to Apache Lucene. Lucene is a full-featured text search engine developed by Doug Cutting. Lucene power the search capabilities of major sites as Nutch, jGuru, TheServerSide, etc.Lucene is extremely scalable and can handle 1000 search queries per second and has been tested on a 100 million page demo system. Erik provided a basic overview of the core concepts and classes along with examples demonstrating basic use as well as different index and search strategies. Since Lucene operates on pure text he also discussed the handling of PDF, Microsoft Word, HTML and other document formats. As a result of audience questions Erik delved into the index file structure as well as mechanics of both quering and indexing concurrently. The meeting went 45 minutes over as a result of many inquistive questions!
Thanks to the New England Software Symposium for arranging Erik's visit!
Lucene is a highly scalable and fast search engine API. Lucene is so good, in fact, that it is being used at the heart of a new open-source Google killer. This presentation will take an outside-in approach to Lucene, hilighting several real-world uses of it and then digging in to its internals to learn what makes it tick. One of the beauties of Lucene is that it is very easy to use, yet has significant power. If you are not familiar with this Jakarta gem, you are missing out. Come see what you've been missing and put Lucene in action right away. Several case studies of high-profile sites leveraging Lucene will begin the session, discussing what makes them tick. These case studies demonstrate that Lucene is plenty powerful enough for your search needs yet developer cleverness on how to use it is what adds value. Lucenes straightforward API then takes the stage, including specifics on indexing, searching, updating, and techniques to parallelize them. Digging even deeper, it is imperative to understand Lucenes analysis process in detail. Textual analysis can include stemming, stop word removal, synonym injection, and much more. The majority of user questions on Lucene involve a misunderstanding of the analysis process and what that means for searching; this session answers these questions.
Erik Hatcher is the co-author of the premiere book on Ant, Java Development with Ant and author of the just released Lucene in Action. He is an active Ant project developer and maintains jGuru's Ant FAQ and Forum.