| home / programming / Lucene / 1 | [previous] |
|
|
Searching in Lucene is as fast and simple as indexing; the power of this functionality is astonishing, as chapters 3 and 5 will show you. For now, let's look at Searcher, a command-line program that we'll use to search the index created by Indexer. (Keep in mind that our Searcher serves the purpose of demonstrating the use of Lucene's search API. Your search application could also take a form of a web or desktop application with a GUI, an EJB, and so on.)
In the previous section, we indexed a directory of text files. The index, in this example, resides in a directory of its own on the file system. We instructed Indexer to create a Lucene index in a build/index directory, relative to the directory from which we invoked Indexer. As you saw in listing 1.1, this index contains the indexed files and their absolute paths. Now we need to use Lucene to search that index in order to find files that contain a specific piece of text. For instance, we may want to find all files that contain the keyword java or lucene, or we may want to find files that include the phrase "system requirements."
The Searcher program complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments:
The path to the index created with Indexer
A query to use to search the index
| Listing 1.2 Searcher: searches a Lucene index for a query passed as an argument |

Searcher, like its Indexer sibling, has only a few lines of code dealing with
Lucene. A couple of special things occur in the search method,
(1) We use Lucene’s IndexSearcher and FSDirectory classes to open our index for searching.
(2) We use QueryParser to parse a human-readable query into Lucene’s Query class.
(3) Searching returns hits in the form of a Hits object.
(4) Note that the Hits object contains only references to the underlying documents.
In other words, instead of being loaded immediately upon search, matches are
loaded from the index in a lazy fashion—only when requested with the hits.
doc(int) call.
Let’s run Searcher and find some documents in our index using the query
'lucene':
%java lia.meetlucene.Searcher build/index 'lucene'
Found 6 document(s) (in 66 milliseconds) that matched
. query 'lucene':
/lucene/README.txt
/lucene/src/jsp/README.txt
/lucene/BUILD.txt
/lucene/todo.txt
/lucene/LICENSE.txt
/lucene/CHANGES.txtThe output shows that 6 of the 13 documents we indexed with Indexer contain
the word lucene and that the search took a meager 66 milliseconds. Because
Indexer stores files’ absolute paths in the index, Searcher can print them out. It’s
worth noting that storing the file path as a field was our decision and appropriate
in this case, but from Lucene’s perspective it’s arbitrary meta-data attached
to indexed documents.
Of course, you can use more sophisticated queries, such as 'lucene AND doug'
or 'lucene AND NOT slow' or '+lucene +book', and so on. Chapters 3, 5, and 6
cover all different aspects of searching, including Lucene’s query syntax.
The Searcher class is a simplistic demo of Lucene’s search features. As such, it
only dumps matches to the standard output. However, Searcher has one more
trick up its sleeve. Imagine that you need to find files that contain a certain keyword
or phrase, and then you want to process the matching files in some way. To
keep things simple, let’s imagine that you want to list each matching file using
the ls UNIX command, perhaps to see the file size, permission bits, or owner. By
having matching document paths written unadorned to the standard output,
and having the statistical output written to standard error, you can use the nifty
UNIX xargs utility to process the matched files, as shown here:
% java lia.meetlucene.Searcher build/index
. 'lucene AND NOT slow' | xargs ls -l
Found 6 document(s) (in 131 milliseconds) that
--> matched query 'lucene AND NOT slow':
-rw-r--r-- 1 erik staff 4215 10 Sep 21:51 /lucene/BUILD.txt
-rw-r--r-- 1 erik staff 17889 28 Dec 10:53 /lucene/CHANGES.txt
-rw-r--r-- 1 erik staff 2670 4 Nov 2001 /lucene/LICENSE.txt
-rw-r--r-- 1 erik staff 683 4 Nov 2001 /lucene/README.txt
-rw-r--r-- 1 erik staff 370 26 Jan 2002 /lucene/src/jsp/
. README.txt
-rw-r--r-- 1 erik staff 943 18 Sep 21:27 /lucene/todo.txtIn this example, we chose the Boolean query 'lucene AND NOT slow',
which finds all files that contain the word lucene and don’t contain
the word slow. This query took 131 milliseconds and found 6 matching
files. We piped Searcher’s output to the xargs command,
which in turn used the ls –l command to list each matching file.
In a similar fashion, the matched files could be copied, concatenated, emailed,
or dumped to standard output.3
Our example indexing and searching applications demonstrate Lucene in a
lot of its glory. Its API usage is simple and unobtrusive. The bulk of the code (and
this applies to all applications interacting with Lucene) is plumbing relating to
the business purpose—in this case, Indexer’s file system crawler that looks for
text files and Searcher’s code that prints matched filenames based on a query to
the standard output. But don’t let this fact, or the conciseness of the examples,
tempt you into complacence: There is a lot going on under the covers of Lucene,
and we’ve used quite a few best practices that come from experience. To effectively
leverage Lucene, it’s important to understand more about how it works
and how to extend it when the need arises. The remainder of this book is dedicated
to giving you these missing pieces.
Written by Otis Gospodnetic and Erik Hatcher and reproduced from "Lucene in Action" by permission of Manning Publications Co. ISBN 1932394281, copyright 2004. All rights reserved. See http://www.manning.com for more information.
| home / programming / Lucene / 1 | [previous] |
Created: March 27, 2003
Revised: January 24, 2005
URL: http://webreference.com/programming/lucene/1