Transcript Lucene-Demo
Lucene-Demo
Brian Nisonger
Intro
No details about Implementation/Theory
See
Treehouse Wiki- Lucene for additional info
Set of Java classes
Not an end to end solution
Designed to allow rapid development of IR
tools
Index
The first step is to take a set of text
documents and build an Index
Demo:IndexFiles
on Pongo
Two major classes
Analyzer
Used to Tokenize data
More on this later
IndexWriter
IndexWriter writer = new IndexWriter(INDEX_DIR, new
StandardAnalyzer(), true);
Index Writer
Index Writer creates an index of
documents
First
argument is a directory of where to
build/find the index
Second argument calls an Analyzer
Third argument determines if a new index
should be created
Analyzer
Standard Analyzer
Porter Stemming w/ Stop Words
Krovetz Stemmer-Example
package org.apache.lucene.analysis;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.KStemFilter;
import java.io.Reader;
public class KStemAnalyzer extends Analyzer
{
public final TokenStream tokenStream(String fieldName, Reader reader)
{
return new KStemFilter(new LowerCaseTokenizer(reader));
}
}
Analyzer-II
Snowball Stemmer
A
stemmer language created by Porter used
to build Stemmers
Multilingual
analyzers/Stemmers
Porter2
Fully
Integrated with Lucene 1.9.1
MyAnalyzer(Home Built)
Demo
Adding Documents
The Next step after creating an index is to
add documents
writer.addDocument(FileDocument.Document(
file));
Remember we already determined how the
document will be tokenized
Fields
Can
split document in to parts such as
document title,body,date created, paragraphs
Adding Documents-II
Assigns Token/doc ID
For
why this is important see Lucene –TreeHouse
Wiki
Create some type of loop to add all the
documents
This is the actual creation of the Index
before we merely set the Index
parameters
Finalizing Index Creation
After that the Index is optimized with
writer.optimize();
Merges
etc.
The Index is close with writer.close();
Searching an Index
Open Index
Create Searcher
IndexReader reader =
IndexReader.open(index);
Searcher searcher = new
IndexSearcher(reader);
Assign Analyzer
Use
the same Analyzer used to create Index
(Why?)
Searching an Index-II
Parse/Create query
Query query = QueryParser.parse(line, field,
analyzer);
Takes a line, looks for a particular field, and
runs it through an analyzer to create query
Determine which documents are matches
Hits hits = searcher.search(query);
Retrieving Documents
Hits creates a collection of documents
Using a loop we can reference each doc
Document doc = hits.doc(i);
This allows us to get info about the document
Name
of document, date is was created, words in
document
Relevancy Score(TF/IDF)
Demo
Finishing Searching
Return list of documents
Close Reader
Other Functions
Spans (Example from
http://lucene.apache.org/java/docs/api/in
dex.html)
Useful
for Phrasal matching
Allows for Passage Retrieval
Questions?
Any Questions, comments, jokes,
opinions??
I said “Good Day”
The END