Lucene-Demo

Transcript Lucene-Demo

Lucene-Demo
Brian Nisonger
Intro

No details about Implementation/Theory
 See
Treehouse Wiki- Lucene for additional info
Set of Java classes
 Not an end to end solution
 Designed to allow rapid development of IR
tools

Index

The first step is to take a set of text
documents and build an Index
 Demo:IndexFiles
on Pongo
 Two major classes
 Analyzer


Used to Tokenize data
More on this later
 IndexWriter

IndexWriter writer = new IndexWriter(INDEX_DIR, new
StandardAnalyzer(), true);
Index Writer

Index Writer creates an index of
documents
 First
argument is a directory of where to
build/find the index
 Second argument calls an Analyzer
 Third argument determines if a new index
should be created
Analyzer

Standard Analyzer


Porter Stemming w/ Stop Words
Krovetz Stemmer-Example















package org.apache.lucene.analysis;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.KStemFilter;
import java.io.Reader;
public class KStemAnalyzer extends Analyzer
{
public final TokenStream tokenStream(String fieldName, Reader reader)
{
return new KStemFilter(new LowerCaseTokenizer(reader));
}
}
Analyzer-II

Snowball Stemmer
A
stemmer language created by Porter used
to build Stemmers
 Multilingual
analyzers/Stemmers
 Porter2
 Fully

Integrated with Lucene 1.9.1
MyAnalyzer(Home Built)
 Demo
Adding Documents

The Next step after creating an index is to
add documents
 writer.addDocument(FileDocument.Document(
file));
 Remember we already determined how the
document will be tokenized

Fields
 Can
split document in to parts such as
document title,body,date created, paragraphs
Adding Documents-II

Assigns Token/doc ID
 For
why this is important see Lucene –TreeHouse
Wiki
Create some type of loop to add all the
documents
 This is the actual creation of the Index
before we merely set the Index
parameters

Finalizing Index Creation

After that the Index is optimized with
writer.optimize();
 Merges

etc.
The Index is close with writer.close();
Searching an Index

Open Index


Create Searcher


IndexReader reader =
IndexReader.open(index);
Searcher searcher = new
IndexSearcher(reader);
Assign Analyzer
 Use
the same Analyzer used to create Index
(Why?)
Searching an Index-II

Parse/Create query
Query query = QueryParser.parse(line, field,
analyzer);
 Takes a line, looks for a particular field, and
runs it through an analyzer to create query


Determine which documents are matches

Hits hits = searcher.search(query);
Retrieving Documents
Hits creates a collection of documents
 Using a loop we can reference each doc

Document doc = hits.doc(i);
 This allows us to get info about the document

 Name
of document, date is was created, words in
document
 Relevancy Score(TF/IDF)

Demo
Finishing Searching
Return list of documents
 Close Reader

Other Functions

Spans (Example from
http://lucene.apache.org/java/docs/api/in
dex.html)
 Useful
for Phrasal matching
 Allows for Passage Retrieval
Questions?

Any Questions, comments, jokes,
opinions??
I said “Good Day”

The END

Lucene-Demo

Transcript Lucene-Demo

Directory