Transcript indexing
IR Indexing
Thanks to
B. Arms
SIMS
Baldi, Frasconi, Smyth
Manning, Raghavan, Schutze
What we have covered
•
•
•
•
•
What is IR
Evaluation
Tokenization, terms and properties of text
Web crawling; robots.txt
Vector model of documents
– Bag of words
• Queries
• Measures of similarity
• This presentation
– Indexing
– Inverted files
Last time
• Vector models of documents and queries
– Bag of words model
• Similarity measures
– Text similarity typically used
– Similarity is a measure of relevance (and
ranking)
– Query is a small document
– Match query to document
• All stored and indexed before a query is
matched.
Summary: What’s the point of using
vector spaces?
• A well-formed algebraic space for retrieval
• Key: A user’s query can be viewed as a (very)
short document.
• Query becomes a vector in the same space as the
docs.
• Can measure each doc’s proximity to it.
• Natural measure of scores/ranking – no longer
Boolean.
– Queries are expressed as bags of words
• Other similarity measures: see
http://www.lans.ece.utexas.edu/~strehl/diss/node52.html
for a survey
Solr/Lucene
Query Engine
Index
Interface
Indexer
Users
Crawler
YouSeer
Nutch
LucidWorks
Web
A Typical Web Search Engine
Indexing
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Why indexing?
• For efficient searching for documents with
unstructured text (not databases) using
terms as queries
– Databases can still be searched with an index
• Sphinx
– Online sequential text search (grep)
• Small collection
• Text volatile
– Data structures - indexes
• Large, semi-stable document collection
• Efficient search
Unstructured vs structured data
• What’s available?
– Web
– Organizations
– You
• Unstructured usually means “text”
• Structured usually means “databases”
• Semistructured somewhere in between
Unstructured (text) vs. structured
(database) data companies in 1996
160
140
120
100
Unstructured
Structured
80
60
40
20
0
Data volume
Market Cap
Unstructured (text) vs. structured
(database) data companies in 2006
160
140
120
100
Unstructured
Structured
80
60
40
20
0
Data volume
Market Cap
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee
Manager
Salary
Smith
Jones
50000
Chang
Smith
60000
Ivy
Smith
50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Unstructured data
• Typically refers to free text
• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified
zones such as the Title and Bullets
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
… to say nothing of linguistic structure
More sophisticated semistructured search
• Title is about Object Oriented Programming
AND Author something like stro*rup
• where * is the wild-card operator
• Issues:
– how do you process “about”?
– how do you rank results?
• The focus of XML search.
Clustering and classification
• Given a set of docs, group them into
clusters based on their contents.
• Given a set of topics, plus a new doc D,
decide which topic(s) D belongs to.
• Not discussed in this course
The web and its challenges
• Unusual and diverse documents
• Unusual and diverse users, queries,
information needs
• Beyond terms, exploit ideas from social
networks
– link analysis, clickstreams ...
• How do search engines work? And how
can we make them better?
More sophisticated information
retrieval not covered here
•
•
•
•
•
Cross-language information retrieval
Question answering
Summarization
Text mining
…
Search Subsystem
query parse query
ranked
document set
query tokens
stop list*
non-stoplist
tokens
ranking*
stemming*
*Indicates
optional
operation.
Boolean
retrieved operations*
document set
relevant
document set
stemmed
terms
Index
database
Lexical Analysis: Tokens
What is a token?
Free text indexing
A token is a group of characters, extracted from the
input string, that has some collective significance,
e.g., a complete word.
Usually, tokens are strings of letters, digits or other
specified characters, separated by punctuation,
spaces, etc.
Decisions in Building Inverted
Files: What is the Term/Token?
•
Underlying character set, e.g., printable ASCII,
Unicode, UTF8.
•
Is there a controlled vocabulary? If so, what
words are included? Stemming?
•
List of stopwords.
•
Rules to decide the beginning and end of words, e.g.,
spaces or punctuation.
•
Character sequences not to be indexed, e.g.,
sequences of numbers.
Decisions in Building an Inverted File:
Efficiency and Query Languages
Some query options may require huge computation, e.g.,
Regular expressions
If inverted files are stored in lexicographic order,
comp* can be processed efficiently
*comp cannot be processed efficiently
Boolean terms
If A and B are search terms
A or B can be processed by comparing two moderate sized lists
(not A) or (not B) requires two very large lists
Major Indexing Methods
• Inverted index
– effective for very large collections of documents
– associates lexical items to their occurrences in the
collection
• Positional index
• Non-positional indexs
– Block-sort
• Suffix trees and arrays
– Faster for phrase searches; harder to build and maintain
• Signature files
– Word oriented index structures based on hashing
(usually not used for large texts)
Sparse Vectors
• Terms (tokens) as vectors
• Vocabulary and therefore dimensionality of
vectors can be very large, ~104 .
• However, most documents and queries do
not contain most words, so vectors are
sparse (i.e. most entries are 0).
• Need efficient methods for storing and
computing with sparse vectors.
Sparse Vectors as Lists
• Store vectors as linked lists of non-zeroweight tokens paired with a weight.
– Space proportional to number of unique terms
(n) in document.
– Requires linear search of the list to find (or
change) the weight of a specific term.
– Requires quadratic time in worst case to
compute vector for a document:
n(n 1)
2
i
O
(
n
)
2
i 1
n
Sparse Vectors as Trees
• Index tokens in a document in a balanced
binary tree or trie with weights stored with
tokens at the leaves.
memory
<
film
<
bit
2
Balanced Binary Tree
variable
<
film memory variable
1
1
2
Sparse Vectors as Trees (cont.)
• Space overhead for tree structure: ~2n nodes.
• O(log n) time to find or update weight of a
specific term.
• O(n log n) time to construct vector.
• Need software package to support such data
structures.
Sparse Vectors as Hash Tables
• Hashing:
– well-defined procedure or mathematical function
that converts a large, possibly variable-sized
amount of data into a small datum, usually a
single integer that may serve as an index to an
array
• Store tokens in hash table, with token string as key
and weight as value.
– Storage overhead for hash table ~1.5n
– Table must fit in main memory.
– Constant time to find or update weight of a specific
token (ignoring collisions).
– O(n) time to construct vector (ignoring collisions).
Implementation Based on
Inverted Files
• In practice, document vectors are not stored
directly; an inverted organization provides much
better efficiency.
• The keyword-to-document index can be
implemented as a hash table, a sorted array, or a
tree-based data structure (trie, B-tree).
• Critical issue is logarithmic or constant-time
access to token information.
Index File Types - Lucene
• Field File
– Field infos: .fnm
– Stored Fields
• field index: .fdx
• field data: .fdt
• Term Dictionary file
– Term infos: .tis
– Term info index: .tii
• Frequency file: .frq
• Position file: .prx
Organization of Inverted Files
Term Frequency
(Posting file)
Term Dictionary
Prefix Suffix Field Doc
Length
Num Freq
Term
DocDelta
(implicit) , Freq?
Term Position File
(Inverted index)
Term(implicit) Doc1 Doc2 … DocN
cat
1
2, 1
…
2
0
cat
1
2
cat
3, 2, 2
cathy
2
1
…
3
3
hy
1
3
cathy
3, 3, 7
dog
3
3
…
6
0
dog
1
6
dog
doom
0
0
…
1
2
om
1
1
3, 3, 5, 3,
4, 2, 7
doom
7
Document Files
Efficiency Criteria
Storage
Inverted files are big, typically 10% to 100% the size of the
collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents
(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use
excessive resources.
Document File
The documents file stores the documents that are being
indexed. The documents may be:
• primary documents, e.g., electronic journal articles
• surrogates, e.g., catalog records or abstracts
Postings File
Merging inverted lists is the most computationally intensive task
in many information retrieval systems.
Since inverted lists may be long, it is important to match postings
efficiently.
Usually, the inverted lists will be held on disk and paged into
memory for matching. Therefore algorithms for matching
postings process the lists sequentially.
For efficient matching, the inverted lists should all be sorted in the
same sequence.
Inverted lists are commonly cached to minimize disk accesses.
Document File
The storage of the document file may be:
Central (monolithic) - all documents stored together on a single
server (e.g., library catalog)
Distributed database - all documents managed together but
stored on several servers (e.g., Medline, Westlaw, Dialog)
Highly distributed - documents are stored on independently
managed servers (e.g., Web)
Each requires: a document ID, which is a unique identifier that
can be used by the inverted file system to refer to the document,
and a location counter, which can be used to specify location
within a document.
Documents File for Web Search
System
For web search systems:
•
A document is a web page.
•
The documents file is the web.
•
The document ID is the URL of the document.
Indexes are built using a web crawler, which retrieves each page
on the web (or a subset). After indexing, each page is discarded,
unless stored in a cache.
(In addition to the usual index file and postings file the indexing
system stores contextual information, which will be discussed in a
later lecture.)
Term Frequency (Postings File)
The postings file stores the elements of a sparse matrix, the term
assignment matrix.
It is stored as a separate inverted list for each term, i.e., a list
corresponding to each term in the index file.
Each element in an inverted list is called a posting, i.e., the
occurrence on a term in documents
Each list consists of one or many individual postings.
Length of Postings File
For a common term there may be very large numbers of
postings for a given term.
Example:
1,000,000,000 documents
1,000,000 distinct words
average length 1,000 words per document
1012 postings
By Zipf's law, the 10th ranking word occurs, approximately:
(1012/10)/10 times
= 1010 times
Inverted Index
Primary data structure for text indexes
• Basically two elements:
– (Vocabulary, Occurrences)
• Main Idea:
– Invert documents into a big index
• Basic steps:
– Make a “dictionary” of all the terms/tokens in the collection
– For each term/token, list all the docs it occurs in.
• Possibly location in document (Lucene stores the positions)
– Compress to reduce redundancy in the data structure
• Also reduces I/O and storage required
Inverted Indexes
We have seen “Vector files”. An Inverted
File is a vector file “inverted” so that rows
become columns and columns become
rows
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
t2
0
0
1
0
1
1
1
1
0
1
t3
1
0
1
0
1
0
0
0
1
1
Terms D1
t1
t2
t3
D2
1
0
1
D3
1
0
0
D4
0
1
1
D5
1
0
0
D6
1
1
1
…
D7
1
1
0
0
1
0
How Are Inverted Files Created
• Documents are parsed one document at a
time to extract tokens/terms. These are
saved with the Document ID (DID).
<term, DID>
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
How Inverted
Files are Created
• After all documents
have been parsed, the
inverted file is sorted
alphabetically and in
document order.
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
How Inverted
Files are Created
• Multiple term/token
entries for a single
document are merged.
• Within-document term
frequency information
is compiled.
• Result <term,DID,tf>
<the,1,2>
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
How Inverted Files are Created
• Then the inverted file can be split into
– A Dictionary file
• File of unique terms
• Fit in memory if possible
and
– A Postings file
• File of what document the term/token is in and how often.
• Sometimes where the term is in the document.
• Store on disk
• Worst case O(n); n size of tokens.
Dictionary and Posting Files
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Dictionary
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Posting
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Inverted indexes
• Permit fast search for individual terms
• For each term/token, you get a list consisting of:
– document ID (DID)
– frequency of term in doc (optional, implied in Lucene)
– position of term in doc (optional, Lucene)
– <term,DID,tf,position>
– <term,(DIDi,tf,positionij),…>
– Lucene:
• <positionij,…> (term and DID are implied from other files)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
How Inverted Files are Used
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Postings
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Query on
“time” AND “dark”
2 docs with “time” in
dictionary ->
IDs 1 and 2 from posting
file
1 doc with “dark” in
dictionary ->
ID 2 from posting file
Therefore, only doc 2
satisfied the query.
Inverted index
• Associates a posting list with each term
– POSTING LIST example
• a (d1, 1)
• …
• the (d1,2) (d2,2)
• Replace term frequency(tf) with tfidf
– Lucene only uses tf, idf added at query time
• Compress index and put hash links
– Hashing (a type of compression) speeds up
the access
• Match query to index and rank
Position in inverted file posting
– POSTING LIST example
• now (1, 1)
•…
• time (1,4) (2,13)
Doc 1
Now is the time
for all good men
to come to the aid
of their country
Doc 2
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Change weight
• Multiple term entries
for a single document
are merged.
• Within-document term
frequency information
is compiled.
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Time Complexity of Indexing
• Complexity of creating vector and indexing a
document of n terms is O(n).
• So building an index of m such documents is O(m n).
• Computing vector lengths is also O(m n), which is
also the complexity of reading in the corpus.
Retrieval with an Inverted Index
• Terms that are not in both the query and the
document do not effect cosine similarity.
– Product of term weights is zero and does not contribute
to the dot product.
• Usually the query is fairly short, and therefore its
vector is extremely sparse.
• Use inverted index to find the limited set of
documents that contain at least one of the query
words.
• Retrieval time O(log M) due to hashing where M is
the size of the document collection.
Inverted Query Retrieval
Efficiency
• Assume that, on average, a query word appears in
B documents:
Q = q1
D11…D1B
q2
…
D21…D2B
qN
Dn1…DnB
• Then retrieval time is O(|Q| B), which is typically,
much better than naïve retrieval that examines all
M documents, O(|V| M), because |Q| << |N| and B
<< M. |V| size of vocabulary
Processing the Query
• Incrementally compute cosine similarity of
each indexed document as query words are
processed one by one.
• To accumulate a total score for each retrieved
document, store retrieved documents in a
hashtable, where DocumentReference is the
key and the partial accumulated score is the
value.
Index Files
On disk
If an index is held on disk, search time is dominated by the
number of disk accesses.
In memory
Suppose that an index has 1,000,000 distinct terms.
Each index entry consists of the term, some basic statistics and
a pointer to the inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in
memory of a dedicated computer.
Index File Structures: Linear
Index
Advantages
Can be searched quickly, e.g., by binary search, O(log M)
Good for sequential processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
Documents File for Web Search
System
For Web search systems:
•
A Document is a Web page.
•
The Documents File is the Web.
•
The Document ID is the URL of the document.
Indexes are built using a Web crawler, which retrieves each page
on the Web (or a subset). After indexing each page is discarded,
unless stored in a cache.
(In addition to the usual index file and postings file the indexing
system stores special information)
Time Complexity of Indexing
• Complexity of creating vector and indexing
a document of n tokens is O(n).
• So indexing m such documents is O(m n).
• Computing token IDFs for a vocabularly V
is O(|V|).
• Computing vector lengths is also O(m n).
• Since |V| m n, complete process is O(m n),
which is also the complexity of just reading
in the corpus.
69
Index on disk vs. memory
• Most retrieval systems keep the dictionary
in memory and the postings on disk
• Web search engines frequently keep both in
memory
– massive memory requirement
– feasible for large web service installations
– less so for commercial usage where query loads
are lighter
Indexing in the real world
• Typically, don’t have all documents sitting on a
local filesystem
– Documents need to be crawled
– Could be dispersed over a WAN with varying
connectivity
– Must schedule distributed spiders/indexers
– Could be (secure content) in
• Databases
• Content management applications
• Email applications
Content residing in applications
• Mail systems/groupware, content management
contain the most “valuable” documents
• http often not the most efficient way of fetching
these documents - native API fetching
– Specialized, repository-specific connectors
– These connectors also facilitate document viewing
when a search result is selected for viewing
Secure documents
• Each document is accessible to a subset of users
– Usually implemented through some form of Access
Control Lists (ACLs)
• Search users are authenticated
• Query should retrieve a document only if user can
access it
– So if there are docs matching your search but you’re
not privy to them, “Sorry no results found”
– E.g., as a lowly employee in the company, I get “No
results” for the query “salary roster”
• Solr has this with post filtering
Users in groups, docs from
groups
• Index the ACLs and filter results by them
Documents
Users
0/1
0 if user can’t read
doc, 1 otherwise.
• Often, user membership in an ACL group verified
at query time – slowdown
Compound documents
• What if a doc consisted of components
– Each component has its own ACL.
• Your search should get a doc only if your query
meets one of its components that you have access
to.
• More generally: doc assembled from computations
on components
– e.g., in Lotus databases or in content management
systems
• How do you index such docs?
No good answers …
“Rich” documents
• (How) Do we index images?
• Researchers have devised Query based on Image
Content (QBIC) systems
– “show me a picture similar to this orange circle”
– Then use vector space retrieval
• In practice, image search usually based on metadata such as file name e.g., monalisa.jpg
• New approaches exploit social tagging
– E.g., flickr.com
Passage/sentence retrieval
• Suppose we want to retrieve not an entire
document matching a query, but only a
passage/sentence - say, in a very long
document
• Can index passages/sentences as minidocuments – what should the index units
be?
• This is the subject of XML search
Indexing Subsystem
Documents
text
assign document IDs
break into tokens
tokens
*Indicates
optional
operation.
documents
stop list*
non-stoplist
stemming*
tokens
stemmed
terms
document
numbers
and *field
numbers
term weighting*
terms with
weights
Index
database
Search Subsystem
query parse query
ranked
document set
query tokens
stop list*
non-stoplist
tokens
ranking*
stemming*
*Indicates
optional
operation.
Boolean
retrieved operations*
document set
relevant
document set
stemmed
terms
Index
database
Lucene’s index (conceptual)
Index
Document
Document
Field
Field
Field
Document
Field
Document
Field
Name
Value
Create a Lucene index (step 1)
• Create Lucene document and add fields
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public void createDoc(String title, String body)
{
Document doc=new Document( );
doc.add(Field.Text(“Title”, title));
doc.add(Field.Text(“Body”,body);
}
Create a Lucene index (step 2)
• Create an Analyser
– Options
• WhitespaceAnalyzer
– divides text at whitespace
• SimpleAnalyzer
– divides text at non-letters
– convert to lower case
• StopAnalyzer
– SimpleAnalyzer
– removes stop words
• StandardAnalyzer
– good for most European Languages
– removes stop words
– convert to lower case
Create a Lucene index (step 3)
• Create an index writer, add Lucene document into the index
import java.IOException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyser;
public void WriteDoc(Document doc, String idxPath)
{
try{
IndexWriter writer =
new IndexWriter(idxPath, new StandardAnalyzer( ), true);
writer.addDocument(doc);
writer.close( );
}
catch (IOException exp) {System.out.println(“I/O Error!”);}
}
Luence Index – Behind the Scene
• Inverted Index (Inverted File)
Doc 1:
Posting
id
word
doc
offset
Penn State
Football …
1
football
Doc 1
3
Doc 1
67
Doc 2
1
football
Doc 2:
Football
players …
State
2
penn
Doc 1
1
3
players
Doc 2
2
4
state
Doc 1
1
Doc 2
13
Posting
Table
Posting table
• Posting table is a fast look-up mechanism
– Key: word
– Value: posting id, satellite data (#df, offset, …)
• Lucene implements the posting table with Java’s hash table
– Objectified from java.util.Hashtable
– Hash function depends on the JVM
• hc2 = hc1 * 31 + nextChar
• Posting table usage
– Indexing: insertion (new terms), update (existing terms)
– Searching: lookup, and construct document vector
Search Lucene’s index (step 1)
• Construct an query (automatic)
import org.apache.lucene.search.Query;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyser;
public void formQuery(String ques)
{
Query query = new QueryParser.parse(
ques, “body”, new StandardAnalyser( )
);
}
Search Lucene’s index (step 1)
• Types of query:
– Boolean: [IST441 Giles] [IST441 OR Giles]
[java AND NOT SUN]
– wildcard: [nu?ch] [nutc*]
– phrase: [“JAVA TOMCAT”]
– proximity: [“lucene nutch” ~10]
– fuzzy: [roam~] matches roams and foam
– date range
– …
Search Lucene’s index (step 2)
• Search the index
import org.apache.lucene.document.Document;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
public void searchIdx(String idxPath)
{
Directory fsDir=FSDirectory.getDirectory(idxPath,false);
IndexSearcher is=new IndexSearcher(fsDir);
Hits hits = is.search(query);
}
Search Lucene’s index (step 3)
• Display the results
for (int i=0;i<hits.length();i++)
{
Document doc=hits.doc(i);
//show your results
}
Example: WestLaw
http://www.westlaw.com/
• Largest commercial (paying subscribers) legal search
service (started 1975; ranking added 1992)
• About 7 terabytes of data; 700,000 users
• Majority of users still use boolean queries
• Example query:
– What is the statute of limitations in cases involving the federal
tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
• Long, precise queries; proximity operators; incrementally
developed; not like web search
Lucene workflow
Document
super_name: Spider-Man
name: Peter Parker
category: superhero
powers: agility, spider-sense
addDocument()
Query
Hits
(powers:agility) (Matching Docs)
search()
IndexWriter
IndexSearcher
1.
2.
3.
Lucene Index
4.
Get Lucene jar
file
Write indexing
code to get data
and create
Document objects
Write code to
create query
objects
Write code to
use/display results
What we covered
• Indexing modes
• Inverted files
– Storage and access advantages
– Posting files
• Examples from Lucene