Transcript eserc-mg4j

MG4J – Managing GigaBytes for Java
Ida Mele
Overview
• Document
–
–
–
–
Document
DocumentCollection
FileSetDocumentCollection
DocumentFactory
• Index
• Query
– HttpQueryServer
– QueryEngine
Ida Mele
MG4J
1
Document
• Indexing in MG4J is centered around documents.
• Interface in: it.unimi.dsi.mg4j.document
• The object, which is the instance of the class Document,
represents a single document that can be indexed.
• Different documents have different number and type of
fields. Example:
• E-mail: from, to, data, subject, body
• HTML page: title, url, body
Ida Mele
MG4J
2
Document
Ida Mele
MG4J
3
Document Collection
• Interface in: it.unimi.dsi.mg4j.document
• DocumentCollection is a randomly addressable lists
of documents.
Ida Mele
MG4J
4
FileSetDocumentCollection
• Class in: it.unimi.dsi.mg4j.document.
• The main method of FileSetDocumentCollection
allows to build and serialize a set of documents
specified by their filenames.
Ida Mele
MG4J
5
Document Factory
• Interface in: it.unimi.dsi.mg4j.document.
• The factory turns a pure stream of bytes (file) into a
document made by several fields (title and text).
Ida Mele
MG4J
6
Standard MG4J Document Factories
•
•
•
•
•
•
•
•
•
CompositeDocumentFactory
HtmlDocumentFactory
IdentityDocumentFactory
MailDocumentFactory
PdfDocumentFactory
ReplicatedDocumentFactory
PropertyBasedDocumentFactory
TRECHeaderDocumentFactory
ZipDocumentCollection.ZipFactory
Ida Mele
MG4J
7
Query
• To query the index we can use the main method of
the class Query. We can submit queries by using:
• command line,
• web browser.
Ida Mele
MG4J
8
Query Engine
• The query engine receives the query and returns the
ranked list of results.
Ida Mele
MG4J
9
HttpQueryServer
• Simple web server for query processing.
Ida Mele
MG4J
10
Indexing and Querying the collection of actors
• For our exercise we will use mg4j-4.0.
• TECHNICAL REQUIREMENTS:
• UNIX Operating System
• Java (>=6)
• Download the collections and the libraries available at:
http://www.dis.uniroma1.it/~mele/teaching_20122013.html
Ida Mele
MG4J
11
Set Classapth
• Download and extract: actors.zip.
• Download and extract: lib_mg4j.zip.
• Download the file: set-classpath.sh.
• Edit the first line of the file set-classpath.sh: replace
your_directory with the path of the folder containing all the
.jar files (lib_mg4j folder).
• Set the CLASSPATH: source set-classpath.sh
Ida Mele
MG4J
12
Building the collection of documents (1)
• Help:
java it.unimi.dsi.mg4j.document.FileSetDocumentCollection –help
• Create the collection:
find actors -name *.html | java
it.unimi.dsi.mg4j.document.FileSetDocumentCollection -f
HtmlDocumentFactory -p encoding=UTF-8 actors.collection
find returns the list of files, one per line. This list is
provided as input to the main method of the
FileSetDocumentCollection.
Ida Mele
MG4J
13
Building the collection of documents (2)
• We need also to specify a factory (the -f option) and the
encoding as a property.
• The name of the collection is actors.collection.
• The collection does not contain the files, but only their
names.
• Deleting or modifying files of actors directory may
cause inconsistence in the collection.
Ida Mele
MG4J
14
Building the index
• Help: java it.unimi.dsi.mg4j.tool.IndexBuilder --help
• Create the index:
java it.unimi.dsi.mg4j.tool.IndexBuilder --downcase -S
actors.collection mycollection
•
•
•
--downcase: this option forces all the terms to be downcased.
-S : specifies that we are producing an index for the specified
collection. If the option is omitted, Index expects to index a
document sequence read from standard input.
• mycollection: basename of the index.
Use –Xmx for allocating more memory to Java:
java –Xmx512M it.unimi.dsi.mg4j.tool.IndexBuilder --downcase -S
actors.collection mycollection
Ida Mele
MG4J
15
Index files (1)
• mycollection-{text,title}.terms: contain the terms of the
dictionary. One term per line.
more mycollection-text.terms
• mycollection-{text,title}.stats: contain statistics.
more mycollection-text.stats
• mycollection-{text,title}.properties: contain global
information.
more mycollection-text.properties
Ida Mele
MG4J
16
Index files (2)
• mycollection-{text,title}.frequencies: for each term, there
is the number of documents with the term (-code).
• mycollection-{text,title}.globcounts: for each term, there
is the number of occurrence of the term (-code).
• mycollection-{text,title}.offset: for each term, there is the
offset (-code).
Ida Mele
MG4J
17
Index files (3)
• mycollection-{title,text}.sizes: contain the list of the
document sizes. The document size is the number of
words contained in each document (- code).
• mycollection-{text,title}.batch<i>: temporary files with
sub-indices (-code). Use the option --keep-batches to
not delete temporary files.
• mycollection-{text,title}.index: contain the index (-code).
Ida Mele
MG4J
18
Web Server
• Help:
java it.unimi.dsi.mg4j.query.Query --help
• Querying the index:
java it.unimi.dsi.mg4j.query.Query -h -i
it.unimi.dsi.mg4j.query.FileSystemItem -c
actors.collection mycollection-text mycollectiontitle
• Command line: {text, title} > Gerini
• Web browser: http://localhost:4242/Query
Ida Mele
MG4J
19
Query (1)
• Search one word. The result is the set of documents
that contain the specified word.
• Ex: attore
• AND: more than one term separated by whitespace or
AND or &. The result is the set of documents that
contain all the specified words.
• Ex: claudia & pandolfi
• Ex: claudia pandolfi
Ida Mele
MG4J
20
Query (2)
• OR: more than one term separated by OR or |. The
result is the set of documents that contain any of the
given words.
• Ex: pandolfi | gerini
• NOT: the operator NOT or ! is used for negation.
• Ex: claudia & !pandolfi
• Parentheses: the parentheses are used to enforce
priority in complex queries.
• Ex: claudia & (pandolfi | gerini)
Ida Mele
MG4J
21
Query (3)
• Proximity restriction: the words must appear within a
limited portion of the document.
• Ex: (claudia spettacolo)~6
• Wildcard (*): wildcard queries can be submitted
appending * at the end of a term.
• Ex: att*
Ida Mele
MG4J
22
Query (4)
• Phrase: using “ ” we can look for documents that
contain the exact phrase.
• Ex: “mitico attore pugliese”
• Index specifiers: prefixing a query with the name of an
index followed by : you can restrict the search to that
index.
• Ex: title:gerini
Ida Mele
MG4J
23
Sophisticated queries (1)
• MG4J provides sophisticated query tuning.
• To use this features, we must use the command line
interface.
• > $ --- to get some help on the available options.
• > $mode --- to choose the kind of results.
Ex. > $mode short
• > $score --- to choose the scorer.
Ex. > $score VignaScorer
Ida Mele
MG4J
24
Sophisticated queries (2)
• >$mplex on --- when multiplexing is on, each query is
multiplexed to all indices. When a scorer is used, it is a
good idea to use multiplexing.
• >$weigth --- to change the weight of the indices. This is
useful when multiplexing is on.
Ex. >$weigth text:1 title:3
Ida Mele
MG4J
25
Scorer (1)
• Scorer are important for ranking the documents result of a query.
Default: BM25Scorer and VignaScorer
• ConstantScorer. Each document has a constant score (default is 0).
>$score ConstantScorer
• CountScorer. It is the product between the number of occurrences
of the term in the document and the weight assigned to the index.
>$score CountScorer
Ida Mele
MG4J
26
Scorer (2)
• TfIdfScorer. Implements TF/IDF.
TF is the term frequency of the term t for the document d:
c/l; where c is the number of occurrences of t in d and l is
the length of d.
IDF is the inverse document frequency of the term t in the
collection: log(N/f); where N is the number of documents
in the collection and f is the number of documents where t
appears.
>$score TfIdfScorer
Ida Mele
MG4J
27
Scorer (3)
• DocumentRankScorer. Scores of documents are
stored in a text file.
>$score DocumentRankScorer nameFile
Ida Mele
MG4J
28
Virtual fields (1)
• A virtual field produces pieces of text that are to be referred
to other documents.
• Referrer: the document that has a link to another document.
• Referee: the document pointed by the Referrer.
• Intuitively the Referrer gives us information about the
Referee. Hence, the Referrer produces in a virtual field a
number of fragments of text, each referring to a certain
Referee.
• The content of a virtual field is a list of pairs made by the
piece of text (called virtual fragment) and by some string
that is aimed at representing the Referee (called the
document spec).
Ida Mele
MG4J
29
Virtual fields (2)
• In the case of the HTMLDocumentFactory, the anchor
field is the list of all anchors contained in the document:
• the document spec is a URL (as specified in the
href attribute);
• the virtual fragment is the content of the anchor
element.
Ida Mele
MG4J
30
Virtual fields (3)
• Create the list of URL of the documents in the collection.
java it.unimi.dsi.mg4j.tool.ScanMetadata -S actors.collection u actors.urls
• Create the document resolver. It is able to map the document
spec produced by some document factory into actual references to
documents in the collection.
Given a document spec, the resolver will decide whether the spec
really refers to a document in the collection or not, and in the first
case it will find out to which document the spec refers to.
java it.unimi.dsi.mg4j.tool.URLMPHVirtualDocumentResolver o actors.urls actors-anchor.resolver
Ida Mele
MG4J
31
Virtual fields (4)
• Building the index:
java it.unimi.dsi.mg4j.tool.IndexBuilder -a -v anchor:actorsanchor.resolver --downcase -S actors.collection mycollection
• Querying the index:
java it.unimi.dsi.mg4j.query.Query -h -n -i FileSystemItem -c
actors.collection mycollection-text mycollection-title
mycollection-anchor
{text, title, anchor} > $score CountScorer
{text, title, anchor} > Gerini
Ida Mele
MG4J
32
Virtual gap (1)
• All the virtual fragments that refer to a given document of
the collection are like a single text, called the virtual text.
• Virtual fragments coming from different anchors are
concatenated, and this may produce false positive
results.
• Ex: anchor:(claudia AND attrice) produces as result a
list of documents that contain both the words, but not
necessarily in the same anchor.
Ida Mele
MG4J
33
Virtual gap (2)
• To avoid such kinds of false positives, we can use virtual
gaps.
• The virtual gap is a positive integer, representing the
virtual space left between different virtual fragments.
• For example, if the virtual gap is 64 (the default), anchors
are concatenated by leaving 64 “empty words” between
subsequent fragments.
>anchor:(claudia AND attrice)~64
Ida Mele
MG4J
34
Virtual gap (3)
• In the indexing phase, it is possible to specify a different
virtual gap.
• For example:
java it.unimi.dsi.mg4j.tool.IndexBuilder -a -g
anchor:100 -v anchor:actors-anchor.resolver -downcase -S actors.collection mycollection
Ida Mele
MG4J
35
Term map (1)
• A simple representation of a dictionary is the term list: a
text file containing the whole dictionary, one term per line,
in index order (the file .terms).
• A more efficient representation is based on a monotone
minimal perfect hash function: it is a very compact data
structure that is able to answer correctly to the question
"What is the index of the term XXX?”.
• You can build such a function from a sorted term list using:
java it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction
titles.mph mycollection-title.terms
Ida Mele
MG4J
36
Term map (2)
• Monotone minimal perfect functions have a serious limit:
they can answer correctly to the question "What is the
index of the term XXX?", but only if the term appears in
the dictionary.
• To solve this problem, we can use a signed function.
• For terms not in the dictionary, the function will answer
with a special value (-1) that means "the word is not in the
dictionary”:
java it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction
titles.mph mycollection-title.terms
Ida Mele
MG4J
37
Term map (3)
• Wildcard searches require the use of a prefix map.
• A prefix map is able to answer correctly to questions like
"What are the indices of terms starting with the characters
YYY?".
• If terms are lexicographically sorted, the answer is a pair
of integers: the first and the last index of terms satisfying
the property.
• We can build a prefix map by using:
java it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki -o
mycollection-title.terms mycollection-title.dict
Ida Mele
MG4J
38
Exercise
• Indexing and querying the other collections
(htmlDIS and cities), available at:
http://www.dis.uniroma1.it/~mele/teaching_20122013.html
Ida Mele
MG4J
39