Lemur Indri Search Engine

Download Report

Transcript Lemur Indri Search Engine

Lemur Indri Search Engine
Yatish Hegde
03/03/2010
Background
• Open source text search engine
• Combines language modeling and inference
networks
• Inquery query language
• API – accesible from C++, Java, C# and PHP.
• Html, xml, txt, trectext, trecweb, ppt, doc*,
ppt*
Resources
• Website: http://lemurproject.org
• Tutorials:
http://sourceforge.net/apps/trac/lemur/wiki
• Forum:
http://sourceforge.net/projects/lemur/forums
How to get started?
• Cygwin: http://cygwin.com (include “perl”, “vi
editor” and “make” package while installing)
• Lemur Toolkit:
http://sourceforge.net/projects/lemur/develo
p
• TREC Eval: http://trec.nist.gov/trec_eval/
Installing Lemur
Inside Lemur Directory • ./configure
• make
• make install
• Build Index – IndriBuildIndex
• Run Query - IndriRunQuery
Building Index
• IndriBuildIndex <parameterFile>
•
<parameters>
<index>/home/lemur/testindex</index> <memory>1G</memory>
<corpus>
<path>/home/lemur/testdata/firstCorpus</path>
<class>trectext</class>
</corpus>
<corpus>
<path>/home/lemur/testdata/secondCorpus</path>
<class>trecweb</class>
</corpus>
<stemmer>
<name>krovetz</name>
</stemmer>
<field> <name>p</name> </field>
</parameters>
Running Query
• IndriRunQuery <queryFile> <stopwordFile> <queryOptions>
•
Query File
<parameters>
<query>
<number>701</number>
<text>oil industry history</text>
•
</query>
</parameters>
Stop Word File
<parameters>
<stopper>
<word>the</word>
•
</stopper>
</parameters>
Query Options File
<parameters>
<trecFormat>true</trecFormat>
<index>/path/to/index</index>
<count>1000</count>
</parameters>
Converting Topic File into Query File
•
Topic File
<top>
<num> Number: 301
<title> International Organized Crime
<desc> Description:
Identify organizations that participate in international criminal
activity, the activity, and, if possible, collaborating organizations
and the countries involved.
<narr> Narrative:
A relevant document must as a minimum identify the organization and the
type of illegal activity (e.g., Columbian cartel exporting cocaine).
Vague references to international drug trade without identification of
the organization(s) involved would not be relevant.
</top>
Converting Topic File into Query File
Perl Program:
• ./topicToQuery.pl [-t] [-d] <inputFile> <outputFile>
• ./topicToQuery.pl -h
TREC Eval
• make
• trec_eval -q -c -M1000 official_qrels query_results
• More Documentation:
http://trecvid.nist.gov/trecvid.tools/trec_eval_video/
README
Lemur Search UI
• User Interface:
http://sourceforge.net/apps/trac/lemur/wiki/
The%20Lemur%20CGI%20Application
• How it looks?
http://sewell.syr.edu/lemur/lemur.cgi
Indri Query Langauge
•
•
•
•
•
#combine( white house)
#1(white house)
#5(white house)
#band(white house)
#band(oil fields) #1(white house)
<parameters>
<query>
<number> 301 </number>
<text>
#combine( Identify organizations that participate in #max( #1(
international criminal activity) international criminal activity ) the activity
and if possible collaborating organizations and the countries involved)
</text>
</query>
</parameters>
Contact
If you have questions -
Yatish Hegde: [email protected]
• Thank You