aelkiss_ncibi_200602..

Download Report

Transcript aelkiss_ncibi_200602..

Extensible Information Retrieval
with Apache Nutch
Aaron Elkiss
16-Feb-2006
Why use Nutch?
• Front-end to large collections of
documents
• Demonstrate research without writing
lots of extra code
Outline
• Nutch - information retrieval
– Pros & Cons
– Crawling the Local Filesystem
– How Nutch Works
– Indexing a Database
– Query Filters: Searching with Nutch
Nutch
• Open source search engine
• Written in Java
• Built on top of Apache Lucene
Advantages of Nutch
• Scalable
– Index local host or entire Internet
• Portable
– Runs anywhere with Java
• Flexible
– Plugin system + API
• Code pretty easy to read & work with
• Better than implementing it yourself!
Disadvantages of Nutch
•
•
•
•
•
Documentation still somewhat lacking
Not yet fully mature
No GUI
Odd Tomcat setup
Several “gotchas”
Crawling the Local Filesystem
• Step 1: Create list of files to index
file_list:
/data0/projects/clairlib/CLAIR/aleClairlib.pl
/data0/projects/clairlib/CLAIR/buildALE.pl
/data0/projects/clairlib/CLAIR/get_cosine_example.pl
/data0/projects/clairlib/CLAIR/lookUpTFIDF.pl
/data0/projects/clairlib/CLAIR/makeCorpus.pl
/data0/projects/clairlib/CLAIR/normalize_cosines.pl
/data0/projects/clairlib/CLAIR/queryALE.pl
/data0/projects/clairlib/CLAIR/testCluster.pl
/data0/projects/clairlib/CLAIR/testCorpusDownload.pl
/data0/projects/clairlib/CLAIR/testDocument.pl
/data0/projects/clairlib/CLAIR/testDocumentPair.pl
/data0/projects/clairlib/CLAIR/testIP.pl
/data0/projects/clairlib/CLAIR/testUtil.pl
/data0/projects/clairlib/CLAIR/testWebSearch.pl
/data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl
/data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl
/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl
/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl
/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl
/data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl
Crawling the Local Filesystem
• Step 2: Edit Configuration
– crawl-urlfilter.txt
• Very restrictive by default
• Must allow file: URLs
crawl-urlfilter.txt default
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
# skip everything else
-.
crawl-urlfilter.txt
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip image and other suffixes we can't yet parse
.\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# allow everything else
+.
Crawling the Local Filesystem
• Step 3: Edit Configuration
– nutch-site.xml (overrides nutch-default.xml)
• Enable protocol-file plugin and parse plugins
<nutch-conf>
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
</nutch-conf>
Crawling the Local Filesystem
• Step 4: Run the crawl
– bin/nutch crawl myurls
• Step 5: Start Tomcat
– GOTCHA: must start in the crawl directory!
– Or edit WEB-INF/classes/nutch-site.xml
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/oriole0/nutch-0.7.1/crawl-20051208231019</value>
</property>
</nutch-conf>
Modifying the Results Page
• Just customize search.jsp!
• For example, display external ‘citations’
link instead of ‘anchors’
(<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>">
<i18n:message key="explain"/></a>)
(<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>)
<%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%>
How Nutch Works
• Protocol plugin
URL
Protocol.
getProtocolOutput
Content
byte[] content
String contentType
URL url
Properties metadata
How Nutch Works
Parse
• Parsing plugins
URL
Protocol.
getProtocolOutput
String text
ParseData data
Properties metadata
Outlink[] outlinks
String title
ParseStatus status
Content
byte[] content
String contentType
URL url
Properties metadata
Parser.
getParse
Indexing a Database
• Need to write a new plugin
• Luckily interface is pretty simple
• Much less tightly coupled than full-text
search inside database
Indexing a Database
• Approach
– Get the text out
– Generate a 1:1 mapping from URLs to
documents in the database
Indexing a Database
• Protocol plugin
– Replaces default ‘http’ plugin
– Converts http request to database request
Indexing a Database
• Parse plugin
– Replaces text or HTML parser
– Protocol plugin gets the text and metadata,
so don’t need to do much here
Indexing a Database
• Configuration - plugin.xml
Indexing a Database
• Configuration - nutch-site.xml
– Add correct plugin
• Make sure Nutch can find plugin
– $NUTCH_HOME/plugins
Improving the Plugin
• Configuration via XML
• Determine which database to use for
what URLs
• Automatically ‘crawl’ database
• Pass unknown URLs to default plugin
Searching with Nutch
• Parse query - NutchAnalysis
• Filter query - QueryFilters
• Pass to Lucene - IndexSearcher
– Optimization/caching LuceneQueryOptimizer
– Translate hits from Lucene back to Nutch
Query Filter
Nutch Query
Lucene Query
QueryFilter.
filter()
Date Query Filter
• Date query filter restricts by date
Basic Query Filter
• Boosts weight of particular fields
• Manipulates phrases
Additional Query Filters
• Could implement relevance feedback in
this framework
• Manual relevance feedback
– could add morelike:somedocument
operator
• Automatic relevance feedback - extend
BasicQueryFilter
Additional Capabilities
• Distributed searching
– Nutch Distributed File System
• MapReduce a la Google
• More
Nutch Distributed Filesystem
• Write-once
• Stream-oriented (append-only,
sequential read)
• Distributed, transparent, replicated,
fault-tolerant
• Distribute index and content
MapReduce
• Distributed processing technique
• Idea from functional programming
Map
• Apply same operation to several data items
• Example (Python):
def getDocument(docid):
""" fetch document with given docid from database """
# do some stuff ...
return document
docids = [1, 2, 3, 4, 5]
documents = map(getDocument,docids)
• Mapping for individual items is independent distributable!
Reduce
• Combine results of map operation
• Simple example - sum of squares
measurements = [4, 2, 6, 9]
def sum(x,y):
return x+y
def square(x):
return x^2
result = reduce(sum,map(square,measurements))
MapReduce in Nutch
• Can use to distribute crawling, indexing,
etc
Conclusions
• Nutch is
–
–
–
–
featureful
flexible
extensible
scalable
• Get started with nutch:
http://lucene.apache.org/nutch
• Sample plugins and code samples:
http://umich.edu/~aelkiss/nutch