Transcript intr-nutch
Introduction to Nutch
Zhao Dongsheng
2008.9.29
Summary
What's Nutch
Nutch's architecture
How to use Nutch
About the first homework
What's Nutch
Written in java
Open-source project
An Application that can build SE
Behind a lot of web sites
What's Nutch
Lucene and Nutch
Nutch grow out of Lucene
Both open-source project
Both written in java
But Lucene is a Java library for
text indexing and search
Nutch is an Application
Nutch uses lucene for indexing
Nutch's architecture
Nutch's core components
Fecher
Requests web pages
Parses and extracts links
Web DB
Page DB
Used for fetch sheduling
Link DB
Store link gragh
Store anchor text with each link
Link-analysis and Anchor text indexing
Nutch's core components (cont.)
Indexer
Creates inverted index
Uses Lucene
Searcher
Finds relelant docs quickly
Ranks the docs
Summarizing
Functions Nutch supports
Politeness when crawling
Duplicates removing
PageRank analysis
Distributed searching
Summarizing
......
Nutch's Technical Goals
Fetch several billion pages per month
Maintain an index of these pages
Search that index up to 1000 times per second
Provide very high quality search results
Operate at minimal cost
Source code & API
Source Dirs
analysis crawl html plugin scoring segment tools
fetcher indexer net parse protocol searcher ...
crawl/Crawl.java
fetcher/Fetcher.java
How to use Nutch
Download & unpack
Nutch required JVM
Set environment variables
Configure
Specify root URLs
Specify URLs filters
Optionally specify
Number of threads
Levels to crawl
Fetch delay
How to use Nutch (cont.)
Root URLs Example
http://www.pku.edu.cn
URL Filter Example
crawl-urlfilter.txt
-^(file|ftp|mailto):
\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|
zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPE
G|bmp|BMP)$
-[?*!@=]
+^http://([a-z0-9]*\.)*pku.edu.cn/
How to use Nutch (cont.)
Run Nutch
Just a command line
bin/nutch crawl myurl.txt -dir mycrawl -depth 4 >&
crawl.log
Use Tomcat to experience!
Home page
Search result
Score Explanation
Anchor texts with a link
About the first Homework
About web crawling
Familiar with Nutch & java
Fetch blog/bbs etc ?
Need your advice!
Q&A
thanks!