Transcript intr-nutch

Introduction to Nutch
Zhao Dongsheng
2008.9.29
Summary




What's Nutch
Nutch's architecture
How to use Nutch
About the first homework
What's Nutch




Written in java
Open-source project
An Application that can build SE
Behind a lot of web sites
What's Nutch







Lucene and Nutch
Nutch grow out of Lucene
Both open-source project
Both written in java
But Lucene is a Java library for
text indexing and search
Nutch is an Application
Nutch uses lucene for indexing
Nutch's architecture
Nutch's core components

Fecher



Requests web pages
Parses and extracts links
Web DB

Page DB


Used for fetch sheduling
Link DB



Store link gragh
Store anchor text with each link
Link-analysis and Anchor text indexing
Nutch's core components (cont.)

Indexer



Creates inverted index
Uses Lucene
Searcher



Finds relelant docs quickly
Ranks the docs
Summarizing
Functions Nutch supports






Politeness when crawling
Duplicates removing
PageRank analysis
Distributed searching
Summarizing
......
Nutch's Technical Goals





Fetch several billion pages per month
Maintain an index of these pages
Search that index up to 1000 times per second
Provide very high quality search results
Operate at minimal cost
Source code & API

Source Dirs



analysis crawl html plugin scoring segment tools
fetcher indexer net parse protocol searcher ...
crawl/Crawl.java
fetcher/Fetcher.java
How to use Nutch

Download & unpack



Nutch required JVM
Set environment variables
Configure



Specify root URLs
Specify URLs filters
Optionally specify



Number of threads
Levels to crawl
Fetch delay
How to use Nutch (cont.)

Root URLs Example


http://www.pku.edu.cn
URL Filter Example





crawl-urlfilter.txt
-^(file|ftp|mailto):
\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|
zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPE
G|bmp|BMP)$
-[?*!@=]
+^http://([a-z0-9]*\.)*pku.edu.cn/
How to use Nutch (cont.)

Run Nutch



Just a command line
bin/nutch crawl myurl.txt -dir mycrawl -depth 4 >&
crawl.log
Use Tomcat to experience!
Home page
Search result
Score Explanation
Anchor texts with a link
About the first Homework




About web crawling
Familiar with Nutch & java
Fetch blog/bbs etc ?
Need your advice!
Q&A
thanks!