Search engine

Download Report

Transcript Search engine

WWW servers and search
engines
2004, 劉震昌
Web browser and server

tools to read HTML document
client
server
Web server (ex. 跑IIS)
Web browser
click a link
display
send request
find document
return HTML document
Where is the web server?
Probing the Internet (cont.)

tracert, ping
封包(網路上資料傳輸單位)
packet
source
destination
www.yahoo.com.tw
router
Probing the Internet (How do you know
you are on Internet?)

ping www.yahoo.com.tw
Pinging rc.tpe.yahoo.com [202.1.237.23] with 32 bytes of data:
Reply
Reply
Reply
Reply
from
from
from
from
202.1.237.23:
202.1.237.23:
202.1.237.23:
202.1.237.23:
bytes=32
bytes=32
bytes=32
bytes=32
time=4ms
time=5ms
time=4ms
time=4ms
TTL=246
TTL=246
TTL=246
TTL=246
Ping statistics for 202.1.237.23:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 4ms, Maximum = 5ms, Average = 4ms
The route from source to
destination

tracert www.yahoo.com.tw
Tracing route to rc.tpe.yahoo.com [202.1.237.23]
over a maximum of 30 hops:
1
2
3
4
5
6
7
8
9
10
11
<1 ms
<1 ms
<1 ms
1 ms
17 ms
2 ms
1 ms
4 ms
4 ms
9 ms
5 ms
<1 ms
<1 ms
<1 ms
1 ms
74 ms
1 ms
1 ms
4 ms
4 ms
5 ms
5 ms
Trace complete.
<1 ms gateway.lan20.csie.ncnu.edu.tw [163.22.20.254]
<1 ms ip253.puli01.ncnu.edu.tw [163.22.1.253]
<1 ms ip090.puli255-64-203.ncnu.edu.tw [203.64.255.90]
1 ms 140.128.251.38
2 ms tc-tanet-gw01.router.hinet.net [211.22.189.186]
1 ms 211.22.189.190
1 ms tc-c12r2.router.hinet.net [211.22.189.74]
4 ms tp-s2-c12r2.router.hinet.net [210.65.200.30]
4 ms tp-s2-c6r8.router.hinet.net [211.22.35.181]
6 ms 211.22.41.89
5 ms rc.tpe.yahoo.com [202.1.237.23]
Lab#5



Try ping and tracert to access
www.google.com.tw
Record your results in a text file
Email to me with subject: Lab5 學號
動態 IP 如何架站(WWW,ftp,…)?

DHCP (Dynamic Host Configuration Protocol)

DHCP 說明
IP:163.22.123.111
IP:163.22.123.123
.
.
.
If we want to communicate
with hime, What’s the
IP or domain name?
1. 自己架 DNS (domain name server)
2. 動態註冊 IP 與 domain name
www.no-ip.com
動態
IP:163.22.123.111
www.no-ip.com
DNS server
Kamiry.no-ip.com
註冊 IP 與 domain name
的對應
參考:No-IP 使用文件
安裝 IIS (internet information
server)




在 Windows CD 片
安裝說明
IIS 設定
Microsoft IIS 太普遍,並且有很多安全漏洞,
可以使用非微軟的 WWW server


Ex. Apache, analogx, …
參考文件
HW#3




在自己的電腦上架設 WWW server
將 server 的 domain name email 給我
將自己的個人網頁放到自己的電腦上
助教指定開機時間 server 必須開啟
Searching the Web
Ref: Chapter 13 in
“Modern Information Retrieval”
Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Outline


Measuring the Web
Methods for searching the Web


Search engines
Web directories
Searching the Web



WWW starts in 1989
Just the textual data is estimated to be in
the order of one terabyte
Goal: how to efficiently manage, retrieve and
filter information from the Web?
Challenges





Distributed data
 Data spans over many computers interconnected
without predefined topology
High percentage of volatile data 易變資料
 40% of the Web changes every month
Large volume
Unstructured and redundant data 重複資料
 30% of Web pages are (near) duplicates
Heterogeneous data
 Different languages
Measuring the Web
URLs
WWW
Web
server
*1998, 3M servers
3百萬
Internet
No. of servers =
1/10 no. of computers on Internet
Measuring the Web (cont.)





1998
5Kb per Web page on average
300M Web pages (3億…)
300M * 5Kb = 1.5 Terabytes
Grow at a rate of 20M pages per month
Growth of the Web
Web
pages
Million
Web
sites
300
200
100
1996
1997
1998
year
Methods for searching the Web

Search engines 搜尋引擎



Index the Web documents as a full-text
database
Alta Vista, Google, …
Web directories 入門網站目錄


Classify selected Web documents by
subject
Yahoo!
Search engines concept
搜尋引擎


Model the Web as a database
All queries must be answered without
accessing the Web pages
User
queries
database
Search engines (cont.)

AltaVista (www.altavista.com)




20 multi-processor machines
130 Gb of RAM each
Over 500 Gb of disk space each
75% resources on the query engine
The top search engines


Foreign
 Google ( www.google.com )
 www.yahoo.com
 www.altavista.com
 Inktomi ( www.inktomi.com )
 Statistics on search engines
 www.searchenginewatch.com
 http://imt.net/~notess/search
Taiwan
 Yahoo!/Kimo uses google
 Openfind ( www.openfind.com.tw )(中正大學吳昇教
授)
 Yam ( www.yam.com.tw )
Search engines (cont.)

Centralized crawler-indexer architecture
Query
Engine
Index
database
User
Interface
Indexer
users
Crawler
Web
User Interface


Query interface
 Keywords
 Boolean operator
Answer interface
 Rank the searched pages
 Statistics about the term occurrence within
the document
 Popularity
 Hyperlink information
Query
Engine
Index
database
User
Interface
Indexer
users
Crawler
Web
Crawler



Robots, spiders (蜘蛛), wanderers,
walkers, and knowbots
In spite of their name, the crawler runs
on a local system and sends requests to
remote Web servers
Method: start with a set of URLs, and
from there extract other URLs
Crawler (cont.)

How the Web is traversed, the index of a
search engine can be thought as analogous
to the stars in a sky


Invalid links in search engines vary from 2% to
9%
The current fastest crawlers are able to
traverse up to 10M Web pages per day
(’98)

300M/10M = 30 days
Web directories 網站目錄




Classify the Web pages by categories
Directories are hierarchical taxonomies
that classify human knowledge
Yahoo! has close to 1M pages classified
How to classify pages?




Pages has to submitted to the Web
directories
Manually done by few people
Automatic classification is not yet mature
Not every page is classified
Some Web directories
Web directories URL
Yahoo!
LookSmart
Lycos Subjects
eBLAST
NewHoo
Magellan
Netscape
Snap
Web sites(K) Categories
www.yahoo.com
www.looksmart.com
a2z.lycos.com
www.eblast.com
www.newhoo.com
www.mckinley.com
www.netscape.com
www.snap.com
750
300
50
125
100
60
24
23
Lab about search engine

Today 1:00~3:00
Final typing test


10/20
沒達到標準學期總分扣 10 分