Origins of the Internet

Download Report

Transcript Origins of the Internet

Introduction to Computer
Networks
2004, 劉震昌
Review of Lab#2 and
Homework#1


“Lab” means “Laboratory”, not “Label”.
Algorithm steps must be executed in turn.
You can not skip any step on your own
decision.



Why?
Please write your homework subject
correctly
No delay for homework
Outline


Origins of the Internet 網際網路的發源
Origins of the WWW (World Wide Web)


HTML (Hypertext Markup Language 超文件標示
語言?) guide
Searching the Web


Search engine (Web browser 網路瀏覽器)
Web directories
Origins of the Internet
Ref: Chap.2 on Comer’s book
Origins of the Internet

In 1969, US DoD’s ARPA(Advanced Research
Projects Agency) built the ARPANET




Only 4 nodes
De-centralized system
Data transmission
參考網站
Origins of the Internet (cont.)

1974, TCP/IP was developed and later
became a standard in 1983




TCP(Transmission Control Protocol)
IP(Internet Protocol)
網路通訊協定的重要性
Growth of ARPANET --> Internet


Internetworking
No organization owns or controls it
no. of computers
Growth of the Internet
1M = 1,000,000
計量單位

http://www.spes.tpc.edu.tw/handouts/B_Basic/ref.htm
log scale
Almost exponential growth
指數成長
Recently ignited by WWW and
economical activities
IP Service


Where is your computer on Internet ?
Current internet (IPv4)

32 bits to represent an IP address


Ex. 163.22.20.129
What is your computer’s IP address? ipconfig
163.22.20.129
163.22.22.119
163.22.20.118
Address Resolution Protocol
(ARP)


IP protocol address is an abstraction;
physical network hardware does not know
how to locate the computer from IP address
Techniques



table look-up
closed form computation
message exchange
Computers on the Net

Every Internet host has a unique IP address,
however, it is hard to remember. So we have
host name


e.g., arbor.watson.ibm.com is 9.2.13.20 and
arbor.ee.ntu.edu.tw is 140.112.21.236
Try: nslookup
Domain Name Server 網域名稱伺
服器


Host name is to be converted into IP address
Domain Name Servers (DNS)



containing a database (look-up table) for host
name to IP address mapping
there are many domain name servers
“.com”, “.gov”, “.edu”, “.tw”
Lab#3

Use the commands


ipconfig
nslookup
Internet application

telnet: A terminal emulation program for
TCP/IP networks such as the Internet
telnet
163.22.22.119

ftp (file transfer protocol)
163.22.22.119
(Run telnet server)
Origins of WWW
Ref: Chap. 32 on Comer’s book
Outline




Origins of WWW(World Wide Web)
Web browser
HTML(Hyper-Text Markup Language)
HTTP(Hyper-Text Transfer Protocol)
Origins of WWW

World Wide Web(WWW)




Proposed in 1989, by Tim Berners-Lee at
CERN(European Particle Research Center)
A large-scale, online repository of information
Develops interoperable technologies
(specifications, guidelines, software, and tools)
Currently, there is a W3C (WWW
consortium) doing these things
Origins of WWW (cont.)

Data format: HTML (HyperText Markup
Language)

Allow hypertext link (URL: Universal Resource
Locator) to other documents on Web
Protocol://computer_name:port/document_name
Protocol: HTTP (HyperText Transfer
Protocol)
 Data exchange standard on Web
資料交換的共通格式與傳輸協定

Origins of WWW (cont.)
URLs
WWW
Internet
就像一個大的資料庫
分佈在Internet上
Web browser

tools to read HTML document
client
Web browser
click a link
display
server
Web server(ex. 跑IIS)
send request
find document
return HTML document
Connection terminated after receiving all items
Web browser (cont.)

Text mode browser: lynx


lynx http://www.csie.ncnu.edu.tw
Graphics mode browser



NCSA(National Center for Supercomputing
Applications) Mosaic by Marc Andreeson
Netscape
IE
Web browser (cont.)
Browser architecture
Document representation



Hypertext: textual information
Hypermedia: additional info., like images and
graphics
HyperXXXX: an abstract idea


A set of documents, and a document can contain
pointers to other documents
Page: a hypermedia document on the Web
Hypertext Markup Language
(HTML)

Markup Language: publishing hypertext in a
less detailed format
HTML
document
display results
may be different
HTML



Text file + tags
Tags: formatting the document
<Tagname>…text…</Tagname>
HTML layout
<HTML>
<HEAD>
<TITLE>
….title of the text….
</TITLE>
</HEAD>
<BODY>
…body of the document…
</BODY>
</HTML>
*良好的縮排便於人類
理解編輯
HTML layout (cont.)
<HTML><HEAD><TITLE>….title of the text….
</TITLE></HEAD><BODY>…body of the document…
</BODY></HTML>
HTML examples




Example1
Example2
Example3: embedding images
Example4: hypertext link(anchor 錨)



<a> ….anything…</a>
Any item can have a hypertext link
Lab#4 in the afternoon

http://www.csie.nctu.edu.tw/~jglee/teacher/content.htm
HTTP documents



See
http://ftp.ics.uci.edu/pub/ietf/http/
HTTP/1.0, RFC 1945, 1996
HTTP/1.1, RFC 2068, 1997
Searching the Web
Ref: Chapter 13 in
“Modern Information Retrieval”
Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Outline


Measuring the Web
Methods for searching the Web


Search engines
Web directories
Searching the Web



WWW starts in 1989
Just the textual data is estimated to be in
the order of one terabyte
Goal: how to efficiently manage, retrieve and
filter information from the Web?
Challenges

Distributed data


High percentage of volatile data 易變資料



40% of the Web changes every month
Large volume
Unstructured and redundant data 重複資料


Data spans over many computers interconnected without
predefined topology
30% of Web pages are (near) duplicates
Heterogeneous data

Different languages
Measuring the Web
URLs
WWW
Web
server
*1998, 3M servers
3百萬
Internet
No. of servers =
1/10 no. of computers on Internet
Measuring the Web (cont.)





1998
5Kb per Web page on average
300M Web pages (3億…)
300M * 5Kb = 1.5 Terabytes
Grow at a rate of 20M pages per month
Growth of the Web
Web
pages
Million
Web
sites
300
200
100
1996
1997
1998
year
Methods for searching the Web

Search engines 搜尋引擎



Index the Web documents as a full-text
database
Alta Vista, Google, …
Web directories 入門網站目錄


Classify selected Web documents by
subject
Yahoo!
Search engines
搜尋引擎


Model the Web as a database
All queries must be answered without
accessing the Web pages
User
queries
database
Search engines (cont.)

AltaVista (www.altavista.com)




20 multi-processor machines
130 Gb of RAM each
Over 500 Gb of disk space each
75% resources on the query engine
The top search engines

Foreign





Google ( www.google.com )
www.yahoo.com
www.altavista.com
Inktomi ( www.inktomi.com )
Statistics on search engines



www.searchenginewatch.com
http://imt.net/~notess/search
Taiwan



Yahoo!/Kimo uses google
Openfind ( www.openfind.com.tw )(中正大學吳昇教授)
Yam ( www.yam.com.tw )
Search engines (cont.)

Centralized crawler-indexer architecture
Query
Engine
Index
database
User
Interface
Indexer
users
Crawler
Web
User Interface

Query interface



Keywords
Boolean operator
Answer interface

Rank the searched pages



Statistics about the term occurrence within the
document
Popularity
Hyperlink information
Query
Engine
Index
database
User
Interface
Indexer
users
Crawler
Web
Crawler



Robots, spiders(蜘蛛), wanderers,
walkers, and knowbots
Inspite of their name, the crawler runs on
a local system and sends requests to
remote Web servers
Method: start with a set of URLs, and
from there extract other URLs
Crawler (cont.)

How the Web is traversed, the index of a
search engine can be thought as analogous
to the stars in a sky


Invalid links in search engines vary from 2% to
9%
The current fastest crawlers are able to
traverse up to 10M Web pages per day

300M/10M = 30 days
Web directories 網站目錄




Classify the Web pages by categories
Directories are hierarchical taxonomies
that classify human knowledge
Yahoo! has close to 1M pages classified
How to classify pages?




Pages has to submitted to the Web
directories
Manually done by few people
Automatic classification is not yet mature
Not every page is classified
Some Web directories
Web directories URL
Yahoo!
LookSmart
Lycos Subjects
eBLAST
NewHoo
Magellan
Netscape
Snap
Web sites(K) Categories
www.yahoo.com
www.looksmart.com
a2z.lycos.com
www.eblast.com
www.newhoo.com
www.mckinley.com
www.netscape.com
www.snap.com
750
300
50
125
100
60
24
23
The power of search engine


I have found a homepage that contains the
solutions to the C textbook!!!
Who find the homepage and sends me email
first will get a bonus point…