The Internet

Download Report

Transcript The Internet

Data Communications and
Computer Networks: A
Business User’s Approach
Chapter 11
The Internet
1
This time
Move up the OSI hierarchy
• Internet
• Apps
• Protocols
– XXXP
2
The Internet Model
3
Introduction
Today’s present Internet is a vast collection of thousands of
networks and their attached devices.
The Internet began as the Arpanet during the 1960s.
One high-speed backbone connected several university,
government, and research sites.
The backbone was capable of supporting 56 Kbps
transmission speeds and eventually became financed by the
National Science Foundation (NSF).
4
Old NSFnet backbone & connecting midlevel and campus networks
5
Brief History of the Internet (1)
• 1964 - Packet switching network paper by Rand
Corporation
• 1969 - The DOD Advanced Research Projects
Agency creates an experimental network called
ARPANET
• 1972 - Email programs sent
• 1980s - ARPANET splits into two networks:
ARPANET and MILNET
• 1984 - Arpanet shut down and Internet resulted
• 1987 - NSFnet Network service Center (NNSC)
6
Brief History of the Internet (2)
• 1993 - InterNIC formed replaced NNSC
• 1993 - CERN releases the World Wide Web
(WWW), developed by Tim Berners-Lee
• 1993-1994 - The graphical web browsers Mosaic
and Netscape Navigator are introduced
• 1995 - NSF quits all support and backbone, and
the Internet became commercially supported
• 1996-present - Internet access increases rapidly
among home, education and business users
7
Brief History of the Internet (3)
• Internet Growth in Nodes
– 1969 - only 4
– 1983 - approximately 500
– 1989 - approximately 80,000
– 1997 - over 16 million
– Now - over 370 million
8
Internet Growth
• http://www.netsizer.com/
• Hosts vs nodes
Hosts – users connected to the internet
130 M (2001)
• Nodes are all connected devices
9
Internet Services
The Internet provides many types of services, including
several very common ones:
• File transfer protocol (FTP)
• Remote login (Telnet)
• Internet telephony
• Electronic mail
• World Wide Web
• Streaming Video and Audio
10
File Transfer Protocol (FTP)
Used to transfer files across the Internet.
User can upload or download a file.
The URL for an FTP site begins with ftp://…
The three most common ways to access an FTP site is:
1. Through a browser
2. Using a canned FTP program
3. Issuing FTP commands at a text-based command prompt.
11
Remote Login (Telnet)
Allows a user to remotely login to a distant computer site.
User usually needs a login and password to remove computer
site.
User saves money on long distance telephone charges.
12
Internet Telephony
The transfer of voice signals using a packet switched network
and the IP protocol.
Also known as packet voice, voice over packet, voice over
the Internet, and voice over Internet Protocol (VoIP).
VoIP can be internal to a company or can be external using
the Internet.
VoIP consumes many resources and may not always work
well, but can be cost effective in certain situations.
13
Internet Telephony (VoIP)
Three basic ways to make a telephone call using VoIP:
1. PC to PC using sound cards and headsets (or speakers and
microphone)
2. PC to telephone (need a gateway to convert IP addresses to
telephone numbers)
3. Telephone to telephone (need gateways)
14
Internet Telephony (VoIP)
Three functions necessary to support voice over IP:
1. Voice must be digitized (PCM, 64 Kbps, fairly standard)
2. 64 Kbps voice must be compressed (many standards here ITU-T G.729A, used by AT&T, Lucent, others; G.723.1, used
by Microsoft and Intel)
3. Once the voice is compressed, the data must be transmitted.
Many different ways to do this.
15
Internet Telephony (VoIP)
How can we transport compressed voice?
Streaming audio, such as Real Time Streaming Protocol
(RTSP) and Microsoft’s Active Streaming Format (ASF)
Resource Reservation Protocol (RSVP) - carries a specific
QoS through the network, reserving bandwidth at every node.
Operates at the transport layer.
Internet Stream Protocol version 2 (ST2) - an experimental
resource reservation protocol that operates at same layer as IP
16
Electronic Mail
E-mail programs can create, send, receive, and store e-mails,
as well as reply to, forward, and attach non-text files.
Multipurpose Internet Mail Extension (MIME) is used to send
e-mail attachments.
Simple Mail Transfer Protocol (SMTP) is used to transmit email messages. (uses port TCP port 25)
Email daemon always waiting to perform its function
Post Office Protocol version 3 (POP3) and Internet Message
Access Protocol (IMAP) are used to hold and later retrieve email messages.
17
eMail
Consists of 2 parts:
User Agent: Allows users to create, edit, store and forward
programs
Message Transfer Agent: Prepares and transfers email
message
18
Electronic Mail Holders
Post Office Protocol version 3 (POP3) and Internet Message
Access Protocol (IMAP) are used to hold and later retrieve email messages.
POP allows you to save messages in your email box
IMAP allows you to only view message heading and not
download everything. Also permits mailboxs, search, etc.
19
Listservs
A popular software program used to create and manage
Internet mailing lists.
When an individual sends an e-mail to a listserv, the listserv
sends a copy of the message to all listserv members.
Listservs can be useful business tools for individuals trying to
follow a particular area of study.
20
Usenet
A voluntary set of rules for passing messages and maintaining
newsgroups.
A newsgroup is the Internet equivalent of an electronic
bulletin board system.
Thousands of Usenet groups exist on virtually any topic.
21
Streaming Audio and Video
The continuous download of a compressed audio or video
file, which can be heard or viewed on the user’s workstation.
Real-time Protocol (RTP) and Real Time Streaming Protocol
(RTSP) support streaming audio and video.
Streaming audio and video consume a large amount of
network resources.
22
World Wide Web
The World Wide Web (WWW) is a immense collection of
web pages and other resources that can be downloaded across
the Internet and displayed on a workstation via a web
browser.
Browser is the user agent.
The most popular service on the Internet.
Basic web pages are created with the HyperText Markup
Language (HTML).
23
World Wide Web
While HTML is the language to display a web page,
HyperText Transport Protocol (HTTP) is the protocol to
transfer a web page.
Many extensions to HTML have been created. Dynamic
HTML is a very popular extension to HTML.
Common examples of dynamic HTML include mouse-over
techniques, live positioning of elements (layers), data
binding, and cascading style sheets.
24
World Wide Web – XML
Extensible Markup Language (XML) is a description for how
to create a document - both the definition of the document
and the contents of the document.
The syntax of XML is fairly similar to HTML.
You can define your own tags, such as <CUSTOMER> which
have their own, unique properties.
25
e-Commerce and e-government
The buying and selling of goods and services via the internet.
Government transitions via the internet.
e-commerce major areas:
1. e-retailing
2. Electronic Data Interchange (EDI)
3. Micro-marketing
4. Electronic security
5. Web services
26
Security
of
Data
Privacy
of
Data
Business
Policies
Transaction
Processing
Integrity
27
Security of Data
• How secure is the data maintained by the
business?
– Personal/business entity data
– data stored by a web site that is used by a
trading partner to make transaction decision
• How secure is the data as it is transmitted to
and from this business?
28
Business Policies
• What are the business policies and practices
of this business?
–
–
–
–
–
billing and payment policies
shipping policy
return policy
tax collection
additional policy information
29
Transaction Processing Integrity
• What procedures are in place to ensure that
the transactions are handled as disclosed?
– How does the company ensure that is does not
lose orders placed?
– How does the company ensure that it accurately
processes bills and account information?
– What controls exist to ensure that the company
accurately posts payment in a timely fashion?
– Does the company have controls in place to
ensure that it ships the right inventory items and
quantities?
30
Privacy of Data
• What is the privacy policy of the business?
• What information does it keep?
• How will the information collected be used by the
business?
• Will this business share or sell customer data
without the customer’s permission or knowledge?
• What ensures that the company’s privacy policies
are observed and practiced on a continuous basis?
31
Security Assurance Systems ensure
that...
• The transacting parties are authenticated who they claim to be - a security issue
• that electronic data are protected from
unauthorized disclosure - a security issue
32
Electronic Data Interchange...
• is the electronic exchange of business
documents between trading partners using a
standardized format.
• Traditional EDI
– High start-up costs
– Used primarily by large firms
– Generally, even large firms could only connect
with 20% of their trading partners
33
Cookies and State Information
A cookie is data created by a web server that is stored on the
hard drive of a user’s workstation.
This state information is used to track a user’s activity and to
predict future needs.
Information on previous viewing habits stored in a cookie can
also be used by other web sites to provide customized
content.
Many consider cookies to be an invasion of privacy.
www.cookiecentral.com
34
Cookie Control
Delete cookies after inserted
Accept no or restricted cookies
Change permissions
www.cookiecentral.com
35
Intranets and Extranets
An intranet is a TCP/IP network inside a company that allow
employees to access the company’s information resources
through an Internet-like interface.
When an intranet is extended outside the corporate walls to
include suppliers, customers, or other external agents, the
intranet becomes an extranet.
36
Internet Protocols
To support the Internet and all its services, many protocols are
necessary.
Some of the protocols that we will look at:
• Internet Protocol (IP)
• Transmission Control Protocol (TCP)
• Address Resolution Protocol (ARP)
• Domain Name System (DNS)
37
Internet Protocols
Recall that the Internet with all its protocols follows the
Internet model.
An application, such as e-mail, resides at the highest layer.
A transport protocol, such as TCP, resides at the transport
layer.
The Internet Protocol (IP) resides at the Internet or network
layer.
A particular media and its framing resides at the interface
layer.
38
The Internet Model
39
Network Layer
 Responsible for creating maintaining
and ending network connections.
 Transfers a data packet from node to
node within the network.
 Message routing
 Billing
 Accounting
40
Transport Layer
 Provides an end-to-end, error-free
network connection.
 Makes sure the data arrives at the
destination exactly as it left the source.
 Makes sure all information is accounted
for:
– Missing information
– Duplicated information
41
The Internet Protocol (IP)
IP prepares a packet called a datagram for transmission across
the Internet.
The IP header is encapsulated onto a transport data packet.
The IP packet is then passed to the next layer where further
network information is encapsulated onto it.
42
Progression of a datagram packet from one network
to another
43
The Internet Protocol (IP)
Using IP, a subnet router:
Makes routing decision based on the destination address.
May have to fragment the datagram into smaller datagrams
(very rare) using Fragment Offset.
May determine that the current datagram has been hopping
around the network too long and delete it TTL (Time to Live).
44
Format of the IP Datagram
45
The Transmission Control Protocol
(TCP)
The TCP layer creates a connection between sender and
receiver using port numbers.
The port number identifies a particular application on a
particular device (IP address).
ftp: 20
smtp: 25
http: 80
TCP can multiplex multiple connections (using port numbers)
over a single IP line.
46
The Transmission Control Protocol (TCP)
The TCP layer can ensure that the receiver is not overrun with
data (end-to-end flow control) using the Window field.
TCP can perform end-to-end error correction (Checksum).
TCP allows for the sending of high priority data (Urgent
Pointer).
47
Fields of the TCP Header
48
Internet Control Message Protocol (ICMP)
ICMP, which is used by routers and nodes, performs the error
reporting for the Internet Protocol.
ICMP reports errors such as invalid IP address, invalid port
address, and the packet has hopped too many times.
49
Ping (Packet Internet Groper)
ping command
50
Ping – TCP/IP Troubleshooting
• Ping is the primary tool for troubleshooting IP-level connectivity. Type
ping -? at a command prompt to see a complete list of available
command-line options. Ping allows you to specify the size of packets
to use (the default is 32 bytes), how many to send, whether to record
the route used, what Time To Live (TTL) value to use, and whether to
set the "don't fragment" flag.
• When a ping command is issued, the utility sends an ICMP Echo
Request to a destination IP address. Try pinging the IP address of the
target host to see if it responds. If that succeeds, try pinging the target
host using a host name. Ping first attempts to resolve the name to an
address through a DNS server, then a WINS server (if one is
configured), then attempts a local broadcast. When using DNS for
name resolution, if the name entered is not a fully qualified domain
name, the DNS name resolver appends the computer's domain name or
names to generate a fully qualified domain name.
• If pinging by address succeeds but pinging by name fails, the problem
usually lies in name resolution, not network connectivity. Note that
name resolution might fail if you do not use a fully qualified domain
name for a remote name. These requests fail because the DNS name
resolver is appending the local domain suffixes to a name that resides
elsewhere in the domain hierarchy.
51
tracert command
tracert – trace route
52
How the TRACERT command works
•
•
•
The TRACERT diagnostic utility determines the route taken to a destination
by sending Internet Control Message Protocol (ICMP) echo packets with
varying IP Time-To-Live (TTL) values to the destination. Each router along
the path is required to decrement the TTL on a packet by at least 1 before
forwarding it, so the TTL is effectively a hop count. When the TTL on a packet
reaches 0, the router should send an ICMP Time Exceeded message back to the
source computer.
TRACERT determines the route by sending the first echo packet with a TTL of
1 and incrementing the TTL by 1 on each subsequent transmission until the
target responds or the maximum TTL is reached. The route is determined by
examining the ICMP Time Exceeded messages sent back by intermediate
routers. Note that some routers silently drop packets with expired TTLs and
are invisible to TRACERT.
TRACERT prints out an ordered list of the routers in the path that returned the
ICMP Time Exceeded message. If the -d switch is used (telling TRACERT not
to perform a DNS lookup on each IP address), the IP address of the near- side
interface of the routers is reported.
53
User Datagram Protocol (UDP)
A transport layer protocol used in place of TCP.
Where TCP supports a connection-oriented application, UDP
is used with connectionless applications.
UDP also encapsulates a header onto an application packet
but the header is much simpler than TCP.
54
Address Resolution Protocol (ARP)
When an IP packet has traversed the Internet and encounters
the destination LAN, how does the packet find the destination
workstation?
Even though the destination workstation may have an IP
address, a LAN does not use IP addresses to deliver frames.
A LAN uses the MAC layer address.
ARP translates an IP address into a MAC layer address so a
frame can be delivered to the proper workstation.
55
Tunneling Protocols
The Internet is not normally a secure system.
If a person wants to use the Internet to access a corporate
computer system, how can a secure connection be created?
One possible technique is by creating a virtual private
network (VPN).
A VPN creates a secure connection through the Internet by
using a tunneling protocol.
56
Every workstation attached to the
Internet needs:
Its IP address
• Its subnet mask (more on this later)
• The IP address of a router
• The IP address of a name server
57
BOOTP (you don’t have an IP address?)
Thin client workstations do not have a disk drive, and its
ROM does not contain the previous four pieces of
information.
How do we tell the machine this information? BOOTP
(Bootstrap protocol).
There are two types of BOOTP operations:
REQUEST – A workstation asks a server for the information (source
IP address = all 0s, destination IP address = all 1s).
REPLY – The server returns the information to the workstation.
58
59
Dynamic Host Configuration Protocol
(DHCP)
BOOTP is not dynamic (when a client requests its IP address,
it is retrieved from a static table).
DHCP is a dynamic extension of BOOTP.
When a DHCP client issues an IP request, the DHCP server
looks in its static table. If no entry exists, the server selects
an IP address from an available pool.
60
Dynamic Host Configuration Protocol
(DHCP)
The address assigned by the DHCP server is temporary.
Part of the agreement includes a specific period of time.
If no time period specified, the default is one hour.
DHCP clients may negotiate for a renewal before the time
period expires.
61
Network Address Translation (NAT)
NAT protocol lets a router represent an entire local area
network to the Internet as a single IP address.
Thus it appears all traffic leaving this LAN appears as
originating from a global IP address.
All traffic coming into this LAN uses this global IP address.
This security feature allows a LAN to hide all the workstation
IP addresses from the Internet.
62
NAT
Since the outside world cannot see into the LAN, you do not
need to use registered IP addresses on the inside LAN.
We can use the following blocks of addresses for private use:
•10.0.0.0 – 10.255.255.255
•172.16.0.0 – 172.31.255.255
•192.168.0.0 – 192.168.255.255
63
NAT
When a user on inside sends a packet to the outside, the NAT
interface changes the user’s inside address to the global IP
address. This change is stored in a cache.
When the response comes back, the NAT looks in the cache
and switches the addresses back.
No cache entry? The packet is dropped. Unless NAT has a
service table of fixed IP address mappings. This service table
allows packets to originate from the outside.
64
Locating a Document on the Internet
Every document on the Internet has a uniform
resource locator (URL) (not necessarily unique) and
an IP address (not necessarily unique).
All URLs consist of four parts:
1. Service type
2. Host or domain name
3. Directory or subdirectory information
4. Filename
65
The Parts of a Uniform Resource Locator (URL)
http://psu.edu/stuff
http
top and mid
levels
edu
Determined
top level domain – type of organization
by
often followed by a country code, eg. --.uk assignment
boards
psu
service type
mid level domain – name of organization
stuff, www.psu.edu
domains generated by organization
66
The Parts of a Uniform Resource Locator (URL)
New domains:
.biz
.zzz
.xxx
.dog
Who controls this?
http://www.icann.org/
67
68
Locating a Document on the Internet
When a user, running a web browser, enters a URL, how is
the URL translated into an IP address?
The Domain Name System (DNS) is a large, distributed
database of URLs and IP addresses.
tracert command does this for you.
The first operation performed by DNS is to query a local
database for URL/IP address information.
If the local server does not recognize the address, the server at
the next level will be queried.
69
Locating a Document on the Internet
Eventually the root server for URL/IP addresses will be
queried.
If the root server has the answer, the results are returned.
If the root server recognizes the domain name but not the
extension in front of the domain name, the root server will
query the server at the domain name’s location.
When the domain’s server returns the results, they are passed
back through the chain of servers (and their caches).
70
IP Addresses
All devices connected to the Internet have a 32-bit IP (IPv4)
address associated with it. 232 = total addresses?
Think of the IP address as a logical address (possibly
temporary), while the 48-bit address on every NIC is the
physical, or permanent address.
Computers, networks and routers use the 32-bit binary
address, but a more readable form is the dotted decimal
notation.
71
IP Addresses
For example, the 32-bit binary address
10000000 10011100 00001110 00000111 (4 octets)
translates to
128.156.14.7 (called dotted decimal notation)
Range of octets is 0-255 = 28
There are basically four types of IP addresses:
Classes A, B, C and D.
A particular class address has a unique network address size
and a unique host address size.
72
Four Basic Forms of an IP 32-bit Address
What is psu’s IP address?
Ping: psu.edu 128.118.141.56
Ping ist.psu.edu?
73
IP Addresses
When you examine the first decimal value in the dotted
decimal notation:
All Class A addresses are in the range 0 - 127
All Class B addresses are in the range 128 - 191
All Class C addresses are in the range 192 - 223
74
IP Subnet Masking
Sometimes you have a large number of IP address to manage.
By using subnet masking, you can break the host ID portion
of the address into a subnet ID and host ID.
Each subnet supports a number of other hosts.
For example, the subnet mask 255.255.255.0 applied to a
class B address will break the host ID (normally 16 bits) into
an 8-bit subnet ID and an 8-bit host ID.
75
Data Communications and Computer Networks
Chapter 10
76
The Future of the Internet
Various Internet committees are constantly working on new
and improved protocols.
Examples include:
• Internet Printing Protocol
• Internet fax
• Extensions to FTP
• Common Name Resolution Protocol
• WWW Distributed Authoring and Versioning
• Web Services
77
IPv6
http://www.ipv6.org/
The next version of the Internet Protocol.
Main features include:
• Simpler header
• 128-bit IP addresses 2128 = (210)12 28 = (103)12 28 = 2 x 1038
• Priority levels and quality of service parameters
• No fragmentation (datagram is big!)
78
Fields in the IPv6 Header
79
Internet2
http://www.internet2.edu/
A new form of the Internet is being developed by a number of
businesses and universities.
Internet2 will support very high speed data streams (Gigs).
Applications might include:
• Digital library services
• Tele-immersion
• Virtual laboratories
80
The Internet In Action:
A Company Creates a VPN
A fictitious company wants to allow 3500 of its workers to
work from home.
If all 3500 users used a dial-in service, the telephone costs
would be very high.
81
Data Communications and Computer Networks
Chapter 11
82
Data Communications and Computer Networks
Chapter 11
The Internet In Action: A Company
Creates a VPN
Instead, the company will require each user to access the
Internet via their local Internet service provider.
This local access will help keep telephone costs low.
Then, once on the Internet, the company will provide
software to support virtual private networks.
The virtual private networks will create secure connections
from the users’ homes into the corporate computer system.
83
Data Communications and Computer Networks
Chapter 11
84
Your old web pages!!!
Internet Archive
www.archive.org
• Founded in 1996 by Brewster Kahle.
• Maintains many, many TB’s of Internet data, including
snapshots of
–
–
–
–
World Wide Web
Usenet
Gopher
FTP archives
• Goals:
– Accumulate and preserve digital information for the long term that
would otherwise be lost.
– Provide access to researchers, journalists, historians and others.
85
Bow-tie Theory of the Web
200 million (billion links) urls explored - Broder, et.al. WWW9 ’00
86
How Big is the Publicly Indexable Web?
• Feb’99: estimate 16 million total web servers
reduces to about 2.8 million servers for the
publicly indexable web
• Average number of pages per site was 289
• Estimated total number of pages on the web about
800 million
• Current estimate – 3 to 5 billion pages
From a random sample of IP addresses (address space 2564 or about 4.3
billion)
87
Volume of Information on Web - Feb,
‘99
• Mean page size was 19k (median 4k)
• Total amount of data: about 15 terabytes of pages
• About 6 terabytes after removing comments, extra
whitespace, and HTML tags
• About 63 images per server, mean image size 15k
(median 6k)
• About 180 million images on the publicly
indexable web, about 3 terabytes of image data
88
What’s on the web?
89
Distribution of the content of WWW
Information
•
•
•
•
%’s of manually classified
homepage of first 2,500
randomly found web
servers
83% of sites commercial
– Off scale for this chart
->
Percentage of sites in areas
like science, health,and
government relatively small
– Would be feasible and
very valuable to create
specialized search
services that are very
comprehensive and up
to date
65% of sites have a
majority of pages in English
90
Web Search Techniques
- 85% of users use search engines to locate information
(GVU survey)
- Several search engines consistently rank in the top 10
sites accessed on the web
•
•
•
•
•
•
•
Full-text indexes
Hierarchical directories
Specialized or niche search services
What’s related (Alexa/Netscape)
Collaborative filtering
Notification systems
Softbots
91
Search Engines
• Lots: over 3000? - 20 make up 98% of all searches done
the web
• Business models are often not just search!
• AltaVista (summer, 1998):
– Indexes about 0.8 Tb (index about 30% of the size of
the grabbed data)
– Every word indexed
– About 37 million queries on weekdays
– Mean response time of 0.6 seconds
– About 20 64-bit machines
• 10 CPU, 625 MHz, 12Gb RAM, 300 Gb RAID
(each)
• Google (spring, 2000):
92
– 2500 PCs, buy 30 a day, discard them when they break
93
Search Engine Architecture
• Web crawler that crawls the web and harvests data – html, text,
etc.
• Indexer that indexes some of the crawled pages
• Query engine that queries the index and presents results
• Query interface
94
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
95
Ways to compare search engines
• Relevance ranking
• Coverage (comments once seen in the press)
– “If you can’t find it using XXX search, it’s probably not out there”
– “HotBot is the first search robot capable of indexing and searching the
entire web”
• Recency (comment once seen in the press)
– “[With XXX] you can find new information just about as quickly as it's
available on the Web”
• Functionality (e.g. query syntax)
• Speed
• Availability
• Usability
• Time/ability to satisfy user requests
96
Ranking Options
Special factors
• Conventional methods (e.g., tf.idf) were developed for
homogenous collections, e.g., items of similar length
• Some items are deliberately constructed to distort indexing
Options
• Vector space ranking with corrections for document length
• Extra weighting for specific fields, e.g., title, anchors, etc.
• Link structure, e.g., Google's PageRank, Kleinberg's Hubs
and Authorities
97
Google (Page, Brin)
• 2nd Generation Search Engine!
• Makes greater use of HTML structure and the graph
formed by hyperlinks between pages
• PageRank
– Iteratively uses information about the number of pages pointing to
a page in order to estimate the popularity of a page
– Links from more popular pages count more
• Uses the text in links to a page
– Link descriptions may describe a page better than the page itself
• Yahoo’s search engine
www.google.com
98
99
PageRank and Google
p1
• Prestige of a page is proportional to
sum of prestige of citing pages
• Standard bibliometric measure of
influence
• Simulate a random walk on the
Web to precompute prestige of all
pages
• Sort keyword-matched responses
by decreasing prestige
p2
p4
p3
p4  p1 + p2 + p3
I.e., p = Ep
Follow random
outlink from page
100
Google
Architecture
• Perl with
C/C++
• Linux
• Module-based
architecture
• Multi-machine
• Multi-thread
Crawler
Store Server
URL Server
Anchors
Repository
URL Resolver
Indexer
Lexicon
Links
Doc
Index
Barrels
Sorter
Pagerank
Searchers
101
Metasearch Engines or Tools
Search
Engine #1
Information
Need
Query
Search
Engine #2
Search
Engine #3
etc
Fusion
Policy
Result
Set
• Single search engine coverage is low, maximum of 16%
– Querying multiple can significantly improve coverage
• Query is sent to several search engines simultaneously
– Policies?
• Results are fused by a fusion policy
– Similar, but slightly different from an ordering policy
• Fusion at many levels
102
http://www.beaucoup.com/
103
Search Engine Coverage - 11 engines Feb ‘99
• Combined coverage
with respect to each
other
• With respect to each
other compared to total
web size
• Combined coverage - 42%
104
Search Engines Sizes
searchenginewatch.com
105
searchenginewatch.com
106
Covered
•
•
•
•
Protocols – XXXP
URL and DNS
IP addresses
Search engines
107