ir_spring06_lec08

Download Report

Transcript ir_spring06_lec08

Basic WWW Technologies
2.1 Web Documents.
2.2 Resource Identifiers: URI, URL, and URN.
2.3 Protocols.
2.4 Log Files.
2.5 Search Engines.
What Is the World Wide Web?
The world wide web (web) is a network of
information resources. The web relies on three
mechanisms to make these resources readily
available to the widest possible audience:
1. A uniform naming scheme for locating resources
on the web (e.g., URIs).
2. Protocols, for access to named resources over
the web (e.g., HTTP).
3. Hypertext, for easy navigation among resources
(e.g., HTML).
2
Internet vs. Web
Internet:
• Internet is a more general term
• Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP…
Web:
• Associated with information stored on the
Internet
• Refers to a broader class of networks, i.e. Web
of English Literature
Both Internet and web are networks
3
Essential Components of WWW
Resources:
• Conceptual mappings to concrete or abstract entities, which do not
change in the short term
• ex: ICS website (web pages and other kinds of files)
Resource identifiers (hyperlinks):
• Strings of characters represent generalized addresses that may
contain instructions for accessing the identified resource
• http://www.ics.uci.edu is used to identify the ICS homepage
Transfer protocols:
• Conventions that regulate the communication between a browser
(web user agent) and a server
4
Standard Generalized Markup
Language (SGML)
• Based on GML (generalized markup language),
developed by IBM in the 1960s
• An international standard (ISO 8879:1986)
defines how descriptive markup should be
embedded in a document
• Gave birth to the extensible markup language
(XML), W3C recommendation in 1998
5
SGML Components
SGML documents have three parts:
• Declaration: specifies which characters and delimiters
may appear in the application
• DTD/ style sheet: defines the syntax of markup
constructs
• Document instance: actual text (with the tag) of the
documents
More info could be found:
http://www.W3.Org/markup/SGML
6
DTD Example One
<!ELEMENT UL - - (LI)+>
• ELEMENT is a keyword that introduces a new
element type unordered list (UL)
• The two hyphens indicate that both the start tag
<UL> and the end tag </UL> for this element
type are required
• Any text between the two tags is treated as a list
item (LI)
7
DTD Example Two
<!ELEMENT IMG - O EMPTY>
• The element type being declared is IMG
• The hyphen and the following "O" indicate
that the end tag can be omitted
• Together with the content model "EMPTY",
this is strengthened to the rule that the end
tag must be omitted. (no closing tag)
8
HTML Background
• HTML was originally developed by Tim BernersLee while at CERN, and popularized by the
Mosaic browser developed at NCSA.
• The Web depends on Web page authors and
vendors sharing the same conventions for
HTML. This has motivated joint work on
specifications for HTML.
• HTML standards are organized by W3C :
http://www.w3.org/MarkUp/
9
HTML Functionalities
HTML gives authors the means to:
• Publish online documents with headings, text, tables,
lists, photos, etc
– Include spread-sheets, video clips, sound clips, and other
applications directly in their documents
• Link information via hypertext links, at the click of a
button
• Design forms for conducting transactions with remote
services, for use in searching for information, making
reservations, ordering products, etc
10
HTML Versions
• HTML 4.01 is a revision of the HTML 4.0 Recommendation first
released on 18th December 1997.
– HTML 4.01 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt
• HTML 4.0 was first released as a W3C Recommendation on 18
December 1997
• HTML 3.2 was W3C's first Recommendation for HTML which
represented the consensus on HTML features for 1996
• HTML 2.0 (RFC 1866) was developed by the IETF's HTML
Working Group, which set the standard for core HTML
features based upon current practice in 1994.
11
Sample Webpage
12
Sample Webpage HTML
Structure
<HTML>
<HEAD>
<TITLE>The title of the webpage</TITLE>
</HEAD>
<BODY> <P>Body of the webpage
</BODY>
</HTML>
13
HTML Structure
• An HTML document is divided into a head section
(here, between <HEAD> and </HEAD>) and a body
(here, between <BODY> and </BODY>)
• The title of the document appears in the head (along
with other information about the document)
• The content of the document appears in the body. The
body in this example contains just one paragraph,
marked up with <P>
14
HTML Hyperlink
<a href="relations/alumni">alumni</a>
• A link is a connection from one Web resource
to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the
"destination" anchor, which may be any Web
resource (e.g., an image, a video clip, a sound
bite, a program, an HTML document)
15
Resource Identifiers
URI: Uniform Resource Identifiers
• URL: Uniform Resource Locators
• URN: Uniform Resource Names
16
Introduction to URIs
Every resource available on the Web has an
address that may be encoded by a URI
URIs typically consist of three pieces:
• The naming scheme of the mechanism used
to access the resource. (HTTP, FTP)
• The name of the machine hosting the
resource
• The name of the resource itself, given as a
path
17
URI Example
http://www.w3.org/TR
• There is a document available via the HTTP
protocol
• Residing on the machines hosting www.w3.org
• Accessible via the path "/TR"
18
Protocols
Describe how messages are encoded and
exchanged
Different Layering Architectures
• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture
19
ISO OSI Layering Architecture
20
ISO’s Design Principles
• A layer should be created where a different level
of abstraction is needed
• Each layer should perform a well-defined
function
• The layer boundaries should be chosen to
minimize information flow across the interfaces
• The number of layers should be large enough
that distinct functions need not be thrown
together in the same layer, and small enough
that the architecture does not become unwieldy
21
TCP/IP Layering Architecture
22
TCP/IP Layering Architecture
• A simplified model, provides the end-toend reliable connection
• The network layer
– Hosts drop packages into this layer, layer
routes towards destination
– Only promise “Try my best”
• The transport layer
– Reliable byte-oriented stream
23
Hypertext Transfer Protocol (HTTP)
• A connection-oriented protocol (TCP) used
to carry WWW traffic between a browser
and a server
• One of the transport layer protocol
supported by Internet
• HTTP communication is established via a
TCP connection and server port 80
24
GET Method in HTTP
25
Domain Name System
DNS (domain name service): mapping from
domain names to IP address
IPv4:
• IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version.
• 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to
255.255.255.255.
IPv6:
• Revision of IPv4 with 128 bit address
26
Top Level Domains (TLD)
Top level domain names, .com, .edu, .gov and ISO
3166 country codes
There are three types of top-level domains:
• Generic domains were created for use by the Internet
public
• Country code domains were created to be used by
individual country
• The .arpa domain Address and Routing Parameter Area
domain is designated to be used exclusively for Internetinfrastructure purposes
27
Registrars
• Domain names ending with .aero, .biz,
.com, .coop, .info, .museum, .name, .net,
.org, or .pro can be registered through
many different companies (known as
"registrars") that compete with one another
• InterNIC at http://internic.net
• Registrars Directory:
http://www.internic.net/regist.html
28
Server Log Files
Server Transfer Log: transactions between a
browser and server are logged
•
•
•
•
IP address, the time of the request
Method of the request (GET, HEAD, POST…)
Status code, a response from the server
Size in byte of the transaction
Referrer Log: where the request originated
Agent Log: browser software making the request (spider)
Error Log: request resulted in errors (404)
29
Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search
engines
• What are the searched keywords
• How many clicks/page views a page
received
• Error reports, like broken links
30
Server Log Analysis
31
Search Engines
According to Pew Internet Project Report
(2002), search engines are the most popular
way to locate information online
• About 33 million U.S. Internet users query on
search engines on a typical day.
• More than 80% have used search engines
Search Engines are measured by coverage and
recency
32
Coverage
Overlap analysis used for estimating the
size of the indexable web
• W: set of webpages
• Wa, Wb: pages crawled by two independent
engines a and b
• P(Wa), P(Wb): probabilities that a page was
crawled by a or b
• P(Wa)=|Wa| / |W|
• P(Wb)=|Wb| / |W|
33
Overlap Analysis
• P(Wa Wb| Wb) = P(Wa  Wb)/ P(Wb)
= |Wa  Wb| / |Wb|
• If a and b are independent:
P(Wa Wb) = P(Wa)*P(Wb)
• P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
= |Wa| * |Wb| / |Wb|
= |Wa| / |W|
=P(Wa)
34
Overlap Analysis
Using |W| = |Wa|/ P(Wa), the researchers
found:
• Web had at least 320 million pages in
1997
• 60% of web was covered by six major
engines
• Maximum coverage of a single engine was
1/3 of the web
35
How to Improve the Coverage?
• Meta-search engine: dispatch the user
query to several engines at same time,
collect and merge the results into one list
to the user.
• Any suggestions?
36
Web Crawler
• A crawler is a program that picks up a
page and follows all the links on that page
• Crawler = Spider
• Types of crawler:
– Breadth First
– Depth First
37
Breadth First Crawlers
Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and
add them to a queue
• Pick the 1st link from the queue, get all
links on the page and add to the queue
• Repeat above step till queue is empty
38
Breadth First Crawlers
39
Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
page
• Visit link and get 1st non-visited link
• Repeat above step till no no-visited links
• Go to next non-visited link in the previous
level and repeat 2nd step
40
Depth First Crawlers
41
WEB GRAPHS
Internet/Web as Graphs
Graph of the physical layer with routers ,
computers etc as nodes and physical
connections as edges
It is limited
Does not capture the graphical connections
associated with the information on the Internet
Web Graph where nodes represent web
pages and edges are associated with
hyperlinks
43
Web Graph
http://www.touchgraph.com/TGGoogleBrowser.html
44
Web Graph Considerations
Edges can be directed or undirected
Graph is highly dynamic
Nodes and edges are added/deleted often
Content of existing nodes is also subject to
change
Pages and hyperlinks created on the fly
Apart from primary connected component
there are also smaller disconnected
components
45
Why the Web Graph?
Example of a large,dynamic and distributed
graph
Possibly similar to other complex graphs in
social, biological and other systems
Reflects how humans organize information
(relevance, ranking) and their societies
Efficient navigation algorithms
Study behavior of users as they traverse
the web graph (e-commerce)
46
Statistics of Interest
Size and connectivity of the graph
Number of connected components
Distribution of pages per site
Distribution of incoming and outgoing
connections per site
Average and maximal length of the shortest
path between any two vertices (diameter)
47
Properties of Web Graphs
Connectivity follows a power law distribution
The graph is sparse
|E| = O(n) or atleast o(n2)
Average number of hyperlinks per page roughly
a constant
A small world graph
48
Power Law Size
Simple estimates suggest over a billion
nodes
Distribution of site sizes measured by the
number of pages follow a power law
distribution
Observed over several orders of magnitude
with an exponent g in the 1.6-1.9 range
49
Power Law Connectivity
Distribution of number of connections per
node follows a power law distribution
Study at Notre Dame University reported
g = 2.45 for outdegree distribution
g = 2.1 for indegree distribution
Random graphs have Poisson distribution if
p is large.
Decays exponentially fast to 0 as k increases
towards its maximum value n-1
50
Power Law Distribution Examples
http://www.pnas.org/cgi/reprint/99/8/5207.pdf
51
Examples of networks with
Power Law Distribution
Internet at the router and interdomain level
Citation network
Collaboration network of actors
Networks associated with metabolic
pathways
Networks formed by interacting genes and
proteins
Network of nervous system connection in C.
elegans
52
Small World Networks
It is a ‘small world’
Millions of people. Yet, separated by “six
degrees” of acquaintance relationships
Popularized by Milgram’s famous experiment
Mathematically
Diameter of graph is small (log N) as compared
to overall size
3. Property seems interesting given ‘sparse’ nature
of graph but …
This property is ‘natural’ in ‘pure’ random graphs
53
The small world of WWW
Empirical study of Web-graph reveals smallworld property
Average distance (d) in simulated web:
e.g.
d = 0.35 + 2.06 log (n)
n = 109, d ~= 19
Graph generated using power-law model
Diameter properties inferred from sampling
Calculation of max. diameter computationally
demanding for large values of n
54
Implications for Web
Logarithmic scaling of diameter makes
future growth of web manageable
10-fold increase of web pages results in only 2
more additional ‘clicks’, but …
Users may not take shortest path, may use
bookmarks or just get distracted on the way
Therefore search engines play a crucial role
55
Some theoretical considerations
Classes of small-world networks
Scale-free: Power-law distribution of connectivity
over entire range
Broad-scale: Power-law over “broad range” + abrupt
cut-off
Single-scale: Connectivity distribution decays
exponentially
56
Power Law of PageRank
Assess importance of a page relative to a
query and rank pages accordingly
Importance measured by indegree
Not reliable since it is entirely local
PageRank – proportion of time a random
surfer would spend on that page at steady
state
A random first order Markov surfer at each
time step travels from one page to another
57
PageRank contd
Page rank r(v) of page v is the steady state
distribution obtained by solving the system
of linear equations given by
Where pa[v] = set of parent nodes
Ch[u] = out degree
58
Examples
Log Plot of PageRank Distribution of Brown Domain
(*.brown.edu)
G.Pandurangan, P.Raghavan,E.Upfal,”Using PageRank to characterize Webstructure”
,COCOON 2002
59
Bow-tie Structure of Web
A large scale study (Altavista crawls)
reveals interesting properties of web
Study of 200 million nodes & 1.5 billion links
Small-world property not applicable to entire
web
Some parts unreachable
Others have long paths
Power-law connectivity holds though
Page indegree (g = 2.1), outdegree (g = 2.72)
60
Bow-tie Components
Strongly Connected
Component (SCC)
Core with small-world property
Upstream (IN)
Core can’t reach IN
Downstream (OUT)
OUT can’t reach core
Disconnected (Tendrils)
61
Component Properties
Each component is roughly same size
~50 million nodes
Tendrils not connected to SCC
But reachable from IN and can reach OUT
Tubes: directed paths IN->Tendrils->OUT
Disconnected components
Maximal and average diameter is infinite
62
Empirical Numbers for Bow-tie
Maximal minimal (?) diameter
28 for SCC, 500 for entire graph
Probability of a path between any 2 nodes
~1 quarter (0.24)
Average length
16 (directed path exists), 7 (undirected)
Shortest directed path between 2 nodes in
SCC: 16-20 links on average
63
Models for the Web Graph
Stochastic models that can explain or atleast
partially reproduce properties of the web
graph
The model should follow the power law
distribution properties
Represent the connectivity of the web
Maintain the small world property
64
Web Page Growth
Empirical studies observe a power law
distribution of site sizes
Size includes size of the Web, number of IP
addresses, number of servers, average size
of a page etc
A Generative model is being proposed to
account for this distribution
65
Component One of the
Generative Model
The first component of this model is that
“ sites have short-term size fluctuations up or
down that are proportional to the size of the
site “
A site with 100,000 pages may gain or lose
a few hundred pages in a day whereas the
effect is rare for a site with only 100 pages
66
Component Two of the
Generative Model
There is an overall growth rate a so that the
size S(t) satisfies
S(t+1) = a(1+htb)S(t)
where
- ht is the realization of a +-1 Bernoulli
random variable at time t with probability
0.5
- b is the absolute rate of the daily
fluctuations
67
Component Two of the
Generative Model contd
After T steps
so that
68
Theoretical Considerations
Assuming ht independent, by central limit
theorem it is clear that for large values of
T, log S(T) is normally distributed
The central limit theorem states that given a distribution
with a mean μ and variance σ2, the sampling
distribution of the mean approaches a normal
distribution with a mean (μ) and a variance σ2/N as N,
the sample size, increases.
http://davidmlane.com/hyperstat/A14043.html
69
Theoretical Considerations
contd
Log S(T) can also be associated with a binomial
distribution counting the number of time ht = +1
Hence S(T) has a log-normal distribution
The probability density and cumulative distribution
functions for the log normal distribution
70
Modified Model
Can be modified to obey power law
distribution
Model is modified to include the following
inorder to obey power law distribution
A wide distribution of growth rates across
different sites and/or
The fact that sites have different ages
71
Capturing Power Law Property
Inorder to capture Power Law property it is
sufficient to consider that
Web sites are being continuously created
Web sites grow at a constant rate a during a growth
period after which their size remains approximately
constant
The periods of growth follow an exponential distribution
This will give a relation l = 0.8a between the rate
of exponential distribution l and a the growth
rage when power law exponent g = 1.08
72
Lattice Perturbation (LP)
Models
Some Terms
“Organized Networks” (a.k.a Mafia)
Each node has same degree k and neighborhoods
are entirely local
1 if dist (a,b) = 1
Probability of Edge (a,b) =
0 otherwise
Note: We are talking about graphs that can
be mapped to a Cartesian plane
73
Terms (Cont’d)
Organized Networks
Are ‘cliquish’ (Subgraph that is fully connected)
in local neighborhood
Probability of edges across neighborhoods is
almost non existent (p=0 for fully organized)
“Disorganized” Networks
‘Long-range’ edges exist
Completely Disorganized <=> Fully Random
(Erdos Model) : p=1
74
Semi-organized (SO) Networks
Probability for long-range edge is between
zero and one
Clustered at local level (cliquish)
But have long-range links as well
Leads to networks that
Are locally cliquish
And have short path
lengths
75
Creating SO Networks
Step 1:
Take a regular network (e.g. lattice)
Step 2:
Shake it up (perturbation)
Step 2 in detail:
For each vertex, pick a local edge
‘Rewire’ the edge into a long-range edge with a
probability (p)
p=0: organized, p=1: disorganized
76
Statistics of SO Networks
Average Diameter (d): Average distance
between two nodes
Average Clique Fraction (c)
Given a vertex v, k(v): neighbors of v
Max edges among k(v) = k(k-1)/2
Clique Fraction (cv): (Edges present) / (Max)
Average clique fraction: average over all nodes
Measures: Degree to which “my friends are
friends of each other”
77
Statistics (Cont’d)
Statistics of common networks:
n
k
Actors
225,226 61
Powergrid
4,941
C.elegans 282
d
c
3.65
0.79
2.67 18.7
0.08
14
0.28
2.65
Large k =
large c?
Small c =
large d?
78
Other Properties
For graph to be sparse but connected:
n >> k >> log(n) >>1
As p --> 0 (organized)
d ~= n/2k >>1 , c ~= 3/4
Highly clustered & d grows linearly with n
As p --> 1 (disorganized)
d ~= log(n)/log(k) , c ~= k/n << 1
Poorly clustered & d grows logarithmically with
n
79
Effect of ‘Shaking it up’
Small shake (p close to zero)
High cliquishness AND short path lengths
Larger shake (p increased further from 0)
d drops rapidly (increased small world
phenomena_
c remains constant (transition to small world
almost undetectable at local level)
Effect of long-range link:
Addition: non-linear decrease of d
Removal: small linear decrease of c
80
LP and The Web
LP has severe limitations
No concept of short or long links in Web
A page in USA and another in Europe can be joined
by one hyperlink
Edge rewiring doesn’t produce power-law
connectivity!
Degree distribution bounded & strongly concentrated
around mean value
Therefore, we need other models …
81