The Web - DidaWiki
Download
Report
Transcript The Web - DidaWiki
Information Retrieval
Web Search
Goal of a Search Engine
Retrieve docs that are “relevant” for the user query
Doc: file word or pdf, web page, email, blog, e-book,...
Query: paradigm “bag of words”
Relevant ?!?
Two main difficulties
The Web:
Extracting “significant data” is difficult !!
Size: more than tens of billions of pages
Language and encodings: hundreds…
Distributed authorship: SPAM, format-less,…
Dynamic: in one year 35% survive, 20% untouched
The User:
Matching “user needs” is difficult !!
Query composition: short (2.5 terms avg) and imprecise
Query results: 85% users look at just one result-page
Several needs: Informational, Navigational, Transactional
Evolution of Search Engines
First generation -- use only on-page, web-text data
1995-1997 AltaVista,
Excite, Lycos, etc
Second generation -- use off-page, web-graph data
Word frequency and language
Link (or connectivity) analysis
Anchor-text (How people refer to a page)
1998: Google
Third generation -- answer “the need behind the query”
Focus on “user need”, rather than on query
Integrate multiple data-sources
Click-through data
Google, Yahoo,
MSN, ASK,………
Fourth generation Information Supply
[Andrei Broder, VP emerging search tech, Yahoo! Research]
This is a search engine!!!
+$
-$
Two new approaches
Sponsored search: Ads driven by
search keywords
(and user-profile issuing them)
AdWords
Context match: Ads driven by the
content of a web page
(and user-profile reaching that page)
AdSense
Information Retrieval
The structure of a Search Engine
?
The structure
Page archive
Crawler
Query
Page
Analizer
Indexer
Query
resolver
Control
text
auxiliary
Structure
Ranker
Information Retrieval
The Web Graph
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
5-40K per page => hundreds of terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
Some definitions
Weakly connected components (WCC)
Set of nodes such that from any node can go to any node via an
undirected path.
Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a
directed path.
WCC
SCC
On observing the Web graph
We do not know which percentage of it we know
The only way to discover the graph structure of the
web as hypertext is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Internet graph
“Cosine” graph (undirected, weighted)
V = static web pages
E = tf-idf distance between pages
Query-Log graph (bipartite, weighted)
V = Routers
E = communication links
V = queries and URL
E = (q,u) where u is a result for q, and has been clicked by
some user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
The In-degree distribution
Altavista crawl, 1999
Indegree follows power law distribution
WebBase Crawl 2001
Pr[in - degree (u ) k ]
a 2.1
1
k
a
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
j
i
21 millions of pages, 150millions of links
URL-sorting
Berkeley
Stanford
Information Retrieval
Crawling
Spidering
24h, 7days “walking” over a Graph
Recall that the Web graph is
A direct graph G = (N, E)
N changes (insert, delete) >> 50 * 109 nodes
E changes (insert, delete) > 10 links per node
BowTie
Crawling Issues
How to crawl?
How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have?
How often to crawl?
Quality: “Best” pages first
Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns (Minimize load)
Freshness: How much has changed?
How to parallelize the process
Sec. 20.2
Crawling picture
URLs crawled
and parsed
Unseen Web
Seed
pages
Web
URLs frontier
Sec. 20.1.1
Updated crawling picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
Sec. 20.2.1
Robots.txt
Protocol for giving spiders (“robots”)
limited access to a website, originally
from 1994
www.robotstxt.org/wc/norobots.html
Website announces its request on what
can(not) be crawled
For a URL, create a file of restrictions
URL/robots.txt
Sec. 20.2.1
Robots.txt example
No robot should visit any URL starting with
"/yoursite/temp/", except the robot called
“searchengine":
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
Sec. 20.2.1
Processing steps in crawling
Pick a URL from the frontier
Fetch the document at the URL
Parse the URL
Extract links from it to other docs (URLs)
For each extracted URL
Which one?
E.g., only crawl .edu, obey
robots.txt, etc.
Ensure it passes certain URL filter tests
Check if it is already in the frontier (duplicate
URL elimination)
Check if URL has content already seen
Duplicate content elimination
Sec. 20.2.1
Basic crawl architecture
DNS
WWW
Doc
FP’s
robots
filters
URL
set
Content
seen?
URL
filter
Dup
URL
elim
Parse
Fetch
URL Frontier
Page selection
Given a page P, define how “good” P is.
Several metrics:
BFS, DFS, Random
Popularity driven (PageRank, full vs partial)
Topic driven or focused crawling
Combined
BFS
“…BFS-order discovers the highest quality pages during
the early stages of the crawl”
328 millions of URL in the testbed
[Najork 01]
This page is a new one ?
Check if file has been parsed or downloaded before
after 20 mil pages, we have “seen” over 200 million URLs
each URL is at least 100 bytes on average
Overall we have about 20Gb of URLS
Options: compress URLs in main memory, or use disk
Bloom Filter (Archive)
Disk access with caching (Mercator, Altavista)
Parallel Crawlers
Web is too big to be crawled by a single crawler, work
should be divided avoiding duplication
Dynamic assignment
Central coordinator dynamically assigns URLs to crawlers
Links are given to Central coordinator
Static assignment
Web is statically partitioned and assigned to crawlers
Crawler only crawls its part of the web
Two problems
Load balancing the #URLs assigned to
Static schemes based on hosts may fail
www.geocities.com/….
www.di.unipi.it/
Let D be the number
of downloaders.
downloaders:
hash(URL) maps an
URL to {0,...,D-1}.
Dowloader x fetches
the URLs U s.t.
hash(U) = x
Dynamic “relocation” schemes may be complicated
Managing the fault-tolerance:
What about the death of downloaders ? DD-1, new hash !!!
What about new downloaders ? DD+1, new hash !!!
A nice technique: Consistent Hashing
A tool for:
Spidering
Web Cache
P2P
Routers Load Balance
Distributed FS
Item and servers mapped to unit circle
Item K assigned to first server N such
that ID(N) ≥ ID(K)
What if a downloader goes down?
What if a new downloader appears?
Each server gets replicated log S times
[monotone] adding a new server moves points
between one old to the new, only.
[balance] Prob item goes to a server is ≤ cost/S
[load] any server gets ≤ (I/S) log S items w.h.p
[scale] you can copy each server more times...
Examples: Open Source
Nutch, also used by WikiSearch
Hentrix, used by Archive.org
http://www.nutch.org
http://archive-crawler.sourceforge.net/index.html
Consisten Hashing
Amazon’s Dynamo
Sec. 20.4
Connectivity Server
Support for fast queries on the web graph
Which URLs point to a given URL?
Which URLs does a given URL point to?
Stores mappings in memory from
URL to outlinks, URL to inlinks
Sec. 20.4
Currently the best
Webgraph – set of algorithms and a
java implementation
Fundamental goal – maintain node
adjacency lists in memory
For this, compressing the adjacency
lists is the critical component
Sec. 20.4
Adjacency lists
The set of neighbors of a node
Assume each URL represented by an
integer
4 billion pages 32 bits per node
And 64 bits per hyperlink
Sec. 20.4
Adjaceny list compression
Properties exploited in
compression:
Similarity (between lists)
Locality (many links from a page
go to “lexic-nearby” pages)
Use gap encodings in sorted lists
Distribution of gap values
3 bits/link
Sec. 20.4
Main ideas
Consider lexicographically ordered list of all
URLs, e.g.,
www.stanford.edu/alchemy
www.stanford.edu/biology
www.stanford.edu/biology/plant
www.stanford.edu/biology/plant/copyright
www.stanford.edu/biology/plant/people
www.stanford.edu/chemistry
Sec. 20.4
Copy lists
Each of these URLs has an adjacency list Why 7?
Main idea: due to templates, the adjacency list
of a node is similar to one of the 7 preceding
URLs in the lexicographic ordering
Express adjacency list in terms of one of these
E.g., consider these adjacency lists
1,
1,
1,
1,
2,
4,
2,
4,
4,
9,
3,
8,
8, 16, 32, 64
16, 25, 36, 49, 64
5, 8, 13, 21, 34, 55, 89, 144
16, 25, 36, 49, 64
Encode as (-2), 11011111, add 8
Sec. 20.4
Extra nodes and binary arrays
Several tricks:
Use RLE over the binary arrays
Use succinct encoding for the intervals
created by extra-nodes
Use special interger-codes for the
remaining integers (z code: good for
integers from a power law)
Sec. 20.4
Main advantages
Adjacency queries can be answered very
efficiently
To fetch out-neighbors, trace back the chain
of prototypes
This chain is typically short in practice (since
similarity is mostly intra-host)
Can also explicitly limit the length of the
chain during encoding
Easy to implement one-pass algorithm
Sec. 19.6
Duplicate documents
The web is full of duplicated content
Strict dup-detection = exact match
Not as common
Many cases of near duplicates
E.g., Last modified date is the only
difference between two page copies
Sec. 19.6
Duplicate/Near-Duplicate Detection
Duplication: Exact match can be detected with
fingerprints
Near-Duplication: Approximate match
Overview
Compute syntactic similarity with an editdistance measure
Use similarity threshold to detect nearduplicates
E.g., Similarity > 80% => Documents are “near
duplicates”
Sec. 19.6
Computing Similarity
Approach:
Shingles (Word N-Grams)
a rose is a rose is a rose →
a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a
Similarity Measure between two docs:
Set of shingles + Set intersection
Documents Sets of 64-bit fingerprints
Doc
shingling
Multiset
fingerprint
of
Shingles
Multiset of
Fingerprints
Efficient shingle management Fingerprints:
• Use Karp-Rabin fingerprints
• Use 64-bit fingerprints
• Prob[collision] << 1
Similarity of Documents
Doc
A
SA
SB
Doc
B
• Jaccard measure – similarity of SA, SB which are sets of integers
sim(S A , SB )
SA SB
SA SB
•Claim: A & B are near-duplicates if sim(SA,SB) is close to 1
Remarks
Multiplicities of q-grams – could retain or ignore
trade-off efficiency with precision
Shingle Size q [4 … 10]
Short shingles: increase similarity of unrelated documents
With q=1, sim(SA,SB) =1 A is permutation of B
Need larger q to sensitize to permutation changes
Long shingles: small random changes have larger impact
Similarity Measure
Similarity is non-transitive, non-metric
But dissimilarity 1-sim(SA,SB) is a metric
[Ukkonen 92] – relate q-gram & edit-distance
Example
A = “a rose is a rose is a rose”
B = “a rose is a flower which is a rose”
Preserving multiplicity
q=1 sim(SA,SB) = 0.7
SA = {a, a, a, is, is, rose, rose, rose}
SB = {a, a, a, is, is, rose, rose, flower, which}
q=2 sim(SA,SB) = 0.5
q=3 sim(SA,SB) = 0.3
Disregarding multiplicity
q=1 sim(SA,SB) = 0.6
q=2 sim(SA,SB) = 0.5
q=3 sim(SA,SB) = 0.4285
Sec. 19.6
Efficiency: Sketches
Create a “sketch vector” (of size ~200)
for each document
Docs that share ≥ t (say 80%) of elemes
in the skecthes are near duplicates
For doc D, sketchD[ i ] is as follows:
Let f map all shingles in the universe to
0..2m (e.g., f = fingerprinting)
Let pi be a random permutation on 0..2m
Pick MIN {pi(f(s))} over all shingles s in D
Sec. 19.6
Computing Sketch[i] for Doc1
Document 1
264 Start with 64-bit f(shingles)
264 Permute on the number line
with pi
264
264 Pick the min value
Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1
same
p(f(*))
Document 2
264
264
264
MIN( p(f (A) )
MIN( p(f (B) )
264
Are these equal?
Test for 200 random permutations:
p1, p2,… p200
Sec. 19.6
Notice that…
Document 2
Document 1
264
264
264
264
264
264
264
264
They are equal iff the shingle with the MIN value in the
union of Doc1 and Doc2 is doc in their intersection
Claim: This happens with probability
Size_of_intersection / Size_of_union
Sec. 19.6
All signature pairs
This is an efficient method for estimating
the similarity (Jaccard coefficient) for one
pair of documents.
But we have to estimate N2 similarities,
where N is the number of web pages.
Still slow
One solution: locality sensitive hashing (LSH)
Another solution: sorting
Information Retrieval
Link-based Ranking
(2° generation)
Query-independent ordering
First generation: using link counts as simple
measures of popularity.
Undirected popularity:
Each page gets a score given by the number of in-links
plus the number of out-links (es. 3+2=5).
Directed popularity:
Score of a page = number of its in-links (es. 3).
Easy to SPAM
Second generation: PageRank
Each link has its own importance!!
PageRank is
independent of the query
many interpretations…
Basic Intuition…
What about nodes
with no in/out links?
Google’s Pagerank
Random jump
Principal
eigenvector
r = [ a T + (1- a) e eT ] × r
B(i) : set of pages linking to i.
#out(j) : number of outgoing links from j.
e : vector of components 1/sqrt{N}.
i, j
1
# out (i)
0
i j
else
Three different interpretations
Graph (intuitive interpretation)
Matrix (easy for computation)
Co-citation
Eigenvector computation or a linear system solution
Markov Chain (useful to prove convergence)
a sort of Usage Simulation
“In the steady state” each page
has a long-term visit rate
- use this as the page’s score.
1-a
Any node
a
Neighbors
Pagerank: use in Search Engines
Preprocessing:
Given graph, build matrix a T + (1Compute its principal eigenvector r
r[i] is the pagerank of page i
We are interested in the relative order
Query processing:
Retrieve pages containing query terms
Rank them by their Pagerank
The final order is query-independent
a) e eT
HITS: Hypertext Induced Topic Search
Calculating HITS
It is query-dependent
Produces two scores per page:
Authority score: a good authority page for a
topic is pointed to by many good hubs for
that topic.
Hub score: A good hub page for a topic
points to many authoritative pages for that
topic.
Authority and Hub scores
5
2
3
1
4
1
6
7
a(1) = h(2) + h(3) + h(4)
h(1) = a(5) + a(6) + a(7)
HITS: Link Analysis Computation
a A h a A Aa
T
h Aa h AA h
T
T
Where
a: Vector of Authority’s scores
h: Vector of Hub’s scores.
A: Adjacency matrix in which ai,j = 1 if ij
Thus, h is an eigenvector of AAt
a is an eigenvector of AtA
Symmetric
matrices
Weighting links
Weight more if the query occurs in the
neighborhood of the link (e.g. anchor text).
h( x )
a( y )
x y
a ( x)
h( y )
y x
h( x) w( x, y ) a( y )
x y
a( x) w( x, y ) h( y )
y x