Transcript Lecture 7

Web Mining
Two Key Problems
Page Rank
 Web Content Mining

PageRank
Intuition: solve the recursive equation: “a
page is important if important pages link
to it.”
 Maximailly: importance = the principal
eigenvector of the stochastic matrix of the
Web.


A few fixups needed.
Stochastic Matrix of the Web



Enumerate pages.
Page i corresponds to row and column i.
M [i,j ] = 1/n if page j links to n pages,
including page i ; 0 if j does not link to i.

M [i,j ] is the probability we’ll next be at page
i if we are now at page j.
Example
Suppose page j links to 3 pages, including i
j
i
1/3
Random Walks on the Web
Suppose v is a vector whose i th
component is the probability that we are
at page i at a certain time.
 If we follow a link from i at random, the
probability distribution for the page we
are then at is given by the vector M v.

Random Walks --- (2)
Starting from any vector v, the limit
M
(M (…M (M v ) …)) is the distribution of
page visits during a random walk.
 Intuition: pages are important in
proportion to how often a random walker
would visit them.
 The math: limiting distribution = principal
eigenvector of M = PageRank.

Example: The Web in 1839
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
M’soft
m
0
1
0
Simulating a Random Walk
Start with the vector v = [1,1,…,1]
representing the idea that each Web page
is given one unit of importance.
 Repeatedly apply the matrix M to v,
allowing the importance to flow like a
random walk.
 Limit exists, but about 50 iterations is
sufficient to estimate final distribution.

Example

Equations v = M v :
y = y /2 + a /2
a = y /2 + m
m = a /2
y
a =
m
1
1
1
1
3/2
1/2
5/4
1
3/4
9/8
11/8
1/2
...
6/5
6/5
3/5
Solving The Equations
Because there are no constant terms,
these 3 equations in 3 unknowns do not
have a unique solution.
 Add in the fact that y +a +m = 3 to solve.
 In Web-sized examples, we cannot solve
by Gaussian elimination; we need to use
relaxation (= iterative solution).

Real-World Problems

Some pages are “dead ends” (have no
links out).


Such a page causes importance to leak out.
Other (groups of) pages are spider traps
(all out-links are within the group).

Eventually spider traps absorb all importance.
Microsoft Becomes Dead End
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
M’soft
m
0
0
0
Example

Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2
y
a =
m
1
1
1
1
1/2
1/2
3/4
1/2
1/4
5/8
3/8
1/4
...
0
0
0
M’soft Becomes Spider Trap
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
M’soft
m
0
0
1
Example

Equations v = M v :
y = y /2 + a /2
a = y /2
m = a /2 + m
y
a =
m
1
1
1
1
1/2
3/2
3/4
1/2
7/4
5/8
3/8
2
...
0
0
3
Google Solution to Traps, Etc.
“Tax” each page a fixed percentage at
each interation.
 Add the same constant to all pages.
 Models a random walk with a fixed
probability of going to a random place
next.

Example: Previous with 20% Tax

Equations v = 0.8(M v ) + 0.2:
y = 0.8(y /2 + a/2) + 0.2
a = 0.8(y /2) + 0.2
m = 0.8(a /2 + m) + 0.2
y
a =
m
1
1
1
1.00 0.84
0.60 0.60
1.40 1.56
0.776
0.536 . . .
1.688
7/11
5/11
21/11
General Case
In this example, because there are no
dead-ends, the total importance remains
at 3.
 In examples with dead-ends, some
importance leaks out, but total remains
finite.

Solving the Equations
Because there are constant terms, we can
expect to solve small examples by
Gaussian elimination.
 Web-sized examples still need to be solved
by relaxation.

Speeding Convergence
Newton-like prediction of where
components of the principal eigenvector
are heading.
 Take advantage of locality in the Web.
 Each technique can reduce the number of
iterations by 50%.


Important --- PageRank takes time!
Web Content Mining

The Web is perhaps the single largest data
source in the world.

Much of the Web (content) mining is about


Data/information extraction from semi-structured
objects and free text, and

Integration of the extracted data/information
Due to the heterogeneity and lack of
structure, mining and integration are
challenging tasks.
Wrapper induction

Using machine learning to generate extraction rules.




The user marks the target items in a few training pages.
The system learns extraction rules from these pages.
The rules are applied to extract target items from other
pages.
Many wrapper induction systems, e.g.,







WIEN (Kushmerick et al, IJCAI-97),
Softmealy (Hsu and Dung, 1998),
Stalker (Muslea et al. Agents-99),
BWI (Freitag and McCallum, AAAI-00),
WL2 (Cohen et al. WWW-02).
IDE (Liu and Zhai, WISE-05)
Thresher (Hogue and Karger, WWW-05)
Stalker: A wrapper induction system
(Muslea et al. Agents-99)
E1:
513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-5551515
E2:
90 Colfax, <b>Palms</b>, Phone (800) 508-1570
E3:
523 1st St., <b>LA</b>, Phone 1-<b>800</b>-5782293
E4:
403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
We want to extract area code.


Start rules:
R1: SkipTo(()
R2: SkipTo(-<b>)
End rules:
R3: SkipTo())
R4: SkipTo(</b>)
Learning extraction rules

Stalker uses sequential covering to learn
extraction rules for each target item.



In each iteration, it learns a perfect rule that
covers as many positive items as possible
without covering any negative items.
Once a positive item is covered by a rule, the
whole example is removed.
The algorithm ends when all the positive items
are covered. The result is an ordered list of all
learned rules.
Rule induction through an example
Training examples:
E1:
E2:
E3:
E4:
513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
90 Colfax, <b>Palms</b>, Phone (800) 508-1570
523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
We learn start rule for area code.
 Assume the algorithm starts with E2. It creates
three initial candidate rules with first prefix
symbol and two wildcards:
 R1: SkipTo(()
 R2: SkipTo(Punctuation)
 R3: SkipTo(Anything)
 R1 is perfect. It covers two positive examples but
no negative example.
Rule induction (cont …)
E1:
E2:
E3:
E4:


R1 covers E2 and E4, which are removed. E1 and
E3 need additional rules.
Three candidates are created:






513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
90 Colfax, <b>Palms</b>, Phone (800) 508-1570
523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
R4: SkiptTo(<b>)
R5: SkipTo(HtmlTag)
R6: SkipTo(Anything)
None is good. Refinement is needed.
Stalker chooses R4 to refine, i.e., to add
additional symbols, to specialize it.
It will find R7: SkipTo(-<b>), which is perfect.
Limitations of Supervised Learning
Manual Labeling is labor intensive and
time consuming, especially if one wants to
extract data from a huge number of sites.
 Wrapper maintenance is very costly:





If Web sites change frequently
It is necessary to detect when a wrapper stops
to work properly.
Any change may make existing extraction rules
invalid.
Re-learning is needed, and most likely manual
re-labeling as well.
The RoadRunner System
(Crescenzi et al. VLDB-01)
Given a set of positive examples (multiple sample
pages). Each contains one or more data records.
 From these pages, generate a wrapper as a
union-free regular expression (i.e., no
disjunction).
The approach
 To start, a sample page is taken as the wrapper.
 The wrapper is then refined by solving
mismatches between the wrapper and each
sample page, which generalizes the wrapper.

Compare with wrapper induction

No manual labeling, but need a set of positive
pages of the same template


which is not necessary for a page with multiple data
records
not wrapper for data records, but pages.

A Web page can have many pieces of irrelevant
information.
Issues of automatic extraction
 Hard to handle disjunctions
 Hard to generate attribute names for the extracted
data.
 extracted data from multiple sites need integration,
manual or automatic.
Relation Extraction

Assumptions:



No single source contains all the tuples
Each tuple appears on many web pages
Components of tuple appear “close” together



Foundation, by Isaac Asimov
Isaac Asimov’s masterpiece, the
<em>Foundation</em> trilogy
There are repeated patterns in the way tuples
are represented on web pages
Naïve approach

Study a few websites and come up with a
set of patterns e.g., regular expressions
letter = [A-Za-z. ]
title = letter{5,40}
author = letter{10,30}
<b>(title)</b> by (author)
Problems with naïve approach

A pattern that works on one web page
might produce nonsense when applied to
another


So patterns need to be page-specific, or at
least site-specific
Impossible for a human to exhaustively
enumerate patterns for every relevant
website

Will result in low coverage
Better approach (Brin)

Exploit duality between patterns and
tuples



Find tuples that match a set of patterns
Find patterns that match a lot of tuples
DIPRE (Dual Iterative Pattern Relation
Extraction)
Match
Patterns
Tuples
Generate
DIPRE Algorithm
1.
R Ã SampleTuples

2.
O Ã FindOccurrences(R)


3.

5.
Occurrences of tuples on web pages
Keep some surrounding context
P Ã GenPatterns(O)

4.
e.g., a small set of <title,author> pairs
Look for patterns in the way tuples occur
Make sure patterns are not too general!
R Ã MatchingTuples(P)
Return or go back to Step 2
Web query interface integration

Many integration tasks,






Integrating
forms)
Integrating
Integrating
Integrating
…
Web query interfaces (search
extracted data
textual information
ontologies (taxonomy)
We only introduce integration of query
interfaces.


Many web sites provide forms to query deep
web
Applications: meta-search and meta-query
Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
Synonym Discovery (He and Chang, KDD-04)

Discover synonym attributes
Author – Writer, Subject – Category
S1:
author
title
subject
ISBN
S2:
writer
title
category
format
S3:
name
title
keyword
binding
Holistic Model Discovery
author
writer name subject
category
S1:
author
title
subject
ISBN
V.S.
S2:
writer
title
category
format
S3:
name
title
keyword
binding
Pairwise Attribute
Correspondence
S1.author  S3.name
S1.subject  S2.category
Schema matching as correlation mining
Across many sources:

Synonym attributes are negatively correlated



synonym attributes are semantically alternatives.
thus, rarely co-occur in query interfaces
Grouping attributes with positive correlation


grouping attributes semantically complement
thus, often co-occur in query interfaces