M - Sun Yat-sen University

Download Report

Transcript M - Sun Yat-sen University

Database and Knowledge Discovery:
from Market Basket Data to Web PageRank
Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
What is a Database ?

A Database:


A very large, integrated collection of data.
Models a real-world “enterprise”

Entities


e.g., students, courses
Relationships

e.g., Zhang San is taking the Database System course.
What is a Database Management System?

A Database Management System (DBMS) is:


Popular DBMS




A software system designed to store, manage,
and facilitate query to databases.
Oracle
IBM DB2
Microsoft SQL Server
Database System = Databases + DBMS
DBMS makes a big industry

The world's largest software companies by Forbes Global
2000 (http://www.forbes.com/lists/2009/18/global-09_The-Global2000_IndName_17.html)
 IBM
 Microsoft
 Oracle
 Google
 Softbank
 SAP
 Accenture
 Computer Sciences Corporation
 Yahoo!
 Cap Gemini
Typical Applications Supported by
Database Systems

Online Transaction Processing (OLTP)




Recording sales data in supermarkets
Booking flight tickets
Electronic banking
Online analytical processing (OLAP) and
Data Warehousing


Business reporting for sales data
Customer Relationship Management (CRM)
Is the WWW a DBMS?

The Web = Surface Web + Deep Web

Surface Web: simply the HTML pages

Accessed by “search”:


Pose keywords in search box.
Deep Web: content hidden behind HTML forms
Accessed by “query”

Fill in query forms.
“Search” as “Simple Search”
“Query” as “Advanced Search”
“Search” vs. “Query”

Search is structure-free.


The keywords “database systems” can appear in
anyplace in a HTML pages.
Query is structure-aware.


Say, we restruct that the keywords “database
systems” can only appear in the “TITLE” field.
i.e., we assume there is an underlying
STRUCTURE (of a book).
What is a “STUCTURE”?

Referring to the C programming language
struct BOOK {
char TITLE [256];
char AUTHOR [256];
float PRICE;
int YEAR;
}

In this course, we study database management
systems that focus on processing structured data.
An Anecdote about “Search”:
Google’s Founders are from Database Group of Stanford.
A Historical Perspective (1)

Relational Data Model, by Edgar Codd, 1970.



System R, by IBM, started in 1974


Codd, E.F. (1970). "A Relational Model of Data for
Large Shared Data Banks". Communications of
the ACM 13 (6): 377–387.
Turing Award, 1981.
Structured Query Language (SQL)
INGRES, by Berkeley, started in 1974

POSTGRES, Mariposa, C-Store
A Historical Perspective (2)

Database Transaction Processing, mainly by Jim
Gray.



J Gray, A Reuter. Transaction processing: concepts
and techniques. 1993.
Turing Award, 1998.
Object-Relational DBMS, 1990s.



Stonebraker, Michael with Moore, Dorothy. ObjectRelational DBMSs: The Next Great Wave. 1996.
Postgres (UC Berkeley), PostgreSQL.
IBM's DB2, Oracle database, and Microsoft SQL Server
Describing Data: Data Models

A data model is a collection of concepts for
describing data.

A schema is a description of a particular collection of
data, using a given data model.

The relational data model is the most widely used
model today.


Main concept: relation, basically a table with rows and
columns.
Every relation has a schema, which describes the columns,
or fields (their names, types, constraints, etc.).
Schema in Relation Data Model
A relation schema is a TEMPLATE of the corresponding relation.
Definition of the Students Schema
Students (sid: string, name: string, login: string, age: integer, gpa: real)
Table 1. An Instance of the Students Relation
sid
name
login
age
53666
53688
53650
Jones
Smith
Smith
jones@cs
18
smith@ee
18
smith@math 19
gpa
3.4
3.2
3.8
Queries in a Relational DBMS

Specified in a Non-Procedural way



Users only specify what data they need;
A DBMS takes care to evaluate queries as efficiently
as possible.
a Non-Procedural Query Language:


SQL: Structured Query Language
Basic form of a SQL query:
SELECT
target-list
FROM
relation-list
qualification
WHERE
A Simple SQL Example

At an airport, a gate agent clicks on a form to
request the passenger list for a flight.
SELECT name
FROM
Passenger
Where
flight = 510275
Passenger(pid: string, name: string, flight: integer)
Concurrent execution of user programs

Why?

Utilize CPU while waiting for disk I/O


(database programs make heavy use of disk)
Avoid short programs waiting behind long ones

e.g. ATM withdrawal while bank manager sums balance
across all accounts
courtesy of Joe Hellerstein
Concurrent execution

Interleaving actions of different user programs can
lead to inconsistency:
Example:
 Bill transfers $100 from savings to checking
Savings –= 100; Checking += 100
Meanwhile, Bill’s wife requests account info.
Bad interleaving:





Savings –= 100
Print balances
Checking += 100
Printout is missing $100 !
courtesy of Joe Hellerstein
Concurrency Control

DBMS ensures such problems don’t arise.

Users can pretend they are using a singleuser system.
STRUCTURE OF A DBMS
The Market-Basket Model


A large set of items, e.g., things sold in a
supermarket.
A large set of baskets, each of which is a
small set of the items, e.g., the things one
customer buys on one day.
22
Support


Simplest question: find sets of items that
appear “frequently” in the baskets.
Support for itemset I = the number of
baskets containing all items in I.


Sometimes given as a percentage.
Given a support threshold s, sets of items
that appear in at least s baskets are called
frequent itemsets.
23
Example: Frequent Itemsets


Items={milk, coke, pepsi, beer, juice}.
Support = 3 baskets.
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}

B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
Frequent itemsets: {m}, {c}, {b}, {j},
{m,b} , {b,c} , {c,j}.
24
Applications – (1)


Items = products; baskets = sets of products
someone bought in one trip to the store.
Example application: given that many people
buy beer and diapers together:


Run a sale on diapers; raise price of beer.
Only useful if many buy diapers & beer.
25
Applications – (2)


Baskets = Web pages; items = words.
Unusual words appearing together in a large
number of documents, e.g., “Brad” and
“Angelina,” may indicate an interesting
relationship.
26
Association Rules



If-then rules about the contents of baskets.
{i1, i2,…,ik} → j means: “if a basket contains
all of i1,…,ik then it is likely to contain j.”
Confidence of this association rule is the
probability of j given i1,…,ik.
27
Example: Confidence
+
_
_

B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
+
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
An association rule: {m, b} → c.

Confidence = 2/4 = 50%.
28
Finding Association Rules

Question: “find all association rules with
support ≥ s and confidence ≥ c .”


Note: “support” of an association rule is the
support of the set of items on the left.
Hard part: finding the frequent itemsets.

Note: if {i1, i2,…,ik} → j has high support and
confidence, then both {i1, i2,…,ik} and
{i1,
i2,…,ik ,j } will be “frequent.”
29
A-Priori Algorithm – (1)


A two-pass approach called a-priori limits the
need for main memory.
Key idea: monotonicity : if a set of items
appears at least s times, so does every
subset.

Contrapositive for pairs: if item i does not appear
in s baskets, then no pair including i can appear
in s baskets.
30
A-Priori Algorithm – (2)

Pass 1: Read baskets and count in main
memory the occurrences of each item.


Requires only memory proportional to #items.
Items that appear at least s times are the
frequent items.
31
A-Priori Algorithm – (3)

Pass 2: Read baskets again and count in
main memory only those pairs both of
which were found in Pass 1 to be
frequent.

Requires memory proportional to square of
frequent items only (for counts), plus a list of
the frequent items (so you know what must be
counted).
32
Picture of A-Priori
Item counts
Frequent items
Counts of
pairs of
frequent
items
Pass 1
Pass 2
33
Page Rank: Ranking web pages

Web pages are not equally “important”


v www.stanford.edu
Inlinks as votes



www.joe-schmoe.com
www.stanford.edu has 23,400 inlinks
www.joe-schmoe.com has 1 inlink
Are all inlinks equal?

Recursive question!
Simple recursive formulation



Each link’s vote is proportional to the
importance of its source page
If page P with importance x has n outlinks,
each link gets x/n votes
Page P’s own importance is the sum of the
votes on its inlinks
Simple “flow” model
The web in 1839
y
a/2
Yahoo
y/2
y/2
m
M’soft
Amazon
a
a/2
m
y = y /2 + a /2
a = y /2 + m
m = a /2
Solving the flow equations

3 equations, 3 unknowns, no constants



Additional constraint forces uniqueness



No unique solution
All solutions equivalent modulo scale factor
y+a+m = 1
y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for small
examples, but we need a better method for
large graphs
Matrix formulation


Matrix M has one row and one column for each web
page
Suppose page j has n outlinks



M is a column stochastic matrix


If j  i, then Mij=1/n
Else Mij=0
Columns sum to 1
Suppose r is a vector with one entry per web page



ri is the importance score of page i
Call it the rank vector
|r| = 1
Example
Suppose page j links to 3 pages, including i
j
i
i
=
1/3
M
r
r
Eigenvector formulation


The flow equations can be written
r = Mr
So the rank vector is an eigenvector of the
stochastic web matrix

In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
m
0
1
0
r = Mr
Amazon
M’soft
y = y /2 + a /2
a = y /2 + m
m = a /2
y
1/2 1/2 0
a = 1/2 0 1
m
0 1/2 0
y
a
m
Power Iteration method





Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1/N,….,1/N]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < 


|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm e.g., Euclidean
Power Iteration Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
y
a =
m
m
0
1
0
M’soft
1/3
1/3
1/3
1/3
1/2
1/6
5/12
1/3
1/4
3/8
11/24 . . .
1/6
2/5
2/5
1/5
Random Walk Interpretation

Imagine a random web surfer





At any time t, surfer is on some page P
At time t+1, the surfer follows an outlink from P
uniformly at random
Ends up on some page Q linked from P
Process repeats indefinitely
Let p(t) be a vector whose ith component is
the probability that the surfer is at page i at
time t

p(t) is a probability distribution on pages
The stationary distribution

Where is the surfer at time t+1?



Suppose the random walk reaches a state
such that p(t+1) = Mp(t) = p(t)


Follows a link uniformly at random
p(t+1) = Mp(t)
Then p(t) is called a stationary distribution for the
random walk
Our rank vector r satisfies r = Mr

So it is a stationary distribution for the random
surfer
Existence and Uniqueness
A central result from the theory of random walks (aka
Markov processes):
For graphs that satisfy certain conditions, the
stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0.
Spider traps

A group of pages is a spider trap if there are
no links from within the group to outside the
group


Random surfer gets trapped
Spider traps violate the conditions needed for
the random walk theorem
Microsoft becomes a spider trap
Yahoo
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
m
0
0
1
M’soft
Amazon
y
a =
m
1
1
1
1
1/2
3/2
3/4
1/2
7/4
5/8
3/8
2
...
0
0
3
Random teleports


The Google solution for spider traps
At each time step, the random surfer has two
options:




With probability , follow a link at random
With probability 1-, jump to some page uniformly
at random
Common values for  are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
Random teleports ( = 0.8)
0.2*1/3
Yahoo
1/2
0.8*1/2
1/2
0.8*1/2
0.2*1/3
y
y 1/2
a 1/2
m 0
y
1/2
0.8* 1/2
0
y
1/3
+ 0.2* 1/3
1/3
0.2*1/3
Amazon
M’soft
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
Random teleports ( = 0.8)
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
Yahoo
M’soft
Amazon
y
a =
m
1
1
1
1.00 0.84
0.60 0.60
1.40 1.56
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
0.776
0.536 . . .
1.688
7/11
5/11
21/11
Matrix formulation

Suppose there are N pages



Consider a page j, with set of outlinks O(j)
We have Mij = 1/|O(j)| when ji and Mij = 0
otherwise
The random teleport is equivalent to



adding a teleport link from j to every page with
probability (1-)/N
reducing the probability of following each outlink from
1/|O(j)| to /|O(j)|
Equivalent: tax each page a fraction (1-) of its score
and redistribute evenly
Page Rank

Construct the N*N matrix A as follows



Verify that A is a stochastic matrix
The page rank vector r is the principal
eigenvector of this matrix


Aij = Mij + (1-)/N
satisfying r = Ar
Equivalently, r is the stationary distribution of
the random walk with teleports