PPT - NYU Stern School of Business

Download Report

Transcript PPT - NYU Stern School of Business

C20.0046: Database
Management Systems
Lecture #26
M.P. Johnson
Stern School of Business, NYU
Spring, 2005
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
1
Agenda

Last time:

OLAP & Data Warehouses

Data Mining

Websearch

Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
2
Goals after today:
1.
Be aware of what problems DM solves
2.
What the algorithms are
3.
Supervised v. unsupervised learning
4.
Understand how to find frequent sets and
why
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
3
New topic: Data Mining

Situ: data rich but knowledge poor
 Terabytes and terabytes of data
 Searching for needles in haystacks

Can collect lots of data
 credit card xacts
 bar codes
 club cards
 webserver logs
 call centers
 311 calls in NYC
 cameras
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
4
Lots of data

Can store this data
 DBMSs, data warehousing

Can query it





SQL
DW/OLAP queries
find the things such that…
Find the totals for each of…
Can get answers to specific questions, but
what does it all mean?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
5
But how to learn from this data?

What kinds of people will buy my widgets?

Whom should I send my direct-mail literature to?

How can I segment my market?

http://www.joelonsoftware.com/articles/CamelsandRubberDuckies.html

Whom should we approve for a gold card?

Who gets approved for car insurance?

And what should we charge?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
6
Knowledge Discovery

Goal: Extract interesting/actionable
knowledge from large collections of data



Finding rules
Finding patterns
Classifying instances

KD ~= DM ~= Business Analytics

Business Intelligence: semi-intelligent
reporting/OLAP
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
7
Data Mining at intersection of disciplines

DBMS


Statistics


mathematical modeling, regression
Visualization


query processing, data warehousing, OLAP
3d representation
AI/machine learning


neural networks, decision trees, etc.
Except the algorithms must really scale
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
8
ML/DM: two basic kinds
Supervised learning
1.


Classification
Regression
Unsupervised learning
2.


Clustering
Or: Find something interesting
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
9
Supervised learning

Situ: you have many particular instances





Have various fields describing them
But other one property you’re interested in:


Transactions
Customers
Credit applications
E.g., credit-worthiness
Goal: infer this dependent property from the
other data
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
10
Supervised learning

Supervised learning starts with training data


Use the training data to build a model



Many different algorithms
Given the model, you can now determine the
dependent property for new instances


Many instances including the dependent property
And ideally, the answers are “correct”
Categorical property  “classification”
Numerical property  “regression”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
11
k-Nearest Neighbor

Very simple algorithm



Map training data points in a space
Given any two points, can compute the distance
between them


Sometimes works
Euclidean
Given an unlabeled point, label this way:


Find the k nearest points
Let them vote on the label
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
12
Neural Networks (skip?)

“Hill climbing”
based on connections between neurons in brain

simple NN:

1.
2.
3.
input layer with 3 nodes
hidden layer with 2 nodes
output layer with 1 node
each node points to each node in next level
each node as some activation level and a
something like a critical mass
Draw picture




What kind of graph is this?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
13
Neural Networks (skip?)

values passed into input nodes represent the
problem instance

given weighted sum of inputs to neuron,
sends out pulse only if the sum is greater
than its threshold


values output by hidden nodes sent to output
node
if the weighted sum going into output node is
high enough, it outputs 1, otherwise 0
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
14
NN applications (skip?)

plausible application: we have data about potential
customers
 party registration
 married or not
 gender
 income level

or: have credit applicant information
 employment

income
home ownership
 bankruptcy
should we give a credit card to him?


M.P. Johnson, DBMS, Stern/NYU, Spring 2005
15
How NNs work (skip?)

hope: plug in customer  out comes whether we
should market toward him

How does it get the right answer?

Initially, all weights are random!
But: we assume we have data for lots of people
which we know to be either interested in our
products or not (let’s say)



we have data for both kinds
So: when we plug in one of these customers, we
know what the right answer is supposed to be
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
16
How NNs work (skip?)

can used the Backpropagation algorithm:



for each known problem instance, plug in and look at answer
if answer is wrong, change edges weights in one way;
o.w., change them the opposite way (details omitted)

repeat…

the more iterations we do, the more the NN learns
our known data
with enough confidence, can apply NN to unknown
customer data to learn whether to market toward
them

M.P. Johnson, DBMS, Stern/NYU, Spring 2005
17
LBS example (skip?)


Investments
goal: maximize return on investments


lots of time-series data for different props for
different stocks




buy/sell right securities at right time
return market signals
pick right ones
react
soln: create NN for each stock

retrain weekly
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
18
Decision Trees

Another use of (rooted) trees

Trees but not BSTs

Each node ~ one attribute
 Its children ~ possible values of that
attribute

E.g.: each node ~ some field on a credit app
 Each path from root to leaf is one rule
 If these fields have these values then make
this decision
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
19
Decision Trees

Details:




Example: top node: history of bankruptcy?





for binary property, two out edges, but may be more
for continuous property (income), divide values into
discrete ranges
a property may appear more tan once
if yes, REJECT
if no, then: employed?
If no, … (maybe look for high monthly housing
payment)
If yes, …
Particular algorithms: ID3, CART, etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
20
Naïve Bayes Classifier

Bayes Theorem: Pr(B|A)
= Pr(B,A)/Pr(A)
= (Pr(A|B)*Pr(B))/Pr(A)

Or:

Used in many spam filters
W means: “the msg has the words W1,W2,…Wn”
S means: “it’s spam”
Goal: given new msg with certain words, is this
spam?




Pr(S|W)
= Pr(S,W)/Pr(W)
= (Pr(W|S)*Pr(S))/Pr(W)
Is Pr(S|W) > 50%?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
21
Naïve Bayes Classifier

This is supervised learning, so we first have a
training phase


Training phase:





Look at lots of spam messages and my non-spam
messages
For each word Wi, compute Pr(Wi)
For each word Wi, compute Pr(Wi|S)
Compute Pr(S)
That’s it!
Now, we wait for email to arrive…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
22
Naïve Bayes Classifier

When a new msg with words W = W1…Wn arrives, we compute:
 Pr(S|W) = (Pr(W|S)*Pr(S))/Pr(W)

What’s Pr(W) and Pr(W|S)?
Assuming words are independent (obviously false), we have:
 Pr(W) = Pr(W1)*Pr(W2)*…*Pr(Wn)
 Pr(W|S) = Pr(W1|S)*Pr(W2|S)*…*Pr(Wn|S)




Each number here, we have precomputed!
 Except for new words…
To decide spam status of message, then, we just do the math!
Very simple, but works surprisingly well in practice
 Really simple: can write in a page of Perl code
 See also Paul Graham: http://www.paulgraham.com/spam.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
23
Naïve Bayes Classifier
From: "mrs bulama shettima" <[email protected]
Date: September 11, 2004 4:32:14 AM EDT
To: [email protected]
Subject: SALAM.
SALAM.
My name is Hajia Amina Shettima Mohammed Bulama the eldest
wife of Shettima Mohammed Bulama who was the est. while
managing director of the Bank of the North Nig, Plc.
I am contacting you in a benevolent spirit; utmost
confidence and trust to enable us provide a solution to a
money transfer of $12M that is presentlt the last resort of
my family.My husband has just been retired from office and
has also been placed under survelience of which he can not
travel out of the country for now due to allegations
levelled against him while in office as the Managing
director of the bank of the north Nig. Plc of which he is
facing litigations right now.
You may be quite surprised at my sudden contact to you but
do not despair,I got your contact from a business site on
the internet and following the information I gathered about
you, I was convinced that you could be of assistance to me.
So, decided to contact you at once due to the urgency
required for us to immediately transfer the said funds out
of the country.During the time my husband was with the bank
he made several loans to various oil firms and inreturn he
was given various sums as gratifications.The prominent
amongst the deals was monies that emanated from funds set
aside for the importation of various ballot equipmemts for
the just concluded april general elections . If you are
conversant with world news,you would understand better.
During this period my husband was able to make some good
money for himself and kept in a private security finance
accounts here as persopnal effects.Out of the money my
husband made, he left the sum of US$12M (Twelve Mllion
United States Dollars) which was kept in a Private security
firm here in Nigeria and he kept it in the vaults of this
firm as personal effects.
M.P. Johnson,
The reason is because no names were used to lodge in the
consignment containing the funds. Instead, he used PERSONAL
IDENTIFICATION NUMBERS (PIN)and declared the contents as
Bearer Bonds and Treasury Bills. Also the firm issued him
with a certificate of deposit of the consignment. Note that
I have these information in my custody.Right now, my
husband has asked me to negotiate with a foreigner who
would assist us lay claim to the consignment, as the
eldest wife of my husband,I believe that I owe the entire
family an obligation to ensure that the US$12M is
successfully transferred abroad for investment purposes.
With the present situation, I cannot do it all by myself.
It is based on this that I am making this contact with you.
I have done a thorough homework and fine-tuned the best way
to create you as the beneficiary to the consignment
containing the funds and effect the transfer of the
consignment accordingly. It is rest assured that the
modalities I have resolved to finalize the entire project
guarantees our safety and the successful transfer of the
funds. So, you will be absolutely right when you say that
this project is risk free and viable.
If you are capable and willing to assist,contact me at once
via this email for more details.
Believe me, there is no one else we can trust again. All my
husband's friends have deserted us after exploiting us on
the pretence of trying to help my husband. As it is said,
it is at the time of problems that you know your true
friends.So long as you keep everything to yourself, we
would definitely have no problems. For your assistance, I
am ready to give you as much as 25% of the total funds
after confirmation in your possession and invest a
reasonable percentage into any viable business you may
suggest. Please, I need your assistance to make this happen
and please do not undermine it because it will also be a
source of up liftment to you also. You have absolutely
nothing to loose in assisting us instead,you have so much
to gain.
I will appreciate if you can contact me at my private
email:[email protected] once.
Awaiting your urgent and positive response.
Thanks and Allah be with you.
DBMS, Stern/NYU, Spring 2005
Hajia (Mrs) Amina Shettima Bulama.
24
Judging the results

(Relatively) easy to create a model that does very
well on the training data

What matters: should do well on future, unseen data

Common problem: overfitting



Model is too close to the training data
Needs to be pruned
Common approach: cross-validation



Divide training data into 10 parts
Train on 9/10, test on 1/10
Do this all 10 ways
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
25
Association applications

Spam filtering

Network intrusion detection

Trading strategies

Political marketing:


Mail order tulip bulbs  Conservative party voters
NYTimes last year…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
26
New topic: Clustering

Unsupervised learning




divide items up into clusters of same type
What are the different types of customers we
have?
What types of webpages are there?
Each item becomes a point in the space




One dim for every property
Poss plotting for webpages: dim. for each word
Value in dim d is # occur.s of d / # all word occur.s
k-means
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
27
k-means

Simple clustering algorithm


Want to partition data points into sets of
“similar” instances
Plot in n-space
 partition points into clumps in space

Like a unsupervised analog to k-NN


But iterative
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
28
k-means




Alg:
Choose initial means (centers)
Do {
assign each point to the closest mean
recompute means of clumps
} until change in means < e
Visualization:

http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
29

Amazon’s “Page You Made”


mpj9@col
How does this work?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
30
New topic: Frequent Sets


Find sets of items that go together
Intuition: market-basket data




Suppose you’re Wal-Mart, with 460TB of data
 http://developers.slashdot.org/article.pl?sid=04/11/14/20
57228
Might not know customer hist (but maybe: club cards!)
One thing you do know: groupings of items
 Each set of item wrung up in a an individual purchase
Famous (claimed) example: beer and diapers are
positively correlated


Intuition: Babies at home, not at bars
 put chips in between
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
31
Finding Frequent Sets

Situ: one table





Baskets(id, item)
One row for each purchase of each item
Alt interp: words on webpages
Goal: given support s, find set of items that appear
together in >=s baskets
For pairs, obvious attempt:
SELECT B1.item, B2.item, count(B.id)
FROM Baskets B1, Baskets B2
WHERE B1.id = B2.id AND
B1.item < B2.item
GROUP BY B1.item, B2.item
M.P. Johnson, DBMS, Stern/NYU,
HAVING count(B1.basket)
>= s; Spring 2005
What’s
wrong?
32
A Priori Property

a priori ~ prior to



As opposed to a posteriori ~ post

Prior to investigation
Deductive/from reason/non-empirical
After investigation/experience
Logical observation: if support(X) >= s, then
for every individual x in X, support(x) >= s

 any item y with support < s can be ignored
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
33
A Priori Property
E.g.: You sell 10,000 items, stores record of
1,000,000 baskets, with avg of 20 items
each



 20,000,000 rows in Baskets
#pairs of items from 1,000,000 baskets
= C(20,2)*1,000,000 = 190,000,000 pairs!
One idea:

1.
2.
Find support-s items
Run old query on them
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
34
A Priori Property

Suppose we’re looking for sup-10,000 pairs


Each item in pair must be sup-10,000
Counting arg: have only 20,000,000 individual items
purchased


Can have at most 2000 = 20,000,000/10,000 popular items
Eliminated 4/5s of the item types!



Actually, probably much more
Will also lower the average basket size
Suppose now have 500,000 rows with avg basket
size = 10

#pairs = 500,000*C(10,2) = 22,500,000 rows
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
35
A Priori Algorithm




But: a frequent itemset may have >2 items
Idea: build frequent sets (not just pairs)
iteratively
Alg:
First, find all frequent items


These are the size-1 frequent sets
Next, for each size-k frequent set

For each frequent item


Check whether the union is frequent
If so, add to collection of size-k+1 frequent sets
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
36
Mining for rules

Frequent sets tell us which things go together

But they don’t tell us causality

Sometimes want to say: a causes b


Or at least: presence a makes b likely
E.g.: razor purchase  future razorblade purchases


but not vice versa
 make razors cheap
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
37
Association rules

Here’s some data – what implies what?
Xact
Custid
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
5/10/99
Pen
1
113
106
5/10/99
Milk
1
114
201
6/1/99
Pen
2
114
201
6/1/99
Ink
2
114
201
6/1/99
Juice
4
114
201
6/1/99
Water
1
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
38
Association rules

Candidates:




Pen  ink
Ink  pen
Water  juice
Etc.

How to pick?

Two main measures
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
39
Judging association rules

Support



Support(X  Y)
= support (X union Y)
= Pr(X union Y)
Does the rule matter?
Confidence


Conf(X  Y) = support(X  Y) / support(X)
= Pr(Y|X)
Show we believe the rule?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
40
Association rules

What are the supports and confidences?
Xact
Custid
Date
Item
Qty
111
201
5/1/99
Pen
2

111
201
5/1/99
Ink
1

111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
5/10/99
Pen
1
113
106
5/10/99
Milk
1
114
201
6/1/99
Pen
2
114
201
6/1/99
Ink
2
114
201
6/1/99
Juice
4
114
201
6/1/99
Water
1
M.P. Johnson, DBMS, Stern/NYU, Spring 2005

Pen  ink
Ink  pen
Water  juice

Which matter?
41
Discovering association rules
Association rules only matter if their support and
confidence are both high enough


User specifies minimum allowed for each
First, support:



High support(XY)  high support(X u Y)
 X u Y is a frequent set
So, to find a good association rule

1.
2.
Generate a frequent set Z
Divide into subsets X,Y all possible ways

Checking if conf(XY) is high enough
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
42
Association rules

Suppose {pen,ink} is frequent
Xact
Custid
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
5/10/99
Pen
1
113
106
5/10/99
Milk
1
114
201
6/1/99
Pen
2
114
201
6/1/99
Ink
2
114
201
6/1/99
Juice
4
114
201
6/1/99
Water
1
M.P. Johnson, DBMS, Stern/NYU, Spring 2005




Divide in
subsets both
ways:
Pen  ink
Ink  pen
Which do we
choose?
43
Other kinds of baskets

Here, the basket was a single transaction
(purchase)

But could have other baskets



All purchases from each customer
All purchases on first-day-of-the-months
Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
44
Frequent set/association applications

Store/website/catalog layout

Page You Made

Direct marketing

Fraud detection
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
45
Mining v. warehousing

Warehousing: let user search, group by
interesting properties




Give me the sales of A4s by year and dealer, for
these colors
User tries to learn from results which properties
are important/interesting
What’s driving sales?
Mining: tell the user what the interesting
properties are

How can I increase sales?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
46
Social/political concerns

Privacy

TIA

Sensitive data

Allow mining but not queries

Opt-in/opt-out

“Don’t be evil.”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
47
For more info

See Dahr & Stein: Seven Methods for
Transforming Corporate Data into Business
Intelligence (1997)


Drawn on above
A few years old, but very accessible

http://www.kdnuggets.com/

Data mining courses offered here…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
48
Future



RAID
Websearch
Proj5 due



Print by 10:30 and turn in on time (at 11)
If not, email…
Final Exam: next Thursday, 5/5,10-11:50am

Info is up
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
49