PPT - NYU Stern School of Business
Download
Report
Transcript PPT - NYU Stern School of Business
C20.0046: Database
Management Systems
Lecture #26
M.P. Johnson
Stern School of Business, NYU
Spring, 2005
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
1
Agenda
Last time:
OLAP & Data Warehouses
Data Mining
Websearch
Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
2
Goals after today:
1.
Be aware of what problems DM solves
2.
What the algorithms are
3.
Supervised v. unsupervised learning
4.
Understand how to find frequent sets and
why
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
3
New topic: Data Mining
Situ: data rich but knowledge poor
Terabytes and terabytes of data
Searching for needles in haystacks
Can collect lots of data
credit card xacts
bar codes
club cards
webserver logs
call centers
311 calls in NYC
cameras
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
4
Lots of data
Can store this data
DBMSs, data warehousing
Can query it
SQL
DW/OLAP queries
find the things such that…
Find the totals for each of…
Can get answers to specific questions, but
what does it all mean?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
5
But how to learn from this data?
What kinds of people will buy my widgets?
Whom should I send my direct-mail literature to?
How can I segment my market?
http://www.joelonsoftware.com/articles/CamelsandRubberDuckies.html
Whom should we approve for a gold card?
Who gets approved for car insurance?
And what should we charge?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
6
Knowledge Discovery
Goal: Extract interesting/actionable
knowledge from large collections of data
Finding rules
Finding patterns
Classifying instances
KD ~= DM ~= Business Analytics
Business Intelligence: semi-intelligent
reporting/OLAP
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
7
Data Mining at intersection of disciplines
DBMS
Statistics
mathematical modeling, regression
Visualization
query processing, data warehousing, OLAP
3d representation
AI/machine learning
neural networks, decision trees, etc.
Except the algorithms must really scale
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
8
ML/DM: two basic kinds
Supervised learning
1.
Classification
Regression
Unsupervised learning
2.
Clustering
Or: Find something interesting
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
9
Supervised learning
Situ: you have many particular instances
Have various fields describing them
But other one property you’re interested in:
Transactions
Customers
Credit applications
E.g., credit-worthiness
Goal: infer this dependent property from the
other data
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
10
Supervised learning
Supervised learning starts with training data
Use the training data to build a model
Many different algorithms
Given the model, you can now determine the
dependent property for new instances
Many instances including the dependent property
And ideally, the answers are “correct”
Categorical property “classification”
Numerical property “regression”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
11
k-Nearest Neighbor
Very simple algorithm
Map training data points in a space
Given any two points, can compute the distance
between them
Sometimes works
Euclidean
Given an unlabeled point, label this way:
Find the k nearest points
Let them vote on the label
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
12
Neural Networks (skip?)
“Hill climbing”
based on connections between neurons in brain
simple NN:
1.
2.
3.
input layer with 3 nodes
hidden layer with 2 nodes
output layer with 1 node
each node points to each node in next level
each node as some activation level and a
something like a critical mass
Draw picture
What kind of graph is this?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
13
Neural Networks (skip?)
values passed into input nodes represent the
problem instance
given weighted sum of inputs to neuron,
sends out pulse only if the sum is greater
than its threshold
values output by hidden nodes sent to output
node
if the weighted sum going into output node is
high enough, it outputs 1, otherwise 0
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
14
NN applications (skip?)
plausible application: we have data about potential
customers
party registration
married or not
gender
income level
or: have credit applicant information
employment
income
home ownership
bankruptcy
should we give a credit card to him?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
15
How NNs work (skip?)
hope: plug in customer out comes whether we
should market toward him
How does it get the right answer?
Initially, all weights are random!
But: we assume we have data for lots of people
which we know to be either interested in our
products or not (let’s say)
we have data for both kinds
So: when we plug in one of these customers, we
know what the right answer is supposed to be
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
16
How NNs work (skip?)
can used the Backpropagation algorithm:
for each known problem instance, plug in and look at answer
if answer is wrong, change edges weights in one way;
o.w., change them the opposite way (details omitted)
repeat…
the more iterations we do, the more the NN learns
our known data
with enough confidence, can apply NN to unknown
customer data to learn whether to market toward
them
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
17
LBS example (skip?)
Investments
goal: maximize return on investments
lots of time-series data for different props for
different stocks
buy/sell right securities at right time
return market signals
pick right ones
react
soln: create NN for each stock
retrain weekly
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
18
Decision Trees
Another use of (rooted) trees
Trees but not BSTs
Each node ~ one attribute
Its children ~ possible values of that
attribute
E.g.: each node ~ some field on a credit app
Each path from root to leaf is one rule
If these fields have these values then make
this decision
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
19
Decision Trees
Details:
Example: top node: history of bankruptcy?
for binary property, two out edges, but may be more
for continuous property (income), divide values into
discrete ranges
a property may appear more tan once
if yes, REJECT
if no, then: employed?
If no, … (maybe look for high monthly housing
payment)
If yes, …
Particular algorithms: ID3, CART, etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
20
Naïve Bayes Classifier
Bayes Theorem: Pr(B|A)
= Pr(B,A)/Pr(A)
= (Pr(A|B)*Pr(B))/Pr(A)
Or:
Used in many spam filters
W means: “the msg has the words W1,W2,…Wn”
S means: “it’s spam”
Goal: given new msg with certain words, is this
spam?
Pr(S|W)
= Pr(S,W)/Pr(W)
= (Pr(W|S)*Pr(S))/Pr(W)
Is Pr(S|W) > 50%?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
21
Naïve Bayes Classifier
This is supervised learning, so we first have a
training phase
Training phase:
Look at lots of spam messages and my non-spam
messages
For each word Wi, compute Pr(Wi)
For each word Wi, compute Pr(Wi|S)
Compute Pr(S)
That’s it!
Now, we wait for email to arrive…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
22
Naïve Bayes Classifier
When a new msg with words W = W1…Wn arrives, we compute:
Pr(S|W) = (Pr(W|S)*Pr(S))/Pr(W)
What’s Pr(W) and Pr(W|S)?
Assuming words are independent (obviously false), we have:
Pr(W) = Pr(W1)*Pr(W2)*…*Pr(Wn)
Pr(W|S) = Pr(W1|S)*Pr(W2|S)*…*Pr(Wn|S)
Each number here, we have precomputed!
Except for new words…
To decide spam status of message, then, we just do the math!
Very simple, but works surprisingly well in practice
Really simple: can write in a page of Perl code
See also Paul Graham: http://www.paulgraham.com/spam.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
23
Naïve Bayes Classifier
From: "mrs bulama shettima" <[email protected]
Date: September 11, 2004 4:32:14 AM EDT
To: [email protected]
Subject: SALAM.
SALAM.
My name is Hajia Amina Shettima Mohammed Bulama the eldest
wife of Shettima Mohammed Bulama who was the est. while
managing director of the Bank of the North Nig, Plc.
I am contacting you in a benevolent spirit; utmost
confidence and trust to enable us provide a solution to a
money transfer of $12M that is presentlt the last resort of
my family.My husband has just been retired from office and
has also been placed under survelience of which he can not
travel out of the country for now due to allegations
levelled against him while in office as the Managing
director of the bank of the north Nig. Plc of which he is
facing litigations right now.
You may be quite surprised at my sudden contact to you but
do not despair,I got your contact from a business site on
the internet and following the information I gathered about
you, I was convinced that you could be of assistance to me.
So, decided to contact you at once due to the urgency
required for us to immediately transfer the said funds out
of the country.During the time my husband was with the bank
he made several loans to various oil firms and inreturn he
was given various sums as gratifications.The prominent
amongst the deals was monies that emanated from funds set
aside for the importation of various ballot equipmemts for
the just concluded april general elections . If you are
conversant with world news,you would understand better.
During this period my husband was able to make some good
money for himself and kept in a private security finance
accounts here as persopnal effects.Out of the money my
husband made, he left the sum of US$12M (Twelve Mllion
United States Dollars) which was kept in a Private security
firm here in Nigeria and he kept it in the vaults of this
firm as personal effects.
M.P. Johnson,
The reason is because no names were used to lodge in the
consignment containing the funds. Instead, he used PERSONAL
IDENTIFICATION NUMBERS (PIN)and declared the contents as
Bearer Bonds and Treasury Bills. Also the firm issued him
with a certificate of deposit of the consignment. Note that
I have these information in my custody.Right now, my
husband has asked me to negotiate with a foreigner who
would assist us lay claim to the consignment, as the
eldest wife of my husband,I believe that I owe the entire
family an obligation to ensure that the US$12M is
successfully transferred abroad for investment purposes.
With the present situation, I cannot do it all by myself.
It is based on this that I am making this contact with you.
I have done a thorough homework and fine-tuned the best way
to create you as the beneficiary to the consignment
containing the funds and effect the transfer of the
consignment accordingly. It is rest assured that the
modalities I have resolved to finalize the entire project
guarantees our safety and the successful transfer of the
funds. So, you will be absolutely right when you say that
this project is risk free and viable.
If you are capable and willing to assist,contact me at once
via this email for more details.
Believe me, there is no one else we can trust again. All my
husband's friends have deserted us after exploiting us on
the pretence of trying to help my husband. As it is said,
it is at the time of problems that you know your true
friends.So long as you keep everything to yourself, we
would definitely have no problems. For your assistance, I
am ready to give you as much as 25% of the total funds
after confirmation in your possession and invest a
reasonable percentage into any viable business you may
suggest. Please, I need your assistance to make this happen
and please do not undermine it because it will also be a
source of up liftment to you also. You have absolutely
nothing to loose in assisting us instead,you have so much
to gain.
I will appreciate if you can contact me at my private
email:[email protected] once.
Awaiting your urgent and positive response.
Thanks and Allah be with you.
DBMS, Stern/NYU, Spring 2005
Hajia (Mrs) Amina Shettima Bulama.
24
Judging the results
(Relatively) easy to create a model that does very
well on the training data
What matters: should do well on future, unseen data
Common problem: overfitting
Model is too close to the training data
Needs to be pruned
Common approach: cross-validation
Divide training data into 10 parts
Train on 9/10, test on 1/10
Do this all 10 ways
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
25
Association applications
Spam filtering
Network intrusion detection
Trading strategies
Political marketing:
Mail order tulip bulbs Conservative party voters
NYTimes last year…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
26
New topic: Clustering
Unsupervised learning
divide items up into clusters of same type
What are the different types of customers we
have?
What types of webpages are there?
Each item becomes a point in the space
One dim for every property
Poss plotting for webpages: dim. for each word
Value in dim d is # occur.s of d / # all word occur.s
k-means
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
27
k-means
Simple clustering algorithm
Want to partition data points into sets of
“similar” instances
Plot in n-space
partition points into clumps in space
Like a unsupervised analog to k-NN
But iterative
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
28
k-means
Alg:
Choose initial means (centers)
Do {
assign each point to the closest mean
recompute means of clumps
} until change in means < e
Visualization:
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
29
Amazon’s “Page You Made”
mpj9@col
How does this work?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
30
New topic: Frequent Sets
Find sets of items that go together
Intuition: market-basket data
Suppose you’re Wal-Mart, with 460TB of data
http://developers.slashdot.org/article.pl?sid=04/11/14/20
57228
Might not know customer hist (but maybe: club cards!)
One thing you do know: groupings of items
Each set of item wrung up in a an individual purchase
Famous (claimed) example: beer and diapers are
positively correlated
Intuition: Babies at home, not at bars
put chips in between
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
31
Finding Frequent Sets
Situ: one table
Baskets(id, item)
One row for each purchase of each item
Alt interp: words on webpages
Goal: given support s, find set of items that appear
together in >=s baskets
For pairs, obvious attempt:
SELECT B1.item, B2.item, count(B.id)
FROM Baskets B1, Baskets B2
WHERE B1.id = B2.id AND
B1.item < B2.item
GROUP BY B1.item, B2.item
M.P. Johnson, DBMS, Stern/NYU,
HAVING count(B1.basket)
>= s; Spring 2005
What’s
wrong?
32
A Priori Property
a priori ~ prior to
As opposed to a posteriori ~ post
Prior to investigation
Deductive/from reason/non-empirical
After investigation/experience
Logical observation: if support(X) >= s, then
for every individual x in X, support(x) >= s
any item y with support < s can be ignored
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
33
A Priori Property
E.g.: You sell 10,000 items, stores record of
1,000,000 baskets, with avg of 20 items
each
20,000,000 rows in Baskets
#pairs of items from 1,000,000 baskets
= C(20,2)*1,000,000 = 190,000,000 pairs!
One idea:
1.
2.
Find support-s items
Run old query on them
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
34
A Priori Property
Suppose we’re looking for sup-10,000 pairs
Each item in pair must be sup-10,000
Counting arg: have only 20,000,000 individual items
purchased
Can have at most 2000 = 20,000,000/10,000 popular items
Eliminated 4/5s of the item types!
Actually, probably much more
Will also lower the average basket size
Suppose now have 500,000 rows with avg basket
size = 10
#pairs = 500,000*C(10,2) = 22,500,000 rows
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
35
A Priori Algorithm
But: a frequent itemset may have >2 items
Idea: build frequent sets (not just pairs)
iteratively
Alg:
First, find all frequent items
These are the size-1 frequent sets
Next, for each size-k frequent set
For each frequent item
Check whether the union is frequent
If so, add to collection of size-k+1 frequent sets
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
36
Mining for rules
Frequent sets tell us which things go together
But they don’t tell us causality
Sometimes want to say: a causes b
Or at least: presence a makes b likely
E.g.: razor purchase future razorblade purchases
but not vice versa
make razors cheap
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
37
Association rules
Here’s some data – what implies what?
Xact
Custid
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
5/10/99
Pen
1
113
106
5/10/99
Milk
1
114
201
6/1/99
Pen
2
114
201
6/1/99
Ink
2
114
201
6/1/99
Juice
4
114
201
6/1/99
Water
1
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
38
Association rules
Candidates:
Pen ink
Ink pen
Water juice
Etc.
How to pick?
Two main measures
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
39
Judging association rules
Support
Support(X Y)
= support (X union Y)
= Pr(X union Y)
Does the rule matter?
Confidence
Conf(X Y) = support(X Y) / support(X)
= Pr(Y|X)
Show we believe the rule?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
40
Association rules
What are the supports and confidences?
Xact
Custid
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
5/10/99
Pen
1
113
106
5/10/99
Milk
1
114
201
6/1/99
Pen
2
114
201
6/1/99
Ink
2
114
201
6/1/99
Juice
4
114
201
6/1/99
Water
1
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Pen ink
Ink pen
Water juice
Which matter?
41
Discovering association rules
Association rules only matter if their support and
confidence are both high enough
User specifies minimum allowed for each
First, support:
High support(XY) high support(X u Y)
X u Y is a frequent set
So, to find a good association rule
1.
2.
Generate a frequent set Z
Divide into subsets X,Y all possible ways
Checking if conf(XY) is high enough
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
42
Association rules
Suppose {pen,ink} is frequent
Xact
Custid
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
5/10/99
Pen
1
113
106
5/10/99
Milk
1
114
201
6/1/99
Pen
2
114
201
6/1/99
Ink
2
114
201
6/1/99
Juice
4
114
201
6/1/99
Water
1
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Divide in
subsets both
ways:
Pen ink
Ink pen
Which do we
choose?
43
Other kinds of baskets
Here, the basket was a single transaction
(purchase)
But could have other baskets
All purchases from each customer
All purchases on first-day-of-the-months
Etc.
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
44
Frequent set/association applications
Store/website/catalog layout
Page You Made
Direct marketing
Fraud detection
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
45
Mining v. warehousing
Warehousing: let user search, group by
interesting properties
Give me the sales of A4s by year and dealer, for
these colors
User tries to learn from results which properties
are important/interesting
What’s driving sales?
Mining: tell the user what the interesting
properties are
How can I increase sales?
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
46
Social/political concerns
Privacy
TIA
Sensitive data
Allow mining but not queries
Opt-in/opt-out
“Don’t be evil.”
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
47
For more info
See Dahr & Stein: Seven Methods for
Transforming Corporate Data into Business
Intelligence (1997)
Drawn on above
A few years old, but very accessible
http://www.kdnuggets.com/
Data mining courses offered here…
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
48
Future
RAID
Websearch
Proj5 due
Print by 10:30 and turn in on time (at 11)
If not, email…
Final Exam: next Thursday, 5/5,10-11:50am
Info is up
M.P. Johnson, DBMS, Stern/NYU, Spring 2005
49