Transcript ADB_DM1

These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Advanced Database Systems
F24DS2 / F29AT2
About Data Mining
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Data Mining has Many meanings
There are lots of things you can do with a database:
1. Access data via straightforward queries, to answer
straightforward questions about instances. E.g.:
• ``What is Ellen McArthur’s home phone number?”
• “What is the ISBN number of Eats, shoots and leaves,
by Lynne Truss?”,
• “What grade did Larry Page get for the Internet
module??”
• “Give me a list of all pages on the www that contain the
phrase “fried egg sandwich”.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Data Mining has Many meanings (cont…)
2. Generate simple reports about data via straightforward
queries, to answer questions about sets of instances. E.g.:
• ``How many of our customers are called “Trevor?””
• “Which of our books has been borrowed more times in
the last month than Eats, shoots and leaves,
by Lynne Truss?”,
• “Which student has the highest average marks?”
• “ What percentage of house owners also own a car?”.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Data Mining has Many meanings (cont…)
3. Generate complex/and/or comprehensive statistical
reports about the database as a whole, to summarise and
understand the data – this is what tends to be done in the
Analysis stage of Data Cleaning.. E.g.:
• For each field, generate a histogram of the values
• Run one or more clustering algorithms to find the
clusters in the data.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Data Mining has Many meanings (cont…)
4. Build Predictive models that may then be useful for
business or research. For example:
• Based on stock market data, we can construct a model
that attempts to predict tomorrow’s Dow Index closing
price, given the previous few days’ prices.
• Based on blood test data from past patients, we can
construct a model that attempts to predict whether or not
a patient is developing hepatitis.
• Based on historic data on vibrations, we can build a
model that tries to predict beforehand if an aircraft wing
is likely to fail.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Data Mining has Many meanings (cont…)
5. Discover INTERESTING and USEFUL rules that are
`hidden’ in the data. For example:
• An analysis of supermarket basket data will show a
surprising amount of baskets that contain both beer and
nappies..
• Analysis of crime records data may find that the violent
crimes rate in newcastle seems to reduce significantly
whenever the violent crimes rate in Sunderland
increases significantly. .
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
All Together
1.
2.
3.
4.
5.
Accessing
Reporting
Clustering/Histograms
Predictive models
Discovery of interesting/surprising things
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Data Mining
1.
2.
3.
4.
5.
Accessing
Reporting
Clustering/Histograms
Predictive models
Discovery of interesting/surprising things
When you hear the term `data mining’, it can mean any of
2, 3, 4 and 5. In business/industry, `2’ and `3’ are called
data mining. In academia we usually take data mining
to mean mainly `4’ and `5’
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Notes on 3/4
1.
2.
3.
4.
5.
Accessing
Reporting
Clustering/Histograms
Predictive models
Discovery of interesting/surprising things
These are the things that you would look at more closely in
a machine learning course. The predictive models are
things like neural networks, decision trees and rulesets
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
What DM means for us
1.
2.
3.
4.
5.
Accessing
Reporting
Clustering/Histograms
Predictive models
Discovery of interesting/surprising things
So, 3 and 4 are dealt with in another course. 5 could be an
entire MSc course on its own, but that is what DM
means for us.
. In particular, we take a small bite of it that is relevant to
practical discovery of interesting things in very large
DBs. We look at a fast algorithm that can discover
interesting rules in transaction databases, and that is a
component in several advanced commercial systems..
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
First, some important
motivational/explanatory notes
Why do we need something like `type 5’ data mining at all?
Couldn’t the `beer and nappies’ thing have been found
by types 2 or 3 DM?
The next slide shows a tiny `supermarket basket’ database. E.g.
Record 11 is a customer who bought eggs and glue only; record 12
Records a transaction where the basket contained only apples.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream
1
2
3
1
1
4
5
6
7
8
9
10
11
12
13
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
14
15
16
17
18
19
20
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
1
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Numbers
Our example DB has 20 records of supermarket
transactions, from a supermarket that only sells 9
things
One month in a large supermarket with five stores
spread around a reasonably sized city might easily
yield a DB of 20,000,000 baskets, each containing
a set of products from a pool of around 1,000
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Rules
A `rule’ is something like this:
If a basket contains apples and cheese, then it
also contains beer
Any such rule has two associated measures:
1. confidence – when the `if’ part is true,
how often is the `then’ bit true?
2. coverage or support – how much of the
database contains the `if’ part?
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Example:
What is the confidence and coverage of:
If the basket contains beer and cheese, then it
also contains honey
2/20 of the records contain both beer and cheese, so coverage is 10%
Of these 2, 1 contains honey, so confidence is 50%
Is that interesting ? Is that useful ? What makes a rule interesting
or useful?
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Interesting/Useful rules
Statistically, anything that is interesting is something
that happens significantly more than you would
expect by chance.
E.g. basic statistical analysis of basket data may
show that 10% of baskets contain bread, and 4%
of baskets contain washing-up powder. I.e:
– There is a probability 0.1 that a basket contains bread.
– There is a probability 0.04 that a basket contains
washing-up powder.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Bread and washing up powder
What is the probability of a basket containing
both bread and washing-up powder? The
laws of probability say:
If these two things are independent, chance is
0.1 * 0.04 = 0.004
That is, we would expect 0.4% of baskets to
contain both bread and washing up powder
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Interesting means surprising
We therefore have a prior expectation that just 4 in
1,000 baskets should contain both bread and
washing up powder.
If we investigate, and discover that really it is 20 in
1,000 baskets, then we will be very surprised. It
tells us that:
– Something is going on in shoppers’ minds: bread and
washing-up powder are connected in some way.
– There may be ways to exploit this discovery … put the
powder and bread at opposite ends of the supermarket?
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Finding surprising rules
Suppose we ask `what is the most surprising rule in this database? This
would be, presumably, a rule whose accuracy is more different from its
expected accuracy than any others. But it also has to have a suitable
level of coverage, or else it may be just a statistical blip, and/or
unexploitable.
Looking only at rules of the form:
if basket contains X and Y, then it also contains Z
… our realistic numbers tell us that there may be around 500,000,000
distinct possible rules. For each of these we need to work out its
accuracy and coverage, by trawling through a database of around
20,000,000 basket records. … c 1016 operations …
Yes, it’s easy to use `type 2’ DM, say, to work out the confidence and
coverage of a given rule. But type 5 DM is all about searching through,
somehow, 500,000,000 (or usually immensely more) rules to sniff out
what may be the interesting ones.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Here are some interesting ones
in our mini basket DB:
• If a basket contains glue, then it also
contains either beer or eggs
confidence: 100% ; coverage 25%
• If a basket contains apples and dates, then it
also contains honey
confidence 100% ; coverage 20%
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
What this lecture was about
• The many different meanings of data
mining
• Warming up for the next lecture, via gentle
discussion on transaction databases, rules,
confidence, coverage, and what it takes for
a rule to be interesting.
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me
These slides are at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/
Next
A classic fast algorithm for finding useful
rules in large databases,
David Corne, room EM G.39, x 3410, [email protected] / any questions, feel free to contact me