Data Warehousing
Download
Report
Transcript Data Warehousing
Data Mining
(and machine learning)
DM Lecture 1: Overview of DM, and overview of the DM part of
the DM&ML module
Some of these slides are derivative of Nick Taylor’s slides used for
this module in previous years
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
All at:
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
• Lecture 1: about data and data mining;
• Lectures 2 and 3: Basic and useful ways to
process and understand data
• Lectures 4, 5, 6, 7, 8 Details of useful
algorithms for finding knowledge from data;
• Lecture 9: overview of what else there is.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Module assessment
100% by coursework
Two main items of coursework, 50% each
Four small items of coursework, worth
nothing, but if you don’t do them
adequately you fail the module.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This Semester
PDW lectures on Mondays (machine learning)
DWC lectures on Thursdays (data mining)
Friday slot usually unused – we may use it,
and will let you know in advance
All coursework set by DWC
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework submission
ALL coursework must be submitted as follows
• as PDF
• by email to [email protected]
• the c/w is an attachment
• Subject line: DMML Coursework A
– (… or B, C, D, 1, 2)
• Body of the email includes your Name and your Course
(e.g. Joe Smith, BSc CS – Jill Brown, MSc AI)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
DWC lectures and c/w, key dates
Thur sep 17th
This lecture
Handout C/W A
Thur sep 24th
Lecture
Handout C/W B
Thur Oct 1st
Lecture
Handout Main C/W 1 (50%)
Thur Oct 8th
Lecture
Thur Oct 15th
Lecture
Thur Oct 22nd
Lecture
Handout Main C/W 2 (50%)
Thur Oct 29th
NO LECTURE
(handin C/W A,B and 1on Fri 30th)
Thur Nov 5th
NO LECTURE
Thur Nov 12th
Lecture
Handout C/W C --- C/W 1 vivas on Fri 13th
Thur Nov 19th
Lecture
Handout C/W D
Thur Nov 26th
Lecture
(handin C/W C,D and 2 on Fri 27th)
Thur Dec 3rd
C/W 2 vivas
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
At last, the lecture
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What some people think can be done
with data
Answer simple questions like:
• How many female clients do we have?
• How much paint did we sell in 2007?
• Which is the most profitable branch of our
supermarket?
• Which postcodes suffered the most dropped
calls in July?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so
Boring
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
More interesting things that can be
done with data
Answer difficult and valuable questions like:
• How can we predict Ovarian cancer early enough to treat it
successfully?
• How can I make significant profit on the stock market next
month?
• Two different authors claim to have written this story –
how can we resolve the dispute?
• How can we get our customers to spend more money in the
store?
• Is this loan applicant a good credit risk?
• Is this sonar image a mine, or a rock?
• What other websites will this browser be interested in?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Definition & Goal
Definition
• – Data Mining is the exploration and analysis of
large quantities of data in order to discover
meaningful patterns and rules
Goal
• – To permit some other goal to be achieved or
performance to be improved through a better
understanding of the data
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some examples of large databases
Retail basket data: much commercial DM is done with this. In one
store, 18,000 baskets per month
Tesco has >500 stores. Per year, 100,000,000 baskets ?
The Internet ~ >15,000,000,000 pages
Lots of datasets: UCI Machine Learning repository
How can we begin to understand and exploit such datasets? Especially
the big ones?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Like this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
or this …
•
see
http://websom.hut.fi/websom/millio
ndemo/html/root.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Basics
• Data Mining is the process of discovering patterns
and inferring associations in raw data
• Data Mining is a collection of techniques intended
to analyse small or large amounts of data
• There is no single Data Mining approach
• Data Mining can employ a range of techniques,
either individually or in combination with each
other
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Why is it important?
•
•
•
•
•
•
Data are being generated in enormous quantities
Data are being collected over long periods of time
Data are being kept for long periods of time
Computing power is formidable and cheap
A variety of Data Mining software is available
All of these data contain `hidden knowledge’ –
facts, rules, patterns, that can be usefully exploited
if we can find them.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – History
• The approach has its roots over 40 years ago
• In the early 1960s Data Mining was called
statistical analysis, and the pioneers were
statistical software companies such as SPSS
• By the late 1980s these traditional techniques had
been augmented by new methods such as machine
induction, artificial neural networks, evolutionary
computing, etc.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some basic terminology
Gender
weight
height
Age in mths 100m
time
Male
Male
Female
Male
52kg
89kg
48kg
86kg
1.71m
1.92m
1.67m
1.96m
243
388
219
274
13.7s
22.3s
14.6s
9.58s
Male
80kg
1.88m
260
10.56s
etc …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a data instance or a
record or just a line of data
Gender
weight
height
Age in mths 100m
time
Male
Male
Female
Male
52kg
89kg
48kg
86kg
1.71m
1.92m
1.67m
1.96m
243
388
219
274
13.7s
22.3s
14.6s
9.58s
Male
80kg
1.88m
260
10.56s
etc …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a field or an attribute;
the value of the Age field in the 4th record is 274
Gender
weight
height
Age in mths 100m
time
Male
Male
Female
Male
52kg
89kg
48kg
86kg
1.71m
1.92m
1.67m
1.96m
243
388
219
274
13.7s
22.3s
14.6s
9.58s
Male
80kg
1.88m
260
10.56s
etc …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Usually we are interested in predicting the value of a
particular field, given the values of the other fields. What we
want to predict is called the class field, or the target class
Gender
weight
height
Age in mths 100m
time
Male
Male
Female
Male
52kg
89kg
48kg
86kg
1.71m
1.92m
1.67m
1.96m
243
388
219
274
13.7s
22.3s
14.6s
9.58s
Male
80kg
1.88m
260
10.56s
etc …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some data-mining related projects that I am currently
working on (either myself, or with a PhD student or RA)
Predicting whether or not two textures will be considered similar
by humans.
Predicting which of two or more writers is the author of a given
piece of text (you will do some work on this)
Discovering which subsets of many thousands of genes play a role
in specific diseases (cancer, diabetes, etc)
(you will do a little work on this too)
Discovering technical trading rules for stock market trading
(you will do a little work on this too)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Which pair of textures is most similar?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Which pair of textures is most similar?
A line of data …
0.23 1.88 9.64 3.22 …
7.1 1086.9 2.23 …
0.76
%age of people who
think they are similar
5,000 features for texture2
5,000 features for texture1
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Who wrote text chunk 4?
0.4
0.3
0.2
0.2
0.2
0.15
0.2
0.15
0.001 0.002 0.6 …
0
0.1 0.5 …
0.001 0.002 0.5 …
0
0.002 0.6 …
AuthorA
AuthorA
AuthorB
?
Word usage `Fingerprint’ of
a 1,000 word chunk of text
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Did the Dow Jones go up or down in
the following week?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Down
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Will the Dow Jones go up or down
tomorrow?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Two Major Types
• Directed (Farming) – Attempts to explain or categorise
some particular target field such as income, medical
disorder, genetic characteristic, etc.
•
Undirected (Exploring) – Attempts to find patterns or
similarities among groups of records without the use of a
particular target field or collection of predefined classes
• Compare with Supervised and Unsupervised systems in
machine learning
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Tasks
Classification - Example: high risk for cancer or not
Estimation - Example: household income
Prediction - Example: credit card balance transfer average
amount
Affinity Grouping - Example: people who buy X, often also
buy Y with a probability of Z
Clustering - similar to classification but no predefined
classes
Description and Profiling – Identifying characteristics
which explain behaviour - Example: “More men watch
football on TV than women”
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing
• Note that Data Mining is very generic and can be used for
detecting patterns in almost any data
– Retail data
– Genomes
– Climate data
– Etc.
• Data Warehousing, on the other hand, is almost
exclusively used to describe the storage of data in the
commercial sector
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What you should do this week
Browse the UCI Machine Learning repository
datasets and associated information; get
acquainted with data
Browse the statlib datasets archive, get acquainted
with that too.
And then …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework A (0 marks, but you fail if
you don’t submit an adequate attempt)
Find three other dataset repositories as follows:
1.One that specialises in financial data
2.One that specialises in time series data
3.One that specialises in anything else.
For each of these three, tell me the URL, and write
one paragraph, ~100 words, in your own words,
describing the contents of this repository,
Submit on or before Friday October 30th
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Au revoir
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
If time available …
Some slides about data warehousing; I don’t
consider this an essential part of this
module, but in case you want to know what
data warehousing is …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing - Definitions
“A subject-oriented,
integrated, time-variant and
nonvolatile collection of data in support of
management's decision making process”
W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No.
1, 1995 -- a very influential definition.
“A copy
of transaction data, specifically structured
for query and analysis”
Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse – why?
For organisational learning to take place data from
many sources must be gathered together over time
and organised in a consistent and useful way
Data Warehousing allows an organisation to
remember its data and what it has learned about its
data
Data Mining techniques make use of the data in a
Data Warehouse and subsequently add their results
to it
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse - Contents
• A Data Warehouse is a copy of transaction data
specifically structured for querying, analysis and
reporting
• The data will normally have been transformed
when it was copied into the Data Warehouse
• The contents of a Data Warehouse, once acquired,
are fixed and cannot be updated or changed later
by the transaction system - but they can be added
to of course
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Marts
• A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse
• A Data Mart will normally reflect the
business rules of a specific business unit
within an enterprise – identifying data
relevant to that unit’s acitivities
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
From Data Warhousing to Machine
Learning, via Data Marts
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Big Challenge for Data Mining
• The largest challenge that a Data Miner may face
is the sheer volume of data in the Data Warehouse
• It is very important, then, that summary data also
be available to get the analysis started
• The sheer volume of data may mask the important
relationships in which the Data Miner is interested
• Being able to overcome the volume and interpret
the data is essential to successful Data Mining
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What happens in practice …
Data Miners, both “farmers” and “explorers”, are
expected to utilise Data Warehouses to give
guidance and answer a limitless variety of
questions
The value of a Data Warehouse and Data Mining lies
in a new and changed appreciation of the meaning
of the data
There are limitations though - A Data Warehouse
cannot correct problems with its data, although it
may help to more clearly identify them
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html