DMML1_1415 - Heriot

Download Report

Transcript DMML1_1415 - Heriot

Data Mining
and Machine
Learning
Lecture 1: Why data is useful, and overview of DMML:
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
might be changes – watch your email
David Corne Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
might be changes – watch your email
C/Ws and Deadlines will slightly change – give me day or two
David Corne Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
might be changes – watch your email
C/Ws and Deadlines will slightly change – give me day or two
Lecture material will change a little, one lec to add
David Corne Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Module assessment
100% by coursework
Three main items of coursework,
CW 1: 30%
CW 2: 40%
CW 3: 30%
Two small items of coursework (A and B),
worth 0%, but if you don’t do them
adequately you fail the module.
Extra bit added to each c/w for MSc students
David Corne, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework submission
ALL coursework to be submitted as follows
• as PDF
• by email to [email protected]
• the c/w is an attachment
• Subject line: DMML Coursework A
– (… or B, 1, 2, 3)
• Body of the email includes your Name and your Course
(e.g. Joe Smith, BSc CS – Jill Brown, MSc AI)
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
At last, the lecture
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What some people think can be done
with data
Answer simple questions like:
• How many female clients do we have?
• How much paint did we sell in 2007?
• Which is the most profitable branch of our
supermarket?
• Which postcodes suffered the most dropped
calls in July?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
that is so
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
More interesting things that can be
done with data
Answer difficult and valuable questions like:
• How can we predict Ovarian cancer early enough to treat it
successfully?
• How can I make significant profit on the stock market next
month?
• Two different authors claim to have written this story –
how can we resolve the dispute?
• How can we get our customers to spend more money in the
store?
• Is this loan applicant a good credit risk?
• Is this sonar image a mine, or a rock?
• What other websites will this browser be interested in?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some competitions at
Data Mining - Definition & Goal
Definition
• – Data Mining is the exploration and analysis of
(often) large quantities of data in order to discover
meaningful patterns and rules
Goal
• – To permit some other goal to be achieved or
performance to be improved through a better
understanding of the data
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some examples of large databases
Retail basket data: much commercial DM is done with this. In one
store, 18,000 baskets per month
Tesco has >500 stores. Per year, 100,000,000 baskets ?
The Internet ~ >20,000,000,000 pages
Lots of datasets: UCI Machine Learning repository
How can we begin to understand and exploit such datasets? Especially
the big ones?
Like this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
or this …
•
see
http://websom.hut.fi/websom/millio
ndemo/html/root.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Or this
What on Earth is ‘big data’ anyway?
Data Mining & Machine Learning - Basics
• Data Mining is the process of discovering patterns and
inferring associations in raw data
• … a collection of techniques intended to analyse small or large
amounts of data
• … can employ a range of techniques, either individually or in
combination with each other
• Machine Learning is the same, but the term ML emphasises a
range of more sophisticated algorithms that try to learn
accurate predictive models of data
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Why is it important?
•
•
•
•
•
•
Data are being generated in enormous quantities
Data are being collected over long periods of time
Data are being kept for long periods of time
Computing power is formidable and cheap
A variety of Data Mining software is available
All of these data contain `hidden knowledge’ –
facts, rules, patterns, that can be usefully exploited
if we can find them.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some basic terminology
Gender
weight
height
Male
Male
Female
Male
Male
52kg
89kg
48kg
86kg
80kg
1.71m
1.92m
1.67m
1.96m
1.88m
etc …
Age in mths 100m
time
243
13.7s
388
219
274
260
22.3s
14.6s
9.58s
10.56s
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a data instance or a
record or just a line of data
Gender
weight
height
Male
Male
Female
Male
Male
52kg
89kg
48kg
86kg
80kg
1.71m
1.92m
1.67m
1.96m
1.88m
etc …
Age in mths 100m
time
243
13.7s
388
219
274
260
22.3s
14.6s
9.58s
10.56s
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
This is called a field or an attribute;
the value of the Age field in the 4th record is 274
Gender
weight
height
Male
Male
Female
Male
Male
52kg
89kg
48kg
86kg
80kg
1.71m
1.92m
1.67m
1.96m
1.88m
etc …
Age in mths 100m
time
243
13.7s
388
219
274
260
22.3s
14.6s
9.58s
10.56s
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Usually we are interested in predicting the value of a
particular field, given the values of the other fields. What we
want to predict is called the class field, or the target class
Gender
weight
height
Male
Male
Female
Male
Male
52kg
89kg
48kg
86kg
80kg
1.71m
1.92m
1.67m
1.96m
1.88m
etc …
Age in mths 100m
time
243
13.7s
388
219
274
260
22.3s
14.6s
9.58s
10.56s
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some data-mining related projects that I am currently
working on (either myself, or with a PhD student or RA)
Analysing sonar images to detect underwater mines
Predicting which of two or more writers is the author of a given
piece of text
Discovering which subsets of many thousands of genes play a role
in specific diseases (cancer, diabetes, etc)
Analysing the current twitter timeline to detect immediate evidence
of an earthquake
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Who wrote text chunk 4?
0.4
0.3
0.2
0.2
0.2
0.15
0.2
0.15
0.001 0.002 0.6 …
0
0.1 0.5 …
0.001 0.002 0.5 …
0
0.002 0.6 …
AuthorA
AuthorA
AuthorB
?
Word usage `Fingerprint’ of
a 1,000 word chunk of text
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Did the Dow Jones go up or down in
the following week?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Down
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Will the Dow Jones go up or down
tomorrow?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing
• Note that Data Mining is very generic and can be used for
detecting patterns in almost any data
– Retail data
– Genomes
– Climate data
– Etc.
• Data Warehousing, on the other hand, is almost
exclusively used to describe the storage of data in the
commercial sector
David Corne,, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What you should do this week
Browse the UCI Machine Learning repository
datasets and associated information; get
acquainted with data
Browse the statlib datasets archive, get acquainted
with that too.
Browse the http://www.kaggle.com/ website - to
give you some idea of how hot data mining is
And then …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Coursework A (0 marks, but you fail if
you don’t submit an adequate attempt)
Find three other dataset repositories as follows:
1. One that specialises in sports data
2. One that specialises in time series data
3. One that specialises in anything else that is interesting.
For each of these three, tell me the URL, and write
one paragraph, ~100 words, in your own words,
describing the contents of this repository,
Submit on or before 23:59pm Friday October 9th
David Corne, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
dataset “repository” ?
• A collection of datasets, probably with an
overall theme
• Not a single dataset
• Not a big deal
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
If interested…
Some slides about data warehousing; I don’t
consider this an essential part of this
module, but in case you want to know what
data warehousing is …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing - Definitions
“A
subject-oriented, integrated, time-variant and
nonvolatile collection of data in support of
management's decision making process”
W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1,
No. 1, 1995 -- a very influential definition.
“A
copy of transaction data, specifically structured
for query and analysis”
Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse – why?
For organisational learning to take place data from
many sources must be gathered together over time
and organised in a consistent and useful way
Data Warehousing allows an organisation to
remember its data and what it has learned about its
data
Data Mining techniques make use of the data in a
Data Warehouse and subsequently add their
results to it
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse - Contents
• A Data Warehouse is a copy of transaction data
specifically structured for querying, analysis and
reporting
• The data will normally have been transformed
when it was copied into the Data Warehouse
• The contents of a Data Warehouse, once acquired,
are fixed and cannot be updated or changed later
by the transaction system - but they can be added
to of course
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Marts
• A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse
• A Data Mart will normally reflect the
business rules of a specific business unit
within an enterprise – identifying data
relevant to that unit’s acitivities
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
From Data Warhousing to Machine
Learning, via Data Marts
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Big Challenge for Data Mining
• The largest challenge that a Data Miner may face
is the sheer volume of data in the Data Warehouse
• It is very important, then, that summary data also
be available to get the analysis started
• The sheer volume of data may mask the important
relationships in which the Data Miner is interested
• Being able to overcome the volume and interpret
the data is essential to successful Data Mining
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What happens in practice …
Data Miners, both “farmers” and “explorers”, are
expected to utilise Data Warehouses to give
guidance and answer a limitless variety of
questions
The value of a Data Warehouse and Data Mining lies
in a new and changed appreciation of the meaning
of the data
There are limitations though - A Data Warehouse
cannot correct problems with its data, although it
may help to more clearly identify them
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html