DMML1_overview

Download Report

Transcript DMML1_overview

Data Mining
(and machine learning)
DM Lecture 1: Overview of DM, and overview of the DM part of
the DM&ML module
Many of these slides are highly derivative of Nick Taylor’s slides
used for this module in previous years
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Overview of My Lectures
All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
•
•
•
•
•
•
•
•
•
•
25/9
02/10:
09/10
16/10
Overview of DM (and of these 8 lectures)
Data Cleaning - usually a necessary first step for large amounts of data
Basic Statistics for Data Miners - essential knowledge, and very useful
Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used
much in industry
NO THURSDAY LECTURE OCTOBER 23rd
30/10
Cluster Analysis and Clustering - simple algs that tell you much about the data
NO THURSDAY LECTURE November 6th
13/11:
Similarity and Correlation Measures - making sure you do clustering appropriately
for the given data
20/11:
Regression - the simplest algorithm for predicting data/class values
27/11:
A Tour of Other Methods and their Essential Details - every important method
you may learn about in future
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Definition & Goal
Definition
• – Data Mining is the exploration and analysis of
large quantities of data in order to discover
meaningful patterns and rules
Goal
• – To permit some other goal to be achieved or
performance to be improved through a better
understanding of the data
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Some examples of huge databases
Retail basket data: much commercial DM is done with this. In one
store, 18,000 baskets per month
Tesco has >500 stores. Per year, 100,000,000 baskets ?
The Internet ~ >15,000,000,000 pages
Lots of datasets: UCI Machine Learning repository
How can we begin to understand and exploit such datasets? Especially
the big ones?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Like this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
and this …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
or this … (see http://www.cs.umd.edu/hcil/treemap-history/
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
or this …
•
see
http://websom.hut.fi/websom/millio
ndemo/html/root.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining - Basics
• Data Mining is the process of discovering patterns
and inferring associations in raw data
• Data Mining is a collection of powerful techniques
intended to analyse large amounts of data
• There is no single Data Mining approach
• Data Mining can employ a range of techniques,
either individually or in combination with each
other
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Why is it important?
•
•
•
•
•
Data are being generated in enormous quantities
Data are being collected over long periods of time
Data are being kept for long periods of time
Computing power is formidable and cheap
A variety of Data Mining software is available
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – History
• The approach has its roots over 40 years ago
• In the early 1960s Data Mining was called
statistical analysis, and the pioneers were
statistical software companies such as SPSS
• By the late 1980s these traditional techniques had
been augmented by new methods such as machine
induction, artificial neural networks, evolutionary
computing, etc.
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Two Major Types
• Directed (Farming) – Attempts to explain or categorise
some particular target field such as income, medical
disorder, genetic characteristic, etc.
•
Undirected (Exploring) – Attempts to find patterns or
similarities among groups of records without the use of a
particular target field or collection of predefined classes
• Compare with Supervised and Unsupervised systems in
machine learning
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Mining – Tasks
Classification - Example: high risk for cancer or not
Estimation - Example: household income
Prediction - Example: credit card balance transfer average
amount
Affinity Grouping - Example: people who buy X, often also
buy Y with a probability of Z
Clustering - similar to classification but no predefined
classes
Description and Profiling – Identifying characteristics
which explain behaviour - Example: “More men watch
football on TV than women”
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing
• Note that Data Mining is very generic and can be used for
detecting patterns in almost any data
– Retail data
– Genomes
– Climate data
– Etc.
• Data Warehousing, on the other hand, is almost
exclusively used to describe the storage of data in the
commercial sector
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehousing - Definitions
“A subject-oriented,
integrated, time-variant and
nonvolatile collection of data in support of
management's decision making process”
W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No.
1, 1995 -- a very influential definition.
“A copy
of transaction data, specifically structured
for query and analysis”
Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse – why?
For organisational learning to take place data from
many sources must be gathered together over time
and organised in a consistent and useful way
Data Warehousing allows an organisation to
remember its data and what it has learned about its
data
Data Mining techniques make use of the data in a
Data Warehouse and subsequently add their results
to it
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Warehouse - Contents
• A Data Warehouse is a copy of transaction data
specifically structured for querying, analysis and
reporting
• The data will normally have been transformed
when it was copied into the Data Warehouse
• The contents of a Data Warehouse, once acquired,
are fixed and cannot be updated or changed later
by the transaction system - but they can be added
to of course
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Data Marts
• A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse
• A Data Mart will normally reflect the
business rules of a specific business unit
within an enterprise – identifying data
relevant to that unit’s acitivities
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
From Data Warhousing to Machine
Learning, via Data Marts
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
The Big Challenge for Data Mining
• The largest challenge that a Data Miner may face
is the sheer volume of data in the Data Warehouse
• It is very important, then, that summary data also
be available to get the analysis started
• The sheer volume of data may mask the important
relationships in which the Data Miner is interested
• Being able to overcome the volume and interpret
the data is essential to successful Data Mining
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
What happens in practice …
Data Miners, both “farmers” and “explorers”, are
expected to utilise Data Warehouses to give
guidance and answer a limitless variety of
questions
The value of a Data Warehouse and Data Mining lies
in a new and changed appreciation of the meaning
of the data
There are limitations though - A Data Warehouse
cannot correct problems with its data, although it
may help to more clearly identify them
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Which brings us to “data cleaning”,
next week …
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html