01 Lecture slides

Download Report

Transcript 01 Lecture slides

BIS4435
Lecture 10
Lecture : Data Mining
Dr. Nawaz Khan
School of Computing Science
E-mail: [email protected]
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
1
Reading Assignment
 Core Text:


Lecture 10
GC DL materials on the WebCT: Unit 11
Connolly, T. and Begg, C., 2002, Database Systems: A
Practical Approach to Design, Implementation, and
Management, Addison Wesley, Harlow, England
Additional Reading:
 Fundamentals of Database Systems. R. Elmasri and S. B.
Navathe, 4th Edition, 2004, Addison-Wesley, ISBN 0-32112226-7: Chapter 27
 Data Warehousing, Data Mining, and OLAP, Alex Berson
and Stephen J. Smith, McGraw-Hill, 1997, ISBN 0-07006272-2 (Chapters 17, 18)
 Other resources on the Internet
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
BIS4229 – Industrial Data Management
Technologies
2
Data Mining
Outline
Lecture 10








DW & DM: differences
The Definition
Application areas
Comparison with query and Web site analysis tools
DM Process
Applications, Models and Algorithms
Summary
Q&A
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
BIS4229 – Industrial Data Management
Technologies
3
Data Mining
DW & DM: differences
Data
Mart
Lecture 10
Data
Transformation
Data
Warehouse
Metadata
Access
Tools
Information
Delivery
System
Operational Data
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
4
Data Mining
DW & DM: differences
Lecture 10
 They have the same purpose - decision support
 DW assembles, formats, and organises historical data to answer
user query as it is - depends on content of DW
 DW will not attempt to extract further information or predict
trends and patterns from data
 DM will extract previously unknown and useful information as
well as predict trends and patterns
 DM can be performed on DW and/or traditional DB, files
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
5
Data Mining
The Definition
 DM is the process of extracting previously unknown,
valid and actionable information from large sets of data
Lecture 10
 Unknown - look for things that are not intuitive
 Valid - useful
 Actionable - translate into business advantage
Example:
Rule 1: people don’t buy shares when political situation is not stable
Rule 2: share market is less active when people don’t want to spend
Outcome statement 1 based on rule 1 and 2 is:
Share market is less active when political situation is not stable
Outcome statement 2 based on rule 1 and 2 is:
People don’t want to spend when political situation is not stable
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
6
Data Mining
Application areas
 Direct Marketing
 The ability to predict who is most likely to be interested in what products can
save companies immense amounts in marketing expenditures
 Trend Analysis
Lecture 10
 Understanding trends in the marketplace is a strategic advantage, because
it is useful in reducing costs and timeliness to market
 Security
 Fraud detection: data mining techniques can help discover which
insurance claims, cellular phone calls, or credit card purchases are likely to
be fraudulent
 IDS (intrusion detection systems)
 Forecasting in Financial Markets
 Mining Online – WebKDD
 Web sites today find themselves competing for customer loyalty. It costs
little for customer to switch to competitors
 Text Mining - intelligent document analysis
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
7
Data Mining
Comparison with query and Web site analysis tools
 Query Tools vs. DM Tools
 Both allow user to ask questions of DBMS/DW - find out facts
 Query tool - users make assumption, query based on hypothesis
 Data mining tool - no assumption when making query (goal)
Lecture 10
Example queries:
1. What is the number of white shirt sold in the north vs the south?
2. What are the most significant factors involved in high, medium, and low
sales volumes of white shirt?
 Data mining tool - discover relationships and hidden patterns that
are not obvious
 Trend - integrate data mining in query tools
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
8
Data Mining
Comparison with query and Web site analysis tools
 OLAP Tools vs. DM Tools
Lecture 10
 OLAP - designed to answer top-down queries
 OLAP - provides multidimensional data analysis, data can be
broken down and summarised
 OLAP - query-driven, user-driven, verification-driven
 Data mining - bottom-up, requires no assumption
 Data mining - focus on finding patterns
 Data mining - data-driven, discovery-driven, identify
facts/conclusions based on patterns discovered
 For example, OLAP may tell a bookseller about total number of books it
sold in a region during a quarter. Statistics can provide another
dimension about these sales. Data mining, on the other hand, can tell
you the patterns of these sales, i.e., factors influencing the sales.
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
9
DM Technologies
(see Unit 20 - WebCT)
Database
Management and
Warehousing
Statistics
Lecture 10
Parallel
Processing
Machine
Learning
Data
Mining
Visualisation
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
Decision
Support
10
Data Mining
DM Process - Overview
Data
Sources
Lecture 10
Selected
data
Pre-processed
data
Transformed
data
Extracted
data
Assimilated
knowledge
 Business objectives
data preparation
results analysis & knowledge assimilation





DM
Mining data is only one step in the overall process
Business objectives drive the entire process
Data preparation requires the most efforts
Iterative process with many loop backs over one or more steps
Labour intensive exercise, far from autonomous
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
11
Data Mining
DM Process – Data Preparation
 Data Selection
 Data Pre-processing
 Data Transformation
Lecture 10
 Data Selection - identify data sources and extract data for
preliminary analysis in preparation for further mining
 Process of choosing data to analyse
 decide dependent variable - data (field) to be analysed
 decide active variable - data actively used in mining
 decide useful data dimension
 choose useful (descriptive) fields in the dimension
 consider adding other useful dimension
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
12
Data Mining
DM Process – Data Preparation
 Data Selection
 Data Pre-processing
 Data Transformation
Lecture 10
 Data Pre-processing - ensure quality of the selected data
 Data mining is at best as good as the data it is representing
 Data quality
 redundant data
 incorrect or inconsistent data
 noisy data - outliers - values that are significantly out of line
 bad outlier & good outliers
 missing values - value not present or deleted
 eliminate observations that have missing values - loss info.
 replace missing values
 predict value using predictive model
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
13
Data Mining
DM Process – Data Preparation
 Data Selection
 Data Pre-processing
 Data Transformation
Lecture 10
 Data transformation – pre-processed data converted to
analytical data model.
 Data is refined to suite the input format required by DM
algorithms
 Techniques for data conversion
 simple calculation (SQL) to derive new data fields
 data reduction: combine several existing variables into one new
variable to reduce the total number of variable
 continuous values are scaled/normalised same order of magnitude
 discretisation: quantitative variables into categorical variables
 one-of-N: convert a categorical variable to a numeric representation
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
14
Data Mining
DM Process – Data Mining & Results Analysis
Lecture 10
 DM - apply selected DM algorithm(s) to the pre-processed data
 Inseparable from results analysis - done by data & business
analyst
 The two are linked in an interactive process - DM definition
 Results analysis - depend on application developed
 Segmentation - change base variable may improve result
 Prediction - accuracy and input sensitivity analysis, overtraining
 Association - iteration required for discovering actionable rules
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
15
Data Mining
DM Process – Knowledge Assimilation
 Close the loop
 Objective - take action according to the new, valid and
actionable information discovered
 Challenges -
Lecture 10
 present discovery in convincing, business-oriented way
 formulate ways to best exploit discovery
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
16
Data Mining
Applications, Models and Algorithms
Typical
Applications
Lecture 10
Models
Techniques


Market
Management
Risk
Management
Target marketing
 Forecasting

Customer relationship
 Customer retention
management
 Quality control
 Competitive analysis
 Market basket analysis
 Cross selling
 Market segmentation
Predictive Modelling Segmentation
Link
(Classification)
(Clustering)
Analysis
Associations
 Decision tree
 Geometric
 Memory-based
 Neural networks discovery (Market
Basket Analysis)
learning
 Neural networks
Fraud
Management
Fraud detection
Deviation
Detection
 Visualisation
 Statistics
 Predictive Modelling –Classification
 Human learning experience - observations form a model of the
essential, underlying characteristics of some phenomenon generalisation ability
 In DM, predictive model can analyse a DB to determine some
essential characteristics about data and make predictions
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
17
Data Mining
Applications, Models and Algorithms
 Predictive Modelling –Classification
 Supervised learning - correct answer to some already solved
cases must be given to the model before it can make prediction
about the new observations
Lecture 10
 Model developed in 2-phase
 Training - build a model based on large proportion (90%) of
available data
 Testing - try out the model on previously unseen data (10%) to
determine its accuracy and performance characteristics
 2 types of predictive modelling
 Classification - classify data into some pre-defined classes
 Value prediction - predict continuous numeric value for database
record
 Algorithms – decision trees, neural networks, rule induction
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
18
Data Mining
Applications, Models and Algorithms
 Segmentation – Clustering
Lecture 10
 Segmentation can discover homogeneous sub-population customer profiling/target marketing
 Segmentation (Clustering) - partition DB into segments (clusters) of
similar records, and segments (clusters) are resulting groups of
data records
 Similarity is defined by a measure depends on the distance of
records from centre of the cluster - Euclidean distance
A(a1,a2, …, an), B(b1, b2, …, bn)
Dist(A, B) = ((a1-b1)2 + (a2-b2)2 + … + (an-bn)2)1/2
 Clustering is unsupervised learning - the types of clusters or
number of clusters are not given - true discovery nature of DM
 Algorithm – neural networks
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
19
Data Mining
Applications, Models and Algorithms
 Link Analysis / Deviation Detection
Lecture 10
 Link analysis seeks to establish links between individual records or
sets of records in the DB
 Association discovery - market basket analysis - one transaction
 Sequential pattern discovery - sequence information over time
 Deviation detection - further investigate outliers
 Applications - fraud detection
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
20
Data Mining
Applications, Models and Algorithms
Lecture 10
Typical
Applications
Models
Techniques


Market
Management
Risk
Management
Target marketing
 Forecasting

Customer relationship
 Customer retention
management
 Quality control
 Competitive analysis
 Market basket analysis
 Cross selling
 Market segmentation
Predictive Modelling Segmentation
Link
(Classification)
(Clustering)
Analysis
Associations
 Decision tree
 Geometric
 Memory-based
 Neural networks discovery (Market
Basket Analysis)
learning
 Neural networks
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
Fraud
Management
Fraud detection
Deviation
Detection
 Visualisation
 Statistics
21
Data Mining
Applications, Models and Algorithms
 Decision Trees
Lecture 10
 Decision tree (IF - THEN) - as a commonly used machine learning
algorithm are powerful and popular tools for classification and
prediction
 Attempt to split DB among desired categories and identify important
cluster features
 Tree construction
 choose an attribute (field) for testing - root node of tree
 number of values of the attribute - branches from the root node
– binary - yes/no type of questions
– multiple - complex questions with more than two answer
 Algorithm - ID3 (Interactive Dichotomizer), C4.5, C5.0, CART (chisquared automatic integration detection)
 rank all features in terms of effectiveness in partitioning the set of
classification - information gain
 make the most effective features as the root node
 recur on each branch
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
22
Data Mining
Applications, Models and Algorithms
 Decision Trees
Lecture 10
Diet
Size
Colour
Habitat
Species
meat
meat
meat
meat
grass
grass
grass
large
large
small
small
large
small
large
striped
tawny
striped
brown
striped
grey
tawny
jungle
jungle
house
jungle
plains
plains
plains
tiger
lion
tabby
weasel
zebra
rabbit
antelope
 Optimal tree produced by ID3
 root node - “Colour”, most information gain
 4 branches - “striped”, “tawny”, “brown” & “grey”
 recur on branch “striped” & “tawny”
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
23
Data Mining
Applications, Models and Algorithms
Colour
striped
tawny
Lecture 10
Habitat
jungle
tiger
grey
brown
Diet
house
plains
tabby
weasel
grass
zebra
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
rabbit
meat
antelope
lion
24
Data Mining
Applications, Models and Algorithms
 Neural Networks
Lecture 10
 An NN is used to simulate the operation of the brain
 An NN consists of large number of processors (neurons/nodes) and
links (connections) - representing knowledge
 An NN is trained with large amount of data and rules about data
relationships - memorise
 A well trained NN can learn association and similarity – generalise
 Supervised learning:
 NN is trained with sets of inputs and desired outputs
 If the actual output is different from the desired output, the network
adjust its internal connection strengths (weights) to reduce the
difference
 This process continues until the network gets the I/O patterns correct or
until an acceptable error rate is attained
 Unsupervised learning - Self-Organising Map (SOM)
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
25
Data Mining
Summary
Lecture 10





DW & DM: differences
The definition
Application areas
Comparison with query and Web site analysis tools
DM Process
 Data preparation (60% of the whole time)
 DM (~10% of the time)
 Applications, Models and Algorithms (decision trees,
neural networks, etc.)
 Next week:
 Revision
Dr. Nawaz Khan, School of Computing Science
E-mail: [email protected]
26