Data Mining A Tutorial

Download Report

Transcript Data Mining A Tutorial

Part I
Data Mining Fundamentals
Data Mining: A First View
Chapter 1
1.1 Data Mining: A Definition
Data Mining
The process of employing one or more
computer learning techniques to
automatically analyze and extract
knowledge from data.
Induction-based Learning
The process of forming general
concept definitions by observing
specific examples of concepts to be
learned.
Knowledge Discovery in
Databases (KDD)
The application of the scientific
method to data mining. Data mining is
one step of the KDD process.
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
1.2 What Can Computers Learn?
Four Levels of Learning
• Facts
• Concepts
• Procedures (to be worked out)
• Principles
Concepts
Computers are good at learning
concepts. Concepts are the output of a
data mining session.
Three Concept Views
• Classical View (Crisp)---old hands
–As a definition
• Probabilistic View (85%)---with some experience
–DM rules with confidence
• Exemplar View (CBR)—new comer
•An illustrated example:
–good credit?
Supervised Learning
• Build a learner model using data
instances of known origin.
• Use the model to determine the
outcome new instances of
unknown origin.
Supervised Learning:
A Decision Tree Example
Decision Tree
A tree structure where non-terminal
nodes represent tests on one or more
attributes and terminal nodes reflect
decision outcomes.
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient
ID#
Sore
Throat
Fever
Swollen
扁桃腺腫脹
淋巴腺
Congestion
Headache
Diagnosis
Yes
Yes
鏈球菌性喉炎 Strep
Yes
No
No
No
No
No
Yes
Yes
Yes
throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
Glands
1
2
3
4
5
6
7
8
9
10
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
Figure 1.1 A decision tree for the data
in Table 1.1
Table 1.2 • Data Instances with an Unknown Classification
Patient
ID#
Sore
Throat
11
12
13
No
Yes
No
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
No
Yes
No
Yes
No
No
Yes
No
No
Yes
Yes
Yes
?ANS=strep throat
? ANS=cold
? ANS=allergy
Production Rules
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
Unsupervised Clustering
A data mining method that builds
models from data without predefined
classes.
Table 1.3 • Acme Investors Incorporated
Customer
Account 保證金 I(融 Transaction
Trades/
Favorite
Annual
資操
作)Margin
ID
1005
1013
1245
2110
1001
Type
Joint
Custodial
Joint
Individual
Individual
Account
Method
Month
Sex
Age
Recreation
Income
No
No
No
Yes
Yes
Online
Broker
Online
Broker
Online
12.5
0.5
3.6
22.3
5.0
F
F
M
M
M
30–39
50–59
20–29
30–39
40–49
Tennis
Skiing
Golf
Fishing
Golf
40–59K
80–99K
20–39K
40–59K
60–79K
3 groups formed (table 1.3 is only a part of whole table)
G1.MarginAccount=yes and age =20-29 and
AnnualIncome=40-59k
accuracy=80% coverage=0.5
G2. AccountType=Custodial and
FavoriteRecreation=Skiing and
AnnualIncome=40-59k
accuracy=95% coverage=0.35
G3.AccountType=joint and Trades/Month>5
and TransactionMethod=online
accuracy=82% coverage=0.65
1.3 Is Data Mining Appropriate
for My Problem?
Data Mining or Data Query?
• Shallow Knowledge (SQL)
• Multidimensional Knowledge (OLAP)
• Hidden Knowledge (DM)
• Deep Knowledge (human)
Data Mining vs. Data Query: An
Example
• Use data query if you already
almost know what you are
looking for.
• Use data mining to find regularities
in data that are not obvious.
1.4 Expert Systems or Data
Mining?
圖14-2 專家系統架構細部圖
Expert System
A computer program that emulates
the problem-solving skills of one or
more human experts.
Knowledge Engineer
A person trained to interact with an
expert in order to capture their
knowledge.
Data
Data Mining Tool
If Swollen Glands = Yes
Then Diagnosis = Strep Throat
Human Expert
Knowledge Engineer
Expert System
Building Tool
If Swollen Glands = Yes
Then Diagnosis = Strep Throat
Figure 1.2 Data mining vs. expert
systems
1.5 A Simple Data Mining
Process Model
Operational
Database
Data
Warehouse
SQL Queries
Data Mining
Interpretation
&
Evaluation
Figure 1.3 A simple data mining
process model
Result
Application
Assembling the Data
• The Data Warehouse
• Relational Databases and Flat Files
Mining the Data
Interpreting the Results
Result Application
1.6 Why Not Simple Search?
• Nearest Neighbor Classifier (i.e., CBA, add a new
instance in a class based on similarity)
–Time consuming and entropy independent
• K-nearest Neighbor Classifier
–Form a class consisting of K-nearest neighbors
Assignment 4
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient
ID#
Sore
Throat
Swollen
扁桃腺腫脹
淋巴腺
Congestion
Headache
Diagnosis
Fever
Glands
1
Yes
Yes
Yes
Yes
Yes
鏈球菌性喉炎 Strep
2
3
4
5
6
7
8
9
10
No
Yes
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
A new instance, Patient ID=14, Sore Throat=yes, Fever =No, Swollen Glands=No,
Congestion =No, Headache =No
Comparison:
with one matched attribute: ID=1,9
with one matched attribute: ID=2,5,10
with one matched attribute: ID=3,6,7,8
with one matched attribute: ID=4strep throat? Correct diagnosis should be allergy
using decision tree
Q: Try K-nearest
Neighbor Classifier
1.7 Data Mining Applications
Customer Intrinsic Value
_
_
_
_
_
_
_
Intrinsic
(Predicted)
Value
_
_
X
X
_
_
X
X
X
X
X
Actual Value
Figure 1.4 Intrinsic vs. actual
customer value
X
X