8392_S2b - Lyle School of Engineering

Download Report

Transcript 8392_S2b - Lyle School of Engineering

CSE 8392 SPRING 1999
DATA MINING: CORE TOPICS
Classification
Professor Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Dallas, Texas 75275
(214) 768-3087
fax: (214) 768-3085
email: [email protected]
www: http://www.seas.smu.edu/~mhd
January 1999
CSE 8392 Spring 1999
51
Classification
• “Classify a set of data based on their values in
certain attributes” (R[2],p868)
• Each grouping is a class
• Equivalence Class
• Given a set K = {k1, k2, k3 …ky}
kx -> (A, B, C, D, E…) where this mapping partitions K
• Similar to estimation
CSE 8392 Spring 1999
52
Classification Examples
•
•
•
•
•
Financial market trends (bull, bear)
Images (raster, vector)
Loan approval (yes, no)
Medical diagnosis
Detecting faults in industry applications
CSE 8392 Spring 1999
53
Basic Classification Techniques
(Kennedy,Ch3)
• Boundaries for decision regions
– Ex: Loan threshold
• Probability Techniques
– p(x|C)
– Figure 3-4, p 3-5, Kennedy
– May also use prior knowledge of class membership
probabilities P(C)
– Select class that maximizes P(C)p(x|C)
– Probability x is in class C is proportional to the
probability that any input is in C and class C contains x
CSE 8392 Spring 1999
54
Supervised Induction Techniques
• Output the probability of class membership based
on input values using some estimation technique
(Decision Trees, Neural Nets)
–
–
–
–
Use a sample database as a training set
Analyze training data
Develop model using attributes of data
Use these class descriptions on rest of data
• Note: May be many different class descriptions
CSE 8392 Spring 1999
55
Posterior Probabilities (Kennedy)
• P(C|x) - Probability input x belongs to class C
• Suppose m classes, look at: P(C1|x), … , P(Cm|x)
• Classification by assigning x to the class with the
highest posterior probability
• Look at training data and assign posterior
probabilities to example patterns
• May not work well with complex tuples in large
databases
• Fig 3-7, p3-9
CSE 8392 Spring 1999
56
Linear Regression
• Linear mapping of input attributes to desired
output
• Error may exist: y  w0  w1 x1  ...  wn xn  error
Here xi are input attributes
• Least-Squares Minimization: Sum of the squared
error terms is minimized over database (training
set)
• Find weights: EQ3,EQ4 Kennedy, p 10-14
• May be used as baseline comparison approach
• May not work well with complex databases :
Not all data values known & May not be numeric values
CSE 8392 Spring 1999
57
Similarity Measures
• Describe each input tuple as vector
D1 = <d11, … , d1n>
• Define Sim(D1,D2) where:
– Normalize (0-no similarity, 1- identical)
– Usually assumes all values are numeric only
• Represent each class with vector Ci
– May be determined as centroid of vectors from training
set
• Assign each tuple Dj to class i where Sim(Dj,Ci) is
minimized
CSE 8392 Spring 1999
58
K Nearest Neighbors (Kennedy)
• Store all input-output pairs in training set
• Define distance function
• When a tuple needs to be classified,
determine distance between it and all items
in the training set
• Fig 10-12, p10-38
• Base answer on K nearest items in training
set. Algorithm p 10-39
• Memory intensive & Slow
CSE 8392 Spring 1999
59
Decision Trees (Kennedy, Section 10.4.10)
• Similar to Twenty Questions: Fig 8-5, p144,
Barquin
• Internal Nodes: Decision Points based on one
attribute
• Leaves: Identify classes
• Classification Process: Input tuple and move
through tree based on attribute values
• Difficult Part - Constructing tree (so that it is
efficient)
(Try playing twenty questions with a young child!)
• Training Set used to build tree
CSE 8392 Spring 1999
60
Decision Tree Issues
•
•
•
•
•
•
Attributes
Splits (Categorical, Discrete, Continuous)
Ordering of attributes in tree
Determining when to stop
Must perform perfectly on training set
Pruning of tree (remove branches)
CSE 8392 Spring 1999
61
Decision Tree Advantages/Disadvantages
• Advantages
– Easy to understand
– Efficient (time)
• Disadvantages
– May be difficult to use with continuous data
– Limited to problems solved by dividing into
subrectangles
– Not flexible (no automatic revisions if incorrect)
– No way to handle missing data
– May have overfitting
– Pruning combats overfitting (but it may induce other
errors)
CSE 8392 Spring 1999
62
ID-3 (R[2])
• Decision Tree learning system based on
information theory
• Attempts to minimize expected number of tests on
a tuple
• Formalizes the approach adults have to twenty
questions!
• Picks attributes with highest information gain first
• Entropy
i   ( pi ln( pi ))
CSE 8392 Spring 1999
63
CART (Kennedy,p10-56)
• Builds binary decision tree
• Exhaustive search to determine best tree where
best defined by goodness of split :
( s / t )  2 PL PR
classes
 P( j / t
j 1
L
)  P( j / t R )
• Optimal splitting:
( s' / t )  Maxi (( si / t ))
CSE 8392 Spring 1999
64
Neural Networks
• Determine “predictions using ‘neurons’
(computations) and their interconnections (weighted
inputs).” (p142,Barquin)
• Example: Fig 8-4, p143, Barquin
Input values of attributes at left
Weight associated with links between nodes
Classification produced at output on right
• Neural Net and Decision Tree
Peter Cabena, Pablo Hadjinian, Rolf Stadler, Jaap Verhees,
and Alessandro Zanasi, Discovering Data Mining From
Concept to Implementation, Prentice-Hall, 1998, p71,p74.
CSE 8392 Spring 1999
65
Neural Nets
• Number of processing layers between input and
output
• Each processing unit (node) connected to all in the
next layer
• Construct neural net:
Determine network structure (modeler)
Weights “learned” by applying tree to training set
Backpropagation used to adjust weights
Desired output provided with training data
Actual network output subtracted from desired output and
error produced
Connection weights changed based on a minimization
method called gradient descent
CSE 8392 Spring 1999
66
Historical Note:
• Neural network weights are adjusted based on
whether or not the prediction is good.
• Earlier IR systems uses “feedback” to adjust
weights in document vectors based on
precision/recall values
CSE 8392 Spring 1999
67
Neural Nets Advantages/Disadvantages
• Advantages
– Low classification error rates
– Robust in noisy environments
– Rules may be more concise if strong relationships in
attributes
– Provides high degree of accuracy
• Disadvantages
– Multiple passes over database - Very expensive
– Classification process not apparent - “Black Box”
• Embedded within graph structure and link weights
• May be difficult to generate rules
• Difficult to infuse domain knowledge in neural net
– May have overfitting
– May fail to converge
CSE 8392 Spring 1999
68
Rule Extraction (RX) Algorithm (R[4])
•
•
•
•
•
•
Cluster
Enumerate ruleset for network outputs
Enumerate ruleset for network inputs
Merge input and output rulesets
Algorithm p958
Neural Net Mining Example
– Lu, section 2.2
CSE 8392 Spring 1999
69
Neural Net Data Mining Research Issues
• Network training time
• Possible methods for incremental training of
network
• Use of domain experts
• Reduction of inputs to network
CSE 8392 Spring 1999
70
Bayesian Classification (Fayyad,Ch6)
• In Bayesian statistics, we measure the likelihood of
observed data y given each value of x, e.g.
f(y|x)
• Data mining goal: find the most probable set of class
descriptions
• NASA AutoClass(Fayyad,Ch6)
–
–
–
–
Discover automatic classifications in data
Like clustering
No prior definition of classes
Classes may be unknown to experts
CSE 8392 Spring 1999
71
AutoClass
• Records represented as vectors of values
• pdf: Gives “probability of observing an instance
possessing any particular attribute value vector.”
(p156,Cheeseman)
• Model: finite mixture distribution
Interclass Mixture Probability P = (Xi  Cj | Vc, Tc, S, I)
• Two levels of search: maximum posterior
parameter (MAP), most probable density function
• Each model a product of independent probability
distributions of attribute subsets
CSE 8392 Spring 1999
72
AutoClass Case Studies
• IRAS: Spectra Classification
– “Minimum” information vs. full access
– “Good” outputs are domain specific
• DNA Codes
– Results may extend beyond limits of database
• LandSat Pixels
– Parallelization
– Undocumented preprocessing
CSE 8392 Spring 1999
73
AutoClass Issues
• Interaction between domain experts and machine
– Essential for good results
– Experts and machine each have unique strengths
– Iterative process
• Ockham Factor
– If a particular parameter does not noticeably improve
model, reject models with this parameter
CSE 8392 Spring 1999
74
Classification Summary
• Watch out for preprocessing: key words:
–
–
–
–
–
–
–
–
–
Calibration
Corrected
Averaged
Normalized
Adjusted
Compressed
Subsets
Partitioned
Representative
• Uncover biases in data collection
• Engage in full partnership with experts and obtain
domain specific results!
CSE 8392 Spring 1999
75