MS PowerPoint 97/2000 format

Download Report

Transcript MS PowerPoint 97/2000 format

Presentation
Aspects Of
Feature Selection for KDD
Friday, Febuary 2, 2001
Presenter:Ajay Gavade
Paper #2: Liu and Motoda, Chapter 3
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Outline
•
Categories of Feature Selection Algorithms
– Feature Ranking Algorithms
– Minimum Subset Algorithms
•
Basic Feature Generation Schemes & Algorithms
– How do we generate subsets?
– Forward, backward, bidirectional, random
•
Search Strategies & Algorithms
– How do we systematically search for a good subset?
– Informed & Uninformed Search
– Complete search
– Heuristic search
– Nondeterministic search
•
Evaluation Measure
– How do we tell how good a candidate subset is?
– Information gain, Entropy.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
The Major Aspects Of Feature Selection
•
Search Directions (Feature Subset Generation)
•
Search Strategies
•
Evaluation Measures
•
A Particular method of feature selection is a combination of some possibilities of
every aspect. Hence each method can be represented by a point in the 3-D structure.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Major Categories of Feature Selection
Algorithms
(From The Point Of View Of Method’s Output)
•
Feature Ranking Algorithms
•
These algorithms return a ranked list of features ordered according to some
evaluation measure. The algorithm tells the importance (relevance) of a feature
compared to others.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Major Categories of Feature Selection
Algorithms
(From The Point Of View Of Method’s Output)
•
Minimum Subset Algorithms
•
These algorithms return a minimum feature subset , and no difference is made for
features in the subset. Theses algorithms are used when we don’t know the number
of relevant features.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Basic Feature Generation Schemes
•
Sequential Forward Generation
•
Starts with empty set and adds features from the original set sequentially. Features
are added according to relevance.
•
N-step look-ahead form.
•
One -step look-ahead form is the most commonly used schemes because of good
efficiency
•
•
A minimum feature subset or ranked list can be obtained.
Can deal with noise in data.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Basic Feature Generation Schemes
•
Sequential Backward Generation
•
Starts with full set and removes one feature at a time from the original set
sequentially. Least relevant feature is removed.
•
•
But this tells nothing about the ranking of the relevant features remaining.
Doesn't guarantee absolute minimal subset.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Basic Feature Generation Schemes
•
Bidirectional Generation
•
This runs SFG and SBG in parallel, and stops when one algorithm finds a
satisfactory subset.
•
Optimizes the speed if number of relevant features is unknown.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Basic Feature Generation Schemes
•
Random Generation
•
Sequential Generation Algorithms are fast on average, but they can’t guarantee
absolute minimum valid set i.e. optimal feature subset. Because if they hit a local
minimum (a best subset at the moment) they have no way to get out.
•
Random Generation scheme produces subset at random. A good random number
generator is required so that every combination of features ideally has a chance to
occur and occurs just once.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Search Strategies
•
Exhaustive Search
•
Exhaustive search is complete since it covers all combinations of features. But a
complete search may not be exhaustive.
•
Depth-First Search
•
This search goes down one branch entirely, and then backtracks to another
branch.This uses stack data structure (explicit or implicit)
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Depth-First Search
Depth-First Search
3 features a,b,c
a
a, b
b
a,c
c
b,c
a,b,c
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Search Strategies
•
Breadth-First Search
•
This search moves down layer by layer, checking all subsets with one feature , then
with two features , and so on. This uses queue data structure.
•
Space Complexity makes it impractical in most cases.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Breadth-First Search
Breadth-First Search
3 features a,b,c
a
a, b
b
a,c
c
b,c
a,b,c
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Search Strategies
•
•
Complete Search
Branch & Bound Search
•
It is a variation of depth-first search hence it is exhaustive search.
If evaluation measure is monotonic, this search is a complete search and
guarantees optimal subset.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Branch & Bound Search
Branch & Bound Search
11
3 features a,b,c
Bound Beta =12
a,b,c
a, b
a
13
17
12
a,c
b
b,c
9
c
15
17
1000
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Heuristic Search
•
•
•
Quick To Find Solution (Subset of Features)
Finds Near Optimal Solution
More Speed With Little Loss of Optimality
•
Best-First Search
•
This is derived from breadth-first search. This expands its search space layer by
layer , and chooses one best subset at each layer to expand.
•
Beam Search
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Best-First Search
Best-First Search
1000
17
a
3 features a,b,c
19
b
18
c
12
13
a, b
10
a,c
a,b,c
b,c
20
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Search Strategies
•
Approximate Branch & Bound Search
•
This is an extension of the Branch & Bound Search
In this the bound is relaxed by some amount , this allows algorithm to continue
and reach optimal subset. By changing  , monotonicity of the measure can be
observed.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Approximate Branch & Bound Search
Approximate Branch & Bound Search
3 features a,b,c
a,b,c
13
a,b
11
12
a,c
15
b,c
9
17
a
17
b
c
1000
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Nondeterministic Search
•
•
Avoid Getting Stuck in Local Minima
Capture The Interdependence of Features
•
RAND
•
•
It keeps only the current best subset.
If sufficiently long running period is allowed and a good random function is used, it
can find optimal subset. Problem with this algorithm is we don’t know when we
reached the optimal subset. Hence stopping condition is the number of maximum
loops allowed.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Evaluation Measures
•
•
What is Entropy ?
A Measure of Uncertainty
– The Quantity
• Purity: how close a set of instances is to having just one label
• Impurity (disorder): how close it is to total uncertainty over labels
– The Measure: Entropy
• Directly proportional to impurity, uncertainty, irregularity, surprise
• Inversely proportional to purity, certainty, regularity, redundancy
Example
•
– For simplicity, assume H = {0, 1}, distributed according to Pr(y)
1.0
• Can have (more than 2) discrete class labels
• Continuous random variables: differential entropy
– Optimal purity for y: either
• Pr(y = 0) = 1, Pr(y = 1) = 0
• Pr(y = 1) = 1, Pr(y = 0) = 0
Entropy is 0 if all members of S belong to same class.
H(p) = Entropy(p)
•
CIS 830: Advanced Topics in Artificial Intelligence
1.0
0.5
p+ = Pr(y = +)
Kansas State University
Department of Computing and Information Sciences
Entropy:
Information Theoretic Definition
•
Components
– D: a set of examples {<x1, c(x1)>, <x2, c(x2)>, …, <xm, c(xm)>}
– p+ = Pr(c(x) = +), p- = Pr(c(x) = -)
•
Definition
– H is defined over a probability density function p
– D contains examples whose frequency of + and - labels indicates p+ and p- for the
observed data
– The entropy of D relative to c is:
H(D)  -p+ logb (p+) - p- logb (p-)
•
If a target attribute can take on c different values, the entropy of S relative to this cwise classification is defined as ,
c
Entropy( S )    pi log 2 pi
i 1
•
where pi is the proportion of S belonging to the class I.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Entropy
–What is the least pure probability distribution?
•Pr(y = 0) = 0.5, Pr(y = 1) = 0.5
•Corresponds to maximum impurity/uncertainty/irregularity/surprise
•Property of entropy: concave function (“concave downward”)
•
•
Entropy is 1 when S contains equal number of positive
& negative examples.
Entropy specifies the minimum number of bits of
information needed to encode the classification of an
arbitrary member of S.
•What Units is H Measured In?
–Depends on the base b of the log (bits for b = 2, nats for b = e, etc.)
–A single bit is required to encode each example in the worst case (p+ = 0.5)
–If there is less uncertainty (e.g., p+ = 0.8), we can use less than 1 bit each
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Information Gain
•
•
•
•
It is a measure of the effectiveness of an attribute in classifying the training data.
Measures the expected reduction in Entropy caused by partitioning the examples
according to the attribute.
Measure the uncertainty removed by splitting on the value of attribute A
The information gain ,Gain(S,A) of an attribute A, relative to collection of examples
S is,
Gain( S , A)  Entropy( S ) 

vvalues( A )
•
•
•
Sv
S
Entropy( Sv)
where values(A) is the set of all possible values of A.
Gain(S,A) is the information provided about the target function value, given the
value of some attribute A.
The value of Gain(S,A) is the number of bits saved when encoding the target value
of an arbitrary member of S, by knowing the value of attribute A.
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
An Illustrative Example
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
•
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Light
Strong
Light
Light
Light
Strong
Strong
Light
Light
Light
Strong
Strong
Light
Strong
PlayTennis?
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Prior (unconditioned) distribution: 9+, 5-
Humidity
High
H(D) = -(9/14) lg (9/14) - (5/14) lg (5/14) bits = 0.94 bits
–
H(D, Humidity = High) = -(3/7) lg (3/7) - (4/7) lg (4/7) = 0.985 bits
–
H(D, Humidity = Normal) = -(6/7) lg (6/7) - (1/7) lg (1/7) = 0.592 bits
–
Gain(D, Humidity) = 0.94 - (7/14) * 0.985 + (7/14) * 0.592 = 0.151 bits
–
Similarly, Gain (D, Wind) = 0.94 - (8/14) * 0.811 + (6/14) * 1.0 = 0.048 bits
Normal
[3+, 4-]
[6+, 1-]
[9+, 5-]
Wind
Light
[6+, 2-]
–
Gain D, A  - H D  
[9+, 5-]
Strong
[3+, 3-]
 Dv




H
D
  D
v 
v values(A) 

CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Attributes with Many Values
•
Problem
– If attribute has many values, Gain(•) will select it (why?)
– Imagine using Date = 06/03/1996 as an attribute!
•
One Approach: Use GainRatio instead of Gain
GainD, A   - H D  
GainRatioD, A  
 Dv

 H Dv 


v values(A) 
 D

GainD, A 
SplitInfor mationD, A 
SplitInfor mationD, A   
 Dv
D 
lg v 

D 
v values(A) 
 D

– SplitInformation: directly proportional to c = | values(A) |
– i.e., penalizes attributes with more values
• e.g., suppose c1 = cDate = n and c2 = 2
• SplitInformation (A1) = lg(n), SplitInformation (A2) = 1
• If Gain(D, A1) = Gain(D, A2), GainRatio (D, A1) << GainRatio (D, A2)
– Thus, preference bias (for lower branch factor) expressed via GainRatio(•)
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences
Summary Points
•
•
•
•
Search & Measure
Search and measure play dominant role in feature selection.
Stopping criteria are usually determined by a particular combination of search &
measure.
There are different feature selection methods with different combinations of search
& evaluation measures.
•
Heuristic : Search :: Inductive Bias : Inductive Generalization
•
Entropy and Information Gain
– Goal: to measure uncertainty removed by splitting on a candidate attribute A
• Calculating information gain (change in entropy)
• Using information gain in construction of tree
CIS 830: Advanced Topics in Artificial Intelligence
Kansas State University
Department of Computing and Information Sciences