GAs and Feature Weighting
Download
Report
Transcript GAs and Feature Weighting
GAs and Feature
Weighting
Rebecca Fiebrink
MUMT 611
31 March 2005
1
Outline
• Genetic algorithms
– History of GAs
– How GAs work
• Simple model
• Variations
• Feature selection and weighting
– Feature selection
– Feature weighting
• Using GAs for feature weighting
– Siedlecki and Sklansky
– Punch
– The ACE project
2 of 25
Part 1:
Genetic Algorithms
3
History of GAs
• 1859: Charles Darwin, Origin of Species
• 1950’s – 60’s: Computers simulate evolution
• 1960’s – 70’s: John Holland invents genetic
algorithms
– Adaptation in Natural and Artificial Systems, book
published 1975
– Context of “adaptive systems,” not just “optimization”
• 1980’s – Present: Further exploration, widespread
adoptation
– Kenneth De Jong
– Goldberg: Genetic Algorithms in Search, Optimization, and
Machine Learning (“classic” textbook published 1989)
• Related areas: evolutionary programming, genetic
programming
4 of 25
(Carnegie Mellon GA online FAQ, Cantu-Paz
2000)
How GAs work
• Problem is a search space
– Variable parameters dimensions
– Problem has minima and maxima within this space
– Minima and maxima are hard to find
• Potential solutions describe points in this space
– It is “easy” to calculate “goodness” of any one solution and
compare it to other solutions
• Maintain a set of potential solutions
–
–
–
–
Guess a set of initial solutions
Combine these solutions with each other
Keep “good” solutions and discard “bad” solutions
Repeat combination and selection process for some time
5 of 25
A simple GA
Start
Evaluate
Population
Terminate?
Select
Parents
Produce
Offspring
Stop
6 of 25
Mutate
Offspring
GA terminology
•
•
•
•
•
•
•
•
•
•
Population: The set of all current potential solutions
Deme: A subset of the population that interbreeds (might be the
whole population)
Chromosome: A solution structure (often binary); contains one or
more genes
Gene: A feature or parameter
Allele: A specific value for a gene
Phenotype: A member of the population, a real-valued solution
vector
Generation: A complete cycle of evaluation, selection, crossover,
and mutation in a deme
Locus: A particular position (bit) on the chromosome string
Crossover: The process of combining two chromosomes; simplest
method is single-point crossover where two chromosomes swap
parts on either side of a random locus
Mutation: A random change to a phenotype (i.e., changing an allele)
7 of 25
Crossover Illustrated
Parent 1
1011010111101
Parent 2
+
101101 0010100
1101010010100
110101 0111101
Child 1
Child 2
8 of 25
GA Variations
• Variables:
–
–
–
–
–
Population size
Selection method
Crossover method
Mutation rate
How to encode a solution in a chromosome
• No single choice is best for all problems
(Wolpert & Macready 1997, cited in CantuPaz 2000).
9 of 25
More variations: Parallel GAs
• Categories:
– Single-population Master/Slave
– Multiple population
– Fine-grained
– Hybrids
Master
Slaves
10 of 25
(Cantu-Paz 2000)
Applications of GAs
•
•
•
•
•
•
•
•
•
NP problems (e.g., Traveling Salesman)
Airfoil design
Noise control
Fluid dynamics
Circuit partitioning
Image processing
Liquid crystals
Water networks
Music
(Miettinen et al. 1999, Coley 1999)
11 of 25
Part 2:
Feature Selection
and Weighting
12
Features
• What are they?
– A classifier’s “handle” to the problem
– Numerical or binary values representing
independent or dependent attributes of each
instance to be classified
• What are they in music?
– Spectral centroid
– Number of instruments
– Composer
– Presence of Banjo
– Beat histogram
-…
13 of 25
Feature Selection – Why?
• “Curse of dimensionality”:
– The size of the training set must grow
exponentially with the dimensionality of the
feature space
• The set of best features to use may vary
according to classification scheme, goal, or
data set
• We need a way to select only a good subset
of all possible features
– May or may not be interested in “best” subset
14 of 25
Feature Selection – How?
• Common approaches
– Dimensionality reduction (principle component analysis or
factor analysis)
– Experimentation: Choose a subset, train a classifier with
it, and evaluate its performance.
• Example: <centroid, length, avg_dynamic>
– A piece might have vector <3.422, 523, 98> (values)
– Try selections <1, 1, 0>, <0, 1, 1> … ? (1=yes, 0=no)
– Which works best?
• Setbacks in experimentation:
– For n potential features, there are 2n possible subsets: a
huge search space!
– Evaluating each classifier takes time
– Choose vectors using sequential search, branch, and
bound, or GAs
15 of 25
Feature Weighting
• Among a group of selected (useful)
features, some may be more useful than
others
• Weight the features for optimal classifier
performance
• Experimental approach: Similar to feature
selection
– Example: Say <1, 0, 1> is optimal selection for
<centroid, length, avg_dynamic> feature set
• Try weights like <.5, 0, 1>, <1, 0, .234983>, … ?
• Practical and theoretical constraints of
feature selection are magnified
16 of 25
Part 3:
Using GAs for
Feature Weighting
17
GAs and Feature Weighting
• A chromosome is a vector of weights
– Each gene corresponds to a feature
weight
• E.g., <1, 0, 1, 0, 1, 1, 1> for selection
• E.g., <.3, 0, .89, 0, .39, .91, .03> for weighting
• The fitness of a chromosome is a
measure of its performance training
and validating an actual classifier
18 of 25
Siedlecki and Sklansky
• 1989: A note on genetic algorithms for largescale feature selection
• Choose feature subset for Knn classifier
• Propose using GAs when feature set > 20
• Binary chromosomes (0/1 for each feature)
• Goal: seek the smallest/least costly subset
of features for which classifier performance
above a certain level
19 of 25
Results
• Compared GAs with exhaustive search, branch and
bound, and (p,q)-search on sample problem
– Branch and bound and (p, q)-search did not come close
enough (within 1% error) to finding a set that performed
optimally
– Exhaustive search used 224 evaluations
– GA found optimal or near-optimal solutions with 1000-3000
evaluations, a huge savings over both exhaustive search
and branch and bound
• GA exhibits slightly more than linear increase of
time complexity with added features
• GA also outperforms other methods on a real
problem with 30 features
20 of 25
Punch et al.
• 1993: Further research on feature selection
and classification using genetic algorithms
• Approach: Use GAs for feature weighting
for Knn classifier: classifier “dimension
warping”
• Each chromosome is a vector of real-valued
weights
– Mapped exponentially to range [.01, 10) or 0
• Also experimented with “hidden features”
representing combination (multiplication) of
2 other features
21 of 25
Results
• Feature weighting can outperform
simple selection, especially on large
data set with noise
– Best performance when feature weighting
follows binary selection
• 14 days to compute results
• Parallel version: nearly linear speedup
when strings passed to other
processors for fitness evaluation
22 of 25
Recent work
• Minaei-Bidgoli, Kortemeyer, and
Punch, 2004: Optimizing classification
ensembles via a genetic algorithm for
a web-based educational system
– Incorporate GA feature weighting in a
multiple-classifier system (vote system)
– Results show over 10% improvement in
accuracy of the classifier ensemble when
GA optimization is used
23 of 25
MIR Application
• ACE project
– Accept arbitrary type and number of
features, arbitrary taxonomy, training and
testing data, and a target classification
time
– Classify data using best means available;
e.g., make use of multiple classifiers,
feature weighting
– Parallel GAs for feature weighting can
improve solution time and quality
24 of 25
Conclusions
• GAs are powerful and interesting tools, but they are
complex and not always well-understood
• Feature selection/weighting involves selecting from
a huge space of possibilities, but it is a necessary
task
• GAs are useful for tackling the feature weighting
problem
• Faster machines, parallel computing, and better
understanding of GA behavior can all benefit the
feature weighting problem
• This is relevant to MIR: we want to do good and
efficient classification!
25 of 25