Some Test Results of CQT3

Download Report

Transcript Some Test Results of CQT3

Non-Metric Methods
Shyh-Kang Jeng
Department of Electrical Engineering/
Graduate Institute of Communication/
Graduate Institute of Networking and
Multimedia, National Taiwan University
1
Non-Metric Descriptions
Nominal data
– Discrete
– Without natural notation of similarity or
even ordering
Property d-tuple
– With lists of attributes
– e.g., { red, shiny, sweet, small }
– i.e., color = red, texture = shiny,
taste = sweet, size = small
2
Non-Metric Descriptions
Strings of nominal attributes
– e.g., Base sequences in DNA segments
“AGCTTCAGATTCCA”
– Might themselves being the output of
other component classifiers
– e.g., Chinese character recognizer and
a neural network for classifying
component brush strokes
3
Non-Metric Methods
Learn categories from non-metric
data
Represent structures in strings
Toward discrete problems addressed
by
– Rule based pattern recognition methods
– Syntactic pattern recognition methods
4
Decision Trees
5
Benefits of Decision Trees
Interpretability
Rapid classification
– Through a sequence of simple queries
Natural way to incorporate prior
knowledge from human experts
6
Interpretability
Conjunctions and disjunctions
For any particular test pattern
– e.g.,properties:{taste, color, shape, size}
–
x = { sweet, yellow, thin, medium }
–
(color = yellow) AND (shape = thin)
For category description
– e.g., Apple = (green AND medium) OR
(red AND medium)
Rule reduction
– e.g., Apple = (medium AND NOT yellow)
7
Tree Construction
Given
– Set D of labeled training data
– Set of properties for discriminating
patterns
Goal
– Organize the tests into a tree
8
Tree Construction
Split samples progressively into
smaller subsets
Pure subset
– All samples have the same category
label
– Could terminate that portion of the tree
Subset with mixture of labels
– Decide either to stop or select another
property and grow the tree further
9
CART
Classification and regression trees
A general framework for decision
trees
General questions in CART
– Number of decision outcomes at a node
– Property tested at a node
– Declaration of leaf
– When and how to prune
– Decision of impure leaf node
– Handling of missing data
10
Branching Factor and
Binary Decisions
Branching factor (branching ratio) B
– Number of links descending from a node
Binary decisions
Every decision can be represented
using just binary decision
– e.g., query of color (B=3) 
–
color = green? Color = yellow?
– Universal expressive power
11
Binary Trees
12
Geometrical Interpretation for Trees
for Numerical Data
13
Fundamental Principle
Prefer decisions leading to a simple,
compact tree with few nodes
– A version of Occam’s razor
Seek a property query T at each node N
– Make the data reaching the immediate
descendent nodes as pure as possible
– i.e., achieve lowest impurity
Impurity i(N)
– Zero if all patterns bear the same label
– Large if the categories are equally represented
14
Entropy Impurity
(Information Impurity)
Most popular measure of impurity
i( N )   P( j ) log 2 P( j )
j
P( j )  Pˆ (x   j | N )
15
Variance Impurity for
Two-Category Case
Particular useful in two-category
case
i( N )  P1 P(2 )
16
Gini Impurity
Generalization of variance impurity
Applicable to two or more
categories
Expected error rate at node N

1
2
i( N )   P(i ) P( j )  1   P ( j )
2
i j
j

17
Misclassification Impurity
Minimum probability that a
training pattern would be
misclassified at N
Most strongly peaked at equal
probabilities
i( N )  1  max P( j )
j
18
Impurity for Two-Category Case
*Adjusted in scale and offset for comparison
19
Heuristic to Choose Query
choose query valu e s to maximize
i(T )  i ( N )  PLi ( N L )  (1  PL )i ( N R )
If entropy impurity is used, the
impurity reduction is corresponding
to an information gain
Reduction of entropy impurity due to
a split can not be greater than 1 bit
20
Finding Extrema
Nominal attributes
– Perform extensive or exhaustive search
over all possible subsets of the training
set
Real-valued attributes
– Use gradient descent algorithms to find
a splitting hyperplane
– As a one-dimensional optimization
problem for binary trees
21
Tie Breaking
Nominal data
– Choose randomly
Real-valued data
– Assume a split lying in xl < xs < xu
– Choose either the middle point or the
weighted average xs = (1-P)xl + Pxu
– P is the probability a pattern goes to
the “left” under the decision
Computational simplicity may be a
determining factor
22
Greedy Method
Get a local optimum at each node
No assurance that successive locally
optimal decisions lead to the global
optimum
No guarantee that we will have the
smallest tree
For reasonable impurity measure and
learning methods
– Often continue to split further to get the
lowest possible impurity at the leafs
23
Favoring Gini Impurity to
Misclassification Impurity
Example: 90 in 1 and 10 in 2
Misclassification impurity: 0.1
Suppose no splits guarantee a 2 majority
in either of the two descendent nodes
Misclassification impurity remains at 0.1
for all splits
An attractive split: 70 1, 0 2 to the right
and 20 1, 10 2 to the left
Gini impurity shows that this is a good
split
24
Twoing Criterion
For multiclass binary tree creation
Find “supercategories” C1 and C2
C1 = {i1, i2, …, ik}, C2 = C-C1
Compute i(s,C1) as though it
corresponding to a standard twoclass problem
Find s*(C1) that maximize the change
and then the supercategory C1*
25
Practical Considerations
Choice of impurity function rarely
affects the final classifier and its
accuracy
Stopping criterion and pruning
methods are more important in
determining final accuracy
26
Multiway Splits
B
i ( s )  i ( N )   Pk i ( N k )
k 1
to avoid favoring decision w ith large B,
iB ( s ) 
i ( s )
B
  Pk log 2 Pk
k 1
based on the gain ratio impurity
27
Importance of Stopping Criteria
Fully growing trees have typically
been overfit
Extreme case: each leaf corresponds
to a single training point
– The full tree is merely a look-up table
– Not to generalize well in noisy problem
Early stopping
– Error on training data not sufficiently
low
– Performance may suffer
28
Stopping by Checking
Validation Error
Using a subset of the data (e.g.,
90%) for training and the remaining
(10%) as a validation set
Continue splitting until the error on
the validation data is minimized
29
Stopping by Setting a Threshold
Stop if maxs i(s) < b
Benefits
– Use all training data
– Leaf can lie in different levels
Fundamental drawback
– Difficult to determine the threshold
An alternative simple method
– Stop when a node represents fewer
than some
A threshold number of points,
A fixed percentage of the total training set
30
Stopping by Checking a
Global Criterion
Stop when a global criterion is
minimum
 size 
 i( N )
leaf nodes
Minimum description length
Criterion: complexity and uncertainty
31
Stopping Using Statistical Tests
split s sends Pn to left and (1-P )n to the right
2
niL  Pni 
i 1
Pni
 
2
2
if the most significan t split does not yield a
 exceeding the chosen thr eshold, stop splitting
2
32
Horizon Effect
Determination of optimal split at a
node is not influenced by decisions at
its descendent nodes
A stopping condition may be met too
early for overall optimal recognition
accuracy
– Biases toward trees in which the
greatest impurity is near the root node
33
Pruning
Grow a tree fully first
All pairs of neighboring leaf nodes
are considered for elimination
– If the elimination yields a satisfactory
(small) increase in impurity, the
common antecedent node is declared a
leaf
– merging or joining
34
Rule Pruning
Each leaf has an associated rule
Some of rules can be simplified if a
series of decisions is redundant
Can improve generalization and
interpretability
Allows us to distinguish between
contexts in which the node is used
35
Example 1: A Simple Tree
36
Example 1: A Simple Tree
impurity of root node
2
i ( N root )   P(i ) log 2 P(i )  1.0
i 1
consider splits of the form " xi  xis "
check exhaustive ly n  1 positions for x1 and
x2 , respective ly
37
Example 1: A Simple Tree
38
Example 1: A Simple Tree
39
Computation Complexity
Training
– Root node
Sorting: O(dn log n)
Entropy computation: O(n)+(n-1)O(d)
Total: O(dn log n)
– Level 1 node
Average case: O(dn log (n/2))
– Total number of levels: O(log n)
– Total average complexity: O(dn (log n)2)
Recall and classification
– O(log n)
40
Feature Choice
41
Feature Choice
42
Multivariate Decision Trees
43
Multivariate Decision Trees Using
General Linear Decisions
44
Priors and Costs
Priors
– Weight samples to correct for the
prior frequencies
Costs
– Cost matrix lij
– Incorporate cost into impurity, e.g.,
i ( N )   lij P(i ) P( j )
i, j
45
Training and Classification
with Deficient Patterns
Training
– Proceed as usual
– Calculate impurities at a node using
only the attribute information present
Classification
– Use traditional (“primary”) decision
whenever possible
– Use surrogate splits when test pattern is
missing some features
– Or use virtual values
46
Example 2: Surrogate Splits and
Missing Attributes
47
Example 2: Surrogate Splits and
Missing Attributes
48
Algorithm ID3
Interactive dichotomizer
For use with nominal (unordered)
inputs only
– Real-valued variables are handled by
bins
Gain ratio impurity is used
Continues until all nodes are pure or
there are no more variables
Pruning can be incorporated
49
Algorithm C4.5
Successor and refinement of ID3
Real-valued variables are treated as
in CART
Gain ratio impurity is used
Use pruning based on statistical
significance of splits
50
Treating Deficient Patterns in C4.5
Branching factor B at node N
Follow all B possible answers to the
descendent nodes and ultimate B leaf
nodes
Final decision is based on the labels
of the B leaf nodes, weighted by the
decision probabilities at N
51
Example of C4.5Rules
IF [ (0.04 x1  0.16 x2  0.11)
AND (0.27 x1  0.44 x2  0.02)
AND (0.96 x1  1.77 x2  0.45)
AND (5.43 x1  13.33 x2  6.03)]
THEN
x  1

IF [ (0.04 x1  0.16 x2  0.11)
AND (5.43 x1  13.33 x2  6.03)]
THEN
x  1
52
Basic Concepts for
String Recognition
Strings
– “AGCTTCGAATC”
Characters and alphabet
– { A, G, C, T }
Words
– x = “AGCTTC”
Texts
Factor
– “GCT” in “AGCTTC”
53
String Problems of Importance for
Pattern Recognition
String matching
– e.g., classify a book by key words
Edit distance
String matching with errors
– e.g., classify texts with misspelled words
String matching with “don’t-care”
symbol
– e.g., classify DNA sequence according to
if it contain a protein with some DNA
segments inert or having no function
54
General String Matching Problem
55
Naïve String Matching
initialize A, x, text, n  length[text], m  length[x]
s0
while s  n  m
if x[1 m]  text[ s  1 s  m]
then print " pattern occurs at shift" s
s  s 1
return
end
56
Boyer-Moore String Matching
initialize A, x, text, n  length[ text], m  length[ x]
F (x)  last - occurrence function
G (x)  good - suffix function
s0
while s  n  m
jm
while j  0 and x[ j ]  text[ s  j ]
do j  j  1
if j  0
then print " pattern occurs at shift" s
s  s 1
else s  s  max  j  1  G (x[ j  1 m]), j  F (text[ s  j ]) 
return
end
57
Boyer-Moore String Matching
58
Subset-Superset Problem
Search for several strings
– Some short strings are factors of long
ones
Example
– Find “beat”, “eat”, “be”
– From “when_chris_beats_the_drum”
Return the longest target strings only
59
Edit Distance
Minimum number of fundamental
operations required to transform
string x to string y
Fundamental operations
– Substitution
– Insertion
– Deletions
60
Edit Distance Computation
initialize A, x, y , m  length[x], n  length[y ]
C[0,0]  0
i0
do i  i  1
C[i,0]  i
until i  m
j0
do j  j  1
C[0, j ]  j
until j  n
61
Edit Distance Computation
i  0, j  0
do i  i  1
do j  j  1
C[i, j ]  min[ C[i  1, j ]  1,
C[i, j  1]  1,
C[i  1, j  1]  1   (x[i ], y[ j ])]
until j  n
until i  m
return C[m, n]
end
62
Edit Distance Computation
63
String Matching with Errors
Given a pattern x and text
Find the shift for which the edit
distance between x and a factor of
text is minimum
Apply an algorithm similar to that
computes edit distance between two
strings, but uses matrix of cost E
Matrix of cost E
– E[i,j] = min[ C(x[1…i], y[1…j]) ]
– E[0,j] = 0
64
String Matching with Errors
65
String Matching with
“Don’t-Care” Symbols
Can modify the naïve algorithm
Extending Boyer-Moore algorithm is
difficult and inefficient
Check Bibliography for better
methods
66
Grammars
Structures of hierarchical Rules to
generate strings
Examples
– “The history book clearly describe several wars”
– Telephone numbers
Provide constraints for a full system that
uses a statistical recognizer as a
component
– e.g., Mathematical equation recognizer
– ASR for phone numbers
67
Grammars
Symbols
– Alphabet A
– Null string
Variables I
Root symbol
– Taken from a set S
Production rules P
Grammar G={A, I, S, P}
Language L(G)
68
Example Language 1
A  a, b, c, S  S , I  A, B, C
 p1 : S  aSBA OR aBA
p : AB  BA
 2
p 3 : bB  bb
P
p 4 : bA  bc
p 5 : cA  cc

 p 6 : aB  ab
69
Example Language 1
root
p1
root S
p1 aBA
S
aSBA
p6
abA
p1
aaBABA
p4
abc
p6
aabABA
p2
aabBAA
p3
aabbAA
p4
aabbcA
p5
aabbcc


L(G )  a n b n c n | n  1
70
Example Language 2
the, history, book, sold, over, 1000, 
A

copies, 

 noun ,  verb ,  noun phrase , 


I   verb phrase ,  adjective ,

 adverb ,  adverbial phrase  


S  sentence 
71
Example Language 2
 sentence  noun phrase  verb phrase 

  noun phrase  adjective  noun phrase 

 verb phrase  verb phrase  adverbial phrase 
P
 noun  book OR theorem OR 


 verb  describes OR buys OR holds OR 


 adverb  over OR 
"Squishy green dreams hop heuristica lly"
72
Derivation Tree
73
Types of String Grammars
Type 0: Free or unrestricted
Type 1: Context-sensitive
– Ib  gb
Type 2: Context-free
– Ig
Type 3: Finite state or regular
–   zb OR   z
Grammars of type i includes all
grammars of type i+1
74
Chomsky Normal Form (CNF)
CNF
– A  BC
and
Az
For every context-free grammar G,
there is another G’ in CNF such that
L(G)=L(G’)
75
Example 3
To pronounce any number between 1 and 999,999
one, two, , ten, eleven,, twenty, 
A

thirty, , ninety, hunderd , thousand 
I  digits 6, digits 3, digits 2, digit1, teens, tys
S  digit 6
76
Example 3
digits 6  digits 3 thousand digits 3


digits
6

digits
3
thousand
OR
digits
3


digits 3  digit 1 hunderd digits 2

digits 3  digit 1 hunderd OR digits 2

P
digits 2  teens OR tys OR tys digit 1 OR digit 1

digit 1  one OR two OR  nine

 teens  ten OR eleven OR  ninenteen

tys  twenty OR thirty OR  ninety

*Type 2 grammar
77
Example 3
78
Recognition Using Grammars
Classify a test sentence x according
to the grammar on which it is
generated
The grammar is among c different
ones: G1, G2, …, Gc
Parsing
– Find a derivation in G that leads to x
79
Bottom-Up Parsing Algorithm
initialize G  A, I, S, P , x  x1 x2  xn in CNF
i0
do i  i  1
Vi1  A | A  xi 
until i  n
j 1
do j  j  1
i0
do i  i  1
80
Bottom-Up Parsing Algorithm
Vij  
k 0
do k  k  1
Vij  Vij 
A | A  BC  P , B V
ik
and C  Vi  k , j  k 
until k  j  1
until i  n  j  1
until j  n
if S  V1n then print " parse of " x " successful in G"
return
end
81
Example
A  a , b
I  A, B, C
SS
p1 :
p :
 2
P
p 3 :
p 4 :
S  AB
A  BA
B  CC
C  AB
OR
OR
OR
OR
BC
a
b
a
82
Example
83
Example
84
Example
85
Finite State Machine
S  the A
A  mouse B OR cowB

G 
86
Grammatical Inference
Learn a grammar from example
sentences it generates
Differences from statistical methods
– Usually an infinite number of grammars
consistent with the training data
– Often can not recover the source
grammar uniquely
87
Main Techniques
Use both positive and negative
instances, D+ and D-, respectively
– Generalization for multicategory cases
Impose conditions and constraints
– Alphabet contains only those symbols
appearing in training sentences
– Every production rule is used
– Seek the “simplest” grammar that
explains the training sentences
88
Grammatical Inference (Overview)
initialize D  , D 
n  D
SS
A  set of characters in D 
initialize G (as simple as possible)
i0
do i  i  1
read x i from D 
if x i can not be parsed by G
89
Grammatical Inference (Overview)
then do propose additional production s to P and
varia bles to I
accept updates if G parses x i but no sentenses in D until i  n 
eliminate redundant production s
return G  A, I, S, P 
end
90
Example 4: Grammatical Inference
D+={ a, aaa, aaab, aab }
D- = { ab, abc, abb, aabb }
A = { a, b }
I={A}
S=S
P={SA}
G0 = { A, I, S, P }
91
Example 4: Grammatical Inference
P
P produces D-?
1 a
SA
Aa
No
2 aaa
SA
Aa
AaA
SA
Aa
AaA
Aab
No
i
xi+
3 aaab
Yes: ab is in D-
92
Example 4: Grammatical Inference
i
xi+
3 aaab
4 aab
P
P produces D-?
SA
Aa
AaA
Aaab
SA
Aa
AaA
Aaab
No
No
93
Rule-Based Methods
For classes characterized by general
relationships among entities, rather
than by instances per se
Integral to expert system
Modest use in pattern recognition
Focus on if-then rules for
representing and learning such
relationships
94
Example: Arch
95
IF-THEN Rules
Example:
– IF Swims(x) AND HasScales(x) THEN
Fish(x)
Predicates
– e.g., Man(.), HasTeeth(.),
AreMarried(.,.), Touch(.,.),
Supports(.,.,.)
Variables
– e.g., IF Eats(x,y) AND HasFlesh(x)
THEN Carnivore(y)
96
Propositional Logic and
First-Order Logic
Propositional logic
– e.g., IF Male(Bill) AND
IsMarried(Bill) THEN
IsHusband(Bill)
First-order logic examples
– IF Male(x) AND IsMarried(x,y) THEN
IsHusband(x)
– IF Parent(x,y) AND Parent(y,z) THEN
GrandParent(x,z)
– IF Spouse(x,y) THEN Spouse(y,x)
97
Functions and Terms
Example
– IF Male(x) AND (Age(x) < 16) THEN
Boy(x)
98
Applications in
Pattern Classification
Example
– IF IsBlock(x) AND IsBlock(y) AND
IsBlock(z) AND Touch(x,y) AND
Touch(x,z) AND NotTouch(y,z) AND
Supports(x,y,z) THEN Arch(x,y,z)
Algorithms to implement predicates
or functions may be difficult
99
Learning Rules
Train a decision tree and then simplify the
tree to extract rules
Infer rules via grammatical inference
Sequential covering learning
– Given sets of positive and negative examples
– Deletes examples that the rules explain and
iterates
– Leads to disjunctive set of rules that cover the
training data
– Simplifies the results by standard logical
methods
100
Learning Rules
101