Transcript cs.uvm.edu

Fuzzy Interpretation of Discretized Intervals
Dr. Xindong Wu
IEEE TRANSACTIONS ON FUZZY SYSTEM
VOL. 7, NO. 6, DECEMBER 1999
Presented by Peter Duval
Summary Slide
•
•
•
•
•
•
•
•
Overview
Problem
Solution
Related Techniques
Algorithms Design in HCV
Experimental Results
Conclusion
Exam Questions
Concepts Review
• Induction: Generalize rules from training data
• Deduction: Apply generalized rules to testing data
• Three possible results of Deduction:
– Single match
– No match
– Multiple match
Concepts Review
• Discretization of Continuous domains
– Continuous numerical domains can be
discretized into intervals
– The discretized intervals can be treated as
nominal values
Concepts Review
• Using Information Gain Heuristic for
Discretization:
(employed by HCV)
– x = (xi + xi+1)/2 for (i = 1, …, n-1)
– x is a possible cut point if xi and xi+1 are of different
classes
– Use IGH to find best x
– Recursively split on left and right
– Stop recursive splitting when some criteria is met
Overview
Training
Data
Discretizaion
induction
rules
No match
Testing
Data
Deduction Single match
Multiple match
Fuzzy Borders
Problem
• Discretization of continuous domains does
not always fit accurate interpretation!
• Recall, using Info Gain, --a kind of
heuristic measure applying in training data,
cannot accurately fit “data in real world”.
• Example
Problem
• Heuristic 1(e.g. Information Gain)
young
old
18
35
49
49.49
• Heuristic 2(e.g. Gain Ratio)
young
18
35
old
50
49.49
Problem
• Suppose after
induction, we just get
one rule:
• If (age=old) then
Class=MORE_EXPERIENCE
According to Heuristic 2,
Instance(age=49.49) No
match!
Solution
• More safe way to describe age=49.49 is to say:
To some degree, it is young; To some degree, it
is old.
• Rather than using one assertion that definitely
tells it is young or old.
• Thus, to some degree, it can get its rule and
classification result other than no match.
– No matchSingle match or multiple match with some
degree
• This is so-called fuzzy match!
Solution
• “Fuzziness is a type of deterministic uncertainty.
It describes the event class ambiguity.”
• “Fuzziness works when there are the outcomes
that belong to several event classes at the same
time but to different degrees.”
• “Fuzziness measures the degree to which an
event occurs.”
– Jim Bezdek, Didier Dubois, Bart osko, Henri Prade
Solution
• “to some degree”?
– Membership function describes “degree”
– Membership function tells you to what degree, an
event belongs to one class.
– Membership function calculates this degree.
• Three widely used membership functions are
employed by HCV.
– Linear
– Polynomial
– Arctan
Solution
S: is user-specified
parameter.
• Linear membership function
l
sl
xleft
e.g.
0.1 indicates the
interval spreads
out into adjacent
intervals for 10%
of its original
length at each end.
xright
k = 1/2sl; a = -kxleft + ½; b = kxright + ½
linleft(x) = kx + a
linright(x) = -kx + b
lin(x) = MAX(0, MIN(1,linleft(x),linright(x)))
Solution
• Polynomial Membership Function—using
more smooth curve function instead of
linear function.
• Arctan Membership Function
• Experimental results shows that no
significant difference between three kinds
of functions—so Polynomial Membership
Function is chosen.
Solution
polyside(x) = asidex3 + bsidex2 + csidex + dside
aside = 1/(4(ls)3)
bside = -3asidexside
side {left,right}
cside = 3aside(xside2 - (ls)2)
dside = -a(xside3 -3xside(ls)2 + 2(ls)3)
poly(x) =
To what degree,
x belongs to one
interval
polyleft(x),
polyright(x),
1,
0,
if xleft -ls  x  xleft + ls
if xright -ls  x  xright +ls
if xleft +ls  x  xright -ls
otherwise
Related Techniques
– No match
• Largest Class
– Assign all no match examples to the largest class, the
default class
– Multiple match
• Largest Rule
– Assign examples to the rules which cover the largest
number of examples
• Estimate of Probability
– Fuzzy borders can bring multiple match--conflicts, so
hybrid method is desired for the whole process
Related Techniques
• Estimate of Probability
The probability of
e belongs to
class ci
# of e.g.s in training
set covered by conj
Conj1 and Conj2 are two
rules supporting
e belongs to Ci
Algorithms Design in HCV
• HCV(Large)
– No match: Largest Class
– Multiple match: Largest Rule
• HCV(Fuzzy)
– No match: Fuzzy Match
– Multiple match: Fuzzy Match
• HCV(Hybrid)
– No match: Fuzzy Match
– Multiple match: Estimate of Probability
Experimental Results
• Data:
– 17 datasets from UCI Machine Learning Repository
– Why select these:
1) Numerical data
2) Situations where no rules clearly apply
• Test conditions
– 68 parameters in HCV are all default except
deduction strategy
– Parameters for C4.5 and NewID are adopted as the
one recommended by respective inventors
Experimental Results
Dataset
HCV
HCV (large)
(hybrid)
HCV
C4.5
C4.5
(fuzzy)
(R 8)
(R 5)
NewID
Anneal
98.00%
93.00%
93.00%
95.00%
93.00%
81.00%
Bupa
57.60%
55.90%
55.90%
71.20%
61.00%
73.00%
Cleveland 2
78.00%
68.10%
73.60%
71.40%
76.90%
67.00%
Cleveland 5
54.90%
56.00%
52.70%
51.60%
56.00%
47.30%
D CRX
82.50%
72.50%
82.00%
83.00%
80.00%
79.00%
Glass (w/out ID)
72.30%
60.00%
60.00%
71.50%
64.60%
66.00%
Hungarian 2
86.30%
85.00%
85.00%
81.20%
80.00%
78.00%
Hypothroid
97.80%
86.30%
96.30%
99.40%
99.40%
92.00%
Imports 85
62.70%
59.30%
61.00%
61.00%
67.80%
61.00%
Ionosphere
88.00%
81.20%
81.20%
86.30%
85.50%
82.00%
Labor Neg
76.50%
76.50%
76.50%
82.40%
82.40%
65.00%
Pima
73.90%
69.10%
69.10%
73.50%
75.50%
73.00%
Swiss 2
96.90%
96.90%
96.90%
96.90%
96.90%
97.00%
Swiss 5
28.10%
25.00%
28.10%
40.60%
31.20%
22.00%
Va 2
78.90%
78.90%
78.90%
77.50%
70.40%
77.00%
Va 5
28.20%
25.40%
29.60%
31.00%
26.80%
20.00%
Wine
90.40%
76.90%
76.90%
90.40%
90.00%
90.40%
Experimental Results
• Predictive accuracy
– HCV (hybrid) outperforms others in
– HCV (large)
– HCV (fuzzy)
– C4.5 (R 8)
– C4.5 (R 5)
– NewID
9 datasets
3 datasets
2 datasets
7 datasets
6 datasets
3 datasets
– HCV (hybrid)clearly and significantly outperforms other
interpretation techniques (in HCV) for datasets with
numerical data in “no match” and “multiple match” cases.
• C4.5 and NewID are included for reference, not for
extensive comparison.
Conclusion
• Fuzziness is strongly domain dependent, HCV
allows users to specify their own intervals and
fuzzy functions.
– An important direction to take with specific domains
• Fuzzy Borders design combined with probability
estimation achieve better results in terms of
predictive accuracy.
– Applicable to other machine learning and data
mining algorithms
The End
• Get the paper here:
– http://library.uvm.edu/articles/computer.html
– Login and use ieeexplore to get .pdf.
• Based on slides by Gong Chen.
Exam Questions
• Q1:When doing deduction on real world data, what are the
three possible cases for each test example?
– Single match
– No match
– Multiple match
• Q2: Of the three cases during deduction, which ones do the
HCV hybrid interpretation algorithm use fuzzy borders to
classify?
– No match
• Q3: In the Hybrid interpretation algorithm used in HCV,
– when are sharp borders set up?
• “Sharp borders are set up as usual during induction”
– when are fuzzy border defined?
• In deduction, “only in the no match case, fuzzy borders are set up in
order to find a rule which is closest to the test example in question”
Citations from Google Scholar
•
•
•
•
•
•
[PS] Building Intelligent Learning Database Systems
X Wu - AI Magazine, 2000 - cs.uvm.edu
Page 1. Building Intelligent Learning Database Systems Xindong Wu Department
of Mathematical and Computer Sciences Colorado School ...
Cited by 4 - View as HTML - Web Search - BL Direct
Enhancement of Power System Data Debugging using GSA-Based Data-Mining Technique - group of 2 »
SJ Huang, JM Lin - IEEE Trans. on Power Systems, 2002 - ieeexplore.ieee.org
Page 1. 1022 IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 17, NO. 4, NOVEMBER 2002
Enhancement of Power System Data Debugging Using GSA-Based Data-Mining Technique ...
Cited by 2 - Web Search - BL Direct
Discretization for Data Mining - group of 2 »
Y Yang, GI Webb - Encyclopedia of data warehousing and mining. Idea Group …, 2005 - cs.uvm.edu
Page 1. Discretization for Data Mining Ying Yang* Department of Computer
Science University of Vermont Burlington, VT 05405 USA Tel ...
Cited by 1 - View as HTML - Web Search
Trading off between Misclassification, Recognition and Generalization in Data Mining with Continuous … - group of 2 »
D Wang, T Dillon, E Chang - LECTURE NOTES IN COMPUTER SCIENCE, 2002 - Springer
Page 1. T. Hendtlass and M. Ali (Eds.): IEA/AIE 2002, LNAI 2358, pp.
303-313, 2002. Springer-Verlag Berlin Heidelberg 2002 Trading ...
Web Search - BL Direct
A Fuzzy Case Retrieval Approach Based on SQL for Implementing Electronic Catalogs - group of 2 »
L Portinale, S Montani - LECTURE NOTES IN COMPUTER SCIENCE, 2002 - Springer
Page 1. A Fuzzy Case Retrieval Approach Based on SQL for Implementing Electronic
Catalogs Luigi Portinale and Stefania Montani Dipartimento ...
Web Search - BL Direct
Data Mining for Constructing Ellipsoidal Fuzzy Classifier with Various Input Features Using GRBF … - group of 5 »
D Wang, T Dillon, E Chang - Proceedings of the 2002 IEEE International Conference on …, 2002 - ieeexplore.ieee.org
Page 1. Data Mining for Constructing Ellipsoidal Fuzzy Classifier with
Various Input Features Using GRBF Neural Networks Dianhui ...
Web Search
Citations from Citeseer
This paper is cited in the following contexts:
• Building Intelligent Learning Database Systems - Wu
(2000) (1 citation) (Correct) ....change a single match case to a
multiple match, and a no match case to a single or even multiple
match. Deduction with fuzzy borders of discretized intervals is
called fuzzy matching. In the multiple match case, we can take the
interval with the greatest degree as the value s discrete value. [31]
describes an implementation of the fuzzy matching techniques.
5 Practical Issues When building practical ILDB systems, we need to
face the following important problems, in addition to noise handling
and dealing with both numerical and nominal data, which have
received wide attention in the ....
X. Wu, Fuzzy Interpretation of Discretized Intervals, IEEE
Transactions on Fuzzy Systems, 7(1999), 6: 753--759.
Google Web Search
•
38 Hits, 19 Shown:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Fuzzy interpretation of discretized intervals - Fuzzy Systems ...
Citations: Fuzzy Interpretation of Discretized Intervals - Wu ...
Building Intelligent Learning Database Systems - Wu (ResearchIndex)
CS 331/295 Final
Using Evolutionary Algorithms as Instance Selection for Data ...
AI Magazine: Building Intelligent Learning Database Systems
DADES DEL SUMARI DE IEEE TRANSACTIONS ON FUZZY SYSTEMS
Contents of related papers Vol.12 No.4
Contents of related papers Vol.12 No.4 - [ Translate this page ]
Author Guidelines for 8
Enhancement of power system data debugging using gsa-based data ...
A Fuzzy Case Retrieval Approach Based on SQL for Implementing ...
Trading off between Misclassification, Recognition and ...
IEEE Transactions on Fuzzy Systems, 1999; 7 (6)
Building Intelligent Learning Database Systems - Wu (ResearchIndex)
Sumarios electrónicos de revistas [Biblioteca de la Universidad de ... - [ Translate this page ]
Building Intelligent Learning Database Systems (eBizSearch)