28SpCS157BL18 - Department of Computer Science

Download Report

Transcript 28SpCS157BL18 - Department of Computer Science

Data Mining-Knowledge
Presentation—ID3 algorithm
Prof. Sin-Min Lee
Department of Computer Science
Data Mining Tasks
Predicting onto new data by using rules or patterns
and behaviors
– Classification
– Estimation
Understanding the groupings, trends, and
characteristics of your customer
– Segmentation
Visualizing the Euclidean spatial relationships,
trends, and patterns of your data
– Description
Stages of Data Mining Process
1. Data gathering, e.g., data warehousing.
2. Data cleansing: eliminate errors and/or bogus data,
e.g., patient fever = 125.
3. Feature extraction: obtaining only the interesting
attributes of the data, e.g., “date acquired” is
probably not useful for clustering celestial objects,
as in Skycat.
4. Pattern extraction and discovery. This is the stage
that is often thought of as “data mining” and is
where we shall concentrate our effort.
5. Visualization of the data.
6. Evaluation of results; not every discovered fact is
useful, or even true! Judgment is necessary before
following your software's conclusions.
Clusters of Galaxies
clustered 2x109
sky objects into stars,
galaxies, quasars, etc.
Each object was a point
in a space of 7
dimensions, with each
dimension representing
radiation in one band of
the spectrum.
The Sloan Sky Survey
is a more ambitious
attempt to catalog and
cluster the entire visible
universe
Skycat
Clustering:
Cholera
Examples
outbreak in London
Decision trees are an alternative way of structuring rule information.
outlook
sunny
humidity
N
overcast
rain
P
windy
normal
true
false
P
N
P
A Classification rule based on the tree
outlook =
overcast
then P
if
outlook =
overcast

outlook = sunny
&
humidity =
normal

outlook = rain &
windy = false
then P
if
outlook = sunny
if
&
humidity =
normal
then P
if
then
outlook = rain &
windy = false
P
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Each internal node tests an attribute
Normal
Yes
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
Top-Down Induction of Decision
Trees ID3
A  the “best” decision attribute for next node
Assign A as decision attribute for node
For each value of A create new descendant
Sort training examples to leaf node according to
the attribute value of the branch
5. If all training examples are perfectly classified
(same value of target attribute) stop, else
iterate over new leaf nodes.
1.
2.
3.
4.
Which Attribute is ”best”?
[29+,35-] A1=?
True
[21+, 5-]
A2=? [29+,35-]
False
[8+, 30-]
True
[18+, 33-]
False
[11+, 2-]
Entropy
S is a sample of training examples

p is the proportion of positive examples
 +
p is the proportion of negative examples
 -
Entropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-
Entropy
Entropy(S)= expected number of bits needed to
encode class (+ or -) of randomly drawn members
of S (under the optimal, shortest length-code)

Why?
Information theory optimal length code assign

–log2 p bits to messages having probability p.

So the expected number of bits to encode
(+ or -) of random member of S:
-p+ log2 p+ - p- log2 p-
Information Gain
• Gain(S,A): expected reduction in entropy due to
sorting S on attribute A
Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)
Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99
[29+,35-] A1=?
True
[21+, 5-]
A2=? [29+,35-]
False
[8+, 30-]
True
[18+, 33-]
False
[11+, 2-]
Information Gain
Entropy([21+,5-]) = 0.71
Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
=0.27
Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.62
Gain(S,A2)=Entropy(S)
-51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])
=0.12
[29+,35-] A1=?
True
[21+, 5-]
A2=? [29+,35-]
False
[8+, 30-]
True
[18+, 33-]
False
[11+, 2-]
Training Examples
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp.
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cold
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Selecting the Next Attribute
S=[9+,5-]
E=0.940
S=[9+,5-]
E=0.940
Humidity
Wind
High
[3+, 4-]
E=0.985
Normal
[6+, 1-]
Weak
[6+, 2-]
Strong
[3+, 3-]
E=0.592
Gain(S,Humidity)
=0.940-(7/14)*0.985
– (7/14)*0.592
=0.151
Gain(S,Wind)
=0.940-(8/14)*0.811
– (6/14)*1.0
=0.048
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook
Sunny
Over
cast
Rain
[2+, 3-]
[4+, 0]
[3+, 2-]
E=0.971
E=0.0
E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
ID3 Algorithm
[D1,D2,…,D14]
[9+,5-]
Outlook
Sunny
Overcast
Rain
Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]
[2+,3-]
[4+,0-]
[3+,2-]
?
Yes
?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Outlook
Sunny
Humidity
High
No
[D1,D2]
Overcast
Rain
Yes
[D3,D7,D12,D13]
Normal
Yes
[D8,D9,D11]
Wind
Strong
Weak
No
Yes
[D6,D14]
[D4,D5,D10]
The ID3 Algorithm
Given
• a set of disjoint target classes {C1, C2, …, Ck},
• a set of training data, S, containing objects of more than
one class.
Let T be any test on a single attribute of the data, with O1, O2,
…, On
representing the possible outcomes of applying T to any
object x (written as T(x)).
T produces a partition {S1, S2, …, Sn} of S such that
Si = { x | T(x) = Oi}
S
O1
S1
S2
O2
…
On
…
Sn
Proceed recursively to replace each Si with a
decision tree.
Crucial factor: Selecting the tests.
In making this decision, Quinlan employs the notion of uncertainty
(entropy from information theory).
M = {m1, m2, …, mn}
Set of messages
p(mi)
Probability of the message mi being received
I(mi) = -log p(mi)
U(M) = i p(mi) I(mi)
Amount of information of message mi
Uncertainty of the set M
Quinlan’s assumptions:
•A correct decision tree for S will classify objects in the same proportion
as their representation in S.
•Given a case to classify, a test can be regarded as the source of a message
about that case.
Let Ni be the number of cases in S that belong to a class Ci:
p(cCi) = Ni / |S|
The uncertainty, U(S), measures the average amount of information
needed to determine the class of a random case, cS.
Uncertainty measure after S has been partitioned.
UT(S) = i (|Si| / |S|) U(Si)
Select the test T that gains the most information, i.e., where
GS(T) = U(S) – UT(S)
is maximal.
Evaluation of ID3
The ID3 algorithm tends to favor tests with a large number of outcomes
over tests with a smaller number.
Its computational complexity depends on the cost of choosing the next
test to branch on;
It was adapted to deal with noisy and incomplete data;
It is a feasible alternative to knowledge elicitation if sufficient data of the
right kind are available;
However this method is not incremental.
Further modification were introduced in C4.5, e.g :
•pruning the decision tree in order to avoid overfitting
•Better test selection heuristic
Search Space and Search Trees
• Search space is logical space composed of
– nodes are search states
– links are all legal connections between search
states
• e.g. in chess, no link between states where W castles
having previously moved K.
– always just an abstraction
– think of search algorithms trying to navigate
this extremely complex space
Search Trees
• Search trees do not
summarise all possible
searches
– instead an abstraction of
one possible search
• Root is null state
• edges represent one choice
– e.g. to set value of A first
• child nodes represent
extensions
– children give all possible
choices
• leaf nodes are
solutions/failures
• Example in SAT
– algorithm detects failure early
– need not pick same variables
ABC, ABc, AbC, Abc, aBC, abC, abc
state = ()
Choose A
a
(a)
Choose B
B
(a B)
Impossible
A
(A)
Choose C
b
(a b)
Impossible
C
(AC)
Choose B
B
(ABC)
Impossible
c
Ac
impossible
b
(AbC)
Solution
Definition
• A tree shaped structure that
represents a set of decisions. These
decisions are used as a basis for
predictions.
• They represent rules for classifying
datasets. Useful knowledge can be
extracted by this classification.
DT Structure
• Node Types
– Decision nodes: specifies some test to be
carried out on a single attribute value.
• Each outcome is assigned to a branch that
connects to a leaf node or another decision
node.
– Leaf nodes: indicates the classification of
an example.
An Example
Growing a Decision Tree
Iterative Dichotomiser 3 (ID3)
Algorithm
• Invented by J. Ross Quinlan in 1975
• Based on a greedy search algorithm.
ID3 Cont.
• The goal is to create the best possible tree
that works on all available data.
– An example is strictly either belonging to one
class or the other.
– Need to select the attribute that best classify
the example (i.e. need to select the attribute
with smallest entropy on all the example.)
– The lower the entropy the higher the
Information Gain. We desire high IG.
Entropy
• A quantitative measurement of the
homogeniety of a set of examples.
• It tells us how well an attribute
separate the training examples
according to their target
classification.
Entropy cont.
• Given a set S with only positive or
negative examples (2 class case):
Entropy(S) = -PPlog2PP – Pnlog2Pn
Where PP = proportion of positive
examples
PN = proportion of negative
examples
Entropy cont.
• Ex. Given 25 examples with 15 positive and
10 negative.
Entropy(S) = -(15/25)log2(15/25)
-(10/25)log2(10/25) = .97
If Entropy(S) = 0 all members in S belong to
strictly one class.
If Entropy(S) = 1 (max value) members are split
equally between the two classes.
In General…
• In general, if an attribute takes more
than two values
n
entropy(S) =
  p log( p )
i 1
i
i
Where n is the number of values
Information Gain
Where A is an attribute of S
Value(A) is the set of possible value of A
v is a particular value in Value(A)
Sv is a subset of S having of v’s on value(A).
Actual Algorithm
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree
Root, with label = most common value of the target attribute in the examples.
Otherwise Begin
A = The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi), be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most
common target value in the examples
Else below this new branch add the subtree ID3 (Examples(vi),
Target_Attribute, Attributes – {A})
End
Return Root
Independent Attributes
Dependent
Outlook
Temperature
Humidity
Windy
Play
sunny
85
85
FALSE
Don’t play
sunny
80
90
TRUE
Don’t play
overcast
83
78
FALSE
Play
rain
70
95
FALSE
Play
rain
68
80
FALSE
Play
rain
65
70
TRUE
Don’t play
overcast
64
65
TRUE
Play
sunny
72
95
FALSE
Don’t play
sunny
69
70
FALSE
Play
rain
75
80
FALSE
Play
sunny
75
70
TRUE
Play
overcast
72
90
TRUE
Play
overcast
81
75
FALSE
Play
rain
71
80
TRUE
Don’t play
We choose Play as our dependent attribute.
Don’t play = Negative
Play = Positive
Independent Attributes
Dependent
Outlook
Temperature
Humidity
Windy
Play
sunny
85
85
FALSE
Don’t play
sunny
80
90
TRUE
Don’t play
overcast
83
78
FALSE
Play
rain
70
95
FALSE
Play
rain
68
80
FALSE
Play
rain
65
70
TRUE
Don’t play
overcast
64
65
TRUE
Play
sunny
72
95
FALSE
Don’t play
sunny
69
70
FALSE
Play
rain
75
80
FALSE
Play
sunny
75
70
TRUE
Play
overcast
72
90
TRUE
Play
overcast
81
75
FALSE
Play
rain
71
80
TRUE
Don’t play
Outlook
play
don’t
play
total
Entropy
sunny
2
3
5
0.34676807
ovecast
4
0
4
0
rain
3
2
5
0.34676807
total
0.69353614
Temp
>70
5
4
9
0.63712032
<=70
4
1
5
0.25783146
total
0.89495179
Humidity
>70
6
4
10
0.69353614
<=70
3
1
4
0.23179375
total
0.92532989
Windy
TRUE
3
3
6
0.42857143
FALSE
6
2
8
0.4635875
total
0.89215893
Independent Attributes
Dependent
Outlook
Temperature
Humidity
Windy
Play
sunny
85
85
FALSE
Don’t play
sunny
80
90
TRUE
Don’t play
sunny
72
95
FALSE
Don’t play
sunny
69
70
FALSE
Play
sunny
75
70
TRUE
Play
play
don’t play
total
Entropy
>70
1
3
4
0.6490225
<=70
1
0
1
0
total
0.6490225
Temp
Humidity
>70
0
3
3
0
<=70
2
0
2
0
total
0
Windy
TRUE
1
1
2
0.4
FALSE
1
2
3
0.5509775
total
0.9509775
Independent Attributes
Dependent
Outlook
Temperature
Humidity
Windy
Play
rain
70
95
FALSE
Play
rain
68
80
FALSE
Play
rain
65
70
TRUE
Don’t play
rain
75
80
FALSE
Play
rain
71
80
TRUE
Don’t play
play
don’t play
total
Entropy
>70
1
1
2
0.4
<=70
2
1
3
0.5509775
total
0.9509775
Temp
Humidity
>70
1
0
3
0
<=70
3
1
4
0.6490225
total
0.6490225
Windy
TRUE
0
2
2
0
FALSE
3
0
3
0
total
0
Independent Attributes
Dependent
Outlook
Temperature
Humidity
Windy
Play
overcast
83
78
FALSE
Play
overcast
64
65
TRUE
Play
overcast
72
90
TRUE
Play
overcast
81
75
FALSE
Play
play
don’t play
total
Entropy
>70
2
0
2
0
<=70
0
2
2
0
total
0
Temp
Humidity
>70
3
0
3
0
<=70
1
0
1
0
total
0
Windy
TRUE
2
0
2
0
FALSE
2
0
2
0
total
0
ID3 Summary
• Step 1: Take all unused attribute and
count their entropy concerning test
samples.
• Step 2: Choose the attribute with
smallest entropy.
• Step 3: Make a node containing that
attribute.
Growth stops when:
• Every attribute already exist along
the path thru the tree.
• The training examples associated with
a leaf all have the same target
attribute.
References
• An Overview of Data Mining Techniques
– http://www.thearling.com/text/dmtechniques/
dmtechniques.htm
• Decision tree
– http://en.wikipedia.org/wiki/Decision_tree
• Decision Trees
– http://dms.irb.hr/tutorial/tut_dtrees.php