21SCS157BL4DMDecisionTrees

Download Report

Transcript 21SCS157BL4DMDecisionTrees

Data Mining, Decision Trees
and Earthquake Prediction
Professor Sin-Min Lee
What is Data Mining?
• Process of automatically finding the
relationships and patterns, and extracting
the meaning of enormous amount of data.
• Also called “knowledge discovery”
Objective
• Extracting the hidden, or not easily
recognizable knowledge out of the large
data… Know the past
• Predicting what is likely to happen if a
particular type of event occurs … Predict
the future
Application
• Marketing example
– Sending direct mail to randomly chosen
people
– Database of recipients’ attribute data (e.g.
gender, marital status, # of children, etc) is
available
– How can this company increase the response
rate of direct mail?
Application (Cont’d)
• Figure out the pattern, relationship of
attributes that those who responded has in
common
• Helps making decision of what kind of
group of people the company should
target
• Data mining helps analyzing large amount
of data, and making decision…but how
exactly does it work?
• One method that is commonly used is
decision tree
Decision Tree
• One of many methods to perform data
mining - particularly classification
• Divides the dataset into multiple groups by
evaluating attributes
• Decision tree can be explained a series of
nested if-then-else statements.
• The Decision Tree is one of the most
popular classification algorithms in
current use in Data Mining
Decision Tree (Cont’d)
• Each non-leaf node has a predicate associated, testing an
attribute of data
• Leaf node represents a class, or category
• To classify a data, start from root node and traverse down the
tree by testing predicates and taking branches
Example of Decision Tree
What is a Decision Tree?
• 20-Questions Example
– Progressive Yes-No Decisions Until an
Answer is Obtained
• 20-Questions Machine at Linens &
Things
• Key to the Phylum – classification tool
– Carl Linnaeus, Swedish Botanist, 1730’s
– Classifies known species:
• Kingdoms, Phyla, Classes, Orders, Families,
Genera, and Species
What is a Decision Tree?
Root
Node
Body
Temperature
Internal
Node
Warm-Blooded
Hibernates
Yes
Mammal
Non-Mammal
No
Non-Mammal
Four-Legged
Yes
Cold-Blooded
No
Non-Mammal
Leaf
Node
What are Decision Trees Used
For?
How to Use a Decision Tree
Start from the root of tree.
Refund
Yes
Taxable
Income Cheat
No
80K
Single
Married
10
10
Test Data
No
NO
Refund Marital
Status
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
>=
> 80K
YES
Deduction
?
How to Make a Decision Tree
Splitting Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
No
Yes
Single
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Training Data
Refund
Induction
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Model: Decision Tree
Hunt’s Algorithm
• Let Dt be the set of training
records that reach a node t
• General Procedure:
– If Dt contains records that
belong the same class yt,
then t is a leaf node labeled
as yt
– If Dt is an empty set, then t is
a leaf node labeled by the
default class, yd
– If Dt contains records that
belong to more than one
class, use an attribute test to
split the data into smaller
subsets. Recursively apply
the procedure to each
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Dt
?
60K
Hunt’s Algorithm
Don’t
Cheat
Refund
Yes
No
Don’t
Cheat
Don’t
Cheat
Refund
Refund
Yes
Yes
No
Don’t
Cheat
Don’t
Cheat
Marital
Status
Single,
Divorced
Cheat
Married
No
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Marital
Status
Single,
Divorced
Don’t
Cheat
Tid Refund Marital
Status
Married
Don’t
Cheat
Taxable
Income
< 80K
>= 80K
Don’t
Cheat
Cheat
60K
Measure of Purity: Gini
• Gini Index for a given node t :
GINI (t )  1   [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
Advantage of Decision Tree
• simple to understand and interpret
• require little data preparation
• able to handle nominal and categorical
data.
• perform well with large data in a short
time
• the explanation for the condition is
easily explained by boolean logic.
Advantages of Decision Tree
• Easy to visualize the process of
classification
– Can easily tell why the data is classified in a
particular category - just trace the path to get
to the leaf and it explains the reason
• Simple, fast processing
– Once the tree is made, just traverse down the
tree to classify the data
Decision Tree is for…
• Classifying the dataset which
– The predicates return discrete values
– Does not have an attributes that all data has
the same value
CMT catalog: Shallow earthquakes, 1976-2005
INDIAN PLATE MOVES NORTH
COLLIDING WITH EURASIA
Gordon & Stein, 1992
COMPLEX
PLATE
BOUNDARY
ZONE IN
SOUTHEAST
ASIA
Northward motion of
India deforms all of
the region
Many small plates
(microplates) and
blocks
Molnar & Tapponier, 1977
India subducts
beneath Burma
microplate
at about 50
mm/yr
Earthquakes
occur at plate
interface along
the Sumatra arc
(Sunda trench)
These are
NOAA
IN DEEP OCEAN tsunami has long wavelength, travels fast,
small amplitude - doesn’t affect ships
AS IT APPROACHES SHORE, it slows. Since energy is
conserved, amplitude builds up - very damaging
TSUNAMI WARNING
Deep ocean buoys can measure
wave heights, verify tsunami and
reduce false alarms
Because seismic waves travel much
faster (km/s) than tsunamis, rapid
analysis of seismograms can identify
earthquakes likely to cause major
tsunamis and predict when waves will
arrive
HOWEVER, HARD TO PREDICT EARTHQUAKES
recurrence is highly variable
Sieh et al., 1989
Extend earthquake history
with geologic records paleoseismology
M>7 mean 132 yr s 105 yr
Estimated probability in 30 yrs 7-51%
EARTHQUAKE RECURRENCE
AT SUBDUCTION ZONES IS
COM PLICATED
In many subduction zones, thrust
earthquakes have patterns in
space and time. Large
earthquakes occurred in the
Nankai trough area of Japan
approximately every 125 years
since 1498 with similar fault areas
In some cases entire region
seems to have slipped at once; in
others slip was divided into
several events over a few years.
Repeatability suggests that a
segment that has not slipped for
some time is a gap due for an
earthquake, but it’s hard to use
this concept well because of
variability
GAP?
NOTHING YET
Ando, 1975
1985 MEXICO EARTHQUAKE
• SEPTEMBER 19,
1985
• M8.1
• A SUBDUCTION
ZONE QUAKE
• ALTHOUGH
LARGER THAN
USUAL, THE
EARTHQUAKE WAS
NOT A “SURPRISE”
• A GOOD, MODERN
BUILDING CODE
HAD BEEN
ADOPTED AND
IMPLEMENTED
1985 MEXICO EARTHQUAKE
• EPICENTER
• 400 BUILDINGS
LOCATED 240 KM
COLLAPSED IN
FROM MEXICO CITY
OLD LAKE BED
ZONE OF MEXICO
CITY
• SOIL-STRUCTURE
RESONANCE IN
OLD LAKE BED
ZONE WAS A
MAJOR FACTOR
1985 MEXICO EARTHQUAKE:
ESSENTIAL STRUCTURES-SCHOOLS
1985 MEXICO EARTHQUAKE:
STEEL FRAME BUILDING
1985 MEXICO EARTHQUAKE:
POUNDING
1985 MEXICO EARTHQUAKE:
NUEVA LEON APARTMENT
BUILDINGS
1985 MEXICO EARTHQUAKE:
SEARCH AND RESCUE
• Definition
• Characteristics
• Project:California
Earthquake
Prediction)
Characteristics (cont.)
Characteristics (cont.)
• 2. Locality: information
transferred by a neuron
is limited by its nearby
neurons.
• CAEP: short term
earthquake prediction is
highly influenced by it’s
geologic figure locally.
Characteristics (cont.)
• 3. Weighted sum and
activation function with
nonlinearity: input signal
is weighted at the
synoptic connection by a
connection weight.
• CAEP: nearby location
will be weighted with
each activation function.
Characteristics (cont.)
• 4. Plasticity: connection weights
change according to the
information fed to the neuron and
the internal state. This plasticity
of the connection weights leads
to learning and self-organization.
The plasticity realizes the
adaptability against the
continuously varying
environment.
• CAEP: calculate the stress of
focused point according to the
seismic wave history in the
around area
Characteristics (cont.)
• 5. Generalization: A neural
network constructs its own
view of the world by
inferring an optimal action
on the basis of previously
learned events by
interpolation, and
extrapolation.
• CAEP: get a view of one
area from past experience
by pattern representation
Prediction.
Basic Function of CSEP
• Neuron: list of
locations along San
Andreas Fault, and
two of the
associated faults—
Hayward and
Calaveras.
Basic Function of CSEP (cont.)
• Neuron’s parameters: magnitude, date, latitude,
longitude, depth, location, ground water,
observation, etc.
Learning
• Learning is essential for unknown environments,
– i.e., when designer lacks omniscience
• Learning is useful as a system construction
method,
– i.e., expose the agent to reality rather than trying to
write it down
• Learning modifies the agent's decision
mechanisms to improve performance
Learning agents
Learning element
• Design of a learning element is affected by
– Which components of the performance element are to
be learned
– What feedback is available to learn these
components
– What representation is used for the components
• Type of feedback:
– Supervised learning: correct answers for each
example
– Unsupervised learning: correct answers not given
– Reinforcement learning: occasional rewards
Inductive learning
• Simplest form: learn a function from examples
•
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
(This is a highly simplified model of real learning:
– Ignores prior knowledge
Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
•
•
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
•
•
Classification of examples is positive (T) or negative (F)
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to wait:
Expressiveness
•
•
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
•
Trivially, there is a consistent decision tree for any training set with one path
to leaf for each example (unless f nondeterministic in x) but it probably won't
generalize to new examples
•
Prefer to find more compact decision trees
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
• E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
• E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
How many purely conjunctive hypotheses (e.g., Hungry  Rain)?
• Each attribute can be in (positive), in (negative), or out
 3n distinct conjunctive hypotheses
• More expressive hypothesis space
– increases chance that target function can be expressed
– increases number of hypotheses consistent with training set
 may get worse predictions
Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Choosing an attribute
• Idea: a good attribute splits the examples into subsets
that are (ideally) "all positive" or "all negative"
• Patrons? is a better choice
Using information theory
• To implement Choose-Attribute in the DTL
algorithm
• Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive examples
and n negative examples:
p
n
p
p
n
n
I(
,
)
log 2

log 2
pn pn
pn
pn pn
pn
Information gain
• A chosen attribute A divides the training set E into
subsets E1, … , Ev according to their values for A, where
A has v distinct values.
v
remainder ( A)  
i 1
p i ni
pi
ni
I(
,
)
p  n pi  ni pi  ni
• Information Gain (IG) or reduction in entropy from the
attribute test:
p
n
IG ( A)  I (
,
)  remainder ( A)
pn pn
• Choose the attribute with the largest IG
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):
2
4
6 2 4
IG ( Patrons)  1  [ I (0,1)  I (1,0)  I ( , )]  .0541 bits
12
12
12 6 6
2 1 1
2 1 1
4 2 2
4 2 2
IG (Type)  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root
Example contd.
• Decision tree learned from the 12 examples:
• Substantially simpler than “true” tree---a more complex
hypothesis isn’t justified by small amount of data