Presentation
Download
Report
Transcript Presentation
Knowledge Mining and Soil Mapping
using Maximum Likelihood Classifier
with Gaussian Mixture Models
ECE539 final project
Instructor: Yu Hen Hu
Fall 2005
Jian Liu
12/13/2005
Overview
This study deals with data mining from soil
survey maps and soil mapping with mined
soil-landscape knowledge.
Soil – landscape models
Soil is a product of the interaction of surrounding
environments
“soil-landscape model” (Hudson, 1992)
Soil can be predicated given the environments
Environmental variables
Environmental factors affecting soil formation:
o Bedrock geology
o Elevation (DEM)
o Slope gradient
1st derivative along the steepest slope
o Profile curvature
2nd derivative along the steepest slope
o Planform curvature
2nd derivative perpendicular to contour lines
Previous Approaches & Problems
Fuzzy system (Zhu 2001)
Elicits knowledge from a soil scientist and
represents it with arbitrary curves
Assumes independence of each environmental
variable
ANN (Zhu 2000; Behrens 2005; Scull 2005 )
Black box knowledge representation
High dimensional matrix is hard to comprehend
Decision trees (Bui, 1999; Qi et.al. 2003)
Knowledge extracted is crisp (typical case), no
information about gradation
Proposal – Knowledge Representation
GMM representation is more suitable because:
Probability representation well captures the physical
gradation of the phenomenon
The interactions between environmental variables are
taken into account by the multivariate Gaussian
distribution
p( x)
1
exp ( x ) T 1 ( x )
2
(2 ) d | |
1
Mixture model gives a great potential of capturing the
real distribution
Physically a soil type may have multiple instances.
c
p ( x | ) p( x | (i ), i ) p (i )
i 1
Proposal – Maximum Likelihood Classifier
Maximum likelihood
P(A|Class1) = 0.8
P(A|Class2) = 0.5
A then is classified into class1 based on
“Maximum likelihood”
Naturally evaluates the composite effect
environmental variables have on the probability of
soil formation
Algorithm
Training procedure:
Standardize feature dimensions of training set
For each geology group in the training data
For each soil type in the geology group
Fit a GMM using EM algorithm (# of mixtures is preset,
k-means is used to initialize the cluster centers)
Testing procedure:
Standardize feature dimensions of testing set
For each sample point
For each class in the corresponding geology group
Calculate the corresponding likelihood based on GMM
The point is classified to the class with the maximum likelihood
Case Study
Training set
elevation
slope gradient
profile curvature
planform curvature
geology
soil map
Testing set
elevation
…
geology
soil map
Evaluation of the GMM representation
The GMM representations well capture the gradation
of soil on the landscape, which complies well with
expert knowledge
e.g. Council at footslope
e.g. Elbaville at backslope
Training accuracy & testing accuracy
Overall, 80% classification accuracy against testing data
Increasing number of mixtures leads to higher classification
accuracy
at an expense of exponentially increasing storage and
computational load
classification accuracy (%)
geology area 1
geology area 2
# of mixtures
training
testing
training
testing
1
70.04
68.07
79.80
77.13
2
76.66
74.50
78.99
76.84
4
81.51
79.27
80.03
75.55
8
83.17
80.12
84.07
79.23
Classification Accuracy vs. # of Mixtures
in geology area 2
100
90
90
80
classification accuracy (%)
classification accuracy (%)
in geology area 1
100
70
60
training accuracy
50
testing accuracy
40
30
20
10
80
70
60
training accuracy
50
testing accuracy
40
30
20
10
0
0
0
2
4
6
# of mixtures
8
10
0
2
4
6
# of mixtures
8
10
Mapping accuracy based on field data
64 points are correctly classified out of 83 field sample points
(77%), higher than traditional manual based soil survey (usually
60%)
Classification result using 8 mixtures
(the dark blue areas are not mapped)
More comments
Standardization of feature dimensions is very
effective, -- improves mapping accuracy from 55% to
80%
Preprocessing techniques such as data cleaning
required by decision tree is not critical to ML because
the ML classifier is not as sensitive to training errors
as long as they are not of a huge amount.
Conclusion
GMM is suitable to represent soil-landscape
knowledge
ML classifier with GMMs is promising for soil
knowledge mining and soil mapping
Future improvement?
Reduce the storage and computational load so that
bigger number of mixtures can be used to improve
classification accuracy
Use diagonal matrix to replace full covariance
matrix (after applying de-correlation to the
features)?