Data Preparation
Download
Report
Transcript Data Preparation
國立雲林科技大學
National Yunlin University of Science and Technology
Exploiting data preparation to enhance
mining and knowledge discovery
Advisor:Dr.Hsu
Graduate: Keng-Wei Chang
Author: Balaji Rajagopalan
Mark W. Isken
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART C:
APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Outline
Motivation
Objective
Introduction
Data Preparation
Research Method
Results
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Motivation
using organizational data for mining and
knowledge discovery
not amenable for mining in its natural form
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Objective
data enhancement by the introduction of new
attributes along with judicious aggregation of
existing attributes
results in higher quality knowledge discovery
differential impact on the performance of different
mining algorithms
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction
Exponential growth information result a
tremendous volume of data to knowledge
workers.
Knowledge management solution
Knowledge repository
Knowledge sharing
Knowledge discovery
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation
Present a framework based on prior research in
knowledge discovery
Data quality
Data characteristics
Data preparation
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Research Method
data set from a large tertiary care hospital in
the United States was used
few topics
A. Problem Domain
B. Data
C. Clustering Algorithms for Knowledge Discovery
D. Entropy-Based Metrics for Cluster Quality
Assessment
E. Rule Extraction Metrics
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Problem Domain
allocation of inpatient beds
more difficult is use quantitative resource
allocation in a manageable set of patient types
quantitative resource
sequence of hospital units visited and corresponding
length of stay
patient types
a group of patients consuming a similar level of hospital
resources
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Problem Domain
refer to this as the patient classification
problem
too few V.S. too many patient types
The key is identify the set of patient types
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data
Inpatient obstetrical and gynecological
(OB/GYN) patient flow
There are numerous fields
demographics
physician information
ICD9-CM diagnostic
procedure codes
diagnosis-related groups (DRGs)
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data
almost 500 defined in DRGs
range[353-384] are related to OB/GYN
grouping these DRGs into five DRG types
Intelligent Database Systems Lab
Clustering Algorithms for Knowledge
Discovery
K-means and Kohonen seof-organizing
Similarity
Euclidean distance function
d x, y
n
x
i 1
i
yi
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Entropy-Based Metrics for Cluster
Quality Assessment
Entropy
1
E j pij log 2
p
i
ij
nijbe the number of cases having a
DRG type of i in cluster j
pij nij / l nlj
Weighted Entropy
N.Y.U.S.T.
I.M.
cluster size
calculate a weighted average entropy measure for
a cluster solution
Purity, let
Pj max i pij
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Rule Extraction Metrics
expect a high degree of resonance for most of
the rules with our domain knowledge
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Results
detail the data enhancements relevant to this
study
A. Data Preparation : Basics
B. Mining and Knowledge Discovery
C. Differential Impact Based on Clustering Method
D. Usefulness of Knowledge Discovered
E. Limitations
F. Implications for Research and Practice
Intelligent Database Systems Lab
Data Preparation : Basics
Data set included fields that represent the path
and associated lengths of stay along that path
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation : Basics
Consider three data sets characterized in order
to illustrate the impact of data preparation
ED1
Eight numeric variables
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation : Basics
ED2
Both DRG and CCS were designed to serve as
aggregate measures of hospital resource
consumption
in addition ED1, ED2 add five nominal variables
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation : Basics
ED3
in addition to ED2, ED3 contains two binary
variables
whether or not gave birth during the visit
whether or not gave birth via C-section
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Mining and Knowledge Discovery
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Mining and Knowledge Discovery
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Differential Impact Based on Clustering
Method
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Usefulness of Knowledge Discovered
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Limitations
may not exactly applicable in every case
examine only two data mining algorithms
K-means and Kohonen self-organizing maps
illustrative, not exhaustive
domain knowledge played a critical role in the
data preparation process
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Implications for Research and Practice
provides empirical evidence demonstrating the
impact of data preparation on mining and
knowledge discovery
engage in a comparative investigation of
multiple altorithms
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Personal opinion
…
Intelligent Database Systems Lab