Data Preparation

Transcript Data Preparation

國立雲林科技大學
National Yunlin University of Science and Technology
Exploiting data preparation to enhance
mining and knowledge discovery
Advisor：Dr.Hsu
Graduate： Keng-Wei Chang
Author： Balaji Rajagopalan
Mark W. Isken
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART C:
APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Outline






Motivation
Objective
Introduction
Data Preparation
Research Method
Results
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Motivation


using organizational data for mining and
knowledge discovery
not amenable for mining in its natural form
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Objective

data enhancement by the introduction of new
attributes along with judicious aggregation of
existing attributes


results in higher quality knowledge discovery
differential impact on the performance of different
mining algorithms
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction


Exponential growth information result a
tremendous volume of data to knowledge
workers.
Knowledge management solution



Knowledge repository
Knowledge sharing
Knowledge discovery
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation

Present a framework based on prior research in
knowledge discovery



Data quality
Data characteristics
Data preparation
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Research Method


data set from a large tertiary care hospital in
the United States was used
few topics
A. Problem Domain
B. Data
C. Clustering Algorithms for Knowledge Discovery
D. Entropy-Based Metrics for Cluster Quality
Assessment
E. Rule Extraction Metrics
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Problem Domain

allocation of inpatient beds


more difficult is use quantitative resource
allocation in a manageable set of patient types
quantitative resource


sequence of hospital units visited and corresponding
length of stay
patient types

a group of patients consuming a similar level of hospital
resources
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Problem Domain



refer to this as the patient classification
problem
too few V.S. too many patient types
The key is identify the set of patient types
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data


Inpatient obstetrical and gynecological
(OB/GYN) patient flow
There are numerous fields





demographics
physician information
ICD9-CM diagnostic
procedure codes
diagnosis-related groups (DRGs)
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data



almost 500 defined in DRGs
range[353-384] are related to OB/GYN
grouping these DRGs into five DRG types
Intelligent Database Systems Lab
Clustering Algorithms for Knowledge
Discovery


K-means and Kohonen seof-organizing
Similarity

Euclidean distance function
d  x, y  
n
 x
i 1
i
 yi 
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Entropy-Based Metrics for Cluster
Quality Assessment

Entropy
 1 

E j   pij log 2 
p 
i
 ij 

nijbe the number of cases having a
DRG type of i in cluster j
pij  nij / l nlj
Weighted Entropy



N.Y.U.S.T.
I.M.
cluster size
calculate a weighted average entropy measure for
a cluster solution
Purity, let
Pj  max i pij
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Rule Extraction Metrics

expect a high degree of resonance for most of
the rules with our domain knowledge
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Results

detail the data enhancements relevant to this
study
A. Data Preparation : Basics
B. Mining and Knowledge Discovery
C. Differential Impact Based on Clustering Method
D. Usefulness of Knowledge Discovered
E. Limitations
F. Implications for Research and Practice
Intelligent Database Systems Lab
Data Preparation : Basics

Data set included fields that represent the path
and associated lengths of stay along that path
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation : Basics

Consider three data sets characterized in order
to illustrate the impact of data preparation

ED1

Eight numeric variables
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation : Basics

ED2


Both DRG and CCS were designed to serve as
aggregate measures of hospital resource
consumption
in addition ED1, ED2 add five nominal variables
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Data Preparation : Basics

ED3

in addition to ED2, ED3 contains two binary
variables


whether or not gave birth during the visit
whether or not gave birth via C-section
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Mining and Knowledge Discovery
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Mining and Knowledge Discovery
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Differential Impact Based on Clustering
Method
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Usefulness of Knowledge Discovered
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Limitations


may not exactly applicable in every case
examine only two data mining algorithms



K-means and Kohonen self-organizing maps
illustrative, not exhaustive
domain knowledge played a critical role in the
data preparation process
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Implications for Research and Practice


provides empirical evidence demonstrating the
impact of data preparation on mining and
knowledge discovery
engage in a comparative investigation of
multiple altorithms
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Personal opinion

…
Intelligent Database Systems Lab

Data Preparation

Transcript Data Preparation

Directory