Data Miing and Knowledge Discvoery - Web
Download
Report
Transcript Data Miing and Knowledge Discvoery - Web
The Knowledge Discovery Process;
Data Preparation & Preprocessing
Bamshad Mobasher
DePaul University
The Knowledge Discovery Process
- The KDD Process
2
Steps in the KDD Process
Learning the application domain
Translate the business problem into a data mining problem
Gathering and integrating of data
Cleaning and preprocessing data
may be the most resource intensive part
Reducing and selecting data
find useful features, dimensionality reduction, etc.
Choosing functions of data mining
summarization, classification, regression, association, clustering, etc.
Choosing the mining algorithm(s)
Data mining: discover patterns of interest; construct models
Evaluate patterns and models
Interpretation: analysis of results
visualization, alteration, removing redundant patterns, querying
Use of discovered knowledge
3
Data Mining: What Kind of Data?
Structured Databases
relational, object-relational, etc.
can use SQL to perform parts of the process
e.g., SELECT count(*) FROM Items WHERE
type=video GROUP BY category
4
Data Mining: What Kind of Data?
Flat Files
most common data source
can be text (or HTML) or binary
may contain transactions, statistical data, measurements, etc.
Transactional databases
set of records each with a transaction id, time stamp, and a set of items
may have an associated “description” file for the items
typical source of data used in market basket analysis
5
Data Mining: What Kind of Data?
Other Types of Databases
legacy databases
multimedia databases (usually very high-dimensional)
spatial databases (containing geographical information, such as maps, or
satellite imaging data, etc.)
Time Series Temporal Data (time dependent information such as stock market
data; usually very dynamic)
World Wide Web
basically a large, heterogeneous, distributed database
need for new or additional tools and techniques
information retrieval, filtering and extraction
agents to assist in browsing and filtering
Web content, usage, and structure (linkage) mining tools
6
Data Mining: What Kind of Data?
Data Warehouses
a data warehouse is a repository of data collected from multiple data sources
(often heterogeneous) so that analysis can be done under the same unified schema
data from the different sources are loaded, cleaned, transformed and integrated
allows for interactive analysis and decision-making
usually modeled by a multi-dimensional data structure (data cube) to facilitate
decision-making
aggregate values can be pre-computed and stored along many dimensions
each dimension of the data cube contains a hierarchy of values for one attribute
data cubes are well suited for fast interactive querying and analysis of data at
different conceptual levels, known as On-Line Analytical Processing (OLAP)
OLAP operations allow the navigation of data at different levels of abstraction,
such as drill-down, roll-up, slice, dice, etc.
7
DM Tasks: Classification
Classifying observations/instances into different “given” classes
(i.e., classification is “supervised”)
example, classifying credit applicants as low, medium, or high risk
normally use a training set where all objects are already associated with known
class labels
classification algorithm learns from the training set and builds a model
the model is used to classify new objects
Suitable data mining tools
Decision Tree algorithms (CHAD, C 5.0, and CART)
Memory-based reasoning
Neural Network
Example (Hypothetical Video Store)
build a model of users based their rental history (returning on-time, payments, etc.)
observe the current customer’s rental and payment history
decide whether should charge a deposit to current customer
8
DM Tasks: Prediction
Same as classification
except classify according to some predicted or estimated future value
In prediction, historical data is used to build a (predictive)
model that explains the current observed behavior
Model can then be applied to new instances to predict future behavior or
forecast the future value of some missing attribute
Examples
predicting the size of balance transfer if the prospect accepts the offer
predicting the load on a Web server in a particular time period
Suitable data mining tools
Association rule discovery
Memory-based reasoning
Decision Trees
Neural Networks
9
DM Tasks: Affinity Grouping
Determine what items often go together (usually in transactional
databases)
Often Referred to as Market Basket Analysis
used in retail for planning arrangement on shelves
used for identifying cross-selling opportunities
“should” be used to determine best link structure for a Web site
Examples
people who buy milk and beer also tend to buy diapers
people who access pages A and B are likely to place an online order
Suitable data mining tools
association rule discovery
clustering
Nearest Neighbor analysis (memory-based reasoning)
10
DM Tasks: Clustering
Like classification, clustering is the organization of data into
classes
however, class labels are unknown and it is up to the clustering algorithm to
discover acceptable classes
also called unsupervised classification, because the classification is not dictated
by given class labels
There are many clustering approaches
all based on the principle of maximizing the similarity between objects in a
same class (intra-class similarity) and minimizing the similarity between
objects of different classes (inter-class similarity)
Examples:
doing market segmentation of customers based on buying patterns and
demographic attributes
grouping user transactions on a Web site based their navigational patterns
11
Characterization & Discrimination
Data characterization is a summarization of general features of
objects in a target class
The data relevant to a target class are retrieved by a database query and run
through a summarization module to extract the essence of the data at different
levels of abstraction
example: characterize the video store’s customers who regularly rent more than
30 movies a year
Data discrimination is used to compare of the general features of
objects between two classes
comparison relevant features of objects between a target class and a
contrasting class
example: compare the general characteristics of the customers who rented
more than 30 movies in the last year with those who rented less than 5
The techniques used for data discrimination are very similar to the techniques
used for data characterization with the exception that data discrimination
results include comparative measures
12
Example: Moviegoer Database
13
Example: Moviegoer Database
SELECT moviegoers.name, moviegoers.sex, moviegoers.age,
sources.source, movies.name
FROM movies, sources, moviegoers
WHERE sources.source_ID = moviegoers.source_ID AND
movies.movie_ID = moviegoers.movie_ID
ORDER BY moviegoers.name;
moviegoers.name
sex
age
source
movies.name
Amy
Andrew
Andy
Anne
Ansje
Beth
Bob
Brian
Candy
Cara
Cathy
Charles
Curt
David
Erica
f
m
m
f
f
f
m
m
f
f
f
m
m
m
f
27
25
34
30
25
30
51
23
29
25
39
25
30
40
23
Oberlin
Oberlin
Oberlin
Oberlin
Oberlin
Oberlin
Pinewoods
Oberlin
Oberlin
Oberlin
Mt. Auburn
Oberlin
MRJ
MRJ
Mt. Auburn
Independence Day
12 Monkeys
The Birdcage
Trainspotting
I Shot Andy Warhol
Chain Reaction
Schindler's List
Super Cop
Eddie
Phenomenon
The Birdcage
Kingpin
T2 Judgment Day
Independence Day
Trainspotting
14
Example: Moviegoer Database
Classification
determine sex based on age, source, and movies seen
determine source based on sex, age, and movies seen
determine most recent movie based on past movies, age, sex, and source
Prediction or Estimation
for predict, need a continuous variable (e.g., “age”)
predict age as a function of source, sex, and past movies
if we had a “rating” field for each moviegoer, we could predict the rating a new
moviegoer gives to a movie based on age, sex, past movies, etc.
Clustering
find groupings of movies that are often seen by the same people
find groupings of people that tend to see the same movies
clustering might reveal relationships that are not necessarily recorded in the
data (e.g., we may find a cluster that is dominated by people with young
children; or a cluster of movies that correspond to a particular genre)
15
Example: Moviegoer Database
Affinity Grouping
market basket analysis (MBA): “which movies go together?”
need to create “transactions” for each moviegoer containing movies seen by that
moviegoer:
name
TID
Transaction
Amy
Andrew
Andy
Anne
…
001
002
003
004
…
{Independence Day, Trainspotting}
{12 Monkeys, The Birdcage, Trainspotting, Phenomenon}
{Super Cop, Independence Day, Kingpin}
{Trainspotting, Schindler's List}
...
may result in association rules such as:
{“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”}
{“Trainspotting”, “The Birdcage”} ==> {sex = “f”}
Sequence Analysis
similar to MBA, but order in which items appear in the pattern is important
e.g., people who rent “The Birdcage” during a visit tend to rent “Trainspotting” in the
next visit.
16
The Knowledge Discovery Process
- The KDD Process
17
Data Preprocessing
Why do we need to prepare the data?
In real world applications data can be inconsistent, incomplete and/or noisy
Data entry, data transmission, or data collection problems
Discrepancy in naming conventions
Duplicated records
Incomplete data
Contradictions in data
What happens when the data can not be trusted?
Can the decision be trusted? Decision making is jeopardized
Better chance to discover useful knowledge when data is clean
18
Data Preprocessing
19
Data Preprocessing
Data Cleaning
Data Integration
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Transformation
Data Reduction
20
Data Cleaning
Real-world application data can be incomplete, noisy, and
inconsistent
No recorded values for some attributes
Not considered at time of entry
Random errors
Irrelevant records or fields
Data cleaning attempts to:
Fill in missing values
Smooth out noisy data
Correct inconsistencies
Remove irrelevant data
21
Dealing with Missing Values
Data is not always available (missing attribute values in records)
equipment malfunction
deleted due to inconsistency or misunderstanding
not considered important at time of data gathering
Solving Missing Data
Ignore the record with missing values;
Fill in the missing values manually;
Use a global constant to fill in missing values (NULL, unknown, etc.);
Use the attribute value mean to filling missing values of that attribute;
Use the attribute mean for all samples belonging to the same class to fill in the
missing values;
Infer the most probable value to fill in the missing value
may need to use methods such as Bayesian classification or decision trees
to automatically infer missing attribute values
22
Smoothing Noisy Data
The purpose of data smoothing is to eliminate noise
Original Data for “price” (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
Binning
Each value in a
bin is replaced
by the mean
value of the bin.
Partition into equidepth bins
Bin1: 4, 8, 15
Bin2: 21, 21, 24
Bin3: 25, 28, 34
means
Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29
boundaries
Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34
Min and Max
values in each bin
are identified
(boundaries).
Each value in a
bin is replaced
with the closest
boundary value.
23
Smoothing Noisy Data
Other Methods
Clustering
Regression
Similar values are organized into
groups (clusters). Values falling
outside of clusters may be considered
“outliers” and may be candidates for
elimination.
Fit data to a function. Linear
regression finds the best line to fit two
variables. Multiple regression can
handle multiple variables. The values
given by the function are used instead
of the original values.
24
Smoothing Noisy Data - Example
Want to smooth “Temperature” by bin means with bins of size 3:
1.
First sort the values of the attribute (keep track of the ID or key so
that the transformed values can be replaced in the original table.
Divide the data into bins of size 3 (or less in case of last bin).
Convert the values in each bin to the mean value for that bin
Put the resulting values into the original table
2.
3.
4.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
85
85
FALSE
80
90
TRUE
83
78
FALSE
70
96
FALSE
68
80
FALSE
65
70
TRUE
58
65
TRUE
72
95
FALSE
69
70
FALSE
71
80
FALSE
75
70
TRUE
73
90
TRUE
81
75
FALSE
75
80
TRUE
ID
7
6
5
9
4
10
8
12
11
14
2
13
3
1
Temperature
58
65
68
69
70
71
72
73
75
75
80
81
83
85
Bin1
Bin2
Bin3
Bin4
Bin5
25
Smoothing Noisy Data - Example
ID
7
6
5
9
4
10
8
12
11
14
2
13
3
1
Temperature
58
65
68
69
70
71
72
73
75
75
80
81
83
85
Bin1
Bin2
Bin3
Bin4
Bin5
ID
7
6
5
9
4
10
8
12
11
14
2
13
3
1
Temperature
64
64
64
70
70
70
73
73
73
79
79
79
84
84
Bin1
Bin2
Bin3
Bin4
Bin5
Value of every record in each bin is changed to the mean value for
that bin. If it is necessary to keep the value as an integer, then the
mean values are rounded to the nearest integer.
26
Smoothing Noisy Data - Example
The final table with the new values for the Temperature attribute.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
84
85
FALSE
79
90
TRUE
84
78
FALSE
70
96
FALSE
64
80
FALSE
64
70
TRUE
64
65
TRUE
73
95
FALSE
70
70
FALSE
70
80
FALSE
73
70
TRUE
73
90
TRUE
79
75
FALSE
79
80
TRUE
27
Data Integration
Data analysis may require a combination of data from multiple
sources into a coherent data store
Challenges in Data Integration:
Schema integration: CID = C_number = Cust-id = cust#
Semantic heterogeneity
Data value conflicts (different representations or scales, etc.)
Synchronization (especially important in Web usage mining)
Redundant attributes (redundant if it can be derived from other attributes) -may be able to identify redundancies via correlation analysis:
Correlation analysis: Pr(A,B) / (Pr(A).Pr(B))
= 1: independent,
> 1: positive correlation,
< 1: negative correlation.
Meta-data is often necessary for successful data integration
28
Normalization
Min-max normalization: linear transformation from v to v’
v’ = [(v - min)/(max - min)] x (newmax - newmin) + newmin
Ex: transform $30000 between [10000..45000] into [0..1] ==>
[(30000 – 10000) / 35000] x (1 - 0) + 0 = 0.514
z-score normalization: normalization of v into v’ based on
attribute value mean and standard deviation
v’ = (v - Mean) / StandardDeviation
Normalization by decimal scaling
moves the decimal point of v by j positions such that j is the minimum number
of positions moved so that absolute maximum value falls in [0..1].
v’ = v / 10j
Ex: if v ranges between -56 and 9976, j=4 ==> v’ ranges between -0.0056 and
0.9976
29
Normalization: Example
z-score normalization: v’ = (v - Mean) / Stdev
Example: normalizing the “Humidity” attribute:
Humidity
85
90
78
96
80
70
65
95
70
80
70
90
75
80
Mean = 80.3
Stdev = 9.84
Humidity
0.48
0.99
-0.23
1.60
-0.03
-1.05
-1.55
1.49
-1.05
-0.03
-1.05
0.99
-0.54
-0.03
30
Normalization: Example II
Min-Max normalization on an employee database
max distance for salary: 100000-19000 = 81000
max distance for age: 52-27 = 25
New min for age and salary = 0; new max for age and salary = 1
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
ID
1
2
3
4
5
Gender
1
0
0
1
0
Age
0.00
0.96
1.00
0.24
0.72
Salary
0.00
0.56
1.00
0.44
0.32
31
Data Reduction
Data is often too large; reducing data can improve performance
Data reduction consists of reducing the representation of the data
set while producing the same (or almost the same) results
Data reduction includes:
Data cube aggregation
Dimensionality reduction
Discretization
Numerosity reduction
Regression
Histograms
Clustering
Sampling
32
Data Cube Aggregation
Reduce the data to the concept level needed in the analysis
Use the smallest (most detailed) level necessary to solve the problem
Queries regarding aggregated information should be answered
using data cube when possible
33
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select only the necessary attributes.
The goal is to find a minimum set of attributes such that the resulting
probability distribution of data classes is as close as possible to the original
distribution obtained using all attributes.
Exponential number of possibilities.
Use heuristics: select local ‘best’ (or most pertinent) attribute,
e.g., using information gain, etc.
step-wise forward selection {}{A1}{A1, A3}{A1, A3, A5}
step-wise backward elimination {A1, A2, A3, A4, A5}{A1, A3, A4, A5}
{A1, A3, A5}
combining forward selection and backward elimination
decision-tree induction
34
Decision Tree Induction
35
Numerocity Reduction
Reduction via histograms:
Divide data into buckets and store
representation of buckets (sum, count, etc.)
Reduction via clustering
Partition data into clusters based on
“closeness” in space
Retain representatives of clusters (centroids)
and outliers
Reduction via sampling
Will the patterns in the sample represent the
patterns in the data?
Random sampling can produce poor results
Stratified sample (stratum = group based on
attribute value)
36
Sampling Techniques
Raw Data
Cluster/Stratified Sample
Raw Data
37
Discretization
3 Types of attributes
nominal - values from an unordered set (also “categorical” attributes)
ordinal - values from an ordered set
continuous - real numbers (but sometimes also integer values)
Discretization is used to reduce the number of values for a given
continuous attribute
usually done by dividing the range of the attribute into intervals
interval labels are then used to replace actual data values
Some data mining algorithms only accept categorical attributes
and cannot handle a range of continuous attribute value
Discretization can also be used to generate concept hierarchies
reduce the data by collecting and replacing low level concepts (e.g., numeric
values for “age”) by higher level concepts (e.g., “young”, “middle aged”, “old”)
38
Discretization - Example
Example: discretizing the “Humidity” attribute using 3
bins.
Humidity
85
90
78
96
80
70
65
95
70
80
70
90
75
80
Low = 60-69
Normal = 70-79
High = 80+
Humidity
High
High
Normal
High
High
Normal
Low
High
Normal
High
Normal
High
Normal
High
39
Converting Categorical Attributes to
Numerical Attributes
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
85
85
FALSE
80
90
TRUE
83
78
FALSE
70
96
FALSE
68
80
FALSE
65
70
TRUE
58
65
TRUE
72
95
FALSE
69
70
FALSE
71
80
FALSE
75
70
TRUE
73
90
TRUE
81
75
FALSE
75
80
TRUE
Create separate columns
for each value of a
categorical attribute (e.g.,
3 values for the Outlook
attribute and two values
of the Windy attribute).
There is no change to the
numerical attributes.
Attributes:
Outlook (overcast, rain, sunny)
Temperature real
Humidity real
Windy (true, false)
Standard Spreadsheet Format
OutLook OutLook OutLook Temp Humidity Windy Windy
overcast
rain
sunny
TRUE FALSE
0
0
1
85
85
0
1
0
0
1
80
90
1
0
1
0
0
83
78
0
1
0
1
0
70
96
0
1
0
1
0
68
80
0
1
0
1
0
65
70
1
0
1
0
0
64
65
1
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
Visualizing Patterns
Example: Cross Tabulation
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
85
85
FALSE
80
90
TRUE
83
78
FALSE
70
96
FALSE
68
80
FALSE
65
70
TRUE
58
65
TRUE
72
95
FALSE
69
70
FALSE
71
80
FALSE
75
70
TRUE
73
90
TRUE
81
75
FALSE
75
80
TRUE
Windy
Not Windy
Outlook =
sunny
2
3
Outlook = rain
2
3
Outlook =
overcast
2
2
3
2.5
2
Outlook = sunny
1.5
Outlook = rain
Outlook = overcast
1
0.5
0
Windy
Not Windy
41
Evaluating Models
To train and evaluate models, data are often divided into three
sets: the training set, the test set, and the evaluation set
Training Set
is used to build the initial model
may need to “enrich the data” to get enough of the special cases
Test Set
is used to adjust the initial model
models can be tweaked to be less idiosyncrasies to the training data and can be
adapted for a more general model
idea is to prevent “over-training” (i.e., finding patterns where none exist).
Evaluation Set
is used to evaluate the model performance
42
Training Sets
The training set will be used to train the models
most important consideration: need to cover the full range of values for all the
features that the model might encounter
good idea to have several examples for each value of a categorical feature and
for a range of values for numerical features
Data Enrichment
training set should have sufficient number of examples of rare events
a random sample of available data is not sufficient since common examples
will swamp the rare examples
training set needs to over-sample the rare cases so that the model can learn the
features for rare events instead of predicting everything to be common outputs
43
Test and Evaluation Sets
Reading too much into the training set (overfitting)
common problem with most data mining algorithms
resulting model works well on the training set but performs miserably on
unseen data
test set can be used to “tweak” the initial model, and to remove unnecessary
inputs or features
Evaluations Set for Final Performance Evaluation
test set can not be used to evaluate model performance for future unseen data
error rate on the evaluation is a good predictor of error rate on unseen data
Insufficient data to divide into three disjoint sets?
In such cases, validation techniques can play a major role
Cross Validation
Bootstrap Validation
44
Cross Validation
Cross validation is a heuristic that works as follows
randomly divide the data into n folds, each with approximately the same
number of records
create n models using the same algorithms and training parameters; each model
is trained with n-1 folds of the data and tested on the remaining fold
can be used to find the best algorithm and its optimal training parameter
Steps in Cross Validation
1. Divide the available data into a training set and an evaluation set
2. Split the training data into n folds
3. Select an architecture (algorithm) and training parameters
4. Train and test n models
5. Repeat step 2 to 4 using different architectures and parameters
6. Select the best model
7. Use all the training data to train the model
8. Assess the final model using the evaluation set
45
Bootstrap Validation
Based on the statistical procedure of sampling with replacement
data set of n instances is sampled n times (with replacement) to give another data
set of n instances
since some elements will be repeated, there will be elements in the original data
set that are not picked
these remaining instances are used as the test set
How many instances in the test set?
Probability of not getting picked in one sampling = 1 - 1/n
Pr(not getting picked in n samples) = (1 -1/n)n = e-1 = 0.368
so, for large data set, test set will contain about 36.8% of instances
to compensate for smaller training sample (63.2%), test set error rate is combined
with the re-substitution error in training set:
e = (0.632 * e test instance) + (0.368 * e training instance)
Bootstrap validation increases variance that can occur in each fold
46
Measuring Effectiveness
Predictive models are measured based on the accuracy of their
predictions on unseen data
Classification and Prediction Tasks
accuracy measured in terms of error rate (usually % of records classified incorrectly)
error rate on a pre-classified evaluation set estimates the real error rate
can also use cross-validation methods discussed before
Estimation Effectiveness
difference between predicted scores and the actual results (from evaluation set)
the accuracy of the model is measured in terms of variance (i.e., average of the
squared differences)
to be able to express this in terms of the original units, compute the standard
deviation (i.e., square root of the variance)
( p1 a1 ) 2 ( pn an ) 2
e
n
47
Ordinal or Nominal Outputs
When the output field is ordinal or nominal (e.g., in two-class
prediction), we use the classification table, the so-called
confusion matrix to evaluate the resulting model
Example
Predicted Class
Actual Class
T
F
Total
T
18
3
21
F
2
15
17
Total
20
18
38
Overall correct classification rate = (18 + 15) / 38 = 87%
Given T, correct classification rate = 18 / 20 = 90%
Given F, correct classification rate = 15 / 18 = 83%
48
Measuring Effectiveness
Market Basket Analysis
MBA may be used for estimation or prediction (e.g., “people who buy milk and
diapers tend to also buy beer”)
confidence: percentage of the time “beer” occurs in transaction that contain
“milk” and “diapers” (i.e., conditional probability)
support: percentage of the time “milk”, “diaper”, and “beer” occurred together in
the whole training set (i.e., prior probability)
Distance-Based Techniques
clustering and memory-based reasoning: a measure of distance is used to evaluate
the “closeness” or “similarity” of a point to cluster centers or a neighborhood
regression: line of best fit minimizes the sum of the distances between the line
and the observations
often for numeric variables, can use Euclidean distance measure (square root of
the sum of the squares of the difference along each dimension)
49
Measuring Effectiveness: Lift Charts
usually used for classification, but can be adopted to other methods
measure change in conditional probability of a target class when going from the
general population (full test set) to a biased sample:
lift Pr(classt | sample) / Pr(classt | population)
Example:
suppose expected response rate to a direct mailing campaign is 5% in the training set
use classifier to assign “yes” or “no” value to a target class “predicted to respond”
the “yes” group will contain a higher proportion of actual responders than the test set
suppose the “yes” group (our biased sample) contains 50% actual responders
this gives lift of 10 = 0.5 / 0.05
What if the lift sample is too small
No. of
respondents
need to increase the sample size
trade-off between lift and sample size
lift
Sample size
50