Steven F. Ashby Center for Applied Scientific Computing

download report

Transcript Steven F. Ashby Center for Applied Scientific Computing

Data Mining: Data
Lecture 3
TIES445 Data mining
Nov-Dec 2007
Sami Äyrämö
These slides are additional material for TIES445
‹#›
Data quality



GIGO – Garbage In, Garbage Out
– Effectiveness of DM exercise depends on the quality
of data
Data quality concerns
– individual measurements (records and fields)
– collections of observations
Sources of error are infinite
– Human error (e.g., keyboard error)
– Instrumentation failure
Inaccuare or imprecise
– Inadequate specification of measurement or data
collection process
These slides are additional material for TIES445
‹#›
Quality of individual measurements

Bias
– the difference between the mean of the repeated
measurements and the true value

Precision
– variability of the repeated measurements (NOTE: precision
is not the number of digits in record)

Accuracy
– small bias and high precision (e.g., small variance)
– e.g, repeated measurement of someone’s height may be
precise (reliable), but inaccurate (validity), if (s)he is
wearing shoes (we are not measuring the right thing)

True value (does it even exist?)
These slides are additional material for TIES445
‹#›
Quality of collections of data : bias

Distorted (biased) samples
– mismatch between the sample population and and the population
of interest (selection bias)
e.g., calculating an average age of students in Jyväskylä when the
sample is restricted to female students

– a sample may be selected through a chain of selection steps
e.g., candidates for bank loans: 1) potential customers are contacted,
2) some reply, some do not, 3) of those who replied some are
creditworthy, some are not, 4) those who take out a loan are followed,
5) some are good customers, some are not,…

– populations are not static (population drift)


e.g., customers shopping behaviour may change over time
A biased sample leads to inconsistent estimates of population
parameters
These slides are additional material for TIES445
‹#›
Quality of collections of data:
Incomplete data

Incomplete data: missing or empty values
– Missing value: Information is not collected
 e.g., People decline to answer a question (age, weight, position,…)
– Empty value: Information does not exist
 A form may have conditional parts: e.g., expiry date of an driver’s
license can not be filled out by children
– Determining whether any value is ”empty” or ”missing”
requires domain knowledge
 If the discriminating information is not provided both empty and
missing values are treated as ”and called” missing
– Fundamental question for data mining task: ”Why are the
data incomplete?”
– Note: A distorted (biased) sample is actually a special case
of incomplete data
These slides are additional material for TIES445
‹#›