Transcript Slides
Data Engineering
A Story of Data-Related Issues
You receive an email from a medical researcher concerning a
project that you are eager to work on.
Hi,
I've attached the data file that I mentioned in my previous email.
Each line contains the information for a single patient and
consists of five fields.
We want to predict the last field using the other fields.
I don't have time to provide any more information about the data
since I'm going out of town for a couple of days, but hopefully
that won't slow you down too much.
Thanks and see you in a couple of days.
Continued…
Despite some misgivings, you proceed to analyze the data. The
first few rows of the file are as follows:
Nothing looks strange. You
put your doubts aside
and start the analysis.
Two days later you you arrive for the meeting, and while waiting for others to
arrive, you strike up a conversation with a statistician who is working on the
project.
Continued…
Statistician: So, you got the data for all the patients?
Data Miner: Yes. I haven't had much time for analysis, but I do have a few
interesting results.
Statistician: Amazing. There were so many data issues with this set of patients
that I couldn't do much.
Data Miner: Oh? I didn't hear about any possible problems.
Statistician: Well, first there is field 5, the variable we want to predict. It's
common knowledge among people who analyze this type of data that results
are better if you work with the log of the values, but I didn't discover this until
later. Was it mentioned to you?
Data Miner: No.
Statistician: But surely you heard about what happened to field 4? It's supposed to
be measured on a scale from 1 to 10, with 0 indicating a missing value, but
because of a data entry error, all 10's were changed into 0's.
Data Miner: Interesting. Were there any other problems?
Statistician: Yes, fields 2 and 3 are basically the same, but I assume that you
probably noticed that.
Data Miner: Yes, but these fields were only weak predictors of field 5.
Continued…
Statistician: Anyway, given all those problems, I'm surprised you were able to
accomplish anything.
Data Miner: True, but my results are really quite good. Field 1 is a very strong
predictor of field 5. I'm surprised that this wasn't noticed before.
Statistician: What? Field 1 is just an identification number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh, no! I just remembered. We assigned ID numbers after we sorted
the records based on field 5. There is a strong connection, but it's meaningless.
Sorry.
Lesson: Get to know your data!
Formally: What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic, or
feature
Objects
• A collection of attributes describe an
object
– Object is also known as record,
point, case, sample, entity, or
instance
Attributes
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Employee Age and ID Number
• Two attributes of an employee are ID and age.
– Both can be represented as integers.
– However, while it is reasonable to talk about the average age of an
employee, it makes no sense to talk about the average employee ID.
– The only valid operation for employee IDs is to test whether they are equal.
– There is no hint of this limitation, however, when integers are used to
represent the employee ID attribute.
• Knowing the type of an attribute is important because it tells us which
properties of the measured values are consistent with the underlying properties
of the attribute, and therefore, it allows us to avoid foolish actions, such as
computing the average employee ID.
Types of Attributes
• There are different types of attributes
– Nominal
•
Examples: ID numbers, eye color, zip codes
– Ordinal
•
Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height in {tall, medium, short}
– Interval
•
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
•
Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
=
–
–
–
–
Distinctness:
Order:
Addition:
Multiplication:
–
–
–
–
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
< >
+ */
•
Discrete and Continuous
Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite
number of digits.
– Continuous attributes are typically represented as floating-point variables.
Typically, nominal and ordinal attributes are binary or discrete, while interval
and ratio attributes are continuous. However, count attributes, which are
discrete, are also ratio attributes.
Asymmetric Attributes
• For asymmetric attributes, only presence -- a non-zero attribute
value -- is regarded as important.
• E.g. Transaction data
– “Bread”, “Coke” etc are in fact (asymmetric) attributes and only
their presence (i.e. value 1 or true) is important.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Types of data sets
• Record
–
–
–
Data Matrix
Document Data
Transaction Data
• Graph
–
–
World Wide Web
Molecular Structures
• Ordered
–
–
–
–
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Document Data
• Each document becomes a ‘term’ (word) vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
3
0
5
0
2
6
0
2
0
2
Document 2
0
7
0
2
1
0
0
3
0
0
Document 3
0
1
0
0
1
2
2
0
3
0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Data with Relationships among Objects
• Examples: Generic graph and HTML Links
2
1
5
2
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Web search engines collect and process Web pages to extract their contents.
It is well known, however, that the links to and from each page provide a great deal of
information about the relevance of a Web page to a query, and thus, must also be taken into
consideration.
Data with Objects That Are Graphs
E.g. Chemical Data
• Benzene Molecule: C6H6
Substructure mining:
Which substructures occur
frequently in a set of
compounds?
Ascertain whether the
presence of any of these
substructures is associated
with the presence or absence
of certain chemical properties,
such as melting point or heat
of formation.
Data Quality
1. What are the kinds of data quality problems?
2. How can we detect problems with the data?
3. What can we do about these problems?
•
Examples of data quality problems:
–
–
–
–
–
Data collection errors
Noise
Outliers
Missing values
Duplicate data
We will study (2) and (3)
in detail, later after the
classification.
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values
–
–
–
–
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
Data Preprocessing
•
•
•
•
•
•
•
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
• Sometimes "less is more" and this is the case with aggregation,
the combining of two or more objects into a single object.
One way to aggregate
transactions for this data set is to
replace all the transactions of a
single store with a single
storewide transaction.
–This reduces the number of
data objects which is now equal
to the number of stores.
•How the values of each attribute are
combined across all the records of a
group (store for instance)?
–Some quantitative attributes, e.g.
price, are typically aggregated by
taking a sum or an average.
–Other attributes, e.g. item or date,
are omitted.
Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and
the final data analysis.
• Sampling is used in data mining because processing the entire set of data of
interest is too expensive or time consuming.
• The key principle for effective sampling is the following:
– using a sample will work almost as well as using the entire data sets,
if the sample is representative.
– A sample is representative if it has approximately the same property
(of interest) as the original set of data.
Sample Size
8000 points
2000 Points
500 Points
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse in the
space that it occupies
• For classification. this can mean that there are not enough data objects to
allow the creation of a model that reliably assigns a class to all possible
objects.
• Definitions of density and distance between points, which is critical for
clustering and outlier detection, become less meaningful.
Feature Subset Selection
• Redundant features
– duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
– contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA
Feature Subset Selection
• Techniques:
– Brute-force approch:
• Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
Discretization and Binarization
• Some data mining algorithms, especially certain classification
algorithms, require that the data be in the form of categorical
attributes.
• Algorithms that find association patterns require that the data be
in the form of binary attributes.
• Thus it is often necessary to transform a continuous attribute into
a categorical attribute (discretization), and both continuous and
discrete attributes may need to be transformed into one or more
binary attributes (binarization).