Data Preparation for Data Mining by Yuenho Leung (4/13)

Download Report

Transcript Data Preparation for Data Mining by Yuenho Leung (4/13)

Data Preparation for Data Mining
Prepared by: Yuenho Leung
What to Do before Data Preparation
Before the stage of data preparation, you have already:
Known the domain of your problem
Planned solutions and approaches you are going to apply
Got as much data as possible including incomplete data
Data Representation Format
First step of data preparation is to convert the raw data
into rows and columns format. Such as:
XML
Access
SQL
Validate Data
To validate data, you need to:
Check the value by data type.
Check the range of the variable
Compare the values with other instances (rows).
Check columns by their relationships.
Validate Data (cont)
To validate this table, you can check the relationship among the city
name, zip code, and area code.
If you get the data from a normalized database, you can skip this
step.
CusID Name
Address
City
Zip
Phone
1
Alan
1800 Bon Ave.
Elk Grove
95758
916-333-4444
2
Tom
600 Bender Rd
Sacramento
95412
916-112-2345
3
Sam
300 Tent St
San Jose
95112
408-345-2134
Validate Data (cont)
From this table, you can tell the third instance is wrong. Why?
Because no small earthquake on 1975/10/20.
Date
Time
Latitude
Longitude
Magnitude
1975/7/10
00:41:23
37.1811
-122.0521
1.32
1975/9/5
00:41:23
34.1653
-122.2348
1.54
1975/10/20
00:41:23
31.1873
-122.0512
5.10
1975/11/18
00:41:23
57.1845
-122.2148
2.02
1975/12/30
00:41:23
57.2373
-122.0328
0.50
Validate Data (cont)
Fixing individual errors from each instance is not the main purpose
of data validation.
The main purpose is to find the cause of errors.
If you know the cause of the errors, you might be figure out the
pattern of the errors and then fix all errors globally.
For example,
We want to mine the pattern of wind speed from data generated by 5
sensors. We find 20% of the speed measurements are obviously
wrong. Therefore, we check the sensors whether they work normally
or not. If we find a broken sensor always display readings 10%
higher than the correct readings, we should fix those incorrect
measurements by 10% decrement.
Dealing with Missing and Empty Value
There is no automated technique for differentiating
between missing and empty values:
Example:
CusID
Name
Sandwich
Sauce
1
Alan
Turkey
Sweet Union
2
Tom
Ham
3
Sam
Beef
Thousand Island
You cannot tell whether:
•Tom didn’t want any sauce.
•Or
•The salesperson forgot to input the sauce’s name.
Dealing with Missing and Incorrect Value
If you know the value is incorrect or missing, you can:
Ignore the instance that contains the value (not
recommended)
Or
Assign a value by a reasonable estimate
Or
Use the default value
Dealing with Missing and Incorrect Value (cont)
Example of reasonable estimate
CusID Name
Address
City
Zip
Phone
1
Alan
1800 Bon Ave.
Elk Grove
95758
916-333-4444
2
Tom
600 Bender Rd
Sacramento
95412
916-112-2345
3
Sam
???
???
408-345-2134
From the area code 408, you may guess the city is San Jose
because San Jose owns over 50% of the phone number with
this area code.
Dealing with Missing and Incorrect Value (cont)
Example (cont)
CusID Name
Address
City
Zip
Phone
1
Alan
1800 Bon Ave.
Elk Grove
95758
916-333-4444
2
Tom
600 Bender Rd
Sacramento
95412
916-112-2345
3
Sam
San Jose
???
408-345-2134
You would guess the missing zip code is 95110. Because
95110 is the center of San Jose
Reduce No. of Variable
More variables generate more relationships and more data
points are required.
We are not only interested in the pattern of each variables.
We are interested in the pattern of relationships among
variables.
With 10 variables, the 1st variable has to be compared with
9 neighbors, the 2nd compares with 8, and so on. The result
is 9 x 8 x 7 x 6… which is 362,880 relationships.
With 13 variables, it is nearly 40 million relationships.
With 15 variables, it is nearly 9 billion relationships.
Therefore, when preparing data sets, try to minimize the
number of variables.
Reduce No. of Variable (cont)
No general strategies to reduce no. of variable.
Before select variable sets, you must fully understand
the role of each variable in the model.
Define Variable Range
Correct range – a variable range contains only the
correct variable.
Example: Correct range of month is 1 – 12
Any data not in this range must be either repaired or
removed from the dataset.
Project required range – a variable range we want to
analyze according to the project statement.
Example: For summer sales, the project required range
for month is 7 – 9.
Our goal is to find the pattern of data in this range.
However, data not in this range may be required by the
model.
Define Variable Range (cont)
In the following table, ‘B’ stands for business. Sam is a
company’s name. ‘G’ is out of the correct range.
However, the data miner guesses it stands for “girl,” so
he replaces ‘G’ by ‘F’ If he wants to mine people’s
shopping behavior, the input will be ‘M’ and ‘F’.
CusID
Name Address
City
Zip
Phone
Gender
1
April
1800 Bon Ave.
Elk Grove
95758
916-333-4444
G
2
Tom
600 Bender Rd
Sacramento
95412
916-112-2345
M
3
Sam
200 Tend St
San Jose
95112
408-345-2134
B
4
May
237 Hello Blvd
San Jose
9510
408-999-1111
F
Define Variable Range
Example of variable range:
You want to mine customers shopping behavior that is younger
than 40 yr old. On the age column, you find the customers are
between 20 and 150 yr old. Therefore, you select all records
with ages between 20 and 40 as your input.
This example is wrong. Nobody is over 130 yr old in the world,
so you can conclude the records with ages above 130 are
wrong. However, your input should also contains the records
with ages between 40 and 130. Why? Because the density and
distribution of these ages directly relate the records with ages
below 40.
Conclusion: Your input should be between 20 – 130 yr old
Choose a Sample
Data miners do not always use the entire data collection.
Instead they usually choose a sample set randomly to
speed up the mining process.
A sample size we pick should depend on:
No of records available
Distributes and density of the data
No. of variables
Project required range of variables
And more…
Sounds difficult, but there is strategies to make a sample
dataset…
Choose a Sample (cont)
Strategies to make a sample dataset:
1. Select 10 instances randomly and put them into your
sample set.
2. Create a distribution curve represented the sample set.
3. Add another 10 random instances to your sample set.
4. Create a distribution curve represented the new sample
set.
5. Compare the new curve with the previous curve. Do they
look almost the same? If no, go back to step 3. If yes,
stop and that is your sample set.
***Sample of distribution curve is on the next slide.
Choose a Sample (cont)
 The solid line represents the current sample set.
 The dot line represents the previous sample set.
Do they look alike?
Reference
 Data Preparation for Data Mining 1999 by Dorian Pyle
Thank you for your attention!