Transcript slides
Data Preprocessing
Relational Databases - Normalization
Denormalization
Data Preprocessing
Missing Data
Missing values and the 3VL approach
Problems with 3VL approach
Special Values
Remember:
Relational Databases
Model entities and relationships
Entities are the things in the real world
Information about employees and the
department they work for
Employee and department are entities
Relationships are the links between these
entities
Employee works for a department
Relation or Tables
Relation: a table of data
• table = relation,(set theory, based on predicate logic)
emplyeeID
name
job
departmentID
7513 Nora
Programmer
128
9842 Ben
DBA
6651 Alex
Programmer
128
9006 Claudia
SystemAdministrator
128
42
Columns and Rows
Each column or attribute describes some
piece of data that each record in the table
has
Each row in a table represents a record
Rows, records or tupels
Keys
A superkey is a column (or a set of columns) that can
be used to identify a row in as table
There are different possible keys
a key is a minimal superkey
candidate keys
We chose form the candidate keys the primary key
Primary key is used to identify a single row (record)
Foreign keys represents links between tables
Keys
primary key
foreign key
emplyeeID
name
job
departmentID
7513 Nora
Programmer
128
9842 Ben
DBA
6651 Alex
Programmer
128
9006 Claudia
SystemAdministrator
128
42
Functional Dependencies
If there is a functional dependency between columns A
and B in a given table which may be written
AB
Then the value of column A determines the value of
column B
employeeID functionally determines the name
Schema
Database schema
Structure or design of the database
Database without any data in it
employee(employeeID,name,job,departmentID)
Design
Minimize redundancy
Redundancy: data is repeated in different
rows
employee(employeeID,name,job,departmentID,departmentName)
emplyeeID
name
job
departmentID
7513
Nora
Programmer
9842
Ben
DBA
6651
Alex
Programmer
128
Research and Development
9006
Claudia
System-Administrator
128
Research and Development
128
42
departamentName
Research and Development
Finance
Reduce redudancy
employee(employeeID,name,job,departmentID,departmentName)
employee(employeeID,name,job,departmentID)
employee(departmentID,name)
Insert Anomalies
Insert data into flawed table
Data does not match what is already in
the table
It is not obvious which of the rows in the
database is correct
Deletion Anomalies
Delete data from a flawed schema
When we delete all the employees of
Department 128, we no longer have any
record, that the Department 128 exists
Update Anomalies
Change data in a flawed schema
We do not change the data for every row
correctly
Null Values
Avoid schema designs that have large
numbers of empty attributes
Normalization
Remove design flaws from a database
Normal forms, which are a set of rules
describing what we should and should not
do in our table structures
Breaking tables into smaller tables that
form a better design
Normal Forms
1 Forma Normal
2 Forma Normal
3 Forma Normal
Forma Normal Boyce Codd
4 Forma Normal
5 Forma Normal
First Normal Form (1NF)
Each attribute or column value must be
atomic
Each attribute must contain a single value
emplyeeID
name
job
department
ID
skills
7513 Nora
Programmer
128 C, Perl, Java
9842 Ben
DBA
6651 Alex
Programmer
128 VB, Java
9006 Claudia
System-Administrator
128 NT, Linux
42 DB2
1NF
emplyeeID
name
job
departmentID
skills
7513 Nora
Programmer
128 C
7513 Nora
Programmer
128 Perl
7513 Nora
Programmer
128 java
9842 Ben
DBA
6651 Alex
Programmer
128 VB
6651 Alex
Programmer
128 java
9006 Claudia
System-Administrator
128 NT
9006 Claudia
System-Administrator
128 Linux
42 DB2
Second Normal Form (2NF)
All attributes that are no part of the
primary key are fully dependent on the
primary key
Each non key attribute must be functionally
dependent on the key
Is already in 1NF
2NF ?
emplyeeID
name
job
departmentID
skills
7513 Nora
Programmer
128 C
7513 Nora
Programmer
128 Perl
7513 Nora
Programmer
128 java
9842 Ben
DBA
6651 Alex
Programmer
128 VB
6651 Alex
Programmer
128 java
9006 Claudia
System-Administrator
128 NT
9006 Claudia
System-Administrator
128 Linux
employee(employeeID,name,job,departmentID,skill)
42 DB2
Functional dependencies
employeeID,skill
employeeID
name, job, deparmentID
name, job, deparmentID
Partially functionally dependent on the
primary key
Not fully functionally dependent on the
primary key
2NF
Decompose the table into tables which all
the non-key attributes are fully functionally
dependent on the key
Breaking the table into two tables
employee(employeeID,name,job,departmentID)
employeeSkills(employeeID,skill)
Third Normal Form (3NF)
Remove all transitive dependencies
Be in 2NF
employeID
name
job
departmentID department
Name
7513 Nora
Programmer
9842 Ben
DBA
6651 Ajay
Programmer
42 Finance
9006 Candy SYS
128 Research
128 Research
128 Research
employee(employeeID,name,job,departmentID,departmentName)
employeeID
departmentID
name,job,departmentID,departmentName
departmentName
Transitive dependency
employeeID
employeeID
departmentName
deparmtentID
departmentID
departmentName
3NF
Remove transitive dependency
Decompose into multiple tables
emploee(employeeID,name,jop,departmentID)
deparment(deparmentID,deparmtentName)
3NF
The left side of the functional dependency is a
superkey (that is, a key that is not necessarily
minimal)
Boyce-Codd Normal Form
or
The right side of the functional dependency is a
part of any key of the table
BCNF
All attributes must be functionally determined by
a superkey
Full normalization means lots of logically
seperate relations
Lots of logically separate relations means a lot
of physically separate files
Lots of physically separate files means a lot of
I/O
Difficulties in finding dimensions for dimensional
schema, star schema
(dimension tables, fact table)
What is Denormalization?
Normalizing a relational variable R means
replacing R by a set of projections R1,R2,..,Rn
such that R is equal to the join R1,R2,..,Rn
Denormalizing the relational variables means
replacing them by their join R
Reduce redundancy, each projections R1,R2,..,Rn is
at the highest possible value of normalization
Increase redundancy, by ensuring that R is a lower
level of normalization than R1,R2,..,Rn
Problems
Once we start to denormalize, it is not clear when to
stop?
Dimensional Schema
Array cells often empty
The more dimensions, there more empty cells
Empty cell Missing information
How to treat not present information ?
How does the system support
•
•
•
•
Information is unknown
Has been not captured
Not applicable
....
Solution?
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
noisy: containing errors or outliers
• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data may come from
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Preprocessing
Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality
data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse
Multi-Dimensional Measure of
Data Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Data integration
Normalization and aggregation
Data reduction
Integration of multiple databases, data cubes, or files
Data transformation
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
Forms of Data Preprocessing
Data Cleaning
Importance
“Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred
Missing Values
The approach of the problem of missing
values adopted in SQL is based on nulls
and three-valued logic (3VL)
null corresponds to UNK for unknown
3VL a mistake?
Boolean Operators
AND t
u
f
OR
t
u
f
NOT
t
t
u
f
t
t
t
t
t
f
u
u
u
f
u
t
u
u
u
u
f
f
f
f
f
t
u
f
f
t
In scalar comparison in which either of the
compared is UNK evaluates the unknown truth
value
MAYBE
Another important Boolean operator is
MAYBE
MAYBE
t
f
u
t
f
f
Example
Consider the query “Get employees who
may be- but are not definitely known to
be- programmers born before January 18,
1971, with salary less then €40.000
EMP WHERE MAYBE ( JOB = ‘PROGRAMMER’ AND
DOB < DATE (‘1971-1-18’) AND
SALLARY < 40000 )
Without maybe we assume the existence of another operator called
IS_UKN which takes a single scalar operand and returns true if
operand evaluates UNK otherwise false
EMP WHERE ( JOB = ‘PROGRAMMER’
OR IS_UKN (JOB) )
AND ( DOB < DATE (‘1971-1-18’)
OR IS_UKN (DOB) )
AND ( SALLARY < 40000
OR IS_UKN (SALLARY) )
AND NOT ( JOB = ‘PROGRAMMER’ AND
DOB < DATE (‘1971-1-18’) AND
SALLARY < 40000 )
Numeric expression
WEIGHT * 454
If WEIGHT is UKN, then the result is also
UKN
Any numeric expression is considered to
evaluate UNK if any operands of that
expression is itself UNK
Anomalies
WEIGHT-WEIGHT=UNK
WEIGHT/0=UNK
(0)
(“zero divide”)
UNK is not u (unk)
UNK (the value-unknown null)
u (unk) (unknown truth value)
...are not the same thing
u is a value, UNK not a value at all!
Suppose X is BOOLEAN
Has tree values: t (true),f (false), u ukn
X is ukn, X is known to be unk
X is UKN, X is not known!
Some 3VL Consequences
The comparison x=x does not give true
In 3VL x is not equal to itself it is happens to
be UNK
The Boolean expression p OR NOT(p)
does not give necessarily true
unk OR NOT (unk) = unk
Example
Get all suppliers in Porto and take the union
with get all suppliers not in Porto
We need to add maybe in Porto
We do not get all suppliers!
In 2 VL p OR NOT(p) corresponds to
p OR NOT(p) OR MAYBE(p) in 3VL
While two cases my exhaust full range of
possibilities in the real world, the database does
not contain the real world - instead it contains
only knowledge about real world
Some 3VL Consequences
The expression r JOIN r does not
necessarily give r
A=B and B=C together does not imply
A=C
....
Many equivalences that are valid in 2VL
break down in 3VL
We will get wrong answers
Special Values
Drop the idea of null and UNK,unk 3VL
Use special values instead to represent
missing information
Special values are used in the real world
In the real world we might use the special
value „?“ to denote hours worked by a certain
employee if actual value is unknown
Special Values
General Idea:
Use an appropriate special value, distinct from
all regular values of the attribute in question,
when no regular value can be used
The special value must be of the applicable
attribute is not just integers, but integers
integers plus whatever the special value is
Approach is not very elegant, but without 3VL
problems, because it is in 2VL
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or
decision tree
Relational Databases - Normalization
Denormalization
Data Preprocessing
Missing Data
Missing values and the 3VL approach
Problems with 3VL approach
Special Values
Next..
Data Preprocessing
Visual inspection
Noise Reduction
Data Reduction
Data Discretization
Data Integration