28SpCS157BL14TRC4NFASRppt

Download Report

Transcript 28SpCS157BL14TRC4NFASRppt

Tuple Relational
Calculus and 4th NF
Association Rules
CS 157B
Prof. Sin-Min Lee
• Unary Relational Operations
• Relational Algebra Operations from
Set Theory
• Binary relational Operations
• Additional Relational Operations
• The Tuple Relational Calculus
•
Unary Relational
Operations
The relational algebra is a set of
operators that take and return relations
• Unary operations take one relation,
and return one relation:
• SELECT operation
• PROJECT operation
• Sequences of unary operations
• RENAME operation
The SELECT Operation
• It selects a subset of tuples from a
relation that satisfy a SELECT
condition.
• The SELECT operation is denoted by:
 <Selection condition> (R)
• The selection condition is a Boolean
expression specified on the attributes
of relation R
The SELECT Operation
• The degree of the relation resulting
from a SELECT operation is the same as
that of R
• For any selection-condition c, we have
|  <c> (R) |  | R |
• The SELECT operation is commutative,
that is,
 <c1> ( <c2> ( R )) =  <c2> ( <c1> ( R ))
The PROJECT Operation
• The PROJECT operation , selects certain
columns from a relation and discards the
columns that are not in the PROJECT list
• The PROJECT operation is denoted by:
 <attribute list> ( R )
Where attribute-list is a list of attributes
from the relation R
The PROJECT Operation
• The degree of the relation resulting
from a PROJECT operation is equal to
the number of attributes in attribute-list.
• If only non-key attributes appear in an
attribute-list, duplicate tuples are likely
to occur. The PROJECT operation,
however, removes any duplicate tuples –
this is called duplicate elimination
• The PROJECT operation is not
commutative
Results of SELECT and PROJECT operations.
(a) (DNO=4 AND SALARY>25000) OR (DNO=5 AND
SLARY>30000)(EMPLOYEE).
(b)  LNAME, FNAME, SALARY (EMPLOYEE)
(c)  SEX, SALARY(EMPLOYEE).
Relational Algebra
Operations
Two relations R and R are said to be
1
2
union compatible if they have the same
degree and all their attributes
(correspondingly) have the same
domain.
• The UNION, INTERSECTION, and SET
DIFFERENCE operations are applicable
on union compatible relations
• The resulting relation has the same
attribute names as the first relation
The UNION operation
• The result of UNION operation on two
relations, R1 and R2, is a relation, R3,
that includes all tuples that are either in
R1, or in R2, or in both R1 and R2.
• The UNION operation is denoted by:
R3 = R 1  R2
• The UNION operation eliminates
duplicate tuples
Results of the UNION operation
RESULT = RESULT1  RESULT2.
The INTERSECTION
operation
• The result of INTERSECTION operation
on two relations, R1 and R2, is a relation,
R3, that includes all tuples that are in
both R1 and R2.
• The INTERSECTION operation is
denoted by:
R3 = R 1  R2
• The both UNION and INTERSECTION
operations are commutative and
associative operations
The SET DIFFERENCE
Operation
• The result of SET DIFFERENCE
operation on two relations, R1 and R2, is
a relation, R3, that includes all tuples
that are in R1 but not in R2.
• The SET DIFFERENCE operation is
denoted by:
R3 = R 1 – R2
• The SET DIFFERENCE (or MINUS)
operation is not commutative
(a)
(c)
(e)
Two union-compatible relations.
(b) STUDENT 
INSTRUCTOR.
STUDENT  INSTRUCTOR. (d) STUDENT – INSTRUCTOR.
INSTRUCTOR – STUDENT
The CARTESIAN PRODUCT
Operation
• This operation (also known as CROSS
PRODUCT or CROSS JOIN) denoted by:
R3 = R 1  R2
• The resulting relation, R3, includes all
combined tuples from two relations R1
and R2
• Degree (R3) = Degree (R1) + Degree (R2)
• |R3| = |R1|  |R2|
The CARTESIAN PRODUCT (CROSS
PRODUCT)
The CARTESIAN PRODUCT (CROSS
PRODUCT)
Binary Relational
Operations
• An important operation for any
relational database is the JOIN
operation, because it enables us to
combine related tuples from two
relations into single tuple
• The JOIN operation is denoted by:
R3 = R1 ⋈ <join condition> R2
• The degree of resulting relation is
degree(R1) + degree(R2)
The JOIN Operation
• The difference between CARTESIAN
PRODUCT and JOIN is that the resulting
relation from JOIN consists only those
tuples that satisfy the join condition
• The JOIN operation is equivalent to
CARTESIAN PRODUCT and then SELECT
operation on the result of CARTESIAN
PRODUCT operation, if the selectcondition is the same as the join
condition
Result of the JOIN operation DEPT_MGR 
DEPARTMENT JOIN MGRSSN=SSN
EMPLOYEE .
The EQUIJOIN Operation
• If the JOIN operation has equality
comparison only (that is, = operation),
then it is called an EQUIJOIN operation
• In the resulting relation on an EQUIJOIN
operation, we always have one or more
pairs of attributes that have identical
values in every tuples
The NATURAL JOIN
Operation
• In EQUIJOIN operation, if the two
attributes in the join condition have the
same name, then in the resulting
relation we will have two identical
columns. In order to avoid this problem,
we define the NATURAL JOIN operation
• The NATURAL JOIN operation is
denoted by:
R3 = R1 *<attribute list> R2
• In R3 only one of the duplicate attributes
from the list are kept
(a) PROJ_DEPT  PROJECT * DEPT.
(b) DEPT_LOCS  DEPARTMENT *
DEPT_LOCATIONS.
A Complete Set of Relational
Algebra
• The set of relational algebra operations
{σ, π, , -, x }
Is a complete set. That is, any of the other
relational algebra operations can be
expressed as a sequence of operations
from this set.
For example:
R  S = (R  S) – ((R – S)  (S – R))
The DIVISION Operation
•
The DIVISION operation is useful for
some queries. For example, “Retrieve
the name of employees who work on
all the projects that ‘John Smith’
works on.”
•
The Division operation is denoted by:
R3 = R1 ÷ R2
•
In general, attributes in R2 are a
subset of attributes in R1
(a) Dividing SSN_PNOS by SMITH_PNOS.
(b) T  R ÷ S.
•
Additional Relational
Operations
Statistical queries cannot be specified
in the basic relational algebra
operation
•
We define aggregate functions that
can be applied over numeric values
such as, SUM, AVERAGE, MAXIMUM,
MINIMUM, and COUNT
•
The aggregate function is denoted by
<grouping attributes> δ
<function list>
(R)
where aggregation (grouping) is based
on the <attributes-list>. If not present,
just 1 group
•
The AGGREGATE FUNCTION
operation.
The Tuple Relational
Calculus
•
The relational algebra is procedural
•
The relational calculus is
nonprocedural
•
Any query that can be specified in the
basic relational algebra can also be
specified in relational calculus, and
vice versa (their expressive power are
identical)
•
A relational query language is said to
be relationally complete if any query
can be expressed in that language
Tuple Variables and Range
Relations
•
The tuple relational calculus is based on
specifying a number of tuple variables
•
The range of each tuple variable is a relation
•
A tuple relational calculus query is denoted by
{t | COND(t)}, where t is a tuple variable and
COND(t) is a conditional expression involving t.
For example,
{t | EMPLOYEE(t) AND t.SALARY>50000}
•
Expressions and Formulas
in TRC
A general expression is of the form
{t1.A1,…, t n.A
t m)}
n
| COND(t1,…, t n, t
n+1,
…,
•
Each COND is a condition or formula
of the tuple relational calculus
•
A condition/formula is made up of one
or more atoms connected via the
logical operators AND, OR, and NOT
The Existential and Universal
Quantifiers
•
Two quantifier can appear in a tuple
relational calculus formula
•
•
•
The universal quantifier, ∀
The existential quantifier, ∃
A tuple variable, t, is bound if it is
quantified, that is, appears in an (∃t)
or (∀t) clause, otherwise it is free
The Existential and Universal
Quantifiers
•
It is possible to transform a universal
quantifier into an existential quantifier,
and vice versa
•
For example:
(∀x)(P(x)) ≡ NOT (∃x) (NOT (P(x))
(∃x)(P(x)) ≡ NOT (∀x) (NOT(P(x))
(∀x)(P(x) AND (Q(x))) ≡ NOT(∃x)(NOT(P(x))
OR(NOT(Q(x))))
Safe and Unsafe Expressions
•
In relational calculus, an expression is
said to be safe if it guaranteed to yield
a finite number of tuples as its result
•
The expression {t | NOT
(EMPLOYEE(t))} is unsafe, because it
yields all tuples in the universe that
are not EMPLOYEE tuples
•
In a safe expression all resulting
values are from the domain of the
expression
Multivalued
Dependencies (MVDs)
• Let R be a relation schema and let   R
and   R. The multivalued dependency
  
holds on R if in any legal relation r(R), for all
pairs for tuples t1 and t2 in r such that t1[] =
t2 [], there exist tuples t3 and t4 in r such
that:
t1[] = t2 [] = t3 [] = t4 []
t3[]
= t1 []
t3[R – ] = t2[R – ]
t4 []
= t2[]
t [R – ] = t [R – ]
MVD (Cont.)
• Tabular representation of 


Multivalued Dependency
in 4N
• A table with a multivalued
dependency is one where the
existence of more than one
independent many-to-many
relationships in a table causes
redundancy; and it is this
redundancy which is removed
by fourth normal form.
Example
Restaurant
Elite Pizza
Elite Pizza
A1 Pizza
A1 Pizza
Pizza
Delivery
Variety
Area
Thin Crust Capital
City
Stuffed
Capital
Crust
City
Thick
Springfield
Crust
Thick
Shelbyvill
Example (cont’)
• It is 1N, 2N, 3N and BCNF
• But because the varieties of pizza a
restaurant offers are independent
from the areas to which the
restaurant delivers, there is
redundancy in the table:
• for example, we are told three times
that A1 Pizza offers Stuffed Crust,
and if A1 Pizza start producing
Cheese Crust pizzas then we will
need to add multiple records, one for
The Fix
Restaurant
Elite Pizza
Elite Pizza
A1 Pizza
A1 Pizza
A1 Pizza
Pizza
Variety
Thin Crust
Delivery
Area
Capital
City
Stuffed
Capital
Crust
City
Thick Crust Springfield
Thick Crust Shelbyville
Thick Crust Capital
Restaur
ant
Elite
Pizza
Elite
Pizza
A1
Pizza
A1
Pizza
Pizza
Variety
Thin
Crust
Stuffed
Crust
Thick
Crust
Stuffed
Crust
Restaur
ant
Elite
Pizza
A1 Pizza
Delivery
Area
Capital
City
Springfiel
d
A little thought
Q:
What if the pizza varieties
offered by a restaurant
sometimes did vary from one
delivery area to another?
A:
Then the original three-column
table would satisfy 4NF
Course
Book
Lecturer
AHA
Silberschatz John D
AHA
Nederpelt
AHA
Silberschatz William M
AHA
Nederpelt
AHA
Silberschatz Christian G
AHA
Nederpelt
OSO
Silberschatz John D
OSO
Silberschatz William M
John D
William M
Christian G
A.
1. {course}  {book}
2. {course}  {lecturer}
Conclusions
• Databases with multivalued
dependencies exhibit
redundancy.
• In database normalization,
fourth normal form requires that
databases have no multivalued
dependencies.
Extra
• Transitivity, Reflexivity,
Complementation, Replication,
Augmentation and a few more
are all properties of 4th normal
form
X ->> Y is trivial if
(a) Y  X or
(b) Y U X = R
Multivalued
Dependencies
• There are database schemas in BCNF that do not
seem to be sufficiently normalized
• Consider a database
classes(course, teacher, book)
such that (c,t,b)  classes means that t is qualified
to teach c, and b is a required textbook for c
• The database is supposed to list for each course
the set of teachers any one of which can be the
course’s instructor, and the set of books, all of
which are required for the course (no matter who
teaches it).
Dependencies
course
database
database
database
database
database
database
operating systems
operating systems
operating systems
operating systems
teacher
Avi
Avi
Hank
Hank
Sudarshan
Sudarshan
Avi
Avi
Jim
Jim
classes
book
DB Concepts
Ullman
DB Concepts
Ullman
DB Concepts
Ullman
OS Concepts
Shaw
OS Concepts
Shaw
• There are no non-trivial functional dependencies
and therefore the relation is in BCNF
• Insertion anomalies – i.e., if Sara is a new teacher
that can teach database, two tuples need to be
inserted
(database, Sara, DB Concepts)
(database, Sara, Ullman)
Dependencies
• Therefore, it is better to
decompose classes into:
course
teacher
database
database
database
operating systems
operating systems
Avi
Hank
Sudarshan
Avi
Jim
teaches
course
book
database
database
operating systems
operating systems
DB Concepts
Ullman
OS Concepts
Shaw
text
We shall see that these two relations are in Fourth Normal
Form (4NF)
What are Association
Rules?
- Techniques used to detect
associations or relationships
between elements in large data
sets.
- Show value conditions that
occur frequently in a data set.
- Association Rules are basically
if-then rules supported by data
- Application of this is called
Market Basket Analysis
What are Association
Rules?
• Finding frequent patterns,
correlations, or associations
among a set of items or objects
in transaction databases,
relational databases, or other
information repositories
Applications of
Association Rules
•
•
•
•
•
Market Basket Analysis
Classification
Clustering
Cross-marketing
Loss-leader Analysis
Characteristics of Association Rules
• Consists of a set of items, the
rule body, leading to another
item, the rule head
XY
• Association Rules relate the
rule body X to the rule head Y
Itemsets
• itemset = a set of items
X = {apple, orange} is an itemset
• k-itemset = a set of k items
X = {apple, orange, banana} = 3itemset
Rules
• Let I = {i1, i2, …, im} be a set of
items
• Let transaction t be a set of
items where
t I
• Let T be the Transaction
Database or set of transactions
where
T = {t1, t2, …, tn}.
Rules
• Transaction t contains itemset
X which belongs to I, a set of
items
• Formal association rule is
defined as:
X  Y, where X, Y  I, and X Y
=
Support
• Support is the percentage of
transactions that support an
association rule.
• It is the percentage of transactions
that contain product X and product
Y
Support = Probability(X  Y).
Confidence
• Confidence is the strength or
reliability of an association rule
• It is the probability that if there is
X in a transaction, then Y will also
be present.
Confidence = Support (X,Y) / Support
(X) or
Confidence = Probability (Y | X)
Thresholds
• Minimum Support Threshold and
Minimum Confidence Threshold
• The association rule is more
valuable if they satisfy these
minimum values
• The higher the percentage, the
more useful the data is
Uses of Association
Rules?
•
•
•
•
•
•
Retail Shopping
Credit Card transactions
Online purchases
Medical patient histories
Banking services
Insurance claims
Market Basket Analysis
• Modeling technique based on
theory that if you buy one group of
items, you are also likely to buy
another group of items
• In consumer behavior, most
purchases are bought on impulse
• Market Basket Analysis seeks to
find a relationship between
purchases
Uses of Market Basket
Analysis
• Identify unexpected shopping
patterns
• Targeted marketing towards
specific types of people
• Predicting customer response
rates to marketing campaigns
• Distinguish between profitable
and unprofitable customers
Maximizing Profitability
- Arrangement of
items in a store
- Planning of specific
sales during times of
the year
- Pricing Policy of
certain goods
Example
-A market has 100 transactions
-15 transactions contain product
X
-Of the 15 transactions, 5
transactions contain product Y
Support = 5/100 or 5%
Confidence = 5/15 or 33.3%
Example
• Milk + Bread => Butter
X = Milk + Bread
Y = Butter
“Milk + Bread” is the Rule Body
and “Butter” is the Rule Head
Example
-
5% of all transactions will contain
combination of Milk, Bread, and Butter.
-If customers buy Milk and Bread, there is a
33.3% possibility that they will buy Butter
Support = 5%
Confidence = 33.3%
Mining Algorithms
• Analyze a set of data, looking
for patterns or trends
• They differ in the strategy and
data structure used
• Their efficiency and memory
requirements also differ
Apriori Algorithm
• Most classic and well-known
algorithm for association rules
• Useful in databases that contain
transactions
• Principle: Any subset of a
frequent itemset must be
frequent
if {A,B} is a frequent itemset,
{A} and {B} must also be frequent
Apriori Algorithm
Method
• Count 1-itemsets, then the 2itemsets, then the 3-itemsets and
so on
• When counting k-itemsets, only
consider those itemsets where
all subsets of length k have been
determined to be frequent in the
previous step
• Non frequent itemsets are pruned
Apriori Algorithm
Diagram
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE