Discovering Functional Dependencies in Relational Database

Download Report

Transcript Discovering Functional Dependencies in Relational Database

Discovering Functional Dependencies in
Relational Databases Using Data Mining
Techniques
Abrar Fawaz AlAbed-AlHaq
Kent State University
[email protected]
October 28, 2011
4/9/2015
1
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering Functional Dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
2
Introduction
 Nowadays there is a fast growing amount of data that are collected
and stored in large databases. As a result, the databases may contain
redundant or inconsistent data.
 An important concept in relational schema design is that of a
functional dependency. Functional dependencies (FD) plays key role
in the design of relational databases, FD is a property of the semantic
or meaning of the attributes. It helps in simplifying the structure of
databases. Discovering functional dependencies (FDs) from an
existing relational instance is an important technique in data mining
and database design.
4/9/2015
3
Introduction (cont.)
 Functional dependencies are relationships between attributes of a
database relation; a functional dependency states that the value of an
attribute is uniquely determined by the value of some other
attributes.
4/9/2015
4
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering Functional Dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
5
Problem Statement
 The problem is to find all functional dependencies among attribute in
a relation database. The early methods for discovering functional
dependencies is based on repeatedly sorting and comparing tuples to
determine whether or not these tuples meet FD definition, so this
approach does not utilize the discovered FDs as knowledge to obtain
new knowledge. As a result this approach is highly sensitive to the
number of tuples and attributes, it is not practical for large database.
Using FD_Mine and TANE algorithms, by using these algorithms,
there is no need to sort on any attribute or compare any value.
4/9/2015
6
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering Functional Dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
7
Data Mining Techniques
 Data mining is the process of producing useful knowledge and
information, it uses an analysis tools to discover patterns and a
relationships in large data that could be use. We can benefit from
data mining algorithm for discovering functional dependencies in
large databases to get useful information. Data mining have several
techniques that we can use it for discovering functional dependencies
such as Apriori algorithm.
 We have two algorithms here:
 FD_Mine
 TANE
4/9/2015
8
The FD_Mine Algorithm
 The mechanism of the FD_Mine is as follow. It uses a level-wise
search, where results from level k are used to explore level k +1.
First, at level 1, all FDs X →Y where X and Y are single attributes
are found and stored in FD_SET F1. The set of candidates that are
considered at this level is denoted L1. F1 and L1 are used to generate
the candidates Xi Xj of L2. At level 2, all FDs of the form Xi Xj → Y
are found and stored in FD_SET F2, F1, F2, L1 and L2 are used to
generate the candidates of L3, and so on, until the candidates at level
Ln-1 have been checked or no candidates remain.
4/9/2015
9
The TANE Algorithm
 TANE searches the set containment lattice in a levelwise manner. A
level LI is the collection of attributes sets of size I. TANE starts with
L1 = {{A} | A R}, and computes L2 from L1, L3 from L2 and so on
according to the information obtained during the algorithm. To
computing partitions; in the beginning, partitions with respect to the
singleton attribute sets are computed straight from relation r. A
partition ∏{A}, where ∏ denoted partition, is computed from the
column r[A] as follows. First, the values of the column are replaced
with integers 1, 2, 3… so the same values are replaced by same
integers and different values with different integers. Then the value
t[A] is the identifier of the equivalence class [t]{A} of ∏{A}, and ∏ {A}
is then easy to construct.
4/9/2015
10
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering Functional Dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
11
Discovering functional dependencies: Example
 Suppose that FD_Mine is applied to database D, as shown in
Table, with R = {A, B, C, D, E}.
4/9/2015
A
B
C
D
E
t1
0
0
0
2
0
t2
0
1
0
2
0
t3
0
2
0
2
2
t4
0
3
1
2
0
t5
4
1
1
1
4
t6
4
3
1
1
2
t7
0
0
1
2
0
12
Discovering functional dependencies: Example
 Next table Summarizes the actions of FD_Mine. In iteration 1, since
|ПA| = |ПAD| = 2, Closure’(A) is set to D, and A→D is deduced. In
same way, D→A is discovered, so the equivalence A↔D is obtained.
As a result, we only need to combine A, B, C, and E to generate the
next level candidates {AB, AC, AE, BC, BE, CE}. At the same time,
the nontrivial closure of each generated candidate is computed. For
example, the closure’(AB)=closure’(A) U closure’(B) = {D} U φ =
{D}. In iteration 2, for candidate AB, only AB→C and AB→E need to
be checked, because R–{A, B}–Closure’(AB)={A, B, C, D, E}–{A,
B}–{D}={C, E}. Since |ПAB| = |ПABE| = 6, then AB→E is obtained. In
the same way, at this level, BE→A and CE→A are also discovered, so
the equivalence AB↔BE is obtained. As a result, we only need to
combine AB, AC, AE, BC, and CE to form the level 3 candidates,
which are {ABC, ACE}. Since CE→A, ACE is pruned by pruning rule
4. Since AB→E, then ABC→E. Since A↔D, then ABC→D, so ABC
is a key, and ABC is also pruned by pruning rule 2. No other candidate
remains, so the algorithm halts.
4/9/2015
13
Discovering functional dependencies: Example
Yao, H., Hamilton, H., and Butz, C. (2002), FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences, Canada.
14
Discovering functional dependencies: Example
 The result after removal is shown in table (b).
A
B
C
D
E
A
B
C
E
t1
0
0
0
2
0
t1
0
0
0
0
t2
0
1
0
2
0
t2
0
1
0
0
t3
0
2
0
2
2
t3
0
2
0
2
t4
0
3
1
2
0
t4
0
3
1
0
t5
4
1
1
1
4
t5
4
1
1
4
t6
4
3
1
1
2
t6
4
3
1
2
t7
0
0
1
2
0
t7
0
0
1
0
(a) Before
4/9/2015
(b) After
15
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering functional dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
16
Experimental Results
 FD_Mine was applied to fifteen datasets, obtained from the
UCI Machine Learning Repository and the results were
compared to TANE. TANE was selected for comparison
because it establishes the theoretical framework for the
problem. For the dataset given in Table, Figure (a) shows the
semi-lattice for FD_Mine, and Figure (b) shows that for TANE.
Each node represents a combination of attributes. If an edge is
shown between nodes X and XY, then X → Y needs to be
checked. Hence, the number of edges is the number of FDs that
need to be checked. Both semi-lattices shown in Figure(a) have
fewer edges than the lattice. In addition, the semi-lattice for
FD_Mine has fewer edges than that for TANE.
4/9/2015
17
Experimental Results
Figure ( a ) FD_Mine
Figure ( b ) TANE
Yao, H., Hamilton, H., and Butz, C. (2002), FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences, Canada.
18
Experimental Results
 Table 5.1 compares the number of FDs that are checked
on data by FD_Mine and TANE for 15 UCI datasets.
Figure 5.2 shows more detailed results for the Imports-85
dataset. At levels 1 through 5, both algorithms check
approximately the same number of FDs, but at levels 6
through 11, FD_Mine checks fewer FDs than TANE,
because it prunes more unnecessary candidates than
TANE by using the equivalences and FDs discovered at
previous levels. For more results.
4/9/2015
19
Experimental Results
Yao, H., Hamilton, H., and Butz, C. (2002), FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences, Canada.
20
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering functional dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
21
Conclusion
 Discovering Functional Dependency (FD) is employed to extract the
Functional Dependency from a datasets. This addresses two issues;
one related to the discovering process of the FDs from the datasets
and the second issues is to measure the discovering Functional
Dependency (FDs) algorithm.
 The empirical comparison on 15 UCI datasets between FD_Mine
and TANE shows FD_Mine examined fewer FDs than TANE,
because it prunes the unnecessary candidates than TANE by using
the equivalences and FDs discovered at previous levels. The result
based on the experiments on 15 UCI datasets show that the FD_Mine
algorithm can prune more candidates than TANE algorithm and it
also show that FD_Mine can discovering FDs more than TANE
algorithm.
4/9/2015
22
Outline
 Introduction.
 Problem Statement.
 Data Mining Techniques.
 Discovering functional dependencies: Example.
 Experimental Results.
 Conclusion.
 References.
4/9/2015
23
References
[1] Yao, H., Hamilton, H., and Butz, C. (2002), FD_Mine: Discovering Functional
Dependencies in a Database Using Equivalences, Canada.
[2] Wyss, C., Giannella, C., and Robertson, E. (2001), FastFDs: A Heuristic-Driven,
Depth-First Algorithm for Mining Functional Dependencies from Relation
Instances, U.S.A.
[3] Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H., (1999), TANE: An
Efficient Algorithm for discovering Functional and Approximate Dependencies,
Computing Journal, V.42, No.20, pp.100-107.
[4] Elmasri, R. and Navathe, S. (2004), Fundamentals of Database Systems, Addison
Wesley, Fourth edition.
[5] Mannila, H. (2000), Theoretical Frameworks for Data Mining, SIGKDD
Explorations, V.1, No.2, pp.30-32.
[6] Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H. (2000), Efficient
Discovery of Functional and Approximate Dependencies Using Partitions,
University of Helsinki, Finland.
Thank You
4/9/2015
24