Mining Approximate Functional Dependencies from Databases Based on Minimal Cover and Equivalent Classes

European Journal of Scientific Research ISSN 1450-216X Vol.33 No.2 (2009), pp.338-346 EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.htm Mining Approximate Functional Dependencies from Databases Based on Minimal Cover and Equivalent Classes Jalal Atoum Computer. Sci. Dept., PSUT, Amman-Jordan E-mail: atoum@psut.edu.jo Tel: 962-777-485656 Abstract Data Mining (DM) represents the process of extracting interesting and previously unknown knowledge from data. Approximate Functional Dependencies (AFD) mined from database relations represent potentially interesting patterns and have proven to be useful for various tasks like feature selection for classification, query optimization and query rewriting. The discovery of AFDs still remains under explored, posing a special set of challenges. Such challenges include defining right interestingness measures for AFDs, employing effective pruning strategies and performing an efficient traversal in the search space of the attribute lattice. In this paper, we present a new algorithm for finding approximate functional dependencies from large relational databases, based on an approximation measure g 3. This algorithm utilizes some concepts from relational databases design theory specifically the concepts of equivalences and the minimal cover. It has resulted in large improvement in performance in comparison with a modified version of an algorithm called TANE. Keywords: Data Mining, Approximate Functional Dependencies, Equivalent classes, Minimal Cover. 1. Introduction The primary motivations for mining function dependencies (FDs) from databases are the discovering of useful patterns from data and the discovering of interesting relations between variables in large databases. In some cases, an FD may not hold because of a few tuples. This FD can be thought to approximately hold. For example, Language Nationality may approximately hold. Approximate functional dependencies (AFDs) represent valuable knowledge of the structure of the relation instance. The discovery of such knowledge can be valuable to find an expertise from a database for a specific domain expert. Such AFDs exist in several databases when there are expected dependencies between attributes, but some tuples contain errors or represent exceptions to the rule. The discovery of unexpected but meaningful approximate dependencies is an exciting and practical goal in many data mining applications. For instance, a restatement from [3]: an AFD in a database of chemical compounds relating various structural attributes to carcinogenicity could provide valuable hints to biochemists for potential causes of cancer (but cannot be taken as a fact without further analysis by domain specialists).

Mining Approximate Functional Dependencies from Databases Based on Minimal Cover and Equivalent Classes 339 Applications of AFDs includes; Predicting Missing Values of attributes in relational tables (QPIAD) [14] using values of attributes in determining set of AFDs, query optimization (CORDS[4]) by maintaining correct selectivity estimates, query rewriting (AIMQ[9], QPIAD[14], QUIC[13]), and in database normalization for better performance and efficient storage design. The discovery of AFDs is costly due to the following reasons; the pruning strategies of FDs are not applicable in case of AFDs, for databases with large number of attributes, the search space gets worse, and the methods for determining whether a dependency holds or not are costly [6]. In this paper, we propose a new algorithm for discovering AFDs from static databases based on an approximation measure g 3 [8]. This algorithm will also employ some concepts from relational database theory, specifically, the theory of equivalencies and minimal cover of FDs. The proposed algorithm aims at minimizing the time requirements of algorithms that discover AFDs from databases. We will compare the results of our proposed algorithm with a modification version of previous well known algorithm called Tane [3]. 2. Previous Research In recent years, a new research direction has emerged involving mining FDs. Researchers have been addressing the problem of finding all of the FDs which hold in a given relation instance [3, 5, 6, 8, 11]. The AFD discovery research consists of three primary parts: (1) defining an approximation measure for AFDs, (2) developing methods for applying AFDs to pre-existing problems, (3) developing algorithms for efficiently computing AFDs. Huhtala et al. [3] address the last part by developing an algorithm, Tane, for discovering all AFDs which hold in a relation instance. Tane uses an approximation measure, g 3, proposed in [8] to define when an AFD is deemed to hold (g 3 will be defined in the section 4). AFDs discovery have been considered in [1, 7, 9]. Kivinen and Mannila [8] define several measures for the error of a dependency, and derive bounds for discovering dependencies with errors of a dependency, and derive bounds for discovering dependencies with errors. The measure g 3 is one of their measures. The use of partitions to describe and define functional and approximate dependencies has been suggested in [1]. 3. Functional Dependencies Given a relation R, a set of attributes X in R is said to functionally determine another set of attributes Y, also in R, (written X Y) if and only if each X value is associated with precisely one Y value. An FD that is denoted by X A, is a constraint between two sets of attributes X and A that are subset of some relation schema R. It specifies a constraint on all possible tuples t 1 and t 2 in R such that if t 1 [X]=t 2 [X], then they must also have t 1 [A] =t 2 [A]. This means that the values of the A component of any tuple in R depend on or determined by the values of the X component. 3.1. Functional Dependencies and Equivalent Classes To discover a set of FDs that are satisfied by a relation instance, we use the partition method that divide the tuples of this instance into groups based on the different values for each column (attribute). For each attribute, the number of groups is equal to the number of different values for that attribute. Each group is called an equivalent class. For instance, consider the relation instance shown in table 1, in this instance, attribute A has value "1" only in tuples number one and two, so they form an equivalent class [1] {A} = [2] {A} = {1,2} (we use here tuple identifiers to denote tuples). Similarly, attribute A has value of 2 in tuples 3,4,5 and has value of 3 in tuples 6,7,8. Hence the whole equivalent classes with respect to attribute A is consisted of three equivalence classes as follows: {A} = {{1, 2}, {3, 4, 5}, {6, 7, 8}}.

340 Jalal Atoum The equivalence classes with respect to the combined attributes {B, C}, for example, is: {B, C} r = {{1}, {2}, {3, 4}, {5}, {6}, {7}, {8}}. Table 1: Relation Instance Tuple ID A B E C D 1 1 a 2 $ Flower 2 1 A 2 Tulip 3 2 A 0 $ Daffodil 4 2 A 0 $ Flower 5 2 B 0 Lily 6 3 B 1 $ Orchid 7 3 C 1 Flower 8 3 C 1 # Rose The concept of equivalent classes refinement gives almost directly functional dependencies. An equivalent classes refines another equivalent classes ' if every equivalence class in is a subset of some equivalence classes of '. An FD X Y holds if and only if {x} refines {Y}. In our example: attribute A has the following sets of equivalence classes: {{t 1, t 2 }, {t 3, t 4, t 5 }, {t 6, t 7, t 8 }}, and attribute E has the following sets of equivalence classes: {{t 1, t 2 }, {t 3, t 4, t 5 }, {t 6, t 7, t 8 }}. Since the equivalent classes of attribute E refine the equivalent classes of attribute A, we can discover that A E holds on this instance (Table 1). 3.2. Minimal Cover The concept of minimal FDs or minimal cover is useful in eliminating unnecessary FDs so that only the minimal number of dependencies need to be considered. A set of functional dependencies F is a minimal cover iff: 1. Every functional dependency in F is of the form X A where A is a single attribute. 2. For no X A in F is F - {X A} equivalent to F 3. For no X A in F and Y X is F {X A} {Y A} equivalent to F Example: {A C, A B } is a minimal cover for {AB C, A B} 4. Approximate Functional Dependencies For some relations, some FDs may not hold for all of the tuples. Such an FD can be thought to approximately hold. For example for cars, Make is determined by Model via an expectation dependency: given that Model =323, we know that Make = Mazda with high probability, but there is also a small chance that Make = BMW. This expected or approximated FD is specified by Model Make. A standard definition of an approximate dependency X A is based on the minimum number of rows that need to be removed from the relation instance r for X A to hold in r: the error g 3 (X A) = 1 - (max{ s s r and X A holds in s})/ r. The measure g 3 has a natural interpretation as the fraction of rows with exceptions or errors affecting the dependency. Given an error threshold ε, 0 ε 1, we say that X A is an approximate dependency if and only if g 3 (X A) is at most ε [2]. An alternative method of computing g 3 that will be used in our proposed algorithm is as follows [3]: An equivalence class c of π X is the union of one or more equivalence classes c 1, c 2,. of π x {A}, and the rows in all but one of the c i s must be removed for X A to hold. The minimum number of rows to remove is thus the size of c minus the size of the largest of the c i s. Summing that over all equivalence classes c of π X gives the total number of tuples to remove. Thus, we have: g 3 (X A) = 1 - c πx max{ c c π x {A} and c c} / r

Mining Approximate Functional Dependencies from Databases Based on Minimal Cover and Equivalent Classes 341 For instance, in our example, to test whether A B holds or not, we find the equivalent classes of π A = {{1, 2}, {3, 4, 5}, {6, 7, 8}} and the equivalent classes of π B = {{1}, {2, 3, 4}, {5, 6}, {7, 8}}. Since the equivalent class {1, 2} in π A does not refine any class in π B and so on for the other classes in π A. Therefore, A B does not hold. However, A B may hold in our example, with some error g 3, if we remove some tuples form the given relation. According to the above alternative method of computing g 3, we first find π A {B} = {{1}, {2}, {3, 4}, {5}, {6}, {7, 8}}. The equivalent class {1, 2} in π A it is equals to: {1} {2} from π A {B} with max size of {1} and {2} = 1. The equivalent class {3, 4, 5} in π A is equals to: {3, 4} {5} from π A {B} with max size of {3, 4} and {5} = 2. Finally, for the equivalent class {6, 7, 8} in π A is equals to: {6} {7, 8} from π A {B} with max size of {6} and {7, 8} = 2. Hence, in our example, g3 (A B) = 1 (1+2+2)/8=0.375. In other words, at least three tuples out of the existing 8 tuples in Table 1 must be removed for the A B to hold. Such an FD: A B is said to be approximately hold on the relation shown in Table 1 with error rate of ε=0.375. This process of discovering AFDs is repeated for all attributes and for all of their combinations (candidate set). For instance, given a relation with five attributes (A, B, C, D, E) the candidate set is {φ, A, B, C, D, E, AB, AC, AD, AE, BC, BD, DE, CD, CE, DE, ABC, ABD, ABE, ACD, ACE, ADE, BCD,BCE, BDE, CDE, ABCD, ABCE, ABDE, ACDE, BCDE, ABCDE} for a total of 32 (i.e. 2 5 ) combinations. These candidate attributes of this relation are represented as a lattice as shown in Figure 1. Each node in Figure 1 represents a candidate attributes. An edge between any two nodes such as E and DE indicates that the AFD: DE D, needs to be checked. Hence, all known algorithms for this task have running times that can be in the worst case exponential in the number of tuples and in number of attributes [8]. Figure 1: Lattice for the Attributes of the Relation in Table 1. 5. Modified Tane Algorithm The original Tane algorithm [3] finds all non-trivial FDs by searching the Lattice in a levelwise manner. A level L l is the collection of attribute sets of size l such that the sets in L l can potentially be used to construct dependencies from the lattice. The algorithm starts with level L 1 = {{A} A R}, and computes L 2 from L 1, L 3 from L 2, and so on.

342 Jalal Atoum This algorithm employs the term, C(X), which is the collection of rhs candidates of a set X R and it is formally defined as C(X) = { A X X \ {A} A does not hold} R \ X. Furthermore, C + (X) is also used to indicate the collection of rhs + candidates of a set X R. C + (X) is formally defined as C +( X) = { A R for all B X, X \ {A, B} B does not hold}. The original Tane algorithm is defined below: Algorithm TANE: levelwise search of dependencies. 1 L 0 := {Ө} 2 C + ( Ө):= R 3 L 1 := {{A} A R} 4 l:= 1 5 while L l Ө 6 COMPUTE-DEPENDENCIES(L l ) 7 PRUNE(L l ) 8 L l +1 := GENERATE-NEXT-LEVEL(L l ) 9 l:= l + 1 The specification of the procedure COMPUTE-DEPENDENCIES is: Procedure COMPUTE-DEPENDENCIES(L l ) 1 for each X L l do 2 C + (X):= A X C + (X \ {A}) 3 for each X L l do 4 for each A X C + (X) do 5 if X \ {A} A is valid then 6 output X \ {A} A 7 remove A from C + (X) 8 remove all B in R \ X from C + (X) The specification of the procedure PRUNE is: Procedure PRUNE(L`) 1 for each X L l do 2 if C + (X) = Ө do 3 delete X from L l 4 if X is a (super)key do 5 for each A C + (X) \ X do 6 if A := b X C + (X {A} \ {B}) then 7 output X A 8 delete X from L l The specification of GENERATE-NEXT-LEVEL is L l +1 = {X X = l + 1 and for all Y with Y X and Y = l we have Y L l} TANE Algorithm was modified to compute all approximate dependencies X A with g 3 (X A) ε, for a given threshold value ε [3]. The key modification is the change of the validity test on line 5 of procedure COMPUTEDEPENDENCIES to: 5 if g 3 (X \ {A} A) ε then In addition, line 8 of COMPUTE-DEPENDENCIES has been replaced by: 8 I if X \ {Ag} A holds exactly then 9 remove all B in R \ X from C + (X)

Mining Approximate Functional Dependencies from Databases Based on Minimal Cover and Equivalent Classes 343 5.1. Time complexities Modified Tane Algorithm For a given relation R of size R attributes and size of r tuples. The time complexity of the modified TANE algorithm is dependent on the number of tuples in the database r, on the number of sets in all levels of the candidate attributes lattice s = O(2 R ), and on the number of keys K = O(2 R / R ). According to [3], the modified Tane algorithm has the following total time complexity: O(s( r + R 2 ) + K [R] 3 ) 6. Suggested Work In this paper, we suggest an algorithm that discovers all AFDs from databases with approximate dependency of at most ε that is called Approximate discovery of Functional Dependency using Minimal Cover and Equivalent Classes (AFDMCEC). This algorithm would reduce the number of attributes and AFDs to be checked by incorporating some concepts from relational database design theory. The first concepts involves an incremental minimal cover computation of AFDs during each phase of discovering AFDs. The aim of this concept is to minimize the number of AFDs to be checked. While the second concept involves the computation of equivalency of attributes based on their nontrivial closure. For each pair of attributes whose closures are found equal we remove one of them from the candidate set of attributes. Also, we add the fact that these two attributes are approximately equivalent ( ). This will reduce the number of attributes to be checked during each phase of the proposed algorithm. The following figure (Figure 2) presents the main procedure of AFDMCEC algorithm. Figure 2: The Main Procedure of the Approximate AFDMinEQC Algorithm AFDMCEC Algorithm: Input: dataset D and its attribute X 1,X 2,.,X n ε: Error threshold, 0 ε 1 Output: Minimal Approximate FD_Set, Candidate Set for next level, EQ_Set, 1. Initialization Step Set R= attribute (X 1, X 2,.., X n ) Nrows = number of rows in the database Set FD_Set = φ Approximate_FDSet=φ Set EQ_Set =φ Set Candidate_Set= {X 1, X 2,.., X n }. 2. While Candidate_Set φ Do For all X i Candidate_Set Do Approximate_FDSet= ComputeMinimalApproximate_FD(X i ) GenerateNextLevelCandidates(Candidate_Set) 3. Display ApproximateFD_Set The main procedure of the Approximate AFDMCEC algorithm calls the ComputeMinimalApproximate_FD(X i ) for each X i in Candidate_set as shown in Figure 3. For each attribute Y, Y R - X i, if g 3 (X Y) ε then add X i Y to ApproximateFD_Set, and if the approximate closure of X i is the same as the approximate closure of Y then add Y X i to ApproximateFD_Set, add X i Y to EQ_Set and remove Y from candidate_set. Finally, Figure 4 presents the GenerateNextLevelCandidates procedure.

344 Jalal Atoum Figure 3: ComputeApproximate_FD Procedure ComputeMinimalApproximate_FD (X i ) Max = 0 TempList=Φ. For each Y R - X i Do M=? Xi N=? XiY For all S N Do For all T M Do If T S then W=W T If Max < Len (T) then max = Len (W) Add Max to Tmplist For I = 1 to Len(Tmplist) J = J+ Tmplist(I) Result = 1-J/ NRows If Result ε Then Add X i Y to ApproximateFD_Set If (approximate_closure(x) = approximate_closure(y)) then Add Y X i to ApproximateFD_Set Add X i to closure' [Y] Add X i Y to EQ_Set Remove Y from candidate_set Figure 4: GenerateNextLevelCandidates procedure Procedure GenerateNextLevelCandidates(CANDIDATE_SET ) For each Xi CANDIDATE_SET do For each Xj CANDIDATE_SET do If (Xi[1]=Xj[1],, Xi[k-2] = Xj[k-2] and Xi[k-1] < Xj[k-1]) then Set Xij = Xi join Xj If Xij TmpList then delete Xij else Compute the partition ПXij of Xij 6.1. Time Complexity of AFDMCEC Algorithm Initially, the proposed algorithm will scan the whole table of size r tuples in order to find all equivalent classes for a time complexity of r. Then the main body of the AFDMCEC algorithm has a loop that iterates R times. Therefore, this main body has a time complexity of R. Within each iteration of this loop, there is a call for each of the following procedures: 1. ComputeMinimalApproximate_FD(), each call of this procedure takes R iterations. In each of these iterations there is a loop that scans all of the candidates in that level of size s = 2 R. Hence the total time of this step is s * R. 2. GeneratNextLevelCandidates(Candidate_Set) this procedure performs two nested loops, each with R iteration for a total time of R 2. Therefore, the total time complexity required by the AFDMCEC algorithm is: O( r + R (s R + R 2 )) = O( r + s R 2 + R 3 )

Mining Approximate Functional Dependencies from Databases Based on Minimal Cover and Equivalent Classes 345 7. Experimental Analysis As a result of running both algorithms (Modified Tane and AFDMCEC), the same set of AFDs from the UCI datasets [12] had been generated. Furthermore, Table 2 shows the results of the actual times required for Modified TANE algorithm and for AFDMCEC algorithm for these UCI datasets with varying number of attributes and tuples and with different thresholds ε values for discovering all AFDs. Table 2: Actual Time Requirements at all level for Both Algorithms (Modified TANE and the AFDMCEC algorithms) for Some UCI Datasets for different thresholds ε. ε = 0.0 ε = 0.05 ε = 0.25 ε = 0.5 DataBase Time (Min) Time (Min) Time (Min) Time (Min) AFDMCEC ModTane AFDMCEC ModTane AFDMCEC ModTane AFDMCEC ModTane Abalone 4.04 4.53 1.02 1.25 0.30 0.42 0.12 0.14 Balance-scale 0.026 0.03 0.01 0.025 0.009 0.012 0.006 0.009 Breast-cancer 60.25 64.02 16.10 18.25 3.24 5.40 1.30 2.50 Bridge 350.00 387.25 90.10 98.30 12.15 15.80 5.25 7.50 Chess 0.33 0.47 0.12 0.20 0.02 0.065 0.015 0.025 Echocardiogram 21.00 23.25 7.20 9.12 1.42 9.12 0.24 1.05 Glass 2.01 2.30 0.81 0.95 0.12 0.62 0.09 0.62 Iris 0.022 0.025 0.006 0.025 0.002 0.009 0.001 0.006 Nursery 10.20 11.40 3.11 4.25 0.95 1.25 0.20 0.55 Machine 6.85 7.20 2.05 3.17 0.82 0.94 0.35 0.46 From Table 2, the same AFDs are found more efficiently using our proposed algorithm in comparisons with the modified version of Tane algorithm. This had happened as a result of more equivalent classes and consequently more equivalent attributes. The more equivalent attributes lead to less number of AFDs to be checked for satisfaction. Furthermore, we notice that the higher the thresholds ε value the decrease in time requirements for both algorithms. This is due to the fact that with higher threshold value ε, the more error rates are allowed and less number of tuples to be removed in computing of g 3. 8. Time Complexity Comparisons Table 3 presents the time complexity comparison that are computed earlier for AFDMCEC Algorithm and for modified TANE Algorithm. Table 3: Time Complexity Comparison Based on T(n) for Both Algorithms Database Name # of Attribute # Of Tuples Modified Tane AFDMCEC s( r + R 2 ) + K [R] 3 r + s R 2 +. R 3 Abalone 9 4,177 2304512 46378 Balance-scale 5 625 22588 1550 Breast-cancer 11 699 2501246 249838 Bridge 13 108 7260882 1386753 Chess 7 28,056 3614034 34671 Echocardiogram 13 132 7457490 1386777 Glass 11 214 1507966 249353 Iris 5 150 7388 1075 Nursery 9 12,960 6801408 55161 Machine 10 209 640233 103609

346 Jalal Atoum 9. Conclusions We have suggested new algorithm for discovering AFDs from large relational databases, based on an approximation measure g 3 which employs the concepts of equivalent properties and minimal (Canonical) cover of FDs. The aim of this algorithm is to optimize the time requirements when compared with a modification of a previous algorithm called TANE. The analyses of the AFDMCEC algorithm had a better performance over the modified version of the TANE algorithm. Furthermore, simulation results for both algorithms have shown that as the thresholds ε values increases both algorithms perform much better with dramatic decreases in time requirements. With higher thresholds values more discovered AFDs are found from database. In this case, most of the discovered AFDs are useless in terms of discovered valuable knowledge since they have high error rates (high threshold values ). References [1] Dalkilic, M.M., Gucht, D. V., and Robertson, E. L, 1997. CE: the Classifier-Estimator Framework for Data mining. In Proceedings of the 7 th IFIP 2.6 Working Conference on Database Semantics (DS-7), Leysin, Switzerland, Oct. 1997. Chapman and Hall. [2] Giannella, Chris and Robertson, Edward, 2004 On Approximation Measures for Functional Dependencies, Inform Action Systems Archive 29(6), 483-507. [3] Huhtala, Y., Karkkainen, J., Porkka P., and Toivonen, H., 1999. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100-111. [4] Ilyas, I. F., Markl, V., Haas, P., Brown P., and Aboulnaga, A. 2004. Cords: Automatic Discovery of Correlations and Soft Functional Fependencies. In SIGMOD 04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 647 658, New York, NY, USA. [5] Jiang, N. and Gruenwald, L., 2006. "Research Issues in Data Stream Association Rule Mining", SIGMOD Record, Vol. 35, No. 1. [6] Kalavagattu, Aravind Krishna., 2008. Mining approximate Dependencies as Condensed Representations of Association Rules, Master Thesis, Arizona State University. [7] Kramer, S. and Pfahringer, B., 1996. Efficient search of strong partial determinations. In E. Simoudis, J. Han, and U. Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 96), pages 371 378, Portland, OR, Aug. 1996. AAAI Press. [8] Kivinen, J., and Mannila, H., 1995. Approximate Inference of Functional Dependencies From Relations. Theoretical Computer Science, 149:129-149. [9] Nambiar, U. and Kambhampati, S., 2006. Answering Imprecise Queries over Autonomous Web Databases. In ICDE, page 45. [10] Novelli, N., and Cicchetti, R., 2001. Fun: An Efficient Algorithm for Mining Functional and Embedded Dependencies. Proceedings of the 8th International Conference on Database Theory (ICDT), pages 189-203. [11] Perugini, S., and Ramakrishnan N., 2006. Mining Web Functional Dependencies for Flexible Information Access, Journal of the Americal Society for Information Science and Technology. [12] UCI Machine Learning Repository, http://www.ics.uci.edu /~mlearn/ MLRepository.html. [13] Wolf, G, Khatri, H., Chen, Y., and Kambhampati, S., 2007. Quic: A System for Handling Imprecision & Incompleteness in Autonomous Databases (demo). In CIDR, pages 263 268. [14] Wolf, G., Khatri, H., Chokshi, B., Fan J., Chen, Y., and Kambhampati, S., 2007. Query Processing over Incomplete Autonomous Databases. In VLDB 07: Proceedings of the 33rd international conference on Very large data bases, pages 651 662. VLDB Endowment.