The fuzzy data mining generalized association rules for quantitative values

Size: px

Start display at page:

Download "The fuzzy data mining generalized association rules for quantitative values"

Gavin Jennings
6 years ago
Views:

1 The fuzzy data mining generalized association rules for quantitative values Tzung-Pei Hong* and Kuei-Ying Lin Graduate School of Information Engineering I-Shou University Kaohsiung, 84008, Taiwan, R.O.C. {tphong, isu.edu.tw Abstract Due to the increasing use of very large databases and data warehouses, mining useful information and helpful nowledge from transactions is evolving into an important research area. Most conventional data-mining algorithms identify the relationships among transactions using binary values and find rules at a single concept level. Transactions with quantitative values and items with hierarchy relation are, however, commonly seen in real-world applications. In this paper, we introduce the problem of mining generalized association rules for quantitative values. We propose fuzzy generalized rules mining algorithm for extracting implicit nowledge from transactions stored as quantitative values. Given a set of transaction and predefined taxonomy, we want to find fuzzy generalized association rules where the quantitative of items may be from any level of the taxonomy. Each item uses only the linguistic term with the maximum cardinality in later mining processes, thus maing the number of fuzzy regions to be processed the same as that of the original items. The algorithm can therefore focus on the most important linguistic terms and reduce its time complexity. We propose algorithm combines fuzzy transaction data mining algorithm and mining generalized association rules algorithm. This paper related to set concepts, fuzzy data mining algorithms and taxonomy and generalized association rules. Keywords: fuzzy data mining, fuzzy set, generalized association rule, quantitative value, taxonomy * Corresponding author 1

2 1. Introduction As the information technology (IT) has progressed rapidly, the capability of storing and managing data in databases is becoming important. Though the IT development facilitates data processing and eases spending of storage medium, extraction of implicitly available information to aid decision maing becomes a new challenging wor. Efficient mechanisms for mining information and nowledge from large databases have thus been designed vigorously. As the results, data mining, first proposed by Agrawal et al. in 1993 [1], becomes a central study field in both databases and artificial intelligence. Finding association rules from transaction databases is most commonly seen in data mining [1][2][3][7][11][12][13][15][16][26]. It discovers relationships among items such that the presence of some items in a transaction would imply the presence of other items. The application of association rules is very wide. For example, they could be used to inform managers in a supermaret what items (products) customers would lie to buy together, thus be useful for planning and mareting activity [1][2][6]. More concretely, assuming that an association rule if one customer buys mil, then he is liely to buy bread is mined out, the supermaret manager can then 2

3 place the mil near the bread area to inspire customers to purchase them at once. Additionally, the manager can promote the mil and the bread by pacing them together on sale so as to earn good profits for the supermaret. Most previous studies have mainly shown how binary valued transaction data may be handled. Transaction data in real-world applications however usually consist of quantitative values, so designing a sophisticated data-mining algorithm able to deal with quantitative data presents a challenge to worers in this research field. In the past, Agrawal and his co-worers proposed several mining algorithms based on the concept of large itemsets to find association rules in transaction data [1-3]. They also proposed a method [26] for mining association rules from data sets with quantitative and categorical attributes. Their proposed method first determines the number of partitions for each quantitative attribute, and then maps all possible values of each attribute into a set of consecutive integers. Some other methods were also proposed to handle numeric attributes and to derive association rules. Fuuda et al introduced the optimized association-rule problem and permitted association rules to contain single uninstantiated conditions on the left-hand side [13]. They also proposed schemes to determine the conditions such that the confidence or support 3

4 values of the rules are maximized. However, their schemes were only suitable for a single optimal region. Rastogi and Shim thus extended the problem for more than one optimal region, and showed that the problem was NP-hard even for the case of one uninstantiated numeric attribute [23][24]. Fuzzy set theory is being used more and more frequently in intelligent systems because of its simplicity and similarity to human reasoning [20]. Several fuzzy learning algorithms for inducing rules from given sets of data have been designed and used to good effect with specific domains [4-5, 6, 10, 14, 17-19, 25, 28-29]. Strategies based on decision trees were proposed in [8-9, 21-22, 25, 30-31], and Wang et al. proposed a fuzzy version space learning strategy for managing vague information [28]. Hong et al also proposed a fuzzy mining algorithm for managing quantitative data [16]. Basically, this approach applies the fuzzy concepts in the Taxonomy information to discover useful fuzzy generalized association rules among quantitative values. The goal of data mining is to discover important associations among items such that the presence of some items in a transaction will imply the presence of some other items. To achieve this purpose, we proposed this method that combine FTDA algorithm [16] and mining generalized association rules [27]. Hong et al proposed integrates the 4

5 concepts of fuzzy sets and the Apriori algorithm to find interesting itemsets and fuzzy association rules from transaction data, the algorithm called FTDA(fuzzy transaction data-mining algorithms). The algorithm is specially capable of transforming quantitative values in transactions into linguistic terms, then filtering them, picing them up and associating them using modified Apriori algorithm. The FTDA algorithm can thus consider the implicit information of quantitative values in the transactions and can infer useful association rules from them. Ramarishnan Sriant and Raesh Agrawal introduce the problem of mining generalized association rules. Given a large database of transaction, where each transaction consists of a set of items, Taxonomy (is-a hierarchy) on the items, we find generalized associations rules between items at any level of Taxonomy. The remaining parts of this paper are organized as follows. Mining generalized association rules for quantitative values at taxonomy is introduced in Section 2. Fuzzy-set concepts are reviewed in Section 3. Notation used in this paper is defined in Section 4. A new fuzzy data mining generalized association rules algorithm is proposed in Section 5. An example to illustrate the proposed algorithm is given in Section 6. Discussion and conclusions are finally stated in Section Mining generalized association rules at taxonomy 5

6 Our algorithm combines fuzzy transaction data-mining algorithm (called FTEA algorithm) that proposed by Tzung-Pei Hong et al [16] and generalized association rule that proposed by Ramarishnan Sriant and Raesh Agrawal[27]. Our algorithm includes fuzzy set, fuzzy data mining, generalized association rules and predefined taxonomy. Explain generalized association rules and predefined taxonomy concepts in this section. In mining generalized association rules, the problem of mining generalized association rules. Informally, given a set of transaction and predefined taxonomy, we want to find association rules where the items may be from any level of the taxonomy. Clothes Footwear Outerwear Shirts Shoes Hiing Boots Jacets Si Pants Figure 1 : predefined taxonomy In most cases, taxonomies (is-a hierarchies) over the items are available. An example of predefined taxonomy is shown in Figure 1: this taxonomy says that Jacet is-a Outerwear. Si Pants is-a Outerwear. Outerwear is-a Clothes, etc. Users are interested in generating rules that span different levels of the taxonomy. For example, we may infer a rule that people who buy Outerwear tend to buy Hiing Boots from the fact that people bought Jacets with Hiing Boots and Si Pants with Hiing Boots. However, the support for the rule Outerwear Hiing Boots may not be the sum 6

7 of the supports for the rules Jacets Hiing Boots and Si Pants Hiing Boots since some people may have bought Jacets. Si Pants and Hiing Boots in the same transaction. Also, Outerwear Hiing Boots may be a valid rule, while Jacets Hiing Boots and Clothes Hiing Boots may not. The former may not have minimum support, and the latter may not have minimum confidence. Previous wor on quantifying the usefulness or interest of a rule focused on how much the support of a rule was more than the expected support based on the support of the antecedent and consequent. Piatetsly-Shapiro argues that a rule X Y is not interesting if support (XY) support (X) * support (Y). This measure did not prune many rules. We have a problem that the rules were found to be redundant. We use the information in taxonomy to derive a new interest measure that prunes out 40% to 60% of the rules as redundant rules. Consider the rule: Clothes Footwear (8% support, 75% confidence) If Clothes is a parent of Shirts, and about a half of sales of Clothes are Shirts, we would expect the rule: Shirts Footwear (4% support, 75% confidence) The rules can be considered redundant since it does not convey any additional information and is less general than the first rule. The notion of interest by saying that we only want to find rules whose support is more than R times the expected value or whose confidence is more than R times the expected value, for some user-specified constant R. We say that an rule is interesting if it has no ancestors or it is R- interesting whit respect to its close ancestors among its interesting ancestors. 7

8 3. Review of Fuzzy Set Concepts Fuzzy set theory was first proposed by Zadeh and Goguen in 1965 [32]. Fuzzy set theory is primarily concerned with quantifying and reasoning using natural language in which words can have ambiguous meanings. This can be thought of as an extension of traditional crisp sets, in which each element must either be in or not in a set. Formally, the process by which individuals from a universal set X are determined to be either members or non-members of a crisp set can be defined by a characteristic or discrimination function [32]. For a given crisp set A, this function assigns a value µ A ( x) to every x X such that (x) µ A = 1 0 if if and and only only if if x A x A. Thus, the function maps elements of the universal set to the set containing 0 and 1. This ind of function can be generalized such that the values assigned to the elements of the universal set fall within specified ranges, referred to as the membership grades of these elements in the set. Larger values denote higher degrees 8

9 of set membership. Such a function is called the membership function, µ A ( x), by which a fuzzy set A is usually defined. This function is represented by µ A : X [ 0, 1 ], where [0, 1] denotes the interval of real numbers from 0 to 1, inclusive. The function can also be generalized to any real interval instead of [0,1]. A special notation is often used in the literature to represent fuzzy sets. Assume that x 1 to x n are the elements in fuzzy set A, and µ 1 to µ n are, respectively, their grades of membership in A. A is then usually represented as follows: A = µ 1 / x 1 + µ 2 / x µ n / x n. An α-cut of a fuzzy set A is a crisp set A α that contains all elements in the universal set X with membership grades in A greater than or equal to a specified value of α. This definition can be written as: A α = {x X µ A ( x) α }. 9

10 The scalar cardinality of a fuzzy set A defined on a finite universal set X is the summation of the membership grades of all the elements of X in A. Thus, A = µ A ( x ) x X. Among operations on fuzzy sets are the basic and commonly used complementation, union and intersection, as proposed by Zadeh. (1) The complementation of a fuzzy set A is denoted by A, and the membership function of A is given by: µ A ( x) = 1 µ A ( x), x X. (2) The intersection of two fuzzy sets A and B is denoted by AI B, and the membership function of AI B is given by: { } µ I ( x) = min µ ( x), µ ( x), x X. A B A B (3) The union of fuzzy sets A and B is denoted by AU B, and the membership function of AU B is given by: 10

11 { } µ ( x) max µ ( x), µ ( x) A B = A B The above concepts will be used in our proposed algorithm to mine fuzzy association rules at predefined taxonomy. 4. Notation The following notation is used in our proposed algorithm: n: the total amount of transaction data; m : the total number of attributes at level ; l: the total number of level in the predefined taxonomy D i : the i-th transaction datum, 1 i n; I : the -th group name at level, 1 m,1 l; h: the number of fuzzy regions for I ; R c : the -th fuzzy region of I, 1 c h; v : the quantitative value of I at level in the transaction i D i ; f i : the fuzzy set converted from v i ; f ic : the membership value of v in Region R ; i c 11

12 count c : the summation of f ic, i=1 to n; max- count : the maximum count value among count c values, c=1 to h; max- R : the fuzzy region of I with max- count ; α : the predefined minimum support level; λ : the predefined minimum confidence value; C r : the set of candidate itemsets with r attributes (items); L r : the set of large itemsets with r attributes (items). 5. The fuzzy data mining generalized association rule algorithm The proposed fuzzy mining algorithm first uses membership functions to transform each quantitative value into a fuzzy set in linguistic terms. The algorithm then calculates the scalar cardinalities of all linguistic terms in the transaction data. Each attribute uses only the linguistic term with the maximum cardinality in later mining processes, thus eeping the number of items the same as that of the original attributes. The algorithm is therefore focused on the most important linguistic terms, which reduces its time complexity. The mining process using fuzzy counts is then performed to find fuzzy association rules. Details of the proposed mining algorithm 12

13 are described below. The mining fuzzy generalized rule algorithm: INPUT: A body of n transaction data, each with m attribute values, a set of membership functions, a predefined minimum support valueα, and a predefined confidence value λ. OUTPUT: A set of fuzzy association rules. STEP 1: Group the items with the predefined taxonomy in each transaction datum D i, and calculate the total item amount in each group. Let the amount of the -th group name at level in transaction D i be denoted v i. STEP 2: Transform the quantitative value v i of each transaction datum D i, i=1 to n, for each appearing encoded group name I, into a fuzzy set f i represented as f R i1 1 fi2 fih R R 2 h using the given membership functions, where h is the number of fuzzy regions for I, R l is the l-th fuzzy region of I, 1 l h, and f ic is v i s fuzzy membership value in region R. c STEP 3: Calculate the scalar cardinality of each fuzzy region R c in the transaction data: 13

14 count c = f ic. h STEP 4: Find max- count MAX( count ) n i= 1 = c, for = 1 to m. Let max c=1 R be the region with max- count for item I. The region max- R will be used to represent the fuzzy characteristic of this item I in later mining processes. STEP 5: Chec whether the value max- count of a region max- R, =1 to m, is larger than or equal to the predefined minimum support value α. If a region max- R is equal to or greater than the minimum support value, put it in the large 1-itemsets ( L 1 ). That is, { max R max count, m } L1 = α 1. STEP 6: Generate the candidate set C 2 from L 1. Each 2-itemset in C 2 must not ancestor or descendant relation in the taxonomy. All the possible 2-itemsets are collected as C 2. STEP 7: Do the following substeps for each newly formed candidate 2-itemset s with items (s 1, s 2 ) in C 2. (a) Calculate the fuzzy value of s in each transaction datum D i as f is = f Λ f, where is 1 is 2 f is is the membership value of D i in region s. If the minimum operator is used for the intersection, then f is = min( fis1, fis2). (b) Calculate the scalar cardinality of s in the transaction data as: count s = n f is i= 1. 14

15 (c) If count s is larger than or equal to the predefined minimum support valueα, put s in L 2. STEP 8: Set r=2, where r is used to represent the number of items ept in the current large itemsets. STEP 9: IF L 2 is null, then have no any rule in these transaction data; otherwise, do the next step. STEP 10: Generate the candidate set C r+1 from L r in a way similar to that in the apriori algorithm [4]. The items of candidate set are not ancestor or descendant relation. That is, the algorithm first oins L r and L r assuming that r-1 items in the two itemsets are the same and the other one is different. It then eeps in C r+1 the itemsets, which have all their sub-itemsets of r items existing in L r. STEP 11: Do the following substeps for each newly formed (r+1)-itemset s with items ( s, ) s 1, 2..., s r + 1 in C r+1 : (a) Calculate the fuzzy value of s in each transaction datum D i as f is = f is Λ f Λ... Λ 1 is f 2 is, where r + 1 is f is the membership value of D i in region s. If the minimum operator is used for the intersection, then f is r+ 1 = Min f. = 1 is 15

16 (b) Calculate the scalar cardinality of s in the transaction data as: n count s = i= 1 f is. (c) If count s is larger than or equal to the predefined minimum support valueα, put s in L r+ 1. STEP 12: If L r+1 is null, then do the next step; otherwise, set r=r+1 and repeat STEPs 10 to 12. STEP 13: Construct the association rules for all large q-itemset s with items ( s s,..., ) 1, 2 s q, q 2, using the following substeps: (a) Form all possible association rules as follows: s 1 Λ... Λs 1Λs + 1Λ... Λs q s, =1 to q. (b) Calculate the confidence values of all association rules using the formula: is i= 1 n ( f is Λ... Λ f is, f 1 K 1 isk + 1 i= 1 n f Λ... Λ f isq ). STEP 14: Chec the confidence values larger than or equal to the predefined confidence threshold λ. STEP 15: Output the interesting rules with have different consequent or antecedent no ancestor or descendant relation in the taxonomy or different linguistic term. 16

17 After STEP 15, the rules output can serve as meta-nowledge concerning the given transactions. 6. Example In the section, an example id given to illustrate the proposed generalized fuzzy data mining algorithm for quantitative values. This is a simple to show the proposed algorithm can be used to generate generalized association rules form quantitative transaction using predefined taxonomy. The set of data, including 6 transaction, are shown in table 1. Table 1. Five transactions in this example Transaction ID Items 1 (mil, 3) (bread, 4) (T- shirt, 2) 2 (tea, 3) (bread, 7) (acet, 7) 3 (tea, 2) (bread, 10) (T-shirt, 5) 4 (bread, 9) (T-shirt, 10) 5 (mil, 7) (acet, 8) 6 (tea, 2) (bread, 8) (acet, 10) Each transaction includes a transaction ID and some items bought. Each item is represented by a tuple (item name, item amount). For example, the fourth transaction consists of nine units of bread and ten units of T-shirt. Assume the predefined taxonomy is shown in Figure 2. Food Cloth 17 Brin bread acet T-shirt

18 Figure 2. the predefined taxonomy in this example In Figure 2, Food, brin and cloth are grouped name. Food can be classified into two classes: brin and bread. Brin can be further classified into mil and tea, cloth can be classified into acet and T-shirt. We can use coed to represent items and grouped name. For example, code A represent mil and B represent tea. Results are shown in Figure 3. T 2 T 3 T 1 C D E A B Figure 3 the predefined taxonomy using code in this example Also assume that the fuzzy membership functions are the same for all the items and are as shown in Figure 4. Membership Value 1 Low Middle High number

19 Figure 4. The membership functions used in this example In this example, amounts are represented by three fuzzy regions: Low, Middle and High. Thus, three fuzzy membership values are produced for each item amount according to the predefined membership functions. For the transaction data in Table 1, the proposed fuzzy mining algorithm proceeds as follows. STEP 1: All the items in the transactions are first grouped. Tae the items in transaction T1 as an example. The items (A, 3) is grouped into (T 1, 3); (A, 3) and (C, 4) are grouped into (T 2, 7); (E, 2) is grouped into (T 3, 2). Results for all the transaction data are shown in Table 2. Table 2 the set of items and grouped items Transaction ID Items Grouped Items 1 (A, 3) (C, 4) (E, 2) (T 1,3) (T 2, 7) (T 3, 2) 2 (B, 3) (C, 7) (D, 7) (T 1, 3 ) (T 2, 10) (T 3,7) 3 (B, 2) (C, 10) (E, 5) (T 1, 2) (T 2,12) (T 3, 5) 4 (C, 9) (E, 10) (T 1, 0) (T 2, 9) (T 3, 10) 5 (A, 7) (D, 8) (T 1, 7) (T 2, 7) (T 3, 8) 6 (B, 2) (C, 8) (D, 10) (T 1, 2) (T 2, 10) (T 3, 10) STEP 2: The quantitative values of the items are represented using fuzzy sets. 19

20 Tae the first item in transaction T4 as an example. The amount 7 of D is converted 0.0 Low 0.6 Middle 0.4 High into a fuzzy set ( + + ) using the given membership functions (Figure 4). This step is repeated for the other items, and the results are shown in Table 3, where the notation item.term is called a fuzzy region. Table 3. fuzzy sets transformed from the data in Table 2 TID Level-1 fuzzy set T ( + )( + )( + )( + )( + )( + ) A. Low A. Middle C. Low C. Middle E. Low E. Middle T1. Low T1. Middle T2. Middle T2. High T3. Low T3. Middle ( + )( + )( + )( + )( + )( + B. Low B. Middle C. Middle C. High D. Middle D. High T1. Low T1. Middle T2. Middle T2. High T3. Middle T3. High ( + )( + )( + )( + )( )( + B. Low B. Middle C. Middle C. High E. Low E. Middle T1. Low T1. Middle T2. High T3. Low T3. Middle ( + )( + ))( + )( + C. Middle C. High E. Middle E. High T2. Middle T2. High T3. Middle T3. High ( + )( + )( + )( + )( + A. Middle A. High D. Middle D. High T1. Middle T1. High T2. Middle T2. High T3. Middle T3. High ( + )( + )( + )( + )( + )( + B. Low B. Middle C. Middle C. High D. Middle D. High T1. Low T1. Middle T2. Middle T2. High T3. Middle T3. High T2 ) T3 ) T4 ) T5 ) T6 ) STEP 3: The scalar cardinality of each fuzzy region in the transactions is calculated as the count value. Tae the fuzzy region C.High as an example. Its scalar cardinality = ( ) = 1.4. This step is repeated for the other regions, and the results are shown in Table 4. Table4. The counts of the items regions Item Count Item Count Item Count Item Count A.Low 0.6 C.Low 0.4 E.Low 0.4 T 2.Low 0.6 A.Midlle 1.2 C.Midlle 3.2 E.Midlle 1.4 T 2.Midlle 1.2 A.High 0.2 C.High 1.2 E.High 2.4 T 2.High

21 B.Low 2.2 D.Low 0.0 T 1.Low 0.4 T 3.Low 0.6 B.Midlle 0.8 D.Midlle 1.6 T 1.Midlle 2.4 T 3.Midlle 2.2 B.High 0.0 D.High 1.4 T 1.High 1.2 T 3.High 0.2 STEP 4. The fuzzy region with the highest count among the three possible regions for each item is found. Tae the item A as an example. The count is 0.6 for Low, 1.2 for Middle, and 0.2 for High. Since the count for Middle is the highest among the three counts, the region Middle is thus used to represent the item A in later mining processes. This step is repeated for the other items, "Low" is chosen for B and T 1, Middle is chosen for C, D, E and T 3, "High" is chosen T 2. STEP 5. The count of any region ept in STEP 4 is checed for a predefined minimum support value α. Assume in this example, α is set at 1.7. Since all the count values of B.Low, C.Middle, T 1.Low, T 2.High and T 3.Middle are larger than 1.7, these items are put in L 1 (Table 5). Table 5. The set of large 1-itemsets in this example Itemset count B.Low 2.2 C.Middle 3.2 T.Low 2.8 T.High 3.0 T.Middle 2.8 STEP 6: The candidate set C 2 is generated from L 1 and the items of C 2 must not 21

22 ancestor or descendant relation in the taxonomy as follows: (B.Low, C.Middel), (B.Low, T 3.Middel), (C.Middel, T 1.Middel), (C.Middel, T 3.Middel), (T 1.Middel, T 3.Middel), (T 2.High, T 3.Middel). STEP 7. The following substeps are done for each newly formed candidate 2-itemset in C 2 : (a) The fuzzy membership values of each transaction data for the candidate 2-itemsets are calculated. Here, the minimum operator is used for intersection. Tae (B.Low, C.Middle) as an example. The derived membership value for transaction 2 is calculated as : min(0.6, 0.8)=0.6. The results for the other transaction are shown in table 6. The results for the other 2-itemsets can be derived in similar fashion. Table6 The membership vales for B.Low^ C.Middle TID B.Low C.Middle B.Low C.Middle T T T T T T (b) The scalar cardinality (count) of each candidate 2-itemset in C 2 is calculated. Results for this example are shown in Table 7. 22

23 Table 5.8 The fuzzy counts of the itemsets in C 2 Itemset Count (B.Low, C.Middle) 2.0 (B.Low, T 3.Middle) 1.6 (C.Middle, T 1.Low) 2.6 (C.Middle, T 3.Middle) 2.2 (T 1.Low, T 3.Middle) 1.8 (T 2.High, T 3.Middle) 2.0 (c) Since only the count of (B.Low, T 3.Middle) is smaller than the predefined minimum support value 1.7, other two-itemset are thus ept in L 2. STEP 8: r is set at 2, where r is used to eep the number of items ept in the current itemsets. STEP 9: Since L 2 is not null, the next step is done. STEP 10: The candidate set C 3 is generated from L 2 and only the itemsets (C.Middle, T 1.Low, T 3.Middle) is formed. STEP 11. The following substeps are done for each newly formed candidate 3-itemset in C 3 : (a) The fuzzy membership values of each transaction data for the candidate 3-itemsets are calculated. Here, the minimum operator is used for intersection. The results are shown in table 8. 23

24 Table8. The membership vales for C.Middle ^ T 1.Low ^ T 3..Middle TID C.Middle T 1.Low T 3.Middl e C.Middle T 1.Low T 3.Middle T T T T T T (b) The scalar cardinality (count) of each candidate 3-itemset in C 3 is calculated. Its scalar cardinality = ( )=1.8 (c) Since the count of (T 3.Middle, T 1.Low, T 3.Middle) is larger than the predefined minimum support value 1.7, it is thus ept in L 3. STEP 12: Since L 3 is not null, r = r +1= 3 and STEP 10 is done again. The same process is then executed for finding level-4 large itemsets. In this example, no level-4 large itemsets are found. STEP 13 is then executed to find the association rules. STEP 13: The association rules are constructed for each large itemset using the following substeps. (a) The possible association rules for the itemsets found are formed as follows: If B = Low, then C = Middle; If C = Middle, then B = Low; If C = Middle, then T 1 = Low; 24

25 If T 1 = Low, then C = Middle; If C = Middle, then T 3 = Middle; If T 3 = Middle, then C = Middle; If T 1 = Low, then T 3 = Middle; If T 3 = Middle, then T 1 = Low; If T 2 = High, then T 3 = Middle; If T 3 = Middle, then T 2 = High; If C = Middle and T 1 = Low then T 3 = Middle; If C = Middle and T 3 = Middle then T 1 = Low; If T 1 = Low and T 3 = Middle then C = Middle. (b) The confidence values of the above thirteen possible association rules are calculated. Tae the first association rule as an example. The count of B.Low C.Middle is 2.0 and the count of B.Low is 2.2. The confidence value for the association rule "If B = Low, then C = Middle" is calculated as: then: The confidence factor for the association rule "If B = Low, then C = Middle" is 6 i = 1 ( B. Low C. Middle 6 i = 1 ( B. Low ) ) 2. 0 = = Results for the other rules are shown below. 25

26 If C = Middle, then B = Low has a confidence factor of 0.625; If C = Middle, then T 1 = Low has a confidence factor of 0.81; If T 1 = Low, then C = Middle has a confidence factor of0.93; If C = Middle, then T 3 = Middle has a confidence factor of 0.69; If T 3 = Middle, then C = Middle has a confidence factor of 0.79; If T 1 = Low, then T 3 = Middle has a confidence factor of 0.64; If T 3 = Middle, then T 1 = Low has a confidence factor of 0.64; If T 2 = High, then T 3 = Middle has a confidence factor of 0.67; If T 3 = Middle, then T 2 = High has a confidence factor of 0.71; If C = Middle and T 1 = Low then T 3 = Middle has a confidence factor of 0.69; If C = Middle and T 3 = Middle then T 1 = Low has a confidence factor of 0.82; If T 1 = Low and T 3 = Middle then C = Middle has a confidence factor of 0.9 STEP 14: The confidence values of the possible association rules are checed with the predefined confidence threshold λ. Assume the given confidence threshold λ is set at 0.8. The six rules are following: B.Low C.Middle [sup=2.0, conf=0.91]; C.Middle T 1.Low [sup=2.6, conf=0.81]; T 1.Low C.Middle [sup=2.6, conf=0.93]; T 3.Middle C = Middle [sup=2.2, conf=0.79]; C.Middle and T 3.Middle T 1. Low [sup=1.8, conf=0.82]; T 1.Low and T 3.Middle C.Middle [sup=1.8, conf=0.9]. 26

27 STEP 15: Since we consider this two rules B.Low C.Middle [sup=2.0, conf=0.91] and T 1.Low C.Middle [sup=2.6, conf=0.93], "T 1 " is a parent of "B" with their same of linguistic term and consequent "C". The B.Low C.Middle [sup=2.0, conf=0.91] is redundant rule thus pruning in this step. The following five rules are thus output to users: C.Middle T 1.Low [sup=2.6, conf=0.81]; T 1.Low C.Middle [sup=2.6, conf=0.93]; T 3.Middle C = Middle [sup=2.2, conf=0.79]; C.Middle and T 3.Middle T 1. Low [sup=1.8, conf=0.82]; T 1.Low and T 3.Middle C.Middle [sup=1.8, conf=0.9]. 7. Discussion and Conclusions In this paper, we have proposed a fuzzy data mining generalized association rules algorithm, which can process transaction data with quantitative values and discover interesting patterns among them. The rules thus mined exhibit quantitative regularity at taxonomy and can be used to provide some suggestions to appropriate supervisors. When compared with the traditional crisp-set mining methods for quantitative data, our approach can get smooth mining results due to the fuzzy membership characteristics. Also, when compared with the fuzzy mining methods, which tae all the fuzzy regions into consideration, our method can get a good time complexity since only the most important fuzzy term is used for each item. Trade-off thus exists 27

28 between the rule completeness and the time complexity. Although the proposed method wors well in data mining for quantitative values, it is ust a beginning. There is still much wor to be done in this field. In the future, we will first extend our proposed algorithm to solve the above two problems. In addition, our method assumes that the membership functions are nown in advance. In [17-19], we also proposed some fuzzy learning methods to automatically derive the membership functions. Therefore, we will attempt to dynamically adust the membership functions in the proposed mining algorithm to avoid the bottlenec of membership function acquisition. We will also attempt to design specific data-mining models for various problem domains. 28

29 References [1] R. Agrawal, T. Imielinsi and A. Swami, Mining association rules between sets of items in large database, The 1993 ACM SIGMOD Conference, Washington DC, USA, [2] R. Agrawal, T. Imielinsi and A. Swami, Database mining: a performance perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, 1993, pp [3] R. Agrawal, R. Sriant and Q. Vu, Mining association rules with item constraints, The Third International Conference on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, August [4] A. F. Blishun, Fuzzy learning models in expert systems, Fuzzy Sets and Systems, Vol. 22, 1987, pp [5] L. M. de Campos and S. Moral, Learning rules for a fuzzy inference model, Fuzzy Sets and Systems, Vol. 59, 1993, pp [6] R. L. P. Chang and T. Pavliddis, Fuzzy decision tree algorithms, IEEE Transactions on Systems, Man and Cybernetics, Vol. 7, 1977, pp [7] M. S. Chen, J. Han and P. S. Yu, Data mining: an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 29

30 6, [8] C. Clair, C. Liu and N. Pissinou, Attribute weighting: a method of applying domain nowledge in the decision tree process, The Seventh International Conference on Information and Knowledge Management, 1998, pp [9] P. Clar and T. Niblett, The CN2 induction algorithm, Machine Learning, Vol. 3, 1989, pp [10] M. Delgado and A. Gonzalez, An inductive learning procedure to identify fuzzy systems, Fuzzy Sets and Systems, Vol. 55, 1993, pp [11] A. Famili, W. M. Shen, R. Weber and E. Simoudis, "Data preprocessing and intelligent data analysis," Intelligent Data Analysis, Vol. 1, No. 1, [12] W. J. Frawley, G. Piatetsy-Shapiro and C. J. Matheus, Knowledge discovery in databases: an overview, The AAAI Worshop on Knowledge Discovery in Databases, 1991, pp [13] T. Fuuda, Y. Morimoto, S. Morishita and T. Touyama, "Mining optimized association rules for numeric attributes," The ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 1996, pp [14] A.Gonzalez, A learning methodology in uncertain and imprecise environments, International Journal of Intelligent Systems, Vol. 10, 1995, pp [15] J. Han and Y. Fu, Discovery of multiple-level association rules from large 30

31 database, The International Conference on Very Large Databases, [16] T. P. Hong, C. S. Kuo and S. C. Chi, "A data mining algorithm for transaction data with quantitative values," Intelligent Data Analysis, Vol. 3, No. 5, 1999, pp [17] T. P. Hong and J. B. Chen, "Finding relevant attributes and membership functions," Fuzzy Sets and Systems, Vol.103, No. 3, 1999, pp [18] T. P. Hong and J. B. Chen, "Processing individual fuzzy attributes for fuzzy rule induction," Fuzzy Sets and Systems, Vol. 112, No. 1, 2000, pp [19] T. P. Hong and C. Y. Lee, "Induction of fuzzy rules and membership functions from training examples," Fuzzy Sets and Systems, Vol. 84, 1996, pp [20] A. Kandel, Fuzzy Expert Systems, CRC Press, Boca Raton, 1992, pp [21] J. R. Quinlan, Decision tree as probabilistic classifier, The Fourth International Machine Learning Worshop, Morgan Kaufmann, San Mateo, CA, 1987, pp [22] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, [23] R. Rastogi and K. Shim, "Mining optimized association rules with categorical and numeric attributes," The 14th IEEE International Conference on Data Engineering, Orlando, 1998, pp

32 [24] R. Rastogi and K. Shim, "Mining optimized support rules for numeric attributes," The 15th IEEE International Conference on Data Engineering, Sydney, Australia, 1999, pp [25] J. Rives, FID3: fuzzy induction decision tree, The First International symposium on Uncertainty, Modeling and Analysis, 1990, pp [26] R. Sriant and R. Agrawal, Mining quantitative association rules in large relational tables, The 1996 ACM SIGMOD International Conference on Management of Data, Monreal, Canada, June 1996, pp [27] R. Sriant and R. Agrawal, Mining Generalized Association Rules," The International Conference on Very Large Databases, [28] C. H. Wang, T. P. Hong and S. S. Tseng, Inductive learning from fuzzy examples, The fifth IEEE International Conference on Fuzzy Systems, New Orleans, 1996, pp [29] C. H. Wang, J. F. Liu, T. P. Hong and S. S. Tseng, A fuzzy inductive learning strategy for modular rules, Fuzzy Sets and Systems, Vol.103, No. 1, 1999, pp [30] R. Weber, Fuzzy-ID3: a class of methods for automatic nowledge acquisition, The Second International Conference on Fuzzy Logic and Neural Networs, Iizua, Japan, 1992, pp

33 [31] Y. Yuan and M. J. Shaw, Induction of fuzzy decision trees, Fuzzy Sets and Systems, 69, 1995, pp [32] L. A. Zadeh, Fuzzy sets, Information and Control, Vol. 8, No. 3, 1965, pp

Maintenance of Generalized Association Rules for Record Deletion Based on the Pre-Large Concept

Proceedings of the 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and ata Bases, Corfu Island, Greece, February 16-19, 2007 142 Maintenance of Generalized Association Rules for