Semantics Oriented Association Rules

Size: px

Start display at page:

Download "Semantics Oriented Association Rules"

Emily Chandler
5 years ago
Views:

1 Semantics Oriented Association Rules Eric Louie BM Almaden Research Center 650 Harry Road, San Jose, CA Abstract - t is well known that relational theory carries very little semantic. To mine deeper semantics, additional modeling are necessary. n fact, some pure association rules are found exist even in a randomly generated data. n this paper, we consider the relational database in which every attribute value bas some additional information, such as price, fuzzy degree, neighborhood, or security compartment and levels. Two types of additions are considered: one is structure added, the other is valued-added. Somewhat a surprise, the additional cost in semantics checking is found very well compensated by the pruning of non-semantic rules. 1. NTRODUCTON n relation theory, attribute domains are Cantor sets; the interactions among members of real world objects are forgotten. For data mining, additional modeling of attribute domains are needed to organize deeper semantics of the data. We will term such addition to the existing data model; semantic added modeling. n this paper, we will consider two aspects; one is structure added, the other value added. 2. MOTVATON- ASSOCATON RULES N RANDOM DATA. n the experiment [4], a totally discrete data are randomly generated. Somewhat a surprise, we find some association with substantial supports of length 2 (though they did not meet our artificial high requirement of supports). This computation implies that frequency itself may not be an adequate criterion for meaningful pattems; see the Table 3 in Experiments Report, Section 7. So semantic modeling seems necessary for database mining. 3. SEMANTC ADDED MODELNG What would be the correct mathematical structure to capture the semantics of real world objects? This is a question that has many ad hoc answers; we decide to consult the history. Model theory of the first order logic uses a cantor set together with relational structures and functions to model the real world. We will follow it; attribute domains are assumed to have all such structures. Previously we have explored the simplest structure, namely, one binary relation is added to each attribute domains [5], [6], [9], [lo], [12], [ll], [7]; totally, they induce finitely many binary relations on the universe (of eneitites). n [3], we consider one real valued function for each domain. n this paper, we combine the two, T. Y. Lin Department of Mathematics and Computer Science San Jose State University, San Jose, CA tylin@cs.sj su.edu namely, on each domain we have one binary relation and one function. 4. STRUCTURE ADDED DATA MODEL BNARY RELATONS We will examine the case each attribute domain is assumed to have one binary relation. ts geometric corresponding concept is called a binary neighborhood system (BNS). n the case of equivalence relation, a BNS is a partition. The binary relation on each attribute domain in turns induced a binary relation on the universe. So on the universe, there are finitely many binary relations. We will examine the impact of such added structure on data mining. 4.1 Crisp/Fuzzy binary neighborhood systems A binary relation (BR) is a subset t defines a set B cvxu. called elementary (basic or binary) neighborhood at p E V A binary neighborhood system (BNS) denotes either the map B: p + Bp, or the family {BP 1 p E V}. The map B has also been called a binary granulation(bg). The set V together with BNS is called a BNS-space on U or simply BNS-space if V and U is the same. Proposition. BNS, BR and BG are equivalent to each other The induced equivalence relation: Note that BG, B:p E V +Bp 2u, induces a partition as follows: The collection of complete inverse images B-l(Bp) forms a partition on V, and hence an equivalence on V. We use EB to denote this equivalence relation. We may drop the subscript, if B is understood Fuzzifications: Binary relation and neighborhood system can be fkzified; in other words, instead of being a /02/$ EEE 956

2 subset of V x U, it could be a fuzzy subset (a membership hction FB: V x U +[O. 11 ) 4.2 Structure Added Data Models Which, in each tuple of Table 1, assigns the element in the first column to the element in the last column. The inverse map CTY-1 induces a BNS on U. So we have A traditional relation instance can be viewed as a knowledge representation that maps each entity to a tuple of attribute values. Table 1 illustrates the notion 0f.a relation instance on the universe U={ul, u2, u3, u4, u5 }. U K (S# i STATUS j CTY) U + 1 (S / TWENTY / C) ~3 j (S3 i TEN Q j (S4 i TWENTY i C1) US j (S5 j THRTY C3) Table 1. An nformation Table; arrows and parentheses will be suppressed n geographical attribute domain, one can use binary relation to capture the "near" semantics. So on CTY attribute, we assume a binary relation holds in the domain; see Table 2 CTY C1 CTY c1 c1 c2 c2 c1 Each binary relation, say B, induced an equivalence relation E. n this case the neighborhood is an equivalence class We should have similar results for "4'; we denote it by 0 (order binary relation). tc3 c3 c2 c3 Table 2 "near"-binary Relation L. J n numerical attribute, such as STUATS, we have the order binary relation ''5.'' Next, we express both B and "5. " in BNS format: Definition. The 3-tuple (U, Aj, Dom(Aj), j=1,2,..., n) is called structure added data model, where U is the universe, AJ is the attributes. For the example in Table 1, the structure added model is (U, B, {Cl, C2, C3}, 0, {TEN, TWENTY, THRTY}) 4.3. The mpact ofadded Structure to Data Mining Note that attributes can be regarded as projections. The CTY attribute is a map, denoted by CTY again. CTY: U + Dom(CTY), n mining such a data model, first concern is the cost in checking the added structure. So experiments have been conduct in [4]. Somewhat a surprise, the cost is well compensated by the saving. t does have cost in checking the continuity of association rules, however, the pruning of noncontinuous rules save the time in computing the long rules. One beauty of continuity is that the compositions of continuous rules are also continuous, so the only cost is at the length /02/$ JEEE 957

3 Table 4 is generated with some embedded semantics. That is, some associations are embedded in the algorithm of generating the test data. From the table it is clear finding pure association rules is expensive. n Table 5, some neighborhoods (one binary relation) are generated. Based on such a structure, the cost of finding continuous association rule is greatly reduced. From the artificial data, it proves that this BNS theory is promising. The next step is to test on real world applications. 5. VALUED ADDED DATA MODEL n this section we will consider the case the function valued will be part of the model 5.1. Valued Added Granular Data Model Definition. The 4-tuple (U, Aj, Xj, Dom(Aj), j=1,2,..., n) is a Value Added Granular Data Model or VA-Granular Data Model, where for each AJ, a value add function is defined on each domain Dom(AJ) : Xj : Dom(Aj) + M where M is either a Cantor set W, the security lattice SC, real numbers, or [0, 11. Proposition 1. f M is 2 then X j is a binary neighborhood system, which is equivalent to a binary relation. Proposition 2. f M is SC, the security lattice then Xj is a classification and the granular data model is a MLS data model. Proposition 3. f M is real number, then Xj is a random variable. Random variable is not a variable varies randomly, it is merely a function whose numerical values are determined by chance; please see [3] for connecting the mathematics to intuition. Proposition 4. f M is [0, 13, then X j could be a grade of fizziness. n this case the granular data model is a fizzy database The mpact of Added Structure Model to Data Mining The difference between this and last sections is that the values of the function do participate in computing. For example, the existing of real valued function implies the existing of a neighborhood system on an attribute domain D (a topological space).. However, the imposed constraints are imposed more than on the structure of D, we use the real values. Table 5 and 6 say the computing of VA-association is quite expensive, if we use values alone. We would like to comment that by assigning the nearest or smallest neighborhood at each point, we have a BNS. n next, project, we will use this BNS, called nearest neighborhood system, and values. 6. PATTERNS PRESERVNG STRUCTURES We collect some generalized standard patterns: [5]. Let A and B be two attributes of a relation-with-additionalsemantics. Let c, d be two values of A and B respectively. Let NEGH(c), NEGH(d) be the respective elementary granules. t is clear that c = NAME(NEGH(c)) and d = NAME( NEGH(d)). Let Card(.)be the cardinal number of a set 0. Structure added association rules 1 A formula c + d is a continuous or semantic decision rule, if the inclusion NEGH(c) E NEGH(d) is continuous. 2 A formula A + B is a continuous or semantic universal decision rule, iff V c E A 3 d E B such that NEGH(c) cnegh(d). This rule is equivalent to extension hctional dependence.. 3. A formula c + d is a robust continuouslsemantic decision rule, if NEGH(c) ENEGH(d) and Card(Pc) 2 threshold [S. 4 A formula c + d is a soft continuouslsemantic decision rule (strong rule), if NEGH(c) is softly included in NEGH(d), NEGH(c) 0 NEGH(d) [12]. 5. Weak association rule: A pair (c, d) is an association rules, if Card (NEGH(c) n NEGH(d)) 2 threshhold. 6. A pair (c, d) is said to be in a relation (or database), if it is a sub-tuple of a tuple that belongs to a relation (or database). 7. A pair (c, d) in a given relation is one-way (c +d) continuous (or semantic) if every x E Bc, there is at least one y E Bd such that (x, y) is in the given relation. 8. A pair (c, d) in a given relation is a two way continuous (or semantic) if (c +d) and (d +c) are both continuous /02/$ EEE 958

4 9. Semantic association rule: A pair (c, d) is an association rule iff the pair is an association rule and two way continuous. 10. Soft association rule: A pair (c, d) is a soft association rule, if Card (NEGH(c) n NEGH(d)) 2 threshhold. [5], [61 Valued added association rules 11. n-the-average-association rule: Two attributes Ai and Aj is associated in the average, if JE(Xi)- E(Xj)l where E(.) is the expected value, and is the absolute value. 12. Fuzzy decision rule: A formula c + d is a fuzzy decision rule, if E, G Ed and X(c) X(d), where E, and Ed are the equivalence classes of c and d. n other words, c and d are the names of the equivalence class E, and Ed 13. Security leak rule: A formula c + d is a security leak decision rule, if E, c Ed and X(c) X(d.). 14. Value added association rule (VA-association rule) [3] Sum-version: A granule (sub-tuple) b=(bl nb2 n... nb ) is a q-va-association rule q if Sum(b) 2 sq, where sum@) = ~j,j,*p(xjo) = ~ qj=1 P Cbj)*lbl/lU, where xjo = P (bj) Min-version: A granule (sub-tuple) b=(bl nb2 n... nb ) q is a q-va-association rule if Min(b) ) 2 sq, where Min(b) =Minj xjo*p(xjo) = Minq i=lh(bj)*lbl/lu 14.3 Max-version: A granule (sub-tuple) b=(bl nb2 n.. nb ) is a q-va-association rule 9 if Max(b) ) 2 sq, where Max(b) =Maxj xjo*p(xj0) = Maxqi=l(f(b-i)*lbl) Traditional: The Max and Min-versions are the traditional one iff the profit function is the constant=l. Recall that we are concerning only with the supports, so association rules have no directions. Since we are using granules, a q-association rule is equivalent to a q-large granule; the former q means the length of tuple, the latter q means the length of the intersections. The frequency of an itemset is the cardinal number of a granule (= a finite intersection of elementary granules) n general, there are no apriori criteria for value added case. However, if we require the thresholds increase with the lengths, that is, Then there are apriori criteria: q-large implies all sub-tuples are (q-+large, where i 2 0. Value added granular data model allows us to import many probability theory into data mining. We list a sample here and more work will be reported in the near future. 7. EXPERMENT REPORTS Here are the ACRONYMNs: q= length; c =candidate; a=association rules; s=support count; 6 = time need to generated next rows (in seconds) 7.1. Random Data Table 3 is the results of finding association rules on randomly generated data: 1. The relation has rows and 16 columns; we require the support to be items. 2. The distinct attribute is limited to 10; there are real world medical data meet this constraints. Table 3. Randomly generated data 7.2. Structure Added Computing Table 4 and 5 is the same as Table 3 except the distinct attribute is limited to s s s s l02/$ EEE 959

5 s s s s s s s s s s s s s s s s s s Table 4: Finding pure association rule is expensive s s s s c S a s a OlOS s s s s lo s lo semantic rules is inexpinsive 7.3. Valued Added Computing Table 6 is the results of finding association rules based on data with real valued function: 1. The relation has 500 rows and 8 columns; we require the weights greater than The distinct attribute is limited to s lo 0.460s 10 ( s s s s l o 2.824s s s s s s s s 0.060s 0.010s s /02/$ leee 960 o 1.312s s s s 1

6 8. CONCLUSONS The advantage of data mining by granular computing are: 1. it is fast in mining classical relations, granular computing is faster than Apriori [13], [14] because the "database scan" are replaced by bit operations. 2. the use of granular computing is extend to Yea1 world" databases (semantically richer relations); its cost is well compensated by pruning. Such extra semantics may be able to use for analyzing unexpected, peculiar rules [ Granular structure is the mathematical structure of the real world. So this method is mining directly on the real world, not on its representations REFERENCES [l] R. Agrawal, T. mielinski, and A. Swami, "Mining Association Rules Between Sets of tems in Large Databases," in Proceeding of ACM-SGMOD international Conference on Management of Data, pp , Washington, DC, June, 1993 [2] P. Halmos, Measure Theory, Van Nostrand, 1950 [3] T. Y. Lin, Y. Y. Yao, and E. Louie, "Value Added Association Rules, " 6~ Pacific-Asia Conference, Taipei, Taiwan, May 6-8,2002 [4] T. Y. Lin, Y. Y. Yao, and E. Louie, "Association Rules with Additional Semantics Modeled by Binary Relations," n: Rough Set Theory and Granular Computing" physica- Verlag, Shusaku Tsumoto, Masahiro nuiguchi and Shoji Hirano (Eds), to appear [5] T. Y. Lin, "Data Mining and Machine Oriented Modeling: A Granular Computing Approach," Journal of Applied ntelligence, Kluwer, Vol. 13,No 2, September/October,2000, pp [6] T. Y. Lin, "Data Mining: Granular Computing Approach.'' n: Methodologies for Knowledge Discovery and Data Mining, Lecture Notes in Artificial ntelligence 1574, Third Pacific-Asia Conference, Beijing, April 26-28, 1999, [7] "Granular Computing on Binary Relations : Data Mining and Neighborhood Systems." n: Rough Sets n Knowledge Discovery, A. Skowom and L. Polkowski (eds), Physica-Verlag, 1998, [8] T. Y. Lin, "Rough Set Theory in Very Large Databases," Symposium on Modeling, Analysis and Simulation, CESA'96 MACS Multi Conference (Computational Engineering in Systems Applications), Lille, France, July 9-12, 1996, Vol. 2 of 2, [9] T. Y. Lin, Neighborhood Systems and Approximation in Database and Knowledge Base Systems, Proceedings of the Fourth nternational Symposium on Methodologies of ntelligent Systems, Poster Session, October 12-15, pp , [lo] T. Y. Lin, "Neighborhood Systems and Relational Database". Abstract, Proceedings of CSC '88, February, 1988, pp [ll] T. Y. Lin and M. Hadjimichael "Non-classificatory Generalization in Data Mining," Proceedings of The Fourth Workshop on Rough Sets, Fuzzy Sets and Machine Discovety, Tokyo, Japan, November 8-10,1996, [12] T. Y. Lin, and Y.Y. Yao "Mining Soft Rules Using Rough Sets and Neighborhoods." n: Symposium on Modeling, Analysis and Simulation, MACS Multiconference (Computational Engineering in Systems Applications), Lille, France, July 9-12, 1996, Vol. 2 of 2, [13] Eric Louie and T.Y. Lin, "Finding Association Rules using Fast Bit Computation: Machine-Oriented Modeling." n: Proceeding of 12th ntemational Symposium SMS2000, Charlotte, North Carolina, Oct 11-14, Lecture Notes in A [14] E. Louie, T. Y. Lin and "A Data Mining Approach using Machine Oriented Modeling: Finding Association Rules using Canonical Names.". n: Proceeding of 14th Annual nternational Symposium Aerospace/Defense Sensing, Simulation, and Controls, SPE Vol 4057, Orlando, April 24-28,2000, pp [15] Balaji Padmanabhan and Alexander Tuzhilin "Finding Unexpected Patterns in Data." n: Data Mining and Granular Computing T. Y. Lin, Y.Y. Yao and L. Zadeh (eds), Physica-Verlag, to appear /02/$ EEE 961

Modeling the Real World for Data Mining: Granular Computing Approach

Modeling the Real World for Data Mining: Granular Computing Approach T. Y. Lin Department of Mathematics and Computer Science San Jose State University San Jose California 95192-0103 and Berkeley Initiative