Association Rules with Additional Semantics Modeled by Binary Relations

Size: px

Start display at page:

Download "Association Rules with Additional Semantics Modeled by Binary Relations"

Asher Roberts
5 years ago
Views:

1 Association Rules with Additional Semantics Modeled by Binary Relations T. Y. Lin 1 and Eric Louie 2 1 Department of Mathematics and Computer Science San Jose State University, San Jose, California tylin@cs.sjsu.edu 2 IBM Almaden Research Center 650 Harry Road, San Jose, CA ewlouie@almaden.ibm.com Abstract. This paper continues the study of mining patterns from the real world data. Association rules that respects the semantics modeled by binary relations are called binary semantic association rules. By experiments we find that semantic computation is necessary, efficient and fruitful. It is necessary, because we find the supports of length 2 candidate is quite high in randomly generated data. It is efficient, because the checking of semantics constraints occurs only at length 2. It is fruitful the additional cost is well compensated by the saving in pruning away (non-semantic) association rules. Keywords: Binary relation, clustered(semantics)association rules 1 Introduction Relational theory assumes everything is a classical Cantor set. In other words, the interactions among real world objects are "forgotten" in the relational modeling. However, in practical database processing, some additional semantics in the attribute domains are often employed. For example, in numerical attributes, the order of numbers is often used in SQL statements. In geographical attributes some relationships, such as "near," "in the same area" are often used in data processing by human operators. Therefore these additional semantics implicitly exist in the stored database. The natural question is: Can such semantics be modeled mathematically? Fortunately the model theory of first order logic provides some answers. Model theory uses relational structure to model the real world. By taking different kind of relational structure, we can capture different level of semantics. 1. Classical relational theory: Each attribute domain is discrete, that is, no interactions among entities or attribute values are modeled. The relational structure consists of identity equivalence relations. M. Inuiguchi et al. (eds.), Rough Set Theory and Granular Computing Springer-Verlag Berlin Heidelberg 2003

2 148 T. Y. Lin 2. Binary granular relational theory: Attribute values in each domain are interacted, related or granulated by a binary relation, for example, the order in a numerical attribute. (a) A binary relation(br), B ~ U x U defines a map p E U -> B p, where Bp = {u I (p,u) E B}, This map or the collection {Bp} is called a binary neighborhood system (BNS). Bp is called the elementary neighborhood or granule of p. (b) Conversely, a BNS defines a binary relation B = Up p x Bp. (c) If the binary relation B is an equivalence relation E, the elementary neighborhood Bp is the elementary set [PJE (the equivalence class containing p) [16J. In this paper, we focus on binary granulated data. In other words, the term "additional semantics" means each attribute domain is clustered or granulated by a binary relation. 2 Databases with Additional Semantics 2.1 Machine Oriented Relational Models and Rough Structures A relation is a knowledge representation that maps each entity to a tuple of attribute values. Table 1 illustrates a knowledge representation of the universe V = {Vl,V2,V3,V4,V5}. In this view, an attribute can be regarded as a projection that maps entities to attribute values, for example in Table 1, the CITY attribute is the map, I: V -> Dom(CITY), which assigns, at every tuple, the element in the first column to the element in the last column. The family of complete inverse image 1-1 (y) forms a partition (equivalence relation). So each column (attribute) defines an equivalence relation. Table 1 gives rise to 4 named equivalence relations. Pawlak called the pair V and a finite family of equivalence relations a knowledge base. Since knowledge bases often have different meaning, so we, after naming all granules, have called it rough granular structure, or rough structure, which is a special form of binary granular structure [5J. Formally A binary granular structure consists of 4-tuple (V, U,B,C) where V is called the object space, U the data space (V and U could be the same set), B = {Bi, i = 1, 2,... n} is a finite set of crisp/fuzzy binary relation, and C is a finite set of elementary concept spaces.

3 Association Rules with Additional Semantics 149 An elementary concept space consists of meaningful name of the elementary neighborhood (subset) B~ = {u I (u,p) E Bi} V P E U. Using traditional language, an elementary concept space is an attribute domain; The name of an elementary neighborhood represents an elementary concept which traditionally is referred to as an attribute value. When V and U are identical, i.e., V = U, and the binary relations are equivalence relations, i.e., B = E, then the triple (U, E, C) = (U, U, E, C) is called a rough granular structure, or simply rough structure. Proposition 1. A relation instance is equivalent to a rough granular structure. Attribute names and attribute values are the meaningful names of equivalence relations and elementary granules ( equivalence classes) respectively. Such a view of a relation instance have been called machine oriented modeling; see Table 1, 2. v (8# SNAME Status City) Vi --> (81 Smith TWENTY Cd V2 --> (82 Jones TEN ) V3 --> (83 Blake TEN ) V4 --> (84 Clark TWENTY Cd Vs --> (8s Adams THIRTY C3) Table 1. Information Table of Suppliers; arrows and parentheses will be suppresed Equiv. Elementary Attribute Class Granule value encoded label meaningful name * 8#(*) * * 8NAME(*) * Vl,V4 8T ATU 8(10010) TWENTY V2,V3 8T ATU8(01l00) TEN Vs 8TATU8(00001) THIRTY Vi, V4 CITY(1001O) C1 V2, V3 CITY(OllOO) Vs CITY(OOOOI) C3 Table 2. Rough granular structure and machine oriented model

4 150 T. Y. Lin 2.2 Relations with Additional Semantics using Binary Relations Let BCITY be a binary relation displayed in Table 3. Then we have the elementary neighborhoods, B6[TY = {G I, G2 }, B6;-TY = {GI, G2, G3 }, Bg1'TY = {G2, G3 }. CITY CITY C1 C1 C3 C3 C1 C1 C3 C3 Table 3. "near"-binary Relation Table 1 with additional semantics defined in Table 3 defines 4 named binary relations on U. The first 3 are equivalence relations, and the last one is a binary relation Bind uced from the" near" -binary relation on the domain of CITY-attribute. The binary relation B can easily be expressed by BNS: B vt = {VI, V2, V3, V4}. BV2 = {Vl,V2,V3,V4,V5}. BV3 = {VI, V2, V3, V4, V5}. BV4 = {Vl,V2,V3,V4}. BV5 = {V2, V3, V5}. 3 Re-formulating Data Mining What is data mining? The common answer is essentially" to find the pattern in data." This is not entirely accurate. For example, we will not be interested in a rule, say, "all data are represented by 5 characters." Because this is a pattern of knowledge representation, not Real World. To show that a discovered pattern in a knowledge representation is, indeed, a pattern of Real World is a difficult problem; we need to show that the equivalent pattern also exits in other knowledge representations. So we will take the following alternative: Find the patterns in the mathematical model of Real World.

5 Association Rules with Additional Semantics 151 For relational databases, the mathematical model of Real World is the rough structure; see Table 2. If we conduct the data mining in such a structure, it is automatically a pattern of real world. In this paper, we extend this approach of relational theory to databases with additional semantics. 4 Mining Semantically In databases with additional semantics, attribute values are semantically related, so in processing any logical formula, e.g., decision or association rules, it is important that one checks the semantics. Since we use the notion of neighborhood systems (generalization of topological space), rules or patterns that respect the semantics will be termed continuous rules or patterns. We collect some standard continuous patterns [4]: Let A and B be two attributes of a relation-with-additional-semantics. Let c, d be two values of A and B respectively. Let NEIGH(c), NEIGH(d) be the respective elementary granules. It is clear that c = NAME(NEIGH(c)) and d= NAME( NEIGH(d)). Let Card(~)be the cardinal number of a set~. 1. A formula c -> d is a continuous(semantics)decision rule, if N EIGH(c) ~ N EIGH(d) continuously. 2. A formula A -> B is a continuous(semantics)universal decision rule (extensional function dependence), iff 'V c E A :3 deb such that NEIGH(c) ~ NEIGH(d) 3. A formula c -> d is a robust continuous(semantics)decision rule, if N EIGH(c) ~ N EIGH(d) and Card(Pc ) 2': threshold [9]. 4. A formula c -> d is a soft continuous(semantics)decision rule (strong rule), if NEIGH(c) is softly included in NEIGH(d), NEIGH(c)c;;'N EIGH(d) [15]. 5. Continuous(semantics)association rule: see next section 6. Weak association rule: A pair (c, d) is an association rules, if Card (NEIGH(c) n NEIGH(d)) 2': threshhold. We will illustrate the continuous(semantics)decision rules only; see next section for association rules. c -> d is a continuous(semantics)decision rule if an attribute value in NEIGH(c) appears in a tuple, it must imply that an attribute value in NEIGH(d) also appears. So to check If STATUS = TEN, then CITY =." One needs to scan through the two columns in Table 1 and check if "TEN" ("TEN = NEIGH("TEN)) is continuously associated with NEIGH()." In machine oriented model, the same fact can be checked by the inclusion of two elementary granules, namely, "TEN" nn EIGH( )= ST ATUS(010000) ncity(11111)= (010000) n(11111) = (01100)= ST ATUS(010000) = "TEN"

6 152 T. Y. Lin 5 Semantic Association Rules We will call any pattern or rule that respects the semantics a continuous (semantic) pattern or continuous (semantic) rule. Let c and d be two attribute values in a relation Definition Continuous (semantic) Association rules 1. A pair (c, d) is said to be in a relation or database, if it is a sub-tuple of a tuple that belongs to a relation or database. 2. A pair (c, d) in a given relation is one-way (c d) continuous (semantic) if every x E Be there is at least one y E Bd such that (x,y) is in the given relation. 3. A pair (c, d) in a given relation is a two way continuous (or semantic) if (c d) and (d c) are both continuous. 4. Continuous (semantic) association rule: A pair (c, d) is an association rule iff the pair is an association rule and two way continuous. 5. Two continuous pairs, (Cl,) and (,C3) compose into a continuous pair (Cl, C3)' In particular, composition of continuous association rules is continuous association rule. 6. Soft association rule: A pair (c, d) is a soft association rule, if Card (NEIGH(c) n NEIGH(d))?:: threshhold. [4], [5] Here is some of our experimental results: see Table 4, 5, 6. Some comments on algorithms and data: The table has rows and 16 columns; we require the support to be items. The algorithm is restricted to use 10 mega bytes of main memory, so it is reasonably scalable. It checks one way continuous. The first column represent the length of combinations (of candidates). A q-combination (combination of length q) exits in the database is called q itemset. The second column is the all possible combinations. q-combination is a join of two (q - I)-association rules. The third column is the support count; the fourth is the number of association rules. An q-association rule is a q-combination that meets the support requirement. The fifth column is the time needed to generate the results of next row. 5.1 Randomly generated data In this experiment, the data is generated randomly. So each individual data is totally independent. Yet, we still find some highly supported item set of length 2. The computation implies that frequency itself is not an "adequate criterion" for meaningful patterns. In other words, digging into deeper semantics seems necessary.

7 Length Cand Supp Rules 8-time Association Rules with Additional Semantics 153 Comment 0 0 O.OOOs Start s # of candidates: 99 I-combinations O.OOOs # of association rules: 62 I-combinations O.OlOs # of candidates: combinations s # of supported: 44 2-combinations 0 O.OOOs # of association rules: 0 2-combinations 0 O.OOOs # of association rules: 0 2-combinations O.OOOs Complete s Totals Table 4. Even randomly generated data still has length 2 rules 5.2 Semantically generated data In this experiment, an association rule of length 16 is embedded in the randomly generated data. Then the data is generated as follows: In generating first column, the algorithm randomly generate a data, if this data is the selected one, then based on the assigned probability, the algorithm choose the selected one in next column or randomly generated another data. The selected rules is randomly embedded in the data. Instead of traditional apriori algorithm, we use granular computing [13J to find the association rules. The result is in Table Semantic/continuous association rules - Data with neighborhoods In this experiment, an interval (neighborhood) is selected systematically for each element. The interval slides up and down like sign curve. The detail is not important, but the selection reflect some semantic is imposed on the random data. Note that compositions of continuous association rules of length 2 are continuous. So the continuity checking is unnecessarily once passes length 2. The pruning occurs only at length 2; this is reflected in the experiments. Look at the last rows of Tabel 5 and Tabel 6 to see the time saved. 6 Conclusion Here are our observations 1. Classical relation is a knowledge representation, while granular structure is the mathematical model of real world. 2. Granular computing, then, mines the patterns from real world, not its representation.

8 154 T. Y. Lin Length Cand Supp Rules 6-time Comment 0 0 O.OOOs Start s # of candidates: 152 l-combinatiuns O.OOOs # of assoc. rules: 16 I-combinations O.OOOs # of candidates: combinations s # of supports: combinations 120 O.OOOs # of assoc. rules: combinations O.OOOs # of candidates: combinations s # of supports: combinations 560 O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations 1820 O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations 4368 O.OOOs # of assoc. rules: combinations s # of candidates: combinations s 8008 O.OOOs # of supports: combinations # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations O.OOOs # of assoc. rules: combinations s # of candidates: 8008 lo-combinations s # of supports: 8008 lo-combinations 8008 O.OOOs # of assoc. rules: 8008 lo-combinations s # of candidates: 4368 II-combinations s # of supports: 4368 ll-combinations 4368 O.OOOs # of assoc. rules: 4368 II-combinations s # of candidates: combinations s # of supports: combinations 1820 O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations 560 O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations 120 O.OOOs # of assoc. rules: combinations O.OOOs # of candidates: combinations s # of supports: combinations 16 O.OOOs # of assoc. rules: combinations 16 1 O.OOOs # of candidates: 1 16-combinations Os # of supports: 1 16-combinations 1 O.OOOs # of assoc. rules: 1 16-combinations O.OOOs Complete s Totals Table 5. "Semantically" generated data-association rule is expensive

9 Length Cand Supp Rules 8-time Association Rules with Additional Semantics 155 Comment 0 0 O.OOOs Start s # of candidates: 152 I-combinations O.OOOs # of assoc. rules: 16 I-combinations s # of candidates: combinations s # of supports: combinations O.OOOs # of assoc. rules: 54 2-combinations O.OOOs # of candidates: combinations s # of supports: combinations 118 O.OOOs # of assoc. rules: combinations s # of candidates: combinations s # of supports: combinations 171 O.OOOs # of assoc. rules: combinations O.OlOs # of candidates: combinations s # of supports: combinations 166 O.OOOs # of assoc. rules: combinations O.OlOs # of candidates: combinations s # of supports: combinations 106 O.OOOs # of assoc. rules: combinations 7 43 O.OOOs # of candidates: 43 7-combinations s # of supports: 43 7-combinations 43 O.OOOs # of assoc. rules: 43 7-combinations 8 10 O.OOOs # of candidates: 10 8-combinations. 10 O.OlOs # of supports: 10 8-combinations 10 O.OOOs # of assoc. rules: 10 8-combinations 9 1 O.OOOs 1 O.OOOs # of candidates: 1 9-combinations. # of supports: 1 9-combinations 1 O.OOOs # of assoc. rules: 1 9-combinations 10 0 O.OOOs # of candidates: 0 lo-combinations. 0 0 O.OOOs # of assoc. rules: 0 10-combinations O.OOOs Complete s Totals Table 6. Data with neighborhoods- semantic rule cost is inexpensive 3. Granular computing was shown to be faster than traditional data mining [13],[14]. Now, we apply to databases with additional semantic. 4. In granular computing the cost of checking additional semantics is well compensated by pruning away non-semantic rules. References 1. Agrawal, R., R. Srikant, "Fast Algorithms for Mining Association Rules," in Proceeding of 20th VLDB Conference San Tiago, Chile, 1994.

10 156 T. Y. Lin 2. \V. Chu and Q. Chen, "Neighborhood and associative query answering," Journal of Intelligent Information Systems, vol 1, , K. Engesser, Some connections between topological and Modal Logic, Mathematical Logic Quarterly, 41, 49-64, T. Y. Lin, "Data Mining and Machine Oriented Modeling: A Granular Computing Approach," Journal of Applied Intelligence, Kluwer, Vol. 13,No 2, September/October,2000, pp T. Y. Lin, "Data Mining: Granular Computing Approach." In: Methodologies for Knowledge Discovery and Data Mining, Lecture Notes in Artificial Intelligence 1574, Third Pacific-Asia Conference, Beijing, April 26-28, 1999, T. Y. Lin," Granular Computing: Fuzzy Logic and Rough Sets. " In: Computing with words in information/intelligent systems, L.A. Zadeh and J. Kacprzyk (eds), Springer-Verlag, , T. Y. Lin, "Granular Computing on Binary Relations I: Data Mining and Neighborhood Systems." In: Rough Sets In Knowledge Discovery, A. Skoworn and L. Polkowski (eds), Springer-Verlag, 1998, T. Y. Lin, "Granular Computing on Binary Relations II: Rough Set Representations and Belief Functions." In: Rough Sets In Knowledge Discovery, A. Skoworn and L. Polkowski (eds), Springer-Verlag, 1998, T. Y. Lin, "Rough Set Theory in Very Large Databases," Symposium on Modeling, Analysis and Simulation, CESA'96 IMACS Multi Conference (Computational Engineering in Systems Applications), Lille, France, July 9-12, 1996, Vol. 2 of 2, T. Y. Lin, " Neighborhood Systems and Approximation in Database and Knowledge Base Systems," Proceedings of the Fourth International Symposium on Methodologies of Intelligent Systems, Poster Session, October 12-15, pp , T. Y. Lin, "Topological Data Models and Approximate Retrieval and Reasoning," in: Proceedings of 1989 ACM Seventeenth Annual Computer Science Conference, February 21-23, Louisville, Kentucky, 1989, T. Y. Lin,"Neighborhood Systems and Relational Database". Abstract, Proceedings of CSC '88, February, 1988, pp Eric Louie and T.Y. Lin, "Finding Association Rules using Fast Bit Computation: Machine-Oriented Modeling." In: Proceeding of 12th International Symposium ISMIS2000, Charlotte, North Carolina, Oct 11-14, Lecture Notes in AI T. Y. Lin and E. Louie, "A Data Mining Approach using Machine Oriented Modeling: Finding Association Rules using Canonical Names.". In: Proceeding of 14th Annual International Symposium Aerospace/Defense Sensing, Simulation, and Controls, SPIE Vol 4057, Orlando, April 24-28, 2000, pp T. Y. Lin, and Y.Y. Yao "Mining Soft Rules Using Rough Sets and Neighborhoods." In: Symposium on Modeling, Analysis and Simulation, IMACS Multiconference (Computational Engineering in Systems Applications), Lille, France, July 9-12, 1996, Vol. 2 of 2, Z. Pawlak, Rough sets. Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, W. Sierpenski and C. Krieger, General Topology, University of Toronto Press 1952.

Modeling the Real World for Data Mining: Granular Computing Approach

Modeling the Real World for Data Mining: Granular Computing Approach T. Y. Lin Department of Mathematics and Computer Science San Jose State University San Jose California 95192-0103 and Berkeley Initiative