Modeling the Real World for Data Mining: Granular Computing Approach T. Y. Lin Department of Mathematics and Computer Science San Jose State University San Jose California 95192-0103 and Berkeley Initiative in Soft Computing University of California Berkeley California 94720 E-mail: tylin@cs.sjsu.edu tylin@cs.berkeley.edu Abstract To each object in an object space a (possibly empty) family of granules (crisp/fuzzy subsets) of the data space is assigned; we call it a granulation. It is a mild generalization of the neighborhood system of (pre- )topological spaces. If each family has at most one granule the granulation defines a binary relations. Interestingly if the granulation is defined by general relations the data space is the (world) models in first order logic. A knowledge representation of such a world model that is assigning a uniquely meaningful name (attribute value) to each neighborhood is called a granular data model. If the granulation are by equivalence relations the model is the classical relational model. Intuitively it is a real world data model; note that granules have overlapped so attribute values may not be independent. In other words attribute domains are more than Cantor sets; intuitively they are real world Models (sets). Depending on the structures and representations the model can be useful in fuzzy logic or data mining. The focus of this paper is on data mining in fact semantically rich rules are mined. Its performance are measured; it twenty some times faster than traditional Apriori. 1. Introduction Relation theory is designed to model the real world of a long duration. To accommodate all instances it assumes the universe and attribute domains are all Cantor sets. In other words the interactions among entities are forgotten in the relational modeling. To have a better approximation we need an appropriate model. In logic a real world is modeled by a Cantor set (of entities) with relational structure. As a first step we decide to consider binary relational structures. Interestingly it agrees with Zadeh s notion of information granulation [5]. In applications we reach seemingly unrelated topics data mining and fuzzy control. In this paper however we focus on data mining. Some impressive experimental results have achieved. In classical case we are 24 time faster than the traditional Apriori. 2 Granulations and Neighborhood Systems In [5]) Zadeh defines (rephrased) information granulation is a collection of granules with a granule being a clump of objects (points) which are drawn towards an object. In other words each object is associated with a family of clumps. This is essentially the notion of Frechet(V) space[?] or neighborhood systems [9]. In this paper a fuzzy set is uniquely defined by its membership function [15]. It is a w-sofset if we use the language of [7]. A crisp/fuzzy neighborhood system (F/NS) is: To each object we associate an (empty finite or infinite) family of crisp/fuzzy subsets called clumps. The mathematical system defined by these families is called crisp/fuzzy neighborhood system or simply neighborhood system and these clumps associated to are called fundamental neighborhoods of. Note that if there is at most one fundamental neighborhood at each point then the neighborhood system is defined by a binary relation; see Section 4. 1
3 Representations of neighborhood systems Weighted sum veristic constraints 3.1 Multiple valued representations We will illustrate the idea by examples. Let and be a family of fuzzy sets of that covers. is a fuzzy neighborhood system and each cover is a fuzzy neighborhood of any point in the cover. are the fuzzy neighborhood system at point The association to each object we associate a set of names is a multiple valued representation of the universe. 3.2 Fuzzy representations Since the neighborhood system is fuzzy we will take the weighted average of multiple values. Let us consider the following formal expressions: where are real numbers. Mathematically the collection of all such expressions is a vector space. Each vector is called a formal word. Let represents the grade of at i.e.. We will call the weight of in. Based on the weight we will form a formal word representation: defined by The expression is called the formal word representation of ; it is Zadeh s veristic constraint [?]. Table 1 consists of all such formal expressions; it is a vector-valued representation of the universe. Each expression represents a certain weighted sum of attribute values. 4 Binary granulation and Partitions A partition is a collection of pair-wise disjoint subsets whose union is. This is the simplest granulation. Its algebraic concept is an equivalence relation. So a natural generalization is a binary relation. We should like to comment that an obvious geometric generalization of a partition is a covering. Unfortunately a covering is not the geometric equivalence of a binary relation. The equivalent one is the more elaborate notion called the binary neighborhood system. This is the subject that will be covered next. Intuitively it is a cover with center more than a simple notion of cover. The notion of the center plays an essential role in this paper. 4.1 Binary granulation relations and neighborhood systems A Crisp/Fuzzy binary relation(br or FBR) is a crisp/fuzzy subset whose membership function is where M is the membership space that is M is either the unit interval [0 1] or the binary values. It defines a crisp/fuzzy set called binary (or elementary) neighborhood whose membership function is is defined by. The collection of all crisp/fuzzy sets on U is denoted by FZ(U). The map is called a crisp/fuzzy binary granulation and the set a crisp/fuzzy binary neighborhood system. 2
Proposition. and are equivalent to each other and will be used interchangeably; see [5]. A subset is a definable set if it is a union of equivalence classes. So a subset is called a definable neighborhood if is a union of elementary neighborhoods. If the definable neighborhood contains the elementary neighborhood of p it is a definable neighborhood of p. 4.2 Induced partitions The binary granulation is a map it induces a partition (or equivalence relation) denoted by on by the collection of complete inverse images. 5 Real world Model and Data Mining 5.1 Granular Data Model- Real world relational theory A crisp/fuzzy granular data model consists of 3-tuple where is called the object space is the data space ( and could be the same set) is a finite family of crisp/fuzzy binary granulations (neighborhood systems or binary relations). If and that will be denoted by is a finite family of equivalence relations then (U E) is called rough data model; it was called knowledge base in [10] [5] [6] [4]. We will not use it here since it confuses with standard usage. The notion of knowledge representation is essentially naming the granular data model that is assign meaningful names to the binary relations (attributes) and their binary neighborhoods (attribute values) [?]. Smith Jones Blake Clark Adams Peterson Ewing Johnson Pike Meyers We will illustrate the idea by examples. In the case of rough data model Table 6 its representation is an ordinary relation Table 2. If there are additional semantics conflict of interests among agents which is represented in the second column of Table??; it induces the equivalence relation in third column. Equiv. Elementary Attribute Class Granule value encoded label meaningful name S# TEN TWENTY THIRTY FORTY EIGHTY NINTY 1. In rough data model the universe is partitioned into equivalence classes. So we consider the following composition: where [p] is equivalence class. 2. In granular data model which maps each object to its unique binary neighborhood and then to a meaningful name. 5.2 Data mining = Granular Computing Granular data model uses granules as its attribute values so any logical formula is translated to set theoretical formula of granules. However we should note that attribute values are semantically related so elementary granules of a column (a binary relation) may overlap. So in processing any logical formula based on attribute values it is important that one checks the continuity 3
TEN TWENTY THIRTY FORTY EIGHTY NINTY Binary neighborhood Center Binary meaningful neighborhood name S# 10 20 30 40 80 90 (namely see if it respects the semantics). Such checking is implicitly included in the computing of granules. We collect some generalized standard patterns: [2]. Let and be two attributes of a relation-withadditional-semantics. Let be two values of and respectively. Let be the respective elementary granules. It is clear that = NAME( ) and = NAME( ). Let Card( )be the cardinal number of a set. 1. Continuous decision rule: A formula is a continuous decision rule if continuously. Binary neighborhood on V Center (induced partition) 2. Continuous universal decision rule: A formula is a continuous universal decision rule (extensional function dependence) iff such that 3. Robust continuous decision rule : A formula is a robust continuous decision rule if and Card threshhold. 4. Soft continuous decision rule [8]: A formula is a soft continuous decision rule (strong rule) if is softly included in. 5. Continuous association rule: A pair is an association rules if Card ( threshhold. We will illustrate the continuous decision rules only; we skip the rest. is a continuous decision rule if an attribute value in NEIGH( ) appears in a tuple it must imply that an attribute value in NEIGH( ) also appears. So to check 4
If then. One needs to scan through the two columns in Table?? and check if ( = NEIGH( )) is continuously associated with NEIGH( ). In machine oriented model the same fact can be checked by the inclusion of two elementary granules namely 5.3 Some performance data = We collect some results on the performance of finding association rules. The relation consists of 128K rows = 131072 16 Columns the support requires 8192 and memory is 10 megabytes; see Tabele 7 [3]. The program for Apriori AporiTid and AprioriHybrid are our honest implementations of the algorithms in [?] [1]. In the implementation we use some buffer scheme to speedup read/write for all algorithms. 6 Conclusions In this conclusion we will reflect on our over all approach. In several of our papers we have literally taken Zadeh s intuitive description of clumps as a formal mathematical notion of granulation. It is essentially a mild generalization of binary relations and neighborhood systems in (pre-)topological spaces [12 9? 5]. By giving a meaningful name to each granule we have a representation theory. It extends the classical relational model based on Cantor sets to real world data model based on real world set theory (neighborhood system space). It is worthy to note here that in crisp world the representation is locally multi-valued in fuzzy world we can use weights to combine these names linearly (a weighted average) and form formal words; this tune it into a single-valued representation namely a formal word table; see Section 3.2. Using Zadeh s terminology such formal word representations are veristic constraints [?]. A formal word table is a generalization of information table. So by employing table processing techniques of rough set methodology to formal word tables we expect some useful applications to fuzzy logic control. Our study seems saying that granular computing is a reasonable notion. At this point its essential ingredients are (1) a representation theory of granular structure which will be useful in data mining (2) a formal = = word representation of input/output spaces and potentially useful to fuzzy logic control. In the over simplified terms the two applications are computing with words. Final we would like to say few words on the computational performance in classical data mining granular computing is faster than Apriori [3] because the database scan are replaced by bit operations. In this paper we extend the use of granular computing to semantically richer models. Such extra semantics can be used to analyze unexpected rules [11]. Granular computing is fast; it seems a promising approach to data mining. References [1] Agrawal R. R. Srikant Fast Algorithms for Mining Association Rules in Proceeding of 20th VLDB Conference San Tiago Chile 1994. [2] T. Y. Lin Data Mining and Machine Oriented Modeling: A Granular Computing Approach Journal of Applied Intelligence Kluwer Vol. 13No 2 September/October2000 pp.113-124. [3] Eric Louie and T.Y. Lin Finding Association Rules using Fast Bit Computation: Machine- Oriented Modeling. In: Proceeding of 12th International Symposium ISMIS2000 Charlotte North Carolina Oct 11-14 2000. Lecture Notes in AI 1932. 486-494. [4] T. Y. Lin Granular Computing: Fuzzy Logic and Rough Sets. In: Computing with words in information/intelligent systems L.A. Zadeh and J. Kacprzyk (eds) Springer-Verlag 183-200 1999 [5] T. Y. Lin Granular Computing on Binary Relations I: Data Mining and Neighborhood Systems. In: Rough Sets In Knowledge Discovery A. Skoworn and L. Polkowski (eds) Springer- Verlag 1998 107-121. [6] T. Y. Lin Granular Computing on Binary Relations II: Rough Set Representations and Belief Functions. In: Rough Sets In Knowledge Discovery A. Skoworn and L. Polkowski (eds) Springer- Verlag 1998 121-140. [7] T. Y Lin A Set Theory for Soft Computing. In: Proceedings of 1996 IEEE International Conference on Fuzzy Systems New Orleans Louisiana September 8-11 1140-1146 1996. [8] T. Y. Lin and Y.Y. Yao Mining Soft Rules Using Rough Sets and Neighborhoods. In: Symposium on Modeling Analysis and Simulation IMACS 5
Length of # of Association Granule(Full Granule Apriori Apriori Apriori combination Candidates rules Computation Partial Hybrid 199 Tid Multiconference (Computational Engineering in Systems Applications) Lille France July 9-12 1996 Vol. 2 of 2 1095-1100. [9] T. Y. Lin Neighborhood Systems and Relational Database. In: Proceedings of 1988 ACM Sixteen Annual Computer Science Conference February 23-25 1988 725 [10] Z. Pawlak Rough sets. Theoretical Aspects of Reasoning about Data Kluwer Academic Publishers 1991 [11] Balaji Padmanabhan and Alexander Tuzhilin Finding Unexpected Patterns in Data. In: Data Mining and Granular Computing T. Y. Lin Y.Y. Yao and L. Zadeh (eds) Physica-Verlag to appear. [12] W. Sierpenski and C. Krieger General Topology University of Torranto Press 1956. [13] Lotfi Zadeh The Key Roles of Information Granulation and Fuzzy logic in Human Reasoning. In: 1996 IEEE International Conference on Fuzzy Systems September 8-11 1 1996. [14] W. Ziarko Variable Precision Rough Set Model. Journal of Computer and Systems Science Vol 46No1 February Academic Press 1993 pp.38-59. [15] H. Zimmerman Fuzzy Set Theory and its Applications Second Ed. Kluwer Acdamic Publisher 1991. 6