A Granular Computing Approach. T.Y. Lin 1;2. Abstract. From the processing point of view, data mining is machine

Size: px

Start display at page:

Download "A Granular Computing Approach. T.Y. Lin 1;2. Abstract. From the processing point of view, data mining is machine"

Doris McCormick
5 years ago
Views:

1 Data Mining and Machine Oriented Modeling: A Granular Computing Approach T.Y. Lin 1;2 1 Department of Mathematics and Computer Science San Jose State University, San Jose, California tylin@cs.sjsu.edu 2 Berkeley Initiative in Soft Computing Department of Electrical Engineering and Computer Science University of California, Berkeley, California tylin@cs.berkely.edu Abstract. From the processing point of view, data mining is machine derivation of interesting properties (to human) from the stored data. Hence, the notion of machine oriented data modeling is explored: An attribute value, in a relational model, is a meaningful label (a property) of a set of entities (granule). A model using these granules themselves as attribute values (their bit patterns or lists of members) is called a machine oriented data model. The model provides a good database compaction and data mining environment. For moderate size databases, nding association rules, decision rules, and etc., can be reduced to easy computation of set theoretical operations of granules. In the second part, these notions are extended to real world objects, where the universe is granulated (clustered) into granules by binary relations. Data modeling and mining with such additional semantics are formulated and investigated. In such models, data mining is essentially a machine "calculus" of granules -granular computing. 1 Introduction What is data mining? We will explore it from the processing point of view. Roughly, data mining is a reverse of database processing. Database processing mainly concerns with organizing and storing massive data according to their known semantics, for example various normal forms. On the other hand, data mining mainly concerns with discovering and extracting previously unknown semantics of stored data. Discovering and extracting are machine derivations of interesting properties, called patterns, from the mathematical structure of stored data. What would be the proper primitives for machine processing? In database theory, attribute values are used as primitives to describe entities. We termed such a set of descriptions a knowledge representation. Attribute values are meaningful primitives (properties of entities) to human. However, to machine, they

2 are merely bits and bytes; human's intuition provides no special aids to the processing. In fact, attribute values are often cumbersome to process, because they are semantically interrelated. Ideally, all primitives should be independent from each other. So we take the entities as primitives, just opposite to the database processing. An attribute value is regarded as a name or label of a set of entities (granules). This leads to the consideration of using these granules themselves, or more precisely, their bit patterns or lists of members, as labels. Such labels are termed canonical names or canonical labels; they are attribute values encoded with machine semantics. So a relational model using canonical labels as its attribute values is called a machine oriented data model. The model provides a compact representation of a database. It reduces some classical data mining methods, such as nding association rules, decision rules, and etc., to simple set theoretical operations. This paper is divided into two parts. The rst part is a machine oriented relational theory. It is a theory of equivalence relations, an extended rough set theory. Data mining in this model is machine processing of elementary sets (equivalence classes). In the second part, the modeling is extended to real world objects. The universe, consisting of interrelated objects, is more than a set; it is granulated (clustered) by some binary relations. The granules are called elementary neighborhoods [9], an extension of elementary sets. A "relational" theory with such additional semantics, is not new, has been formulated and examined for approximate retrievals, e.g., [16, 3, 5, 7, 19]. Data mining in this extended theory is machine processing of elementary neighborhoods. The computational theory to handle such granulated spaces is called granular computing- a new eld inspired by Zadeh [25] and labeled by this author [20]. From computational point of view, data mining is one form of granular computing. PartI Machine Oriented Relational Theory In this part, we re-develop relational database theory from data mining point of view. It is an extensional database theory ([6], pp.90). As in classical relational theory, the universe of entities and attribute domains are all classical sets. Roughly data are discrete, not clustered. 2 Single Column Representations and Partitions In this section, we will give a detail illustration on the simplest relational model. Let V be the universe which is a set of entities. Let C, called the elementary concept space, be a set of elementary concepts (attribute values).

3 2.1 Single Column Representations. A map from the universe V to the collection C of elementary concepts, A : V! C, is called a single column (knowledge) representation. We will be interested only in the attribute values currently using. In other words, we assume A is an onto map. Such a C is called an active domain by database theorists([21], p11); we may also denote C by ADom(A). Intuitively each element in C is a label of certain property of existing entities; it represents an elementary concept. Partitions and Quotient Sets. Let c be an element in C. The inverse image of c under A, in symbols A 1 (c), is the set of all those entities whose image is c; i.e., A 1 (c) = fu j A(u) = cg It is clear that these inverse images A 1 (c); 8 c 2 C forms a partition P A on V. Note that a partition induces an equivalence relation and vice versa. By abuse of notation, we will use P A to denote the equivalence relation too. Each A 1 (c) is an equivalence class. The collection of all equivalence classes constitutes a set, called the quotient set and denoted by V=A. Canonical Representations. An equivalence class plays two roles, one as an element of the quotient set V=A, another as a subset of the universe V. We can regard the element as the canonical name or label of the subset. In other words, the quotient set is a set that consists of canonical names. We will use CNAME( ) to denote the canonical name. So the map, u! [u]! CNAME([u]) is called single column canonical representation, where, as usual, [u] denotes the equivalence class containing u. This representation will be denoted by CN AM E too, that is, CNAME(u) = CNAME([u]). The graph (u; CNAME(u)) is called single column canonical information table. Note that as far as computer systems are concerned, both canonical name and meaningful name are all bits and bytes. Examples. We will illustrate the idea by examples. Let the universe V = fid 1 ; ID 2 ; :::ID 9 g be a set of 9 restaurant owners, and the attribute values be the locations of their restaurants. For comparison, we combine two single column representations into one "table;" see Table 1 and Table 2. For human, the meaningful names inform that some restaurants are located in West Wood, West LA, and Brent Wood, while the canonical names add no information to human. However, for machines, both are bits and bytes; either choice gives rise to the same mathematical structure of the stored data. In fact, canonical names reveal the machine semantics and may speed up machine processing. In Table 1, ordinary subset notations are used while in Table 2, we use bit representations; a bit is on if and only if the corresponding object is belonging to the subset.

4 PLEASE put Table 1 here. PLEASE put Table 2 here. 2.2 Single Column Machine Oriented Relational Models Let us rst summarize the previous discussions into a theorem Theorem. 1. There is a one-to-one correspondence between single column canonical representations and partitions. 2. Each attribute value is dened by one and only one equivalence class. 3. A logical formulas of attribute values is dened by and only by a set theoretical relationship among equivalence classes. We refer to Theorem , Item 2 and Item 3 as machine semantics of attribute values and logical formulas respectively. One should note that this theorem is valid, even when we consider a collection of partitions. Next, let us consider the following factorization of the representation A: V! V=A! C, Recall that the quotient set is a set of canonical names, so the rst map, which maps each entity to the canonical name of its equivalence class, is the canonical representation. The second map is called the naming map, which sends each canonical name to a meaningful name. First, we note that A is a single column relational data model. We factor it into a pair of maps: The rst map represents the universe of entities by labels encoded with machine semantics. Data mining will focus on such encoded labels. The second map translates the encoded labels to human understandable terms; the primary use is to output the discovered patterns. We will call the pair or the triple (V; V =A; C) machine oriented relational model. In table representations, since the rst map is implicitly in the encoded labels, so we only need to display two columns; see examples below. Perhaps, we should note that the triple has been called granular structure in our earlier papers (Section 8.1). Examples In Table 1 or Table 2, there are six elements in the column of canonical names. However, there are only three distinct ones in either table. So we have the condensed forms of machine oriented models: PLEASE put Table 3 here. The next two representations condenses all the information in the Table 1 and 2, into a very compact form. It is the named bit and list representations of a partition. PLEASE put Table 5 here. PLEASE put Table 4 here.

5 3 Multiple Column Representations and Information Tables It is clear the results in previous section can be easily generalized to multiple column knowledge representations. Its graph is called an information table; see Appendix. We shall illustrate the notion by examples. 3.1 Examples Let V be a set of 9 restaurant owners. Its elementary concept spaces are denoted by ADom attribute ; see Section 2.1. It has three single column representations: 1. TYPE: V! ADom T Y P E 2. LOCATION: V! ADom LOCAT ION 3. PRICE: V! ADom P RICE Information Table: These three representations form a multiple column knowledge representation. Its graph is in the following information table, Table 6. Put Table 6 here Bit Table: Each single column representation induces a partition. If we label each equivalence class by its bit patterns, we have the canonical information table, Table 7; we skip the list representations. Put Table 7 here 4 Machine Oriented Relational Models These three attributes induce three partitions; we will treat them as one multiple partition. Pawlak called it a knowledge base [22]. Roughly, a machine oriented model is a multiple partition, in which each equivalence classes is given a meaningful names. Table 6 and 7 can be condensed to the following named multiple partition: Put Table 8 It is obvious what a list model looks like; we skip the details

6 5 Data Mining on Relational Databases We will formulate some classical data mining notions in our models. Let c, d be attribute values of a relational database. Let P, Q be equivalence classes corresponding to c and d. In other words, c = NAME(P ) and d=name(q). Let Card(-)be the cardinal number of a set. 1. Association rule: A pair (c; d) is an association rules, if Card(P \ Q) threshhold [1]. 2. Decision rule: A formula c! d is a decision rule, if P Q [22]. 3. Robust decision rule: A formula c! d is a robust decision rule, if P Q and Card(P ) threshhold [12]. 4. Soft decision rule (strong rule): A formula c! d is a soft decision rule(strong rule), if Card(P n (P \ Q)) threshhold [27]. 5.1 Discovering Decision Rules Let us examine how the following decision rule can be discovered from Table 6. The rule can be expressed in several formats: 1. "if-then" format: "If cuisine TYPE is American, then the PRICE is inexpensive," 2. logic formula: "American! inexpensive," or 3. set theoretical formula: "American inexpensive." In classical model, we scan through the TYPE and PRICE columns of Table 6 to check if the attribute value "American" is consistently associated only with "inexpensive;" In machine oriented approach, we only need to verify the inclusion of two equivalence classes. It can be readily veried by bit operations, if the database is of moderate size. For example, if the database has one millions rows, it requires 2 20 =32 words (32bit=word) operations, namely, 32K words operations that is considerable less than one database access. In this particular example, we only need two assembly instructions, "and" and "compare". "American" T "inexpensive" =T Y P E( ) T P RICE( ) =T Y P E( )= "American" Note that the attributes are referred to partitions, so there is no eect on the bit patterns. From this simple example, one can see machine oriented model seems a more desirable approach; the details will be in the future papers.

7 5.2 Discovering Association Rules There are many literatures on association rules, we will defer the comparisons study of our approach to various well known algorithms, e.g., [1, 2] to future papers [8]. In this section, we will illustrate, by simple examples, that, for moderate size databases, we have a viable approach. Based on the model in Section 4, to nd if "French" and "expensive" form an association rule (with support 3), we only need to compute their intersection. Let Card(-) denote the cardinal number of a set. Card("French" T "expensive") = Card(T Y P E( ) T P RICE( )))= Card(( )&&( ))= Card(( ))=4 3. Based on our modeling, nding association rules is reduced to compute the intersections. When database is small, bit patterns are useful representations. However, when database is larger, we may need a good balance between list and bit representations. Roughly, we need a clever way to take bit-intersections (using bit patterns to compute the intersections). First, let us do some terminology translation. A 1- itemset is a label of an equivalence class, we shall, by abuse of language refer to it as an equivalence class. A 2-itemset is the intersection of two equivalence classes, we will abbreviate it as 2-intersetion. In general a k-itemset is a k-intersection. A k-item set is large if k-intersection is large. To nd all the association rules, rst, we nd all the large 1-itemsets by pure counting. Next, we nd all large 2-itemsets by computing the 2-intersections of large 1-itemsets. In general, we nd the large k-itemsets by computing the k-intersections of large (k-1)-itemsets. However, we do not want to compute the k-intersection unless all its (k-1)-sub-intersections are all large. The idea is similar to Aprori; but slightly dierent. As soon as the cardinal number of k-intersections get smaller, we may shift from the bit representation to list representation of a k-intersection; see forth coming papers. PartII Machine Oriented Models for Real World Data In database processing, though relational model is very eective its mathematical structure does not adequately reect the semantics of real world data. To capture some of these semantics in data mining, relational model needs to be extended. In relational theory, the universe of entities is assumed to be a classical set. In other words, there is no interaction among entities; we do know, however, there

8 are interactions among real world objects. There are similarities among events, distance in space, hierarchy in company positions, and etc. What should be the proper extra structures? In formal logic, Tarski imposes a relational structure to each world model. In fuzzy theory, Zadeh implicitly imposes a granular structure; see 6. Since both structures are "generated" by crisp or fuzzy binary relations, in this part, we will assume { the universe is, not a classical set as relational theory postulated, is a set with granulation (clustering) imposed by crisp binary relations. Data modeling and mining for such a universe is the main focal points of this part. 6 Granulation and Binary Relations Let us quote the following from [10]: "According to Lot Zadeh [25], { " information granulation involves partitioning a class of objects(points) into granules, with a granule being a clump of objects (points) which are drawn together by indistinguishability, similarity or functionality." By observing some technical points, we translate it to a formal denition, called granular structure [9]. The structure is some constraints imposed by some forms of crisp or fuzzy binary relations. In this part, we will focus on a subset of it. 6.1 Binary Relations In [9], we formulate the theory in two universes, however, in this paper, we will be interested only in a single universe V, the object space. Let B V V be a binary relation on V. For each object p 2 V, we associate a subset B p = fu j pbug, called elementary B-neighborhood B p or elementary neighborhood. The collection B = fb p j 8 p 2 V g is called a binary B-neighborhood system (BNS). The association denes a map B : V! 2 V : p! B p is called a binary B-granulation or simply granulation; it is clear the map B and the set fb p g determine each other. Suppose a binary neighborhood system B p is given, a binary relation can easily bedened by B = f(p; u) j u 2 B p g.

9 So we conclude this subsection with a Proposition. There is a one-to-one correspondence between binary neighborhood systems, binary granulations and binary relations. Since binary relation, binary neighborhood systems and binary granulation are essentially the same concept. We will treat them as synonyms and use them interchangeably. In fact, by abuse of notation, we have used the same notation B for all of them. If the binary relation B is an equivalence relation E, then the binary granulation is a partition. Each equivalence class is the elementary E-neighborhood of its members. 6.2 Neighborhood System Spaces The pair (V; B) is called a binary neighborhood system space (BNS-space). In this paper, we may simply refer it as a neighborhood system space(ns-space). Note that, strictly speaking, a binary neighborhood system space is a neighborhood space, but not vice versa [11, 9, 10]. An NS-space is a space with multilevel or multiple granulations, while BNS has only one single level of granulation. An NS-space is a pre-topological space; it is a variant of Frechet(V)-space [23]. In the case that B is an equivalence relation E, (V; E) is a clopen topological space [13]. Let B 0 be another binary relation Denition. 1. A subset X is called a denable B-neighborhood, if X is a union of elementary neighborhoods of B; 2. A subset X p is called a denable B-neighborhood of p if, further, the union contains the elementary B-neighborhood of p. 3. The set of all denable B-neighborhoods at p is denoted by BS(p); in BNSspace, there is at most one elementary neighborhood B p in BS(p) at each p. The set of all denable B-neighborhoods is denoted by BS(U). 4. Let X be a subset of V. NEIGH(X) = S pjinx B (p) is called the elementary B-neighborhood of X; note that it is a denable neighborhood. 5. B 0 strongly depends on B, denoted by B =) B 0, i every elementary B 0 - neighborhood is a denable B-neighborhood. 6. If B =) B 0, we will say B is denably ner than B 0 or B 0 is denably coarser than B. Strongly dependence is an elaborate extension of renement of equivalence relations. The obvious extension, which says every B 0 -neighborhood is a subset of B-neighborhood, does not have the desirable properties of "functional" dependency (or knowledge dependency of Pawlak). We requires every elementary B 0 -neighborhood is a union of B-neighborhoods. We recall some specic binary neighborhood systems from [9] Denition.

10 1. (V; B) is serial, if 8p; B p is non-empty, 2. (V; B) is reexive, if 8p; p 2 B p ; 3. (V; B) is symmetric, if 8p; 8q; q 2 B p =) p 2 B q ; 4. (V; B) is transitive, if 8p; 8q; 8r; q 2 B p and r 2 B q =) r 2 B p ; 5. (V; B) is Euclidean, if q 2 B p, and r 2 B p =) r 2 B q ; 6. (V; B) is clopen, if B is reexive, symmetric, and transitive. [13] 7 Single Column Granular Representations Suppose we are given a universe V, a binary neighborhood system B, and an elementary concept space (an active domain of attribute values; see Section 2.1). Then, the 3-tuple (V; B; C) is called a granular structure; see Section 8.1. Let us consider the map GN : p! B p! NAME(B p ). where the st map is the granulation B (Section 6.1), and the second map is a naming map. GN is called a single column granular representation. Its graph (p, GN(p)) is a single column granular table. As before, we will call it canonical single column granular table or simply canonical granular table, if we use the canonical names. Note that the rst map B induces a partition on V, we will denote it by P B. It is clear GN can be factored through V=P B. p! [p]! B p! NAME(B p ). Note that in the relational case, the middle map is an identity. 7.1 Examples Let V be the set of restaurant owners as given in Table 1. We will suppress the ID from ID i, so the set of restaurant owners is V = f1; 2; 3; 4; 5; 6; 7; 8; 9g. Further, V has a new "attribute" dened as follows: Each restaurant owner is associated with a group of major investors in his restaurant; the investors are members in V. Each group has a registered name, such as, bronze, silver, gold, or platinum groups. Note that American restaurant is too expensive for the owner to be a major investor. So the group associated to ID 1 or ID 2 does not include the owners. Technically, a group is an elementary neighborhood. Here are the lists of these groups of investors. 1. B 1 = B 2 = f3; 4; 5; 6; 7; 8; 9g, 2. B 3 = f1; 2; 3g, 3. B 4 = B 5 = f4; 5g 4. B 6 = B 7 = B 8 = B 9 = f6; 7; 8; 9g We name each neighborhood as follows:

11 1. NAME(B 1 )=NAME(B 2 )= platinum, 2. NAME(B 3 )= bronze, 3. NAME(B 4 )=NAME(B 5 )= silver, 4. NAME(B 6 )=NAME(B 7 )=NAME(B 8 )=NAME(B 9 )= gold 7.2 Single Column Granular Table. Let C = f bronze, silver, gold, platinum g and consider the single column granular table for INVESTROS; see Table 9. Note that B induces a binary relation B C on C: Denition Let c=name(p ) and d=name(q) be two elements in C. c B C d i 9p 2 P and 9q 2 Q such that p B q. Table 10 is such a binary relation for INVESTORS-attribute. Please put Table 9 here Please put table 10 here 7.3 Single Column Machine Oriented Granular Models In relational theory, single column machine oriented model is a named partition; each elementary set(equivalence class) is named. For real world data, it is a named binary granulation (binary neighborhood system); each elementary neighborhood is named. One can represent it in a table format; see Table 11,or Table 12, where we have grouped the owners together if their neighborhoods are the same. In fact, the grouping is the partition P B, see the beginning paragraphs of Section 7. Since the members of an elementary neighborhood are explicitly listed, there is no need to display the semantic relations. Put Table 11 here. Put Table 12 here. 8 Multiple Column Granular Representations It is rather easy to generalize the single column representation theory to a multiple column representations. So this section will be rather brief and formal. 8.1 Granular Structures A binary granular structure consists of 3-tuple (V; B j ; C j ; j = 1; 2; : : :; n) where

12 1. V is the universe, called the object space. 2. Each B j is a binary neighborhood system(a binary relation), j=1,2,: : :n 3. Each B j consists of elementary neighborhoods B j p ; 8 p 2 V. 4. For each elementary neighborhood a meaningful name is given, that is, C j p = NAME(B j p); j = 1; 2; : : :n and p 2 V. 5. C j is the elementary space that consists of all the names of elementary neighborhoods in B j ; j = 1; 2; : : :n. It is also referred to as an active domain, namely C j = ADom Bj ; see Section 2.1 A collection of single column granular representations forms one multiple column granular representation or simply a granular representation. Its graph will be called multiple column granular table, or simply granular table; it was called extended information table [10]. Perhaps, once again, we should caution the readers that unlike the case of relation theory, the entries in the granular table are not semantically independent in its respective domain; see Table 9, and Continuous Functions Let us consider the following relation, Table 13, that is derived from Tabel 9 and Table 6. If we had forgotten the semantic relation in the active domain of INVESTORS, then one would think that there were an extensional functional dependency INVESTOR! TYPE in Table 13. However, for example, bronze and platinum are B C -related, but their images, American and Chinese, are not related. So the map INVESTOR! TYPE does not respect the semantic relation. We will not treat it as a functional dependency. We require the functional dependency respect such semantic relations. So we dene Denition 1. A map F, which is dened on a neighborhood of p, is continuous at p, if F (NEIGH(p)) NEIGH(q), where q = F (p); see Section C j is continuously functionally depended on C h, if there is a map F : C j! C h that is continuous at every point p 2 C j The only continuous functional dependency in Table 13 is T Y P E! INV EST ORS. Put Table 13 9 Machine Oriented Granular Models A multiple column machine oriented model is a collection of single column machine oriented models. In other words, it is a collection of named binary granulation (binary neighborhood system); each elementary neighborhood is named. The machine oriented model for Table 13 is simply the union of Table 12, and part of Table 8; see Table 14. Put Table 14

13 10 Data Mining on Clustered Data We will extend many classical notions of various rules to granular tables. Recall that each elementary concept space (active domains) is an NS-space; see Section 6.2. Let (V; B j ; C j ; j = 1; 2; : : :; n) be a granular structure. The elementary neighborhood B j p of p 2 V will be denoted by NEIGH B j (p). Note that there is an induced binary relation on C j ; see 7.1 Denition. So there is an elementary neighborhood for each element in C j. Write c=name(p ) and d=name(q), where P and Q are elementary neighborhoods in B 1 and B 2, respectively. Note that NEIGH B j(p ), or simply NEIGH(P ) if B j is understood, means the union of NEIGH B j(p) 8 p 2 P. Also note that NEIGH(c)is an elementary neighborhood in the elementary concept space C j. 1. Soft association rule: A pair (c; d) is a soft association rule, if Card(NEIGH(P )\NEIGH(Q)) threshold. 2. Soft decision rules: A formula c! d is a continuous decision rule, if P NEIGH(Q) [18]. 3. Continuous decision rules: A formula c! d is a continuous decision rule, if P Q and NEIGH(c) NEIGH(d). 4. Softly robust continuous decision rule: A formula c! d is a softly robust continuous decision rule, if N EIGH(c) NEIGH(d) and Card(NEIGH(P ) \ NEIGH(Q)) threshhold [12]. 5. (Softly robust)high level continuous decision rules: Suppose P and Q are two denably coarser granular structures of B 1 and B 2. A formula c! d is a (softly robust) high level continuous decision rule, if NEIGH(P ) NEIGH(Q) (and Card (NEIGH(P ) \ NEIGH(Q)) threshhold) [14, 4, 9]. Some applications will be reported in the future papers. 11 Conclusion We started with two notions, 1. data mining is machine derivation of interesting (to human) properties from the underlying mathematical structure of the stored data. 2. the universe of real world objects are granulated (clustered). and set forth to develop data models suitable for mining real world data. Machine oriented data models, in which attribute values are encoded with machine semantics (knowledge), are introduced. The model eectively provides machine the necessary information for mining various forms of rules. Data mining in such models is reduced to set theoretical operations of granules, which is machine calculus of granules - granular computing.

14 For relational theory, granules are equivalence classes; the computation are ecient; it is faster than usual approaches. Applications are on the way; they will be reported soon. For clustered data, substantial research is still needed; Currently, the semantic relations on the attribute domains are supplied by human (concept hierarchy [4, 17] is a special case). Some automations of building such semantic relations are needed for large scaled applications. We will report our exploration in future papers. 12 Appendix-Information Tables and Relations The syntax of information tables is very similar to relations in relational databases. Entities are also represented by tuples of attribute values. However, the representation may not be faithful, namely, entities and tuples may not be one to one correspondence. where An information table is a 4-tuple (V; A; Dom; ), 1. V = fu; v; : : :g is a set of entities. 2. A is a set of attributes fa 1 ; A 2 ; : : :A n g. 3. dom(a i ) is the set of values of attribute A i Dom = dom(a 1 ) dom(a 2 ) : : : dom(a n ) 4. : V A! Dom, called description function, is a map such that (u; A i ) is in dom(a i ) for all u in V and A i in A. The description function induces a set of maps Each image forms a tuple: t = (u; ) : A! Dom. t = ((u; A 1 ); (u; A 2 ); ::::; (u; A i ); ::(u; A n )) Note that the tuple t is associated with object u, but not necessarily uniquely. In an information table, two distinct objects could have the same tuple representation that is not permissible in relational databases. A decision table is an information table (V; A; Dom; ) in which the attribute set A = C [ D is a union of two non-empty sets, C and D, of attributes. The elements in C are called conditional attributes. The elements in D are called decision attributes. Each row is a decision rule. The notion of a relation in relational theory consists of

15 1. V = fx; y; : : :g is an implicit set of entities, which is not appear in the formal model. 2. A is a set of attributes fa 1 ; A 2 ; : : :A n g. 3. Dom(A i ) is the set of values of attribute A i. Dom = dom(a 1 ) S dom(a 2 ) S : : : S dom(a n ) 4. Implicitly, to each entity u we associate a mapping t u : A! Dom, where t(a) 2 dom(a i ) for each A i 2 A. A relation consists of mappings t u : A! Dom, Informally, one can view relation as a table consists of rows of elements. Each row represents an entity uniquely. References 1. R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," in Proceeding of ACM-SIGMOD international Conference on Management of Data, pp , Washington, DC, June, R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules," in Proceeding of 20th VLDB Conference SanTiago, Chile, S. Bairamian, Goal Search in Relational Databases, California State Univeristy- Northridge, Thesis, Y.D. Cai, N. Cercone, and J. Han. "Attribute-oriented induction in relational databases," in Knowledge Discovery in Databases, pages AAAI/MIT Press, Cambridge, MA, W. Chu and Q. Chen, "Neighborhood and associative query answering," Journal of Intelligent Information Systems, vol 1, , C. J. Date, Introduction to Database Systems 3rd, 6th editions, Addision-Wesely, Reading, Massachusetts, 1981, T. Gaasterland, Generating Cooperative Answers in Deductive Databases, University of Maryland, College Park, Maryland, Dissertation, T. Y. Lin and Eric Louie, "Finding Association Rules by Computing Bits" Data Mining and Knowledge Discovery: Theory, Tools, and Technology II (or29) April 2000, Orlando, Florida USA 9. T. Y. Lin, "Granular Computing of Binary relations I: Data Mining and Neighborhood Systems," in Rough Sets and Knowledge Discovery, edited by Polkowski and Skowron, Physica-Verlag, , T. Y. Lin, "Granular Computing of Binary relations II: Rough Set Representations and Belief Functions," in Rough Sets and Knowledge Discovery, edited by Polkowski and Skowron, Physica-Verlag, , T. Y. Lin, "Neighborhood Systems -A Qualitative Theory for Fuzzy and Rough Sets," in Advances in Machine Intelligence and Soft Computing, Volume IV, edited by Paul Wang, , T. Y. Lin, "Rough Set Theory in Very Large Databases," in Proceedings of Symposium on Modeling, Analysis and Simulation, IMACS Multi Conference (Computational Engineering in Systems Applications), Lille, France, July 9-12, Vol. 2 of 2, , 1996.

16 13. T. Y. Lin, "Topological and Fuzzy Rough Sets," in Decision Support by Experience - Application of the Rough Sets Theory, edited by R. Slowinski, Kluwer Academic Publishers, , T. Y. Lin, "Neighborhood Systems and Approximation in Database and Knowledge Base Systems," in Proceedings of the Fourth International Symposium on Methodologies of Intelligent Systems, Poster Session, October 12-15, 1989, T. Y. Lin, "Neighborhood Systems and Relational Database," in Proceedings of 1988 ACM Sixteen Annual Computer Science Conference, February 23-25,1988, T. Y. Lin, "Topological Data Models and Approximate Retrieval and Reasoning," in Proceedings of 1989 ACM Seventeenth Annual Computer Science Conference, February 21-23, Louisville, Kentucky, 1989, T. Y. Lin and M. Hadjimichael, "Non-Classicatory Generalization in Data Mining," in Proceedings of The Fourth International Workshop on Rough Sets, Fuzzy Sets, and Machine Discovery, November 6-8, 1996, Tokyo, Japan, T. Y. Lin, and Y. Y. Yao, "Mining Soft Rules Using Rough Sets and Neighborhoods," in Proceedings of Symposium on Modeling, Analysis and Simulation, CESA'96 IMACS Multiconference (Computational Engineering in Systems Applications), Lille, France, 1996, Vol. 2 of 2, B. Michael and T. Y. Lin, "Neighborhoods, Rough sets, and Query Relaxation," in Rough Sets and Data Mining: Analysis of Imprecise Data, Kluwer Academic Publisher, edited by T. Y. Lin and N. Cercone, , (Final version of the paper presented in Workshop on Rough Sets and Database Mining, March 2, L.A. Zadeh, "Some Reections on Soft Computing, Granular Computing and Their Roles in the Conception, Design and Utilization of Information/Intelligent Systems, " in Granular Computing: Fuzzy sets, Fuzzy Logic and Applications to Information/Intelligent Systems, edited by T. Y. Lin, Y. Y. Yao, and L. Zadeh, Physica- Verlag, to appear. 21. D. Meyer, The Theory of Relational Databases, Computer Science press, 1983 (6th printing 1988). 22. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic, Dordrecht, W. Sierpenski and C. Krieger, General Topology, University of Torranto Press L.A. Zadeh, "Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic," Fuzzy Sets and Systems, 90, , Lot Zadeh, "The Key Roles of Information Granulation and Fuzzy logic in Human Reasoning," in Proceedings of 1996 IEEE International Conference on Fuzzy Systems, September 8-11,1996, L.A. Zadeh, " Fuzzy Sets and Information Granularity," in Advances in Fuzzy Set Theory and Applications, edited by M. Gupta, R. Ragade, and R. Yager, North- Holland, Amsterdam, 3-18, W. Ziarko, R. Golan, and D. Edwards, "An Application of DataLogic/R Knowledge Discovery Tool to Identify Strong Predictive Rules in Stock Market Data," in Proceedings of AAAI-93 Workshop on Knowledge Discovery in Databases, Washington, DC, This article was processed using the LaT E X macro package with LLNCS style

17 Restaurant owners LOCATIONs : : : Restaurant owners Canonical Names ID 1 West Wood : : : ID 1 fid 1; ID 2; ID 3g ID 2 West Wood : : : ID 2 fid 1; ID 2; ID 3g ID 3 West Wood : : : ID 3 fid 1; ID 2; ID 3g ID 4 West LA : : : ID 4 fid 4; ID 5g ID 5 West LA : : : ID 5 fid 4; ID 5g ID 6 Brent Wood : : : ID 6 fid 6; ID 7; ID 8; ID 9g ID 7 Brent Wood : : : ID 7 fid 6; ID 7; ID 8; ID 9g ID 8 Brent Wood : : : ID 8 fid 6; ID 7; ID 8; ID 9g ID 9 Brent Wood : : : ID 9 fid 6; ID 7; ID 8; ID 9g Table 1. Two Tables of Single Column Information Tables Restaurant owners LOCATIONs : : : Restaurant owners Canonical Names ID 1 West Wood : : : ID 1 B( ) ID 2 West Wood : : : ID 2 B( ) ID 3 West Wood : : : ID 3 B( ) ID 4 West LA : : : ID 4 B( ) ID 5 West LA : : : ID 5 B( ) ID 6 Brent Wood : : : ID 6 B( ) ID 7 Brent Wood : : : ID 7 B( ) ID 8 Brent Wood : : : ID 8 B( ) ID 9 Brent Wood : : : ID 9 B( ) Table 2. Bit Representation of a Single Column Information Table Restaurant Natural Canonical names Naming Meaningful names Owner projection (encoded map (attribute Groups (partition) labels) values) ID 1; ID 2; ID 3 : : : B( ) : : : West Wood ID 4; ID 5! B( )! West LA ID 6; ID 7; ID 8; ID 9 : : : B( ) : : : Bent Wood Table 3. The First and Second Maps, a factorization of A

18 Canonical names Meaningful names (encoded labels) (attribute values) B( ) B( ) B( ) West Wood West LA Bent Wood Table 4. Single Column Machine Oriented Relational Bit Model; condensed form Canonical names Meaningful names (encoded labels) (attribute values) f1; 2; 3g West Wood f4; 5g West LA f6; 7; 8; 9g Bent Wood Table 5. Single Column Machine Oriented Relational List Model RESTAURANT OWNER TYPE LOCATION PRICE ID 1 American West wood inexpensive ID 2 American West wood inexpensive ID 3 Chinese West wood moderate ID 4 Japanese West LA moderate ID 5 Japanese West LA moderate ID 6 French Brent Wood expensive ID 7 French Brent Wood expensive ID 8 French Brent Wood expensive ID 9 French Brent Wood expensive Table 6. A Relational Restaurant Database

19 RESTAURANT CNAME(TYPE) CNAME(LOCATION) CNAME(PRICE) OWNER ID 1 T Y P E( ) LOCAT ION( ) P RICE( ) ID 2 T Y P E( ) LOCAT ION( ) P RICE( ) ID 3 T Y P E( ) LOCAT ION( ) P RICE( ) ID 4 T Y P E( ) LOCAT ION( ) P RICE( ) ID 5 T Y P E( ) LOCAT ION( ) P RICE( ) ID 6 T Y P E( ) LOCAT ION( ) P RICE( ) ID 7 T Y P E( ) LOCAT ION( ) P RICE( ) ID 8 T Y P E( ) LOCAT ION( ) P RICE( ) ID 9 T Y P E( ) LOCAT ION( ) P RICE( ) Table 7. Bit Representations of Relational Restaurant Database Canonical names (encoded labels) T Y P E( ) T Y P E( ) T Y P E( ) T Y P E( ) LOCAT ION( ) LOCAT ION( ) LOCAT ION( ) P RICE( ) P RICE( ) P RICE( ) Meaningful names (attribute values) American Chinese Japanese French West Wood West LA Brent Wood inexpensive moderate expensive Table 8. Multiple Attributes Machine Oriented Relational Model Objects INVESTORS ID 1 ID 2 ID 3 ID 4 ID 5 ID 6 ID 7 ID 8 ID 9 platinum platinum bronze silver silver gold gold gold gold Table 9. Single Column Granular Table; entries are semantically interrelated; see next table

20 INVESTORS INVESTORS platinum silver platinum gold platinum bronze gold gold gold platinum silver silver silver platinum bronze bronze bronze platinum Table 10. Semantic Relation BC Restaurant Binary Canonical names Meaningful names Owner Groups granulation (encoded labels) (attribute values) ID 1; ID 2! f3; 4; 5; 6; 7; 8; 9g platinum ID 3! f1; 2; 3g bronze ID 4; ID 5! f4; 5g silver ID 6; ID 7; ID 8; ID 9! f6; 7; 8; 9g gold Table 11. Machine Oriented List Granular Model; Canonical names spell out the semantics relation Restaurant Binary Canonical names Meaningful names Owner Groups granulation (encoded labels) (attribute values) ID 1; ID 2! B( ) platinum ID 3! B( ) bronze ID 4; ID 5! B( ) silver ID 6; ID 7; ID 8; ID 9! B( ) gold Table 12. Machine Oriented Bit Granular Model

21 RESTAURANT TYPE INVESTORS OWNER ID 1 American platinum ID 2 American platinum ID 3 Chinese bronze ID 4 Japanese silver ID 5 Japanese silver ID 6 French gold ID 7 French gold ID 8 French gold ID 9 French gold Table 13. A Granular Restaurant Database; INVESTORS attribute has a semantic relation, TYPE attribute has no semantic relation Restaurant Canonical names Meaningful names Owner Groups (encoded labels) (attribute values) ID 1; ID 2 IN V EST OR( ) platinum ID 3 IN V EST OR( ) bronze ID 4; ID 5 IN V EST OR( ) silver ID 6; ID 7; ID 8; ID 9 IN V EST OR( ) gold ID 1; ID 2 T Y P E( ) American ID 3 T Y P E( ) Chinese ID 4; ID 5 T Y P E( ) Japanese ID 6; ID 7; ID 8; ID 9 T Y P E( ) French Table 14. Machine Oriented Granular Model

Modeling the Real World for Data Mining: Granular Computing Approach

Modeling the Real World for Data Mining: Granular Computing Approach T. Y. Lin Department of Mathematics and Computer Science San Jose State University San Jose California 95192-0103 and Berkeley Initiative