Modeling the Real World for Data Mining: Granular Computing Approach

Similar documents
Association Rules with Additional Semantics Modeled by Binary Relations

Qualitative Fuzzy Sets and Granularity

Semantics Oriented Association Rules

Rough Sets, Neighborhood Systems, and Granular Computing

A Granular Computing Approach. T.Y. Lin 1;2. Abstract. From the processing point of view, data mining is machine

A Generalized Decision Logic Language for Granular Computing

On Generalizing Rough Set Theory

Mathematical Foundation of Association Rules - Mining Associations by Solving Integral Linear Inequalities

Approximation Theories: Granular Computing vs Rough Sets

Granular Computing based on Rough Sets, Quotient Space Theory, and Belief Functions

Granular Computing on Binary Relations In Data Mining and Neighborhood Systems

A Set Theory For Soft Computing A Unified View of Fuzzy Sets via Neighbrohoods

A Logic Language of Granular Computing

Mining High Order Decision Rules

Sets with Partial Memberships A Rough Set View of Fuzzy Sets

Rough Set Approaches to Rule Induction from Incomplete Data

Formal Concept Analysis and Hierarchical Classes Analysis

Generalized Infinitive Rough Sets Based on Reflexive Relations

Granular Computing: A Paradigm in Information Processing Saroj K. Meher Center for Soft Computing Research Indian Statistical Institute, Kolkata

Granular Computing. Y. Y. Yao

Information Granulation and Approximation in a Decision-theoretic Model of Rough Sets

Value Added Association Rules

Attribute (Feature) Completion The Theory of Attributes from Data Mining Prospect

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values

Granular Computing: Examples, Intuitions and Modeling

Rough Approximations under Level Fuzzy Sets

XI International PhD Workshop OWD 2009, October Fuzzy Sets as Metasets

Applying Fuzzy Sets and Rough Sets as Metric for Vagueness and Uncertainty in Information Retrieval Systems

Using level-2 fuzzy sets to combine uncertainty and imprecision in fuzzy regions

Semantics of Fuzzy Sets in Rough Set Theory

COMBINATION OF ROUGH AND FUZZY SETS

Knowledge Engineering in Search Engines

Efficient SQL-Querying Method for Data Mining in Large Data Bases

A Graded Meaning of Formulas in Approximation Spaces

Molodtsov's Soft Set Theory and its Applications in Decision Making

A Rough Set Approach to Data with Missing Attribute Values

Induction of Strong Feature Subsets

Granular Computing II:

SOME OPERATIONS ON INTUITIONISTIC FUZZY SETS

Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction

EFFICIENT ATTRIBUTE REDUCTION ALGORITHM

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

Granular association rules for multi-valued data

On Reduct Construction Algorithms

A Decision-Theoretic Rough Set Model

[Ch 6] Set Theory. 1. Basic Concepts and Definitions. 400 lecture note #4. 1) Basics

Yiyu Yao University of Regina, Regina, Saskatchewan, Canada

Granular Computing: Models and Applications

ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING WITH UNCERTAINTY

Data Analysis and Mining in Ordered Information Tables

Approximation of Relations. Andrzej Skowron. Warsaw University. Banacha 2, Warsaw, Poland. Jaroslaw Stepaniuk

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM

Collaborative Rough Clustering

FUNDAMENTALS OF FUZZY SETS

NEIGHBORHOOD SYSTEMS: A Qualitative Theory for Fuzzy and Rough Sets

A fuzzy soft set theoretic approach to decision making problems

Interpreting Association Rules in Granular Data Model via Decision Logic

REDUNDANCY OF MULTISET TOPOLOGICAL SPACES

Generating Topology on Graphs by. Operations on Graphs

MA651 Topology. Lecture 4. Topological spaces 2

Action Rules. (*Corresponding author)

Union and intersection of Level-2 fuzzy regions

Brian Hamrick. October 26, 2009

Bipolar Fuzzy Line Graph of a Bipolar Fuzzy Hypergraph

Assessment of Human Skills Using Trapezoidal Fuzzy Numbers

Introduction to Sets and Logic (MATH 1190)

Granular Computing: The Concept of Granulation and Its Formal Theory I

A mining method for tracking changes in temporal association rules from an encoded database

Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

Fuzzy Sets and Systems. Lecture 1 (Introduction) Bu- Ali Sina University Computer Engineering Dep. Spring 2010

Attribute Reduction using Forward Selection and Relative Reduct Algorithm

CSC Discrete Math I, Spring Sets

A Closest Fit Approach to Missing Attribute Values in Preterm Birth Data

Mining High Average-Utility Itemsets

Finite-Resolution Simplicial Complexes

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

The Application of K-medoids and PAM to the Clustering of Rules

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Songklanakarin Journal of Science and Technology SJST R1 Ghareeb SPATIAL OBJECT MODELING IN SOFT TOPOLOGY

Introducing fuzzy quantification in OWL 2 ontologies

Local and Global Approximations for Incomplete Data

Minimal Test Cost Feature Selection with Positive Region Constraint

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Disjunctive and Conjunctive Normal Forms in Fuzzy Logic

Modeling with Uncertainty Interval Computations Using Fuzzy Sets

Reichenbach Fuzzy Set of Transitivity

The Rough Set View on Bayes Theorem

Sets MAT231. Fall Transition to Higher Mathematics. MAT231 (Transition to Higher Math) Sets Fall / 31

The Rough Set Database System: An Overview

DISCRETE DOMAIN REPRESENTATION FOR SHAPE CONCEPTUALIZATION

Classification with Diffuse or Incomplete Information

From Topology to Anti-reflexive Topology

An Architecture Model of Distributed Simulation System Based on Quotient Space

This is an author produced version of Ordered Information Systems and Graph Granulation.

Discrete Mathematics Lecture 4. Harper Langston New York University

A study on lower interval probability function based decision theoretic rough set models

Rough Connected Topologized. Approximation Spaces

A database can be modeled as: + a collection of entities, + a set of relationships among entities.

Transcription:

Modeling the Real World for Data Mining: Granular Computing Approach T. Y. Lin Department of Mathematics and Computer Science San Jose State University San Jose California 95192-0103 and Berkeley Initiative in Soft Computing University of California Berkeley California 94720 E-mail: tylin@cs.sjsu.edu tylin@cs.berkeley.edu Abstract To each object in an object space a (possibly empty) family of granules (crisp/fuzzy subsets) of the data space is assigned; we call it a granulation. It is a mild generalization of the neighborhood system of (pre- )topological spaces. If each family has at most one granule the granulation defines a binary relations. Interestingly if the granulation is defined by general relations the data space is the (world) models in first order logic. A knowledge representation of such a world model that is assigning a uniquely meaningful name (attribute value) to each neighborhood is called a granular data model. If the granulation are by equivalence relations the model is the classical relational model. Intuitively it is a real world data model; note that granules have overlapped so attribute values may not be independent. In other words attribute domains are more than Cantor sets; intuitively they are real world Models (sets). Depending on the structures and representations the model can be useful in fuzzy logic or data mining. The focus of this paper is on data mining in fact semantically rich rules are mined. Its performance are measured; it twenty some times faster than traditional Apriori. 1. Introduction Relation theory is designed to model the real world of a long duration. To accommodate all instances it assumes the universe and attribute domains are all Cantor sets. In other words the interactions among entities are forgotten in the relational modeling. To have a better approximation we need an appropriate model. In logic a real world is modeled by a Cantor set (of entities) with relational structure. As a first step we decide to consider binary relational structures. Interestingly it agrees with Zadeh s notion of information granulation [5]. In applications we reach seemingly unrelated topics data mining and fuzzy control. In this paper however we focus on data mining. Some impressive experimental results have achieved. In classical case we are 24 time faster than the traditional Apriori. 2 Granulations and Neighborhood Systems In [5]) Zadeh defines (rephrased) information granulation is a collection of granules with a granule being a clump of objects (points) which are drawn towards an object. In other words each object is associated with a family of clumps. This is essentially the notion of Frechet(V) space[?] or neighborhood systems [9]. In this paper a fuzzy set is uniquely defined by its membership function [15]. It is a w-sofset if we use the language of [7]. A crisp/fuzzy neighborhood system (F/NS) is: To each object we associate an (empty finite or infinite) family of crisp/fuzzy subsets called clumps. The mathematical system defined by these families is called crisp/fuzzy neighborhood system or simply neighborhood system and these clumps associated to are called fundamental neighborhoods of. Note that if there is at most one fundamental neighborhood at each point then the neighborhood system is defined by a binary relation; see Section 4. 1

3 Representations of neighborhood systems Weighted sum veristic constraints 3.1 Multiple valued representations We will illustrate the idea by examples. Let and be a family of fuzzy sets of that covers. is a fuzzy neighborhood system and each cover is a fuzzy neighborhood of any point in the cover. are the fuzzy neighborhood system at point The association to each object we associate a set of names is a multiple valued representation of the universe. 3.2 Fuzzy representations Since the neighborhood system is fuzzy we will take the weighted average of multiple values. Let us consider the following formal expressions: where are real numbers. Mathematically the collection of all such expressions is a vector space. Each vector is called a formal word. Let represents the grade of at i.e.. We will call the weight of in. Based on the weight we will form a formal word representation: defined by The expression is called the formal word representation of ; it is Zadeh s veristic constraint [?]. Table 1 consists of all such formal expressions; it is a vector-valued representation of the universe. Each expression represents a certain weighted sum of attribute values. 4 Binary granulation and Partitions A partition is a collection of pair-wise disjoint subsets whose union is. This is the simplest granulation. Its algebraic concept is an equivalence relation. So a natural generalization is a binary relation. We should like to comment that an obvious geometric generalization of a partition is a covering. Unfortunately a covering is not the geometric equivalence of a binary relation. The equivalent one is the more elaborate notion called the binary neighborhood system. This is the subject that will be covered next. Intuitively it is a cover with center more than a simple notion of cover. The notion of the center plays an essential role in this paper. 4.1 Binary granulation relations and neighborhood systems A Crisp/Fuzzy binary relation(br or FBR) is a crisp/fuzzy subset whose membership function is where M is the membership space that is M is either the unit interval [0 1] or the binary values. It defines a crisp/fuzzy set called binary (or elementary) neighborhood whose membership function is is defined by. The collection of all crisp/fuzzy sets on U is denoted by FZ(U). The map is called a crisp/fuzzy binary granulation and the set a crisp/fuzzy binary neighborhood system. 2

Proposition. and are equivalent to each other and will be used interchangeably; see [5]. A subset is a definable set if it is a union of equivalence classes. So a subset is called a definable neighborhood if is a union of elementary neighborhoods. If the definable neighborhood contains the elementary neighborhood of p it is a definable neighborhood of p. 4.2 Induced partitions The binary granulation is a map it induces a partition (or equivalence relation) denoted by on by the collection of complete inverse images. 5 Real world Model and Data Mining 5.1 Granular Data Model- Real world relational theory A crisp/fuzzy granular data model consists of 3-tuple where is called the object space is the data space ( and could be the same set) is a finite family of crisp/fuzzy binary granulations (neighborhood systems or binary relations). If and that will be denoted by is a finite family of equivalence relations then (U E) is called rough data model; it was called knowledge base in [10] [5] [6] [4]. We will not use it here since it confuses with standard usage. The notion of knowledge representation is essentially naming the granular data model that is assign meaningful names to the binary relations (attributes) and their binary neighborhoods (attribute values) [?]. Smith Jones Blake Clark Adams Peterson Ewing Johnson Pike Meyers We will illustrate the idea by examples. In the case of rough data model Table 6 its representation is an ordinary relation Table 2. If there are additional semantics conflict of interests among agents which is represented in the second column of Table??; it induces the equivalence relation in third column. Equiv. Elementary Attribute Class Granule value encoded label meaningful name S# TEN TWENTY THIRTY FORTY EIGHTY NINTY 1. In rough data model the universe is partitioned into equivalence classes. So we consider the following composition: where [p] is equivalence class. 2. In granular data model which maps each object to its unique binary neighborhood and then to a meaningful name. 5.2 Data mining = Granular Computing Granular data model uses granules as its attribute values so any logical formula is translated to set theoretical formula of granules. However we should note that attribute values are semantically related so elementary granules of a column (a binary relation) may overlap. So in processing any logical formula based on attribute values it is important that one checks the continuity 3

TEN TWENTY THIRTY FORTY EIGHTY NINTY Binary neighborhood Center Binary meaningful neighborhood name S# 10 20 30 40 80 90 (namely see if it respects the semantics). Such checking is implicitly included in the computing of granules. We collect some generalized standard patterns: [2]. Let and be two attributes of a relation-withadditional-semantics. Let be two values of and respectively. Let be the respective elementary granules. It is clear that = NAME( ) and = NAME( ). Let Card( )be the cardinal number of a set. 1. Continuous decision rule: A formula is a continuous decision rule if continuously. Binary neighborhood on V Center (induced partition) 2. Continuous universal decision rule: A formula is a continuous universal decision rule (extensional function dependence) iff such that 3. Robust continuous decision rule : A formula is a robust continuous decision rule if and Card threshhold. 4. Soft continuous decision rule [8]: A formula is a soft continuous decision rule (strong rule) if is softly included in. 5. Continuous association rule: A pair is an association rules if Card ( threshhold. We will illustrate the continuous decision rules only; we skip the rest. is a continuous decision rule if an attribute value in NEIGH( ) appears in a tuple it must imply that an attribute value in NEIGH( ) also appears. So to check 4

If then. One needs to scan through the two columns in Table?? and check if ( = NEIGH( )) is continuously associated with NEIGH( ). In machine oriented model the same fact can be checked by the inclusion of two elementary granules namely 5.3 Some performance data = We collect some results on the performance of finding association rules. The relation consists of 128K rows = 131072 16 Columns the support requires 8192 and memory is 10 megabytes; see Tabele 7 [3]. The program for Apriori AporiTid and AprioriHybrid are our honest implementations of the algorithms in [?] [1]. In the implementation we use some buffer scheme to speedup read/write for all algorithms. 6 Conclusions In this conclusion we will reflect on our over all approach. In several of our papers we have literally taken Zadeh s intuitive description of clumps as a formal mathematical notion of granulation. It is essentially a mild generalization of binary relations and neighborhood systems in (pre-)topological spaces [12 9? 5]. By giving a meaningful name to each granule we have a representation theory. It extends the classical relational model based on Cantor sets to real world data model based on real world set theory (neighborhood system space). It is worthy to note here that in crisp world the representation is locally multi-valued in fuzzy world we can use weights to combine these names linearly (a weighted average) and form formal words; this tune it into a single-valued representation namely a formal word table; see Section 3.2. Using Zadeh s terminology such formal word representations are veristic constraints [?]. A formal word table is a generalization of information table. So by employing table processing techniques of rough set methodology to formal word tables we expect some useful applications to fuzzy logic control. Our study seems saying that granular computing is a reasonable notion. At this point its essential ingredients are (1) a representation theory of granular structure which will be useful in data mining (2) a formal = = word representation of input/output spaces and potentially useful to fuzzy logic control. In the over simplified terms the two applications are computing with words. Final we would like to say few words on the computational performance in classical data mining granular computing is faster than Apriori [3] because the database scan are replaced by bit operations. In this paper we extend the use of granular computing to semantically richer models. Such extra semantics can be used to analyze unexpected rules [11]. Granular computing is fast; it seems a promising approach to data mining. References [1] Agrawal R. R. Srikant Fast Algorithms for Mining Association Rules in Proceeding of 20th VLDB Conference San Tiago Chile 1994. [2] T. Y. Lin Data Mining and Machine Oriented Modeling: A Granular Computing Approach Journal of Applied Intelligence Kluwer Vol. 13No 2 September/October2000 pp.113-124. [3] Eric Louie and T.Y. Lin Finding Association Rules using Fast Bit Computation: Machine- Oriented Modeling. In: Proceeding of 12th International Symposium ISMIS2000 Charlotte North Carolina Oct 11-14 2000. Lecture Notes in AI 1932. 486-494. [4] T. Y. Lin Granular Computing: Fuzzy Logic and Rough Sets. In: Computing with words in information/intelligent systems L.A. Zadeh and J. Kacprzyk (eds) Springer-Verlag 183-200 1999 [5] T. Y. Lin Granular Computing on Binary Relations I: Data Mining and Neighborhood Systems. In: Rough Sets In Knowledge Discovery A. Skoworn and L. Polkowski (eds) Springer- Verlag 1998 107-121. [6] T. Y. Lin Granular Computing on Binary Relations II: Rough Set Representations and Belief Functions. In: Rough Sets In Knowledge Discovery A. Skoworn and L. Polkowski (eds) Springer- Verlag 1998 121-140. [7] T. Y Lin A Set Theory for Soft Computing. In: Proceedings of 1996 IEEE International Conference on Fuzzy Systems New Orleans Louisiana September 8-11 1140-1146 1996. [8] T. Y. Lin and Y.Y. Yao Mining Soft Rules Using Rough Sets and Neighborhoods. In: Symposium on Modeling Analysis and Simulation IMACS 5

Length of # of Association Granule(Full Granule Apriori Apriori Apriori combination Candidates rules Computation Partial Hybrid 199 Tid Multiconference (Computational Engineering in Systems Applications) Lille France July 9-12 1996 Vol. 2 of 2 1095-1100. [9] T. Y. Lin Neighborhood Systems and Relational Database. In: Proceedings of 1988 ACM Sixteen Annual Computer Science Conference February 23-25 1988 725 [10] Z. Pawlak Rough sets. Theoretical Aspects of Reasoning about Data Kluwer Academic Publishers 1991 [11] Balaji Padmanabhan and Alexander Tuzhilin Finding Unexpected Patterns in Data. In: Data Mining and Granular Computing T. Y. Lin Y.Y. Yao and L. Zadeh (eds) Physica-Verlag to appear. [12] W. Sierpenski and C. Krieger General Topology University of Torranto Press 1956. [13] Lotfi Zadeh The Key Roles of Information Granulation and Fuzzy logic in Human Reasoning. In: 1996 IEEE International Conference on Fuzzy Systems September 8-11 1 1996. [14] W. Ziarko Variable Precision Rough Set Model. Journal of Computer and Systems Science Vol 46No1 February Academic Press 1993 pp.38-59. [15] H. Zimmerman Fuzzy Set Theory and its Applications Second Ed. Kluwer Acdamic Publisher 1991. 6