SIMILARITY MEASURES FOR MULTIVALUED ATTRIBUTES FOR DATABASE CLUSTERING


 Carmel Fox
 1 years ago
 Views:
Transcription
1 SIMILARITY MEASURES FOR MULTIVALUED ATTRIBUTES FOR DATABASE CLUSTERING TAEWAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas {twryu, ABSTRACT: This paper introduces an approach to cope with the representational inappropriateness of traditional flat file format for data sets from databases, specifically in database clustering. After analyzing the problems of the traditional flat file format to represent related information, a better representation scheme called extended data set that allows attributes of an object to have multivalues is introduced, and it is demonstrated how this representation scheme can represent structural information in databases for clustering. A unified similarity measure framework for mixed types of multivalued and singlevalued attributes is proposed. A query discovery system, MASSON that takes each cluster is used to discover a set of queries that represent discriminant characteristic knowledge for each cluster. INTRODUCTION Many data analysis and data mining tools, such as clustering tools, inductive learning tools, statistical analysis tools, assume that data sets to be analyzed are represented as a single flat file (or table) in which an object is characterized by attributes that have a single value. Person Purchase Joined result ssn name age sex Johny 43 M Andy 2 F Post 67 M Jenny 35 F ssn location ptype amount date Warehouse Grocery Mall Mall Grocery Mall (a) (b) ptype (payment type): for cash, 2 for credit, and 3 for check name age sex ptype amount location Johny 43 M 400 Mall Johny 43 M 2 70 Grocery Johny 43 M Warehouse Andy 2 F Mall Andy 2 F 3 00 Grocery Post 67 M 30 Mall Jenny 35 F null null null Figure.: (a) an example of Personal relational database, the cardinality ratio between Person and Purchase is :n (b) a joined table from Person and Purchase Recently, many of these data analysis approaches are being applied to data sets that have been extracted from databases. However, a database may consist of several related data sets (e.g., relations in relational model ) and the cardinality ratio of relationships between data sets in such a database is frequently :n or n:m, which may cause significant problems when data that have been extracted from a database have to be converted into a flat file in order to apply the above mentioned tools. Flat file format is not appropriate for representing related information that is commonly found in In this paper, we specifically focus on data sets from relational databases, although our approach can be easily extended to the objectoriented database model.
2 databases. For example, suppose we have a relational database as depicted in Figure.: (a) that consists of Person and Purchase relations that store information about a person s purchases, and we want to categorize persons that occur in the database into several groups that have similar characteristics. It is obvious that the attributes found in the Person relation alone are not sufficient to achieve this goal, because many important characteristics of persons are found in other related relations such as the Purchase relation that stores the shopping history for persons. This raises the question how the two tables can be combined into a single flat file so that traditional clustering and/or machine learning algorithms can be applied to it. Although some systems (Thompson9, Ribeiro95) attempt to discover knowledge directly from structured domains, it seems that the most straight forward approach for generating a single flat file is to join related tables (Quinlan93). Figure.: (b) depicts the results of the natural join operation for the two relations in Figure.: (a) and (b) using ssn as the join attribute. The object Andy in the Person relation is represented using two different tuples in the joined table in Figure.: (b). The main problem with this representation is that many clustering algorithms or machine learning tools would consider each tuple as a different object; that is, they would interpret the above table of 7 person objects rather than 4 unique person objects. This representational discrepancy between a data set from a structured database and a data set in traditional flat file format assumed by many data analysis approaches seems to have been overlooked. This paper proposes a knowledge discovery and data mining framework to deal with this limitation of the traditional flat file representation. We specifically focus on the problems of structured database clustering and discovery of a set of queries that describe the characteristics of objects in each cluster. EXTENDED DATA SETS The example in the previous section motivated that a better representation scheme is needed to represent information for objects that are interrelated with other objects. One simple approach to cope with this problem would be to group all the related objects into a single object by applying aggregate operations (e.g., average) to replace related values by a single value for the object. The problem of this approach is that the user has to make critical decisions (e.g., which aggregate function to use) before hand; moreover, by applying the aggregate function frequently, valuable information is lost (e.g., how many purchases a person has made, or what the maximum purchase of a person was). Tversky (977) gives more examples that illustrate that data analysis techniques, such as clustering, can benefit significantly considering set and group similarities. Name age sex p.ptype p.amount p.location Johny 43 M {,2,3} {400,70,200} {Mall, Grocery, Warehouse} Andy 2 F {2,3} {300,00} {Mall, Grocery} Post 67 M 30 Mall Jenny 35 F null null null Figure 2.: A converted table with a bag of values We propose another approach that allows attributes to be multivalued for an object to cope with this problem. We call this generalization of the flat file format, extended data set. In an extended data set objects are characterized through attributes that do not only have a single value, but rather a bag of values. A bag allows for duplicate elements unlike set but the value elements must be in the same domain. For example, the
3 following bag {200,200,300,300,00} for the amount attribute might represent five purchases 00, 200, 200, 300 and 300 dollars by a person. Figure 2. depicts an extended data set that has been constructed from the two relations in Figure.: (a). In this table, the related attributes that are called structured attributes, p.ptype, p.amount, and p.location now contain path information (e.g., p stands for Purchase relation) for clearer semantics of related attributes and can have a bag of values. Basically, the related object groups in Figure.: (b) are combined into one unique object with a bag of values for related attributes. In Figure 2. as well as throughout the paper, we use curly bracket to represent set of values (e.g., {,2,3}), we use null to denote empty bags, and we just give its element, if the bag has a single value. Most existing similaritybased clustering algorithms can not deal with this data set representation because similarity metrics used in those algorithms expect that an object has a singlevalue for an attribute and not a bag of values. Accordingly, our approach to discover useful set of queries through database clustering faces following problems: How to generalize data mining techniques (e.g., clustering algorithms in this paper) so that they can cope with multivalued attributes. How to discover a set of useful queries that describe the characteristics of objects in each cluster. We need more systematic and comprehensive approaches, to measure group similarity (e.g., similarity between bags of values) for clustering and to discover useful set of queries for each cluster. GROUP SIMILARITY MEASURES FOR EXTENDED DATA SETS In this paper, we broadly categorize types of attributes into quantitative type and qualitative type, and introduce existing similarity measures based on these two types, and generalize those to cope with extended data sets with mixed types. Qualitative type Tversky (977) proposed his contrast model and ratio model that generalizes several settheoretical similarity models proposed at that time. Tversky considers objects as sets of features instead of geometric points in a metric space. To illustrate his models, let a and b be two objects, and A and B denote the sets of features associated with the objects a and b respectively. Tversky proposed the following family of similarity measures, called the contrast model: S(a,b) = θf(a B) αf(a B) βf(b A), for some θ, α, β 0; f is usually the cardinality of the set. In the previous models, the similarity between objects was determined only by their common features, or only by their distinctive features. In the contrast model, the similarity of a pair of objects is expressed as a linear combination of the measures of the common and the distinctive features. The contrast model expresses similarity between objects as a weighted difference of the measures for their common and distinctive features. The following family of similarity measures represents the ratio model: S(a,b) = f(a B) / [( A B) + αf(a B) + βf(b A)], α, β 0 In the ratio model, the similarity value is normalized to a value range of 0 and. In Tversky s set theoretic similarity models, a feature usually denotes a value of a binary attribute or a nominal attribute but it can be extended to interval or ordinal type. Note that the set in Tversky s model means crisp set, not fuzzy set. For the qualitative type of multivalued case, Tversky s set similarity can be used since we can consider this case
4 as an attribute for an object has group feature property (e.g., a set of feature values). Quantitative type One simple way to measure intergroup distance is to substitute group means for the ith attribute of an object in the formulae for interobject measures such as Euclidean distance (Everitt93). The main problem of this group mean approach is that it does not consider cardinality of quantitative elements in a group. Another approach, known as group average, can be used to measure intergroup similarity. In this approach, the between group similarity is measured by taking the average of all the interobject measures for those pairs of objects from which each object of a pair is in different groups. For example, the average dissimilarity between group A and B can be defined as d(a,b) = d( a, b) n, where n is the total number of n i= objectpairs, d(a,b) i is the dissimilarity measure for the ith pair of objects a and b, a A, b B. In computing group similarity based on group average, decision on whether we compute the average for every possible pair of similarity or the average for a subset of possible pairs of similarity may be required. For example, suppose we have a pair of value set: {20,5}:{4,5} and use the city block measure as a distance function. One way to compute a group average for this pair of value set is to compute from every possible pairs, ( )/4, and the other way may be to compute only from corresponding pair of distance ( )/2 after sorting each value set. In the latter approach, sorting may help reducing unnecessary computation although it requires additional sorting time. For example, the total difference of every possible pair for value sets, {2,5} and {6,3} is 8, and the sorted individual value difference for the same set, {2,5} and {3,6} is 2. The example shows that computing similarity after sorting the value sets may result in better similarity index between multivalued objects. We call the former one as everypair approach, and the latter one as sortedpair approach. This group average approach considers both cardinality and quantitative variance of elements in a group in computing similarity between two groups of values. A FRAMEWORK FOR SIMILARITY MEASURES A similarity measure that was proposed by Gower (97) is particularly useful for such data sets that contain a variety of attribute types. It is defined as: m S(a,b) = w s i i i i i= i= i m ( a, b ) / w In this formula, s(a i,b i ) is the normalized similarity index in the range of 0 and between the objects a and b as measured by the function s i for ith attribute and w i is a weight for the ith attribute. The weight w i can be also used as a mask depending on the validity of the similarity comparison on the ith attribute which may be unknown or irrelevant for similarity computation for a pair of objects. We can extend Gower s similarity function to measure similarity for extended data sets with mixedtypes. The similarity function can consist of two subfunctions, similarity for l number of qualitative attributes and similarity for q number of quantitative attributes. We assume each attribute has the type information since data analyst can easily provide the type information for attributes. The following formula represents the extended similarity function: l q S(a,b) = [ w s ( a, b ) + w s ( a, b )]/( w + w ), i l i i j q j j i= j= i= where m = l + q. The functions, s l (a,b) and s q (a,b) are similarity functions for qualitative i l i q j= j
5 attributes and quantitative attributes respectively. For each type of similarity measures, user makes the choice of specific similarity measures and proper weights based on attribute types and applications. For example, for the similarity function, s l (a,b), we can use the Tversky s set similarity measure for the l number of qualitative attributes. For the similarity function, s q (a,b), we can use the group similarity function for the q number of quantitative attributes. The quantitative type of multivalued objects has additional property, group feature property including cardinality information as well as quantitative property. Therefore, s q (a,b) may consist of two subfunctions to measure group features and group quantity, s q (a,b) = s l (a,b) + s g (a,b), where the functions s l (a,b) and s g (a,b) can be Tversky s set similarity and group average similarity functions respectively. The main objective of using Tversky s set similarity here is to give more weights to the common features for a pair of objects. AN ARCHITECTURE FOR DATABASE CLUSTERING The unified similarity measure requires basic information such as attribute type (i.e., qualitative or quantitative type), weight, and range values of quantitative attributes before it can be applied. Figure 3. shows the architecture of an interactive database clustering environment we are currently developing. Extended Data set Clustering Tool Similarity measure Data Extraction Tool User Interface Similarity Measure Tool Library of similarity measures DBMS A set of clusters Default choice and domain information Type and weight information MASSON A set of discovered queries Figure 3.: Architecture of a Database Clustering Environment The database extraction tool generates an extended data set from a database based on user requirements. The similarity measure tool assists the user in constructing a similarity measure that is appropriate for his/her application. Relying on a library of similarity measures, it interactively guides the user through the construction process, inquiring information about types, weights, and other characteristics of attributes, offering alternatives and choices to the user, if more than one similarity measure seems to be appropriate. In the case that the user cannot provide the necessary information, default assumptions are made and default choices are provided, and occasionally necessary information is directly retrieved from the database. For example, as default weight the unit vector (i.e., all the weights are equally one) can be used, and as default similarity measures, Tversky s ratio model is used for qualitative types and Euclidean distance is used for quantitative types. The range value information (to normalize the similarity index) for quantitative type of attributes can be easily retrieved from a given data set by scanning the column vector of quantitative attributes. The clustering tool takes the constructed similarity measure and the extended data set as its input and
6 applies a clustering algorithm, such as Nearestneighbor (Everitt93) chosen by the user to the extended data set. Finally, MASSON (Ryu96a) takes objects with only objectids from each cluster and returns a set of discovered queries that describe the commonalities for the set of objects in the given cluster. MASSON is a query discovery system that uses database queries as a rule representation language (Ryu96b). MASSON discovers a set of discriminant queries (e.g., a set of queries that describes only the given set of objects in a cluster not any other objects in other cluster) in structured databases (Ryu98) using genetic programming (Koza90). SUMMARY AND CONCLUSION In this paper, we analyzed the problem of generating single flat file format to represent data sets that have been extracted from structured databases, and pointed out its representational inappropriateness to represent related information, a fact that has been frequently overlooked by recent data mining research. To overcome these difficulties, we introduced a better representation scheme, called extended data set, which allows attributes of an object to have a bag of values, and discussed how existing similarity measures for singlevalued attributes could be generalized to measure group similarity for extended data sets in clustering. We also proposed a unified framework for similarity measures to cope with extended data sets with mixed types by extending Gower s work. Once the target database is grouped into clusters with similar properties, the discriminant query discovery system, MASSON can discover useful characteristic information for a set of objects that belong to a cluster. We claim that the proposed representation scheme is suitable to cope with related information and that it is more expressive than the traditional single flat file format. More importantly, the relationship information in a structured database is actually considered in clustering process. REFERENCES Everitt, B.S. (993). Cluster Analysis, Edward Arnold, Copublished by Halsted Press and imprint of John Wiley & Sons Inc., 3 rd edition. Gower, J.C. (97). A general coefficient of similarity and some of its properties, Biometrics 27, Koza, John R. (990). Genetic Programming: On the Programming of Computers by Means of Natural Selection, Cambridge, MA: The MIT Press. Quinlan, J. (993). C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann. Ribeiro, J.S., Kaufmann, K., and Kerschberg, L. (995). Knowledge Discovery from Multiple Databases, In Proc. of the st Int l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal. Ryu, T.W and Eick, C.F. (996a). Deriving Queries from Results using Genetic Programming, In Proceedings of the 2 nd Int l Conf. on Knowledge Discovery and Data Mining. Portland, Oregon. Ryu, T.W and Eick, C.F. (996b). MASSON: Discovering Commonalities in Collection of Objects using Genetic Programming, In Proceedings of the Genetic Programming 996 Conference, Stanford University, San Francisco. Ryu,T.W. and Eick,C.F. (998). Automated Discovery of Discriminant Rules for a Group of Objects in Databases, In Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA, June 3. Thompson, K., and Langley, P. (99). Concept formation in structured domains, In Concept Formation: Knowledge and Experience in Unsupervised Learning, Eds., Fisher, D.H; Pazzani, M.; and Langley, P., Morgan Kaufmann. Tversky, A. (977). Features of similarity, Psychological review, 84(4): , July.
MultiModal Data Fusion: A Description
MultiModal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 452210030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups
More informationFuzzy SetTheoretical Approach for Comparing Objects with Fuzzy Attributes
Fuzzy SetTheoretical Approach for Comparing Objects with Fuzzy Attributes Y. Bashon, D. Neagu, M.J. Ridley Department of Computing University of Bradford Bradford, BD7 DP, UK email: {Y.Bashon, D.Neagu,
More informationMining di Dati Web. Lezione 3  Clustering and Classification
Mining di Dati Web Lezione 3  Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised
More informationUSING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING
USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA Email:
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationFormal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.
Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationCHAPTER 3 A FAST KMODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES
70 CHAPTER 3 A FAST KMODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically
More informationUNIT 2 Data Preprocessing
UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and
More informationChapter 3. Algorithms for Query Processing and Optimization
Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms
More informationMining Quantitative Association Rules on Overlapped Intervals
Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,
More informationA Classifier with the Functionbased Decision Tree
A Classifier with the Functionbased Decision Tree BeenChian Chien and JungYi Lin Institute of Information Engineering IShou University, Kaohsiung 84008, Taiwan, R.O.C Email: cbc@isu.edu.tw, m893310m@isu.edu.tw
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More information6. Relational Algebra (Part II)
6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed
More informationA Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery
A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex A. Freitas Postgraduate Program in Computer Science, Pontificia Universidade Catolica do Parana Rua Imaculada Conceicao,
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationDATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data
DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [1221] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationWEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1
WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationMining High Order Decision Rules
Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 email: yyao@cs.uregina.ca Abstract. We introduce the notion of high
More informationLinguistic Values on Attribute Subdomains in Vague Database Querying
Linguistic Values on Attribute Subdomains in Vague Database Querying CORNELIA TUDORIE Department of Computer Science and Engineering University "Dunărea de Jos" Domnească, 82 Galaţi ROMANIA Abstract: 
More informationCONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM
1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA EMAIL: Koza@Sunburn.Stanford.Edu
More informationRoad map. Basic concepts
Clustering Basic concepts Road map Kmeans algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?
More informationCNBC: NeighborhoodBased Clustering with Constraints
CNBC: NeighborhoodBased Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is
More informationCMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)
CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.717.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification
More informationData Preprocessing. Why Data Preprocessing? MIT652 Data Mining Applications. Chapter 3: Data Preprocessing. MultiDimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationA Model of Machine Learning Based on User Preference of Attributes
1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationThe Application of Kmedoids and PAM to the Clustering of Rules
The Application of Kmedoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. RaywardSmith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research
More informationDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. ElBastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,
More informationCluster Analysis. MuChun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis MuChun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationHandling Missing Values via Decomposition of the Conditioned Set
Handling Missing Values via Decomposition of the Conditioned Set MeiLing Shyu, Indika Priyantha KuruppuAppuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,
More informationAlgorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)
Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two
More informationIntroduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1
Introduction to Clustering and Classification Psych 993 Methods for Clustering and Classification Lecture 1 Today s Lecture Introduction to methods for clustering and classification Discussion of measures
More informationData Modeling with the Entity Relationship Model. CS157A Chris Pollett Sept. 7, 2005.
Data Modeling with the Entity Relationship Model CS157A Chris Pollett Sept. 7, 2005. Outline Conceptual Data Models and Database Design An Example Application Entity Types, Sets, Attributes and Keys Relationship
More informationDynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using KMeans Clustering
Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using KMeans Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of
More informationNovel Materialized View Selection in a Multidimensional Database
Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/
More informationProcessing Missing Values with SelfOrganized Maps
Processing Missing Values with SelfOrganized Maps David Sommer, Tobias Grimm, Martin Golz University of Applied Sciences Schmalkalden Department of Computer Science D98574 Schmalkalden, Germany Phone:
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationCluster Analysis. CSE634 Data Mining
Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods KMeans Clustering KMedoids Clustering Constrained KMeans clustering Introduction
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More informationThis tutorial has been prepared for computer science graduates to help them understand the basictoadvanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationAnalytical Techniques for Anomaly Detection Through Features, SignalNoise Separation and PartialValue Association
Proceedings of Machine Learning Research 77:20 32, 2017 KDD 2017: Workshop on Anomaly Detection in Finance Analytical Techniques for Anomaly Detection Through Features, SignalNoise Separation and PartialValue
More informationUsing Text Learning to help Web browsing
Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing
More informationAn InformationTheoretic Approach to the Prepruning of Classification Rules
An InformationTheoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationOntology based Model and Procedure Creation for Topic Analysis in Chinese Language
Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel PierreàMazel 7, CH2000 Neuchâtel,
More informationData Quality Control: Using High Performance Binning to Prevent Information Loss
SESUG Paper DM1732017 Data Quality Control: Using High Performance Binning to Prevent Information Loss ABSTRACT Deanna N SchreiberGregory, Henry M Jackson Foundation It is a wellknown fact that the
More informationSOCIAL MEDIA MINING. Data Mining Essentials
SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate
More informationSemiSupervised Clustering with Partial Background Information
SemiSupervised Clustering with Partial Background Information Jing Gao PangNing Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More informationIntroduction to Data Mining
Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Schemeindependent, schemespecific Attribute discretization Unsupervised, supervised, error
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationASurveyonClusteringTechniquesforMultiValuedDataSets
Global Journal of omputer Science and Technology: Software & Data Engineering Volume 16 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.
More informationData mining, 4 cu Lecture 6:
582364 Data mining, 4 cu Lecture 6: Quantitative association rules Multilevel association rules Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Data mining, Spring 2010 (Slides adapted
More informationMachine Learning Chapter 2. Input
Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat
More information5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS
5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS Association rules generated from mining data at multiple levels of abstraction are called multiple level or multi level association
More informationHorizontal Aggregations for Mining Relational Databases
Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,
More informationInternational Journal of Computer Engineering and Applications, ICCSTAR2016, Special Issue, May.16
The Survey Of Data Mining And Warehousing Architha.S, A.Kishore Kumar Department of Computer Engineering Department of computer engineering city engineering college VTU Bangalore, India ABSTRACT: Data
More informationData Clustering With Leaders and Subleaders Algorithm
IOSR Journal of Engineering (IOSRJEN) eissn: 22503021, pissn: 22788719, Volume 2, Issue 11 (November2012), PP 0107 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara
More informationData Mining By IK Unit 4. Unit 4
Unit 4 Data mining can be classified into two categories 1) Descriptive mining: describes concepts or taskrelevant data sets in concise, summarative, informative, discriminative forms 2) Predictive mining:
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 033, Martin Ester 84 3.1 Introduction Motivation
More informationChapter 3: Data Mining:
Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems
More informationEstimating Missing Attribute Values Using DynamicallyOrdered Attribute Trees
Estimating Missing Attribute Values Using DynamicallyOrdered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jingwang1@uiowa.edu W. Nick Street Management Sciences Department,
More informationISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 23217782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationRelative Unsupervised Discretization for Association Rule Mining
Relative Unsupervised Discretization for Association Rule Mining MarcusChristopher Ludl 1 and Gerhard Widmer 1,2 1 Austrian Research Institute for Artificial Intelligence, Vienna, 2 Department of Medical
More informationCOMP 465: Data Mining Classification Basics
Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised
More informationClustering of Data with Mixed Attributes based on Unified Similarity Metric
Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1
More informationOutlier Detection Using Unsupervised and SemiSupervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and SemiSupervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationMining Conditional Cardinality Patterns for Data Warehouse Query Optimization
Mining Conditional Cardinality Patterns for Data Warehouse Query Optimization Miko laj Morzy 1 and Marcin Krystek 2 1 Institute of Computing Science Poznan University of Technology Piotrowo 2, 60965 Poznan,
More informationAn Approach to Intensional Query Answering at Multiple Abstraction Levels Using Data Mining Approaches
An Approach to Intensional Query Answering at Multiple Abstraction Levels Using Data Mining Approaches SukChung Yoon E. K. Park Dept. of Computer Science Dept. of Software Architecture Widener University
More informationData Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA
Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI
More informationLOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES
8 th International Conference on DEVELOPMENT AND APPLICATION SYSTEMS S u c e a v a, R o m a n i a, M a y 25 27, 2 0 0 6 LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION
More informationTadeusz Morzy, Maciej Zakrzewicz
From: KDD98 Proceedings. Copyright 998, AAAI (www.aaai.org). All rights reserved. Group Bitmap Index: A Structure for Association Rules Retrieval Tadeusz Morzy, Maciej Zakrzewicz Institute of Computing
More informationOneShot Learning with a Hierarchical Nonparametric Bayesian Model
OneShot Learning with a Hierarchical Nonparametric Bayesian Model R. Salakhutdinov, J. Tenenbaum and A. Torralba MIT Technical Report, 2010 Presented by Esther Salazar Duke University June 10, 2011 E.
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationDATABASE DEVELOPMENT (H4)
IMIS HIGHER DIPLOMA QUALIFICATIONS DATABASE DEVELOPMENT (H4) Friday 3 rd June 2016 10:00hrs 13:00hrs DURATION: 3 HOURS Candidates should answer ALL the questions in Part A and THREE of the five questions
More informationClassification with Diffuse or Incomplete Information
Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication
More informationROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM
ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM Pavel Jirava Institute of System Engineering and Informatics Faculty of Economics and Administration, University of Pardubice Abstract: This article
More informationGetting to Know Your Data
Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss
More informationManaging Changes to Schema of Data Sources in a Data Warehouse
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Managing Changes to Schema of Data Sources in
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationFunctions as Conditionally Discoverable Relational Database Tables
Functions as Conditionally Discoverable Relational Database Tables A. Ondi and T. Hagan Securboration, Inc., Melbourne, FL, USA Abstract  It is beneficial for large enterprises to have an accurate and
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationEfficient SQLQuerying Method for Data Mining in Large Data Bases
Efficient SQLQuerying Method for Data Mining in Large Data Bases Nguyen Hung Son Institute of Mathematics Warsaw University Banacha 2, 02095, Warsaw, Poland Abstract Data mining can be understood as a
More informationData Analytics and Boolean Algebras
Data Analytics and Boolean Algebras Hans van Thiel November 28, 2012 c Muitovar 2012 KvK Amsterdam 34350608 Passeerdersstraat 76 1016 XZ Amsterdam The Netherlands T: + 31 20 6247137 E: hthiel@muitovar.com
More informationChange Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions
Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions Chun Sheng Chen 1, Vadeerat Rinsurongkawong 1, Christoph F. Eick 1, and Michael D. Twa 2 1 Department
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationFuzzy Partitioning with FID3.1
Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing
More information2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t
Data Reduction  an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.
More informationConstructing XofN Attributes with a Genetic Algorithm
Constructing XofN Attributes with a Genetic Algorithm Otavio Larsen 1 Alex Freitas 2 Julio C. Nievola 1 1 Postgraduate Program in Applied Computer Science 2 Computing Laboratory Pontificia Universidade
More information