SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING

Size: px
Start display at page:

Download "SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING"

Transcription

1 SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING TAE-WAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas {twryu, ABSTRACT: This paper introduces an approach to cope with the representational inappropriateness of traditional flat file format for data sets from databases, specifically in database clustering. After analyzing the problems of the traditional flat file format to represent related information, a better representation scheme called extended data set that allows attributes of an object to have multi-values is introduced, and it is demonstrated how this representation scheme can represent structural information in databases for clustering. A unified similarity measure framework for mixed types of multi-valued and single-valued attributes is proposed. A query discovery system, MASSON that takes each cluster is used to discover a set of queries that represent discriminant characteristic knowledge for each cluster. INTRODUCTION Many data analysis and data mining tools, such as clustering tools, inductive learning tools, statistical analysis tools, assume that data sets to be analyzed are represented as a single flat file (or table) in which an object is characterized by attributes that have a single value. Person Purchase Joined result ssn name age sex Johny 43 M Andy 2 F Post 67 M Jenny 35 F ssn location ptype amount date Warehouse Grocery Mall Mall Grocery Mall (a) (b) ptype (payment type): for cash, 2 for credit, and 3 for check name age sex ptype amount location Johny 43 M 400 Mall Johny 43 M 2 70 Grocery Johny 43 M Warehouse Andy 2 F Mall Andy 2 F 3 00 Grocery Post 67 M 30 Mall Jenny 35 F null null null Figure.: (a) an example of Personal relational database, the cardinality ratio between Person and Purchase is :n (b) a joined table from Person and Purchase Recently, many of these data analysis approaches are being applied to data sets that have been extracted from databases. However, a database may consist of several related data sets (e.g., relations in relational model ) and the cardinality ratio of relationships between data sets in such a database is frequently :n or n:m, which may cause significant problems when data that have been extracted from a database have to be converted into a flat file in order to apply the above mentioned tools. Flat file format is not appropriate for representing related information that is commonly found in In this paper, we specifically focus on data sets from relational databases, although our approach can be easily extended to the object-oriented database model.

2 databases. For example, suppose we have a relational database as depicted in Figure.: (a) that consists of Person and Purchase relations that store information about a person s purchases, and we want to categorize persons that occur in the database into several groups that have similar characteristics. It is obvious that the attributes found in the Person relation alone are not sufficient to achieve this goal, because many important characteristics of persons are found in other related relations such as the Purchase relation that stores the shopping history for persons. This raises the question how the two tables can be combined into a single flat file so that traditional clustering and/or machine learning algorithms can be applied to it. Although some systems (Thompson9, Ribeiro95) attempt to discover knowledge directly from structured domains, it seems that the most straight forward approach for generating a single flat file is to join related tables (Quinlan93). Figure.: (b) depicts the results of the natural join operation for the two relations in Figure.: (a) and (b) using ssn as the join attribute. The object Andy in the Person relation is represented using two different tuples in the joined table in Figure.: (b). The main problem with this representation is that many clustering algorithms or machine learning tools would consider each tuple as a different object; that is, they would interpret the above table of 7 person objects rather than 4 unique person objects. This representational discrepancy between a data set from a structured database and a data set in traditional flat file format assumed by many data analysis approaches seems to have been overlooked. This paper proposes a knowledge discovery and data mining framework to deal with this limitation of the traditional flat file representation. We specifically focus on the problems of structured database clustering and discovery of a set of queries that describe the characteristics of objects in each cluster. EXTENDED DATA SETS The example in the previous section motivated that a better representation scheme is needed to represent information for objects that are interrelated with other objects. One simple approach to cope with this problem would be to group all the related objects into a single object by applying aggregate operations (e.g., average) to replace related values by a single value for the object. The problem of this approach is that the user has to make critical decisions (e.g., which aggregate function to use) before hand; moreover, by applying the aggregate function frequently, valuable information is lost (e.g., how many purchases a person has made, or what the maximum purchase of a person was). Tversky (977) gives more examples that illustrate that data analysis techniques, such as clustering, can benefit significantly considering set and group similarities. Name age sex p.ptype p.amount p.location Johny 43 M {,2,3} {400,70,200} {Mall, Grocery, Warehouse} Andy 2 F {2,3} {300,00} {Mall, Grocery} Post 67 M 30 Mall Jenny 35 F null null null Figure 2.: A converted table with a bag of values We propose another approach that allows attributes to be multi-valued for an object to cope with this problem. We call this generalization of the flat file format, extended data set. In an extended data set objects are characterized through attributes that do not only have a single value, but rather a bag of values. A bag allows for duplicate elements unlike set but the value elements must be in the same domain. For example, the

3 following bag {200,200,300,300,00} for the amount attribute might represent five purchases 00, 200, 200, 300 and 300 dollars by a person. Figure 2. depicts an extended data set that has been constructed from the two relations in Figure.: (a). In this table, the related attributes that are called structured attributes, p.ptype, p.amount, and p.location now contain path information (e.g., p stands for Purchase relation) for clearer semantics of related attributes and can have a bag of values. Basically, the related object groups in Figure.: (b) are combined into one unique object with a bag of values for related attributes. In Figure 2. as well as throughout the paper, we use curly bracket to represent set of values (e.g., {,2,3}), we use null to denote empty bags, and we just give its element, if the bag has a single value. Most existing similarity-based clustering algorithms can not deal with this data set representation because similarity metrics used in those algorithms expect that an object has a single-value for an attribute and not a bag of values. Accordingly, our approach to discover useful set of queries through database clustering faces following problems: How to generalize data mining techniques (e.g., clustering algorithms in this paper) so that they can cope with multi-valued attributes. How to discover a set of useful queries that describe the characteristics of objects in each cluster. We need more systematic and comprehensive approaches, to measure group similarity (e.g., similarity between bags of values) for clustering and to discover useful set of queries for each cluster. GROUP SIMILARITY MEASURES FOR EXTENDED DATA SETS In this paper, we broadly categorize types of attributes into quantitative type and qualitative type, and introduce existing similarity measures based on these two types, and generalize those to cope with extended data sets with mixed types. Qualitative type Tversky (977) proposed his contrast model and ratio model that generalizes several set-theoretical similarity models proposed at that time. Tversky considers objects as sets of features instead of geometric points in a metric space. To illustrate his models, let a and b be two objects, and A and B denote the sets of features associated with the objects a and b respectively. Tversky proposed the following family of similarity measures, called the contrast model: S(a,b) = θf(a B) αf(a B) βf(b A), for some θ, α, β 0; f is usually the cardinality of the set. In the previous models, the similarity between objects was determined only by their common features, or only by their distinctive features. In the contrast model, the similarity of a pair of objects is expressed as a linear combination of the measures of the common and the distinctive features. The contrast model expresses similarity between objects as a weighted difference of the measures for their common and distinctive features. The following family of similarity measures represents the ratio model: S(a,b) = f(a B) / [( A B) + αf(a B) + βf(b A)], α, β 0 In the ratio model, the similarity value is normalized to a value range of 0 and. In Tversky s set theoretic similarity models, a feature usually denotes a value of a binary attribute or a nominal attribute but it can be extended to interval or ordinal type. Note that the set in Tversky s model means crisp set, not fuzzy set. For the qualitative type of multi-valued case, Tversky s set similarity can be used since we can consider this case

4 as an attribute for an object has group feature property (e.g., a set of feature values). Quantitative type One simple way to measure inter-group distance is to substitute group means for the ith attribute of an object in the formulae for inter-object measures such as Euclidean distance (Everitt93). The main problem of this group mean approach is that it does not consider cardinality of quantitative elements in a group. Another approach, known as group average, can be used to measure inter-group similarity. In this approach, the between group similarity is measured by taking the average of all the inter-object measures for those pairs of objects from which each object of a pair is in different groups. For example, the average dissimilarity between group A and B can be defined as d(a,b) = d( a, b) n, where n is the total number of n i= object-pairs, d(a,b) i is the dissimilarity measure for the ith pair of objects a and b, a A, b B. In computing group similarity based on group average, decision on whether we compute the average for every possible pair of similarity or the average for a subset of possible pairs of similarity may be required. For example, suppose we have a pair of value set: {20,5}:{4,5} and use the city block measure as a distance function. One way to compute a group average for this pair of value set is to compute from every possible pairs, ( )/4, and the other way may be to compute only from corresponding pair of distance ( )/2 after sorting each value set. In the latter approach, sorting may help reducing unnecessary computation although it requires additional sorting time. For example, the total difference of every possible pair for value sets, {2,5} and {6,3} is 8, and the sorted individual value difference for the same set, {2,5} and {3,6} is 2. The example shows that computing similarity after sorting the value sets may result in better similarity index between multi-valued objects. We call the former one as every-pair approach, and the latter one as sorted-pair approach. This group average approach considers both cardinality and quantitative variance of elements in a group in computing similarity between two groups of values. A FRAMEWORK FOR SIMILARITY MEASURES A similarity measure that was proposed by Gower (97) is particularly useful for such data sets that contain a variety of attribute types. It is defined as: m S(a,b) = w s i i i i i= i= i m ( a, b ) / w In this formula, s(a i,b i ) is the normalized similarity index in the range of 0 and between the objects a and b as measured by the function s i for ith attribute and w i is a weight for the ith attribute. The weight w i can be also used as a mask depending on the validity of the similarity comparison on the ith attribute which may be unknown or irrelevant for similarity computation for a pair of objects. We can extend Gower s similarity function to measure similarity for extended data sets with mixed-types. The similarity function can consist of two sub-functions, similarity for l number of qualitative attributes and similarity for q number of quantitative attributes. We assume each attribute has the type information since data analyst can easily provide the type information for attributes. The following formula represents the extended similarity function: l q S(a,b) = [ w s ( a, b ) + w s ( a, b )]/( w + w ), i l i i j q j j i= j= i= where m = l + q. The functions, s l (a,b) and s q (a,b) are similarity functions for qualitative i l i q j= j

5 attributes and quantitative attributes respectively. For each type of similarity measures, user makes the choice of specific similarity measures and proper weights based on attribute types and applications. For example, for the similarity function, s l (a,b), we can use the Tversky s set similarity measure for the l number of qualitative attributes. For the similarity function, s q (a,b), we can use the group similarity function for the q number of quantitative attributes. The quantitative type of multi-valued objects has additional property, group feature property including cardinality information as well as quantitative property. Therefore, s q (a,b) may consist of two sub-functions to measure group features and group quantity, s q (a,b) = s l (a,b) + s g (a,b), where the functions s l (a,b) and s g (a,b) can be Tversky s set similarity and group average similarity functions respectively. The main objective of using Tversky s set similarity here is to give more weights to the common features for a pair of objects. AN ARCHITECTURE FOR DATABASE CLUSTERING The unified similarity measure requires basic information such as attribute type (i.e., qualitative or quantitative type), weight, and range values of quantitative attributes before it can be applied. Figure 3. shows the architecture of an interactive database clustering environment we are currently developing. Extended Data set Clustering Tool Similarity measure Data Extraction Tool User Interface Similarity Measure Tool Library of similarity measures DBMS A set of clusters Default choice and domain information Type and weight information MASSON A set of discovered queries Figure 3.: Architecture of a Database Clustering Environment The database extraction tool generates an extended data set from a database based on user requirements. The similarity measure tool assists the user in constructing a similarity measure that is appropriate for his/her application. Relying on a library of similarity measures, it interactively guides the user through the construction process, inquiring information about types, weights, and other characteristics of attributes, offering alternatives and choices to the user, if more than one similarity measure seems to be appropriate. In the case that the user cannot provide the necessary information, default assumptions are made and default choices are provided, and occasionally necessary information is directly retrieved from the database. For example, as default weight the unit vector (i.e., all the weights are equally one) can be used, and as default similarity measures, Tversky s ratio model is used for qualitative types and Euclidean distance is used for quantitative types. The range value information (to normalize the similarity index) for quantitative type of attributes can be easily retrieved from a given data set by scanning the column vector of quantitative attributes. The clustering tool takes the constructed similarity measure and the extended data set as its input and

6 applies a clustering algorithm, such as Nearest-neighbor (Everitt93) chosen by the user to the extended data set. Finally, MASSON (Ryu96a) takes objects with only object-ids from each cluster and returns a set of discovered queries that describe the commonalities for the set of objects in the given cluster. MASSON is a query discovery system that uses database queries as a rule representation language (Ryu96b). MASSON discovers a set of discriminant queries (e.g., a set of queries that describes only the given set of objects in a cluster not any other objects in other cluster) in structured databases (Ryu98) using genetic programming (Koza90). SUMMARY AND CONCLUSION In this paper, we analyzed the problem of generating single flat file format to represent data sets that have been extracted from structured databases, and pointed out its representational inappropriateness to represent related information, a fact that has been frequently overlooked by recent data mining research. To overcome these difficulties, we introduced a better representation scheme, called extended data set, which allows attributes of an object to have a bag of values, and discussed how existing similarity measures for single-valued attributes could be generalized to measure group similarity for extended data sets in clustering. We also proposed a unified framework for similarity measures to cope with extended data sets with mixed types by extending Gower s work. Once the target database is grouped into clusters with similar properties, the discriminant query discovery system, MASSON can discover useful characteristic information for a set of objects that belong to a cluster. We claim that the proposed representation scheme is suitable to cope with related information and that it is more expressive than the traditional single flat file format. More importantly, the relationship information in a structured database is actually considered in clustering process. REFERENCES Everitt, B.S. (993). Cluster Analysis, Edward Arnold, Copublished by Halsted Press and imprint of John Wiley & Sons Inc., 3 rd edition. Gower, J.C. (97). A general coefficient of similarity and some of its properties, Biometrics 27, Koza, John R. (990). Genetic Programming: On the Programming of Computers by Means of Natural Selection, Cambridge, MA: The MIT Press. Quinlan, J. (993). C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann. Ribeiro, J.S., Kaufmann, K., and Kerschberg, L. (995). Knowledge Discovery from Multiple Databases, In Proc. of the st Int l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal. Ryu, T.W and Eick, C.F. (996a). Deriving Queries from Results using Genetic Programming, In Proceedings of the 2 nd Int l Conf. on Knowledge Discovery and Data Mining. Portland, Oregon. Ryu, T.W and Eick, C.F. (996b). MASSON: Discovering Commonalities in Collection of Objects using Genetic Programming, In Proceedings of the Genetic Programming 996 Conference, Stanford University, San Francisco. Ryu,T.W. and Eick,C.F. (998). Automated Discovery of Discriminant Rules for a Group of Objects in Databases, In Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA, June -3. Thompson, K., and Langley, P. (99). Concept formation in structured domains, In Concept Formation: Knowledge and Experience in Unsupervised Learning, Eds., Fisher, D.H; Pazzani, M.; and Langley, P., Morgan Kaufmann. Tversky, A. (977). Features of similarity, Psychological review, 84(4): , July.

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes

Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes Fuzzy Set-Theoretical Approach for Comparing Objects with Fuzzy Attributes Y. Bashon, D. Neagu, M.J. Ridley Department of Computing University of Bradford Bradford, BD7 DP, UK e-mail: {Y.Bashon, D.Neagu,

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA E-mail:

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

Chapter 3. Algorithms for Query Processing and Optimization

Chapter 3. Algorithms for Query Processing and Optimization Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

Linguistic Values on Attribute Subdomains in Vague Database Querying

Linguistic Values on Attribute Subdomains in Vague Database Querying Linguistic Values on Attribute Subdomains in Vague Database Querying CORNELIA TUDORIE Department of Computer Science and Engineering University "Dunărea de Jos" Domnească, 82 Galaţi ROMANIA Abstract: -

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex A. Freitas Postgraduate Program in Computer Science, Pontificia Universidade Catolica do Parana Rua Imaculada Conceicao,

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

A Model of Machine Learning Based on User Preference of Attributes

A Model of Machine Learning Based on User Preference of Attributes 1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4 Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. El-Bastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1

Introduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1 Introduction to Clustering and Classification Psych 993 Methods for Clustering and Classification Lecture 1 Today s Lecture Introduction to methods for clustering and classification Discussion of measures

More information

Data Modeling with the Entity Relationship Model. CS157A Chris Pollett Sept. 7, 2005.

Data Modeling with the Entity Relationship Model. CS157A Chris Pollett Sept. 7, 2005. Data Modeling with the Entity Relationship Model CS157A Chris Pollett Sept. 7, 2005. Outline Conceptual Data Models and Database Design An Example Application Entity Types, Sets, Attributes and Keys Relationship

More information

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

Horizontal Aggregations for Mining Relational Databases

Horizontal Aggregations for Mining Relational Databases Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,

More information

Processing Missing Values with Self-Organized Maps

Processing Missing Values with Self-Organized Maps Processing Missing Values with Self-Organized Maps David Sommer, Tobias Grimm, Martin Golz University of Applied Sciences Schmalkalden Department of Computer Science D-98574 Schmalkalden, Germany Phone:

More information

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 The Survey Of Data Mining And Warehousing Architha.S, A.Kishore Kumar Department of Computer Engineering Department of computer engineering city engineering college VTU Bangalore, India ABSTRACT: Data

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value Association

Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value Association Proceedings of Machine Learning Research 77:20 32, 2017 KDD 2017: Workshop on Anomaly Detection in Finance Analytical Techniques for Anomaly Detection Through Features, Signal-Noise Separation and Partial-Value

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Data Quality Control: Using High Performance Binning to Prevent Information Loss SESUG Paper DM-173-2017 Data Quality Control: Using High Performance Binning to Prevent Information Loss ABSTRACT Deanna N Schreiber-Gregory, Henry M Jackson Foundation It is a well-known fact that the

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

Data mining, 4 cu Lecture 6:

Data mining, 4 cu Lecture 6: 582364 Data mining, 4 cu Lecture 6: Quantitative association rules Multi-level association rules Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Data mining, Spring 2010 (Slides adapted

More information

ASurveyonClusteringTechniquesforMultiValuedDataSets

ASurveyonClusteringTechniquesforMultiValuedDataSets Global Journal of omputer Science and Technology: Software & Data Engineering Volume 16 Issue 1 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc.

More information

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS 5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS Association rules generated from mining data at multiple levels of abstraction are called multiple level or multi level association

More information

LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES

LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES 8 th International Conference on DEVELOPMENT AND APPLICATION SYSTEMS S u c e a v a, R o m a n i a, M a y 25 27, 2 0 0 6 LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION

More information

Data Clustering With Leaders and Subleaders Algorithm

Data Clustering With Leaders and Subleaders Algorithm IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 11 (November2012), PP 01-07 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara

More information

Data Mining By IK Unit 4. Unit 4

Data Mining By IK Unit 4. Unit 4 Unit 4 Data mining can be classified into two categories 1) Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms 2) Predictive mining:

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM

ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM ROUGH SETS THEORY AND UNCERTAINTY INTO INFORMATION SYSTEM Pavel Jirava Institute of System Engineering and Informatics Faculty of Economics and Administration, University of Pardubice Abstract: This article

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 19 Query Optimization Introduction Query optimization Conducted by a query optimizer in a DBMS Goal: select best available strategy for executing query Based on information available Most RDBMSs

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Chapter 3: Data Mining:

Chapter 3: Data Mining: Chapter 3: Data Mining: 3.1 What is Data Mining? Data Mining is the process of automatically discovering useful information in large repository. Why do we need Data mining? Conventional database systems

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute

More information

Relative Unsupervised Discretization for Association Rule Mining

Relative Unsupervised Discretization for Association Rule Mining Relative Unsupervised Discretization for Association Rule Mining Marcus-Christopher Ludl 1 and Gerhard Widmer 1,2 1 Austrian Research Institute for Artificial Intelligence, Vienna, 2 Department of Medical

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

Fuzzy Partitioning with FID3.1

Fuzzy Partitioning with FID3.1 Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing

More information

DISCOVERING CLASSIFICATION RULES USING VARIABLE-VALUED LOGIC SYSTEM VL, R. s. Michalski University of Illinois Urbana, Illinois 61801

DISCOVERING CLASSIFICATION RULES USING VARIABLE-VALUED LOGIC SYSTEM VL, R. s. Michalski University of Illinois Urbana, Illinois 61801 Session 6 Logic; II Theorem Proving and DISCOVERING CLASSIFICATION RULES USING VARIABLE-VALUED LOGIC SYSTEM VL, R. s. Michalski University of Illinois Urbana, Illinois 61801 Abstract (AT to the power union)

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Tadeusz Morzy, Maciej Zakrzewicz

Tadeusz Morzy, Maciej Zakrzewicz From: KDD-98 Proceedings. Copyright 998, AAAI (www.aaai.org). All rights reserved. Group Bitmap Index: A Structure for Association Rules Retrieval Tadeusz Morzy, Maciej Zakrzewicz Institute of Computing

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model One-Shot Learning with a Hierarchical Nonparametric Bayesian Model R. Salakhutdinov, J. Tenenbaum and A. Torralba MIT Technical Report, 2010 Presented by Esther Salazar Duke University June 10, 2011 E.

More information

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

UNIT 4. Research Methods in Business

UNIT 4. Research Methods in Business UNIT 4 Preparing Data for Analysis:- After data are obtained through questionnaires, interviews, observation or through secondary sources, they need to be edited. The blank responses, if any have to be

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

DATABASE DEVELOPMENT (H4)

DATABASE DEVELOPMENT (H4) IMIS HIGHER DIPLOMA QUALIFICATIONS DATABASE DEVELOPMENT (H4) Friday 3 rd June 2016 10:00hrs 13:00hrs DURATION: 3 HOURS Candidates should answer ALL the questions in Part A and THREE of the five questions

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Managing Changes to Schema of Data Sources in a Data Warehouse

Managing Changes to Schema of Data Sources in a Data Warehouse Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Managing Changes to Schema of Data Sources in

More information

Functions as Conditionally Discoverable Relational Database Tables

Functions as Conditionally Discoverable Relational Database Tables Functions as Conditionally Discoverable Relational Database Tables A. Ondi and T. Hagan Securboration, Inc., Melbourne, FL, USA Abstract - It is beneficial for large enterprises to have an accurate and

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information