SIMILARITY MEASURES FOR MULTIVALUED ATTRIBUTES FOR DATABASE CLUSTERING


 Carmel Fox
 1 years ago
 Views:
Transcription
1 SIMILARITY MEASURES FOR MULTIVALUED ATTRIBUTES FOR DATABASE CLUSTERING TAEWAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas {twryu, ABSTRACT: This paper introduces an approach to cope with the representational inappropriateness of traditional flat file format for data sets from databases, specifically in database clustering. After analyzing the problems of the traditional flat file format to represent related information, a better representation scheme called extended data set that allows attributes of an object to have multivalues is introduced, and it is demonstrated how this representation scheme can represent structural information in databases for clustering. A unified similarity measure framework for mixed types of multivalued and singlevalued attributes is proposed. A query discovery system, MASSON that takes each cluster is used to discover a set of queries that represent discriminant characteristic knowledge for each cluster. INTRODUCTION Many data analysis and data mining tools, such as clustering tools, inductive learning tools, statistical analysis tools, assume that data sets to be analyzed are represented as a single flat file (or table) in which an object is characterized by attributes that have a single value. Person Purchase Joined result ssn name age sex Johny 43 M Andy 2 F Post 67 M Jenny 35 F ssn location ptype amount date Warehouse Grocery Mall Mall Grocery Mall (a) (b) ptype (payment type): for cash, 2 for credit, and 3 for check name age sex ptype amount location Johny 43 M 400 Mall Johny 43 M 2 70 Grocery Johny 43 M Warehouse Andy 2 F Mall Andy 2 F 3 00 Grocery Post 67 M 30 Mall Jenny 35 F null null null Figure.: (a) an example of Personal relational database, the cardinality ratio between Person and Purchase is :n (b) a joined table from Person and Purchase Recently, many of these data analysis approaches are being applied to data sets that have been extracted from databases. However, a database may consist of several related data sets (e.g., relations in relational model ) and the cardinality ratio of relationships between data sets in such a database is frequently :n or n:m, which may cause significant problems when data that have been extracted from a database have to be converted into a flat file in order to apply the above mentioned tools. Flat file format is not appropriate for representing related information that is commonly found in In this paper, we specifically focus on data sets from relational databases, although our approach can be easily extended to the objectoriented database model.
2 databases. For example, suppose we have a relational database as depicted in Figure.: (a) that consists of Person and Purchase relations that store information about a person s purchases, and we want to categorize persons that occur in the database into several groups that have similar characteristics. It is obvious that the attributes found in the Person relation alone are not sufficient to achieve this goal, because many important characteristics of persons are found in other related relations such as the Purchase relation that stores the shopping history for persons. This raises the question how the two tables can be combined into a single flat file so that traditional clustering and/or machine learning algorithms can be applied to it. Although some systems (Thompson9, Ribeiro95) attempt to discover knowledge directly from structured domains, it seems that the most straight forward approach for generating a single flat file is to join related tables (Quinlan93). Figure.: (b) depicts the results of the natural join operation for the two relations in Figure.: (a) and (b) using ssn as the join attribute. The object Andy in the Person relation is represented using two different tuples in the joined table in Figure.: (b). The main problem with this representation is that many clustering algorithms or machine learning tools would consider each tuple as a different object; that is, they would interpret the above table of 7 person objects rather than 4 unique person objects. This representational discrepancy between a data set from a structured database and a data set in traditional flat file format assumed by many data analysis approaches seems to have been overlooked. This paper proposes a knowledge discovery and data mining framework to deal with this limitation of the traditional flat file representation. We specifically focus on the problems of structured database clustering and discovery of a set of queries that describe the characteristics of objects in each cluster. EXTENDED DATA SETS The example in the previous section motivated that a better representation scheme is needed to represent information for objects that are interrelated with other objects. One simple approach to cope with this problem would be to group all the related objects into a single object by applying aggregate operations (e.g., average) to replace related values by a single value for the object. The problem of this approach is that the user has to make critical decisions (e.g., which aggregate function to use) before hand; moreover, by applying the aggregate function frequently, valuable information is lost (e.g., how many purchases a person has made, or what the maximum purchase of a person was). Tversky (977) gives more examples that illustrate that data analysis techniques, such as clustering, can benefit significantly considering set and group similarities. Name age sex p.ptype p.amount p.location Johny 43 M {,2,3} {400,70,200} {Mall, Grocery, Warehouse} Andy 2 F {2,3} {300,00} {Mall, Grocery} Post 67 M 30 Mall Jenny 35 F null null null Figure 2.: A converted table with a bag of values We propose another approach that allows attributes to be multivalued for an object to cope with this problem. We call this generalization of the flat file format, extended data set. In an extended data set objects are characterized through attributes that do not only have a single value, but rather a bag of values. A bag allows for duplicate elements unlike set but the value elements must be in the same domain. For example, the
3 following bag {200,200,300,300,00} for the amount attribute might represent five purchases 00, 200, 200, 300 and 300 dollars by a person. Figure 2. depicts an extended data set that has been constructed from the two relations in Figure.: (a). In this table, the related attributes that are called structured attributes, p.ptype, p.amount, and p.location now contain path information (e.g., p stands for Purchase relation) for clearer semantics of related attributes and can have a bag of values. Basically, the related object groups in Figure.: (b) are combined into one unique object with a bag of values for related attributes. In Figure 2. as well as throughout the paper, we use curly bracket to represent set of values (e.g., {,2,3}), we use null to denote empty bags, and we just give its element, if the bag has a single value. Most existing similaritybased clustering algorithms can not deal with this data set representation because similarity metrics used in those algorithms expect that an object has a singlevalue for an attribute and not a bag of values. Accordingly, our approach to discover useful set of queries through database clustering faces following problems: How to generalize data mining techniques (e.g., clustering algorithms in this paper) so that they can cope with multivalued attributes. How to discover a set of useful queries that describe the characteristics of objects in each cluster. We need more systematic and comprehensive approaches, to measure group similarity (e.g., similarity between bags of values) for clustering and to discover useful set of queries for each cluster. GROUP SIMILARITY MEASURES FOR EXTENDED DATA SETS In this paper, we broadly categorize types of attributes into quantitative type and qualitative type, and introduce existing similarity measures based on these two types, and generalize those to cope with extended data sets with mixed types. Qualitative type Tversky (977) proposed his contrast model and ratio model that generalizes several settheoretical similarity models proposed at that time. Tversky considers objects as sets of features instead of geometric points in a metric space. To illustrate his models, let a and b be two objects, and A and B denote the sets of features associated with the objects a and b respectively. Tversky proposed the following family of similarity measures, called the contrast model: S(a,b) = θf(a B) αf(a B) βf(b A), for some θ, α, β 0; f is usually the cardinality of the set. In the previous models, the similarity between objects was determined only by their common features, or only by their distinctive features. In the contrast model, the similarity of a pair of objects is expressed as a linear combination of the measures of the common and the distinctive features. The contrast model expresses similarity between objects as a weighted difference of the measures for their common and distinctive features. The following family of similarity measures represents the ratio model: S(a,b) = f(a B) / [( A B) + αf(a B) + βf(b A)], α, β 0 In the ratio model, the similarity value is normalized to a value range of 0 and. In Tversky s set theoretic similarity models, a feature usually denotes a value of a binary attribute or a nominal attribute but it can be extended to interval or ordinal type. Note that the set in Tversky s model means crisp set, not fuzzy set. For the qualitative type of multivalued case, Tversky s set similarity can be used since we can consider this case
4 as an attribute for an object has group feature property (e.g., a set of feature values). Quantitative type One simple way to measure intergroup distance is to substitute group means for the ith attribute of an object in the formulae for interobject measures such as Euclidean distance (Everitt93). The main problem of this group mean approach is that it does not consider cardinality of quantitative elements in a group. Another approach, known as group average, can be used to measure intergroup similarity. In this approach, the between group similarity is measured by taking the average of all the interobject measures for those pairs of objects from which each object of a pair is in different groups. For example, the average dissimilarity between group A and B can be defined as d(a,b) = d( a, b) n, where n is the total number of n i= objectpairs, d(a,b) i is the dissimilarity measure for the ith pair of objects a and b, a A, b B. In computing group similarity based on group average, decision on whether we compute the average for every possible pair of similarity or the average for a subset of possible pairs of similarity may be required. For example, suppose we have a pair of value set: {20,5}:{4,5} and use the city block measure as a distance function. One way to compute a group average for this pair of value set is to compute from every possible pairs, ( )/4, and the other way may be to compute only from corresponding pair of distance ( )/2 after sorting each value set. In the latter approach, sorting may help reducing unnecessary computation although it requires additional sorting time. For example, the total difference of every possible pair for value sets, {2,5} and {6,3} is 8, and the sorted individual value difference for the same set, {2,5} and {3,6} is 2. The example shows that computing similarity after sorting the value sets may result in better similarity index between multivalued objects. We call the former one as everypair approach, and the latter one as sortedpair approach. This group average approach considers both cardinality and quantitative variance of elements in a group in computing similarity between two groups of values. A FRAMEWORK FOR SIMILARITY MEASURES A similarity measure that was proposed by Gower (97) is particularly useful for such data sets that contain a variety of attribute types. It is defined as: m S(a,b) = w s i i i i i= i= i m ( a, b ) / w In this formula, s(a i,b i ) is the normalized similarity index in the range of 0 and between the objects a and b as measured by the function s i for ith attribute and w i is a weight for the ith attribute. The weight w i can be also used as a mask depending on the validity of the similarity comparison on the ith attribute which may be unknown or irrelevant for similarity computation for a pair of objects. We can extend Gower s similarity function to measure similarity for extended data sets with mixedtypes. The similarity function can consist of two subfunctions, similarity for l number of qualitative attributes and similarity for q number of quantitative attributes. We assume each attribute has the type information since data analyst can easily provide the type information for attributes. The following formula represents the extended similarity function: l q S(a,b) = [ w s ( a, b ) + w s ( a, b )]/( w + w ), i l i i j q j j i= j= i= where m = l + q. The functions, s l (a,b) and s q (a,b) are similarity functions for qualitative i l i q j= j
5 attributes and quantitative attributes respectively. For each type of similarity measures, user makes the choice of specific similarity measures and proper weights based on attribute types and applications. For example, for the similarity function, s l (a,b), we can use the Tversky s set similarity measure for the l number of qualitative attributes. For the similarity function, s q (a,b), we can use the group similarity function for the q number of quantitative attributes. The quantitative type of multivalued objects has additional property, group feature property including cardinality information as well as quantitative property. Therefore, s q (a,b) may consist of two subfunctions to measure group features and group quantity, s q (a,b) = s l (a,b) + s g (a,b), where the functions s l (a,b) and s g (a,b) can be Tversky s set similarity and group average similarity functions respectively. The main objective of using Tversky s set similarity here is to give more weights to the common features for a pair of objects. AN ARCHITECTURE FOR DATABASE CLUSTERING The unified similarity measure requires basic information such as attribute type (i.e., qualitative or quantitative type), weight, and range values of quantitative attributes before it can be applied. Figure 3. shows the architecture of an interactive database clustering environment we are currently developing. Extended Data set Clustering Tool Similarity measure Data Extraction Tool User Interface Similarity Measure Tool Library of similarity measures DBMS A set of clusters Default choice and domain information Type and weight information MASSON A set of discovered queries Figure 3.: Architecture of a Database Clustering Environment The database extraction tool generates an extended data set from a database based on user requirements. The similarity measure tool assists the user in constructing a similarity measure that is appropriate for his/her application. Relying on a library of similarity measures, it interactively guides the user through the construction process, inquiring information about types, weights, and other characteristics of attributes, offering alternatives and choices to the user, if more than one similarity measure seems to be appropriate. In the case that the user cannot provide the necessary information, default assumptions are made and default choices are provided, and occasionally necessary information is directly retrieved from the database. For example, as default weight the unit vector (i.e., all the weights are equally one) can be used, and as default similarity measures, Tversky s ratio model is used for qualitative types and Euclidean distance is used for quantitative types. The range value information (to normalize the similarity index) for quantitative type of attributes can be easily retrieved from a given data set by scanning the column vector of quantitative attributes. The clustering tool takes the constructed similarity measure and the extended data set as its input and
6 applies a clustering algorithm, such as Nearestneighbor (Everitt93) chosen by the user to the extended data set. Finally, MASSON (Ryu96a) takes objects with only objectids from each cluster and returns a set of discovered queries that describe the commonalities for the set of objects in the given cluster. MASSON is a query discovery system that uses database queries as a rule representation language (Ryu96b). MASSON discovers a set of discriminant queries (e.g., a set of queries that describes only the given set of objects in a cluster not any other objects in other cluster) in structured databases (Ryu98) using genetic programming (Koza90). SUMMARY AND CONCLUSION In this paper, we analyzed the problem of generating single flat file format to represent data sets that have been extracted from structured databases, and pointed out its representational inappropriateness to represent related information, a fact that has been frequently overlooked by recent data mining research. To overcome these difficulties, we introduced a better representation scheme, called extended data set, which allows attributes of an object to have a bag of values, and discussed how existing similarity measures for singlevalued attributes could be generalized to measure group similarity for extended data sets in clustering. We also proposed a unified framework for similarity measures to cope with extended data sets with mixed types by extending Gower s work. Once the target database is grouped into clusters with similar properties, the discriminant query discovery system, MASSON can discover useful characteristic information for a set of objects that belong to a cluster. We claim that the proposed representation scheme is suitable to cope with related information and that it is more expressive than the traditional single flat file format. More importantly, the relationship information in a structured database is actually considered in clustering process. REFERENCES Everitt, B.S. (993). Cluster Analysis, Edward Arnold, Copublished by Halsted Press and imprint of John Wiley & Sons Inc., 3 rd edition. Gower, J.C. (97). A general coefficient of similarity and some of its properties, Biometrics 27, Koza, John R. (990). Genetic Programming: On the Programming of Computers by Means of Natural Selection, Cambridge, MA: The MIT Press. Quinlan, J. (993). C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann. Ribeiro, J.S., Kaufmann, K., and Kerschberg, L. (995). Knowledge Discovery from Multiple Databases, In Proc. of the st Int l Conf. On Knowledge Discovery and Data Mining, Quebec, Montreal. Ryu, T.W and Eick, C.F. (996a). Deriving Queries from Results using Genetic Programming, In Proceedings of the 2 nd Int l Conf. on Knowledge Discovery and Data Mining. Portland, Oregon. Ryu, T.W and Eick, C.F. (996b). MASSON: Discovering Commonalities in Collection of Objects using Genetic Programming, In Proceedings of the Genetic Programming 996 Conference, Stanford University, San Francisco. Ryu,T.W. and Eick,C.F. (998). Automated Discovery of Discriminant Rules for a Group of Objects in Databases, In Conference on Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA, June 3. Thompson, K., and Langley, P. (99). Concept formation in structured domains, In Concept Formation: Knowledge and Experience in Unsupervised Learning, Eds., Fisher, D.H; Pazzani, M.; and Langley, P., Morgan Kaufmann. Tversky, A. (977). Features of similarity, Psychological review, 84(4): , July.
USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING
USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA Email:
More informationFormal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.
Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement
More informationChapter 3. Algorithms for Query Processing and Optimization
Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms
More informationCONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM
1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA EMAIL: Koza@Sunburn.Stanford.Edu
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationA Model of Machine Learning Based on User Preference of Attributes
1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada
More informationCluster Analysis. MuChun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis MuChun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationWEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1
WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey
More informationData Modeling with the Entity Relationship Model. CS157A Chris Pollett Sept. 7, 2005.
Data Modeling with the Entity Relationship Model CS157A Chris Pollett Sept. 7, 2005. Outline Conceptual Data Models and Database Design An Example Application Entity Types, Sets, Attributes and Keys Relationship
More informationData Preprocessing. Why Data Preprocessing? MIT652 Data Mining Applications. Chapter 3: Data Preprocessing. MultiDimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationData Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?
More informationIntroduction to Clustering and Classification. Psych 993 Methods for Clustering and Classification Lecture 1
Introduction to Clustering and Classification Psych 993 Methods for Clustering and Classification Lecture 1 Today s Lecture Introduction to methods for clustering and classification Discussion of measures
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More informationHandling Missing Values via Decomposition of the Conditioned Set
Handling Missing Values via Decomposition of the Conditioned Set MeiLing Shyu, Indika Priyantha KuruppuAppuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,
More informationDIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. ElBastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,
More informationIntroduction to Data Mining
Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data
More informationData mining, 4 cu Lecture 6:
582364 Data mining, 4 cu Lecture 6: Quantitative association rules Multilevel association rules Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Data mining, Spring 2010 (Slides adapted
More informationHorizontal Aggregations for Mining Relational Databases
Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,
More informationConstructing XofN Attributes with a Genetic Algorithm
Constructing XofN Attributes with a Genetic Algorithm Otavio Larsen 1 Alex Freitas 2 Julio C. Nievola 1 1 Postgraduate Program in Applied Computer Science 2 Computing Laboratory Pontificia Universidade
More informationChange Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions
Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions Chun Sheng Chen 1, Vadeerat Rinsurongkawong 1, Christoph F. Eick 1, and Michael D. Twa 2 1 Department
More informationFuzzy Partitioning with FID3.1
Fuzzy Partitioning with FID3.1 Cezary Z. Janikow Dept. of Mathematics and Computer Science University of Missouri St. Louis St. Louis, Missouri 63121 janikow@umsl.edu Maciej Fajfer Institute of Computing
More informationClassification. Instructor: Wei Ding
Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute
More informationNuts and Bolts Research Methods Symposium
Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Topics to Discuss: Types of Variables Constructing a Variable Code Book Developing Excel Spreadsheets
More informationData Preprocessing. Komate AMPHAWAN
Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available online at: http://www.cse.wustl.edu/~jain/cse56708/
More informationData Clustering With Leaders and Subleaders Algorithm
IOSR Journal of Engineering (IOSRJEN) eissn: 22503021, pissn: 22788719, Volume 2, Issue 11 (November2012), PP 0107 Data Clustering With Leaders and Subleaders Algorithm Srinivasulu M 1,Kotilingswara
More informationRelational Database: The Relational Data Model; Operations on Database Relations
Relational Database: The Relational Data Model; Operations on Database Relations Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin Overview
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationHow do microarrays work
Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid
More informationManaging Changes to Schema of Data Sources in a Data Warehouse
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Managing Changes to Schema of Data Sources in
More informationTadeusz Morzy, Maciej Zakrzewicz
From: KDD98 Proceedings. Copyright 998, AAAI (www.aaai.org). All rights reserved. Group Bitmap Index: A Structure for Association Rules Retrieval Tadeusz Morzy, Maciej Zakrzewicz Institute of Computing
More informationDATABASE DEVELOPMENT (H4)
IMIS HIGHER DIPLOMA QUALIFICATIONS DATABASE DEVELOPMENT (H4) Friday 3 rd June 2016 10:00hrs 13:00hrs DURATION: 3 HOURS Candidates should answer ALL the questions in Part A and THREE of the five questions
More informationA Two Stage Zone Regression Method for Global Characterization of a Project Database
A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,
More informationData Mining Practical Machine Learning Tools and Techniques
Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More information2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.
Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss
More informationChapter 6: Cluster Analysis
Chapter 6: Cluster Analysis The major goal of cluster analysis is to separate individual observations, or items, into groups, or clusters, on the basis of the values for the q variables measured on each
More informationChapter 4 Data Mining A Short Introduction
Chapter 4 Data Mining A Short Introduction Data Mining  1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining  2 2 1. Data Mining Overview
More informationCHAPTER23 MINING COMPLEX TYPES OF DATA
CHAPTER23 MINING COMPLEX TYPES OF DATA 23.1 Introduction 23.2 Multidimensional Analysis and Descriptive Mining of Complex Data Objects 23.3 Generalization of Structured Data 23.4 Aggregation and Approximation
More informationOptimization of Queries in Distributed Database Management System
Optimization of Queries in Distributed Database Management System Bhagvant Institute of Technology, Muzaffarnagar Abstract The query optimizer is widely considered to be the most important component of
More informationUsing Google s PageRank Algorithm to Identify Important Attributes of Genes
Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105
More informationQuery Optimization in Distributed Databases. Dilşat ABDULLAH
Query Optimization in Distributed Databases Dilşat ABDULLAH 1302108 Department of Computer Engineering Middle East Technical University December 2003 ABSTRACT Query optimization refers to the process of
More informationOn Multiple Query Optimization in Data Mining
On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl
More informationFeatureweighted knearest Neighbor Classifier
Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Featureweighted knearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka
More informationRECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH
Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 23195991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Data Visualization Value of Visualization Data And Image Models
More informationCS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008
CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem
More informationTemporal Support in Sequential Pattern Mining
Temporal Support in Sequential Pattern Mining Leticia I. Gómez 1 and Bart Kuijpers 2 and Alejandro A. Vaisman 3 1 Instituto Tecnólogico de Buenos Aires lgomez@itba.edu.ar 2 Buenos Aires University, Hasselt
More informationGuideline 1: Semantic of the relation attributes Do not mix attributes from distinct real world. example
Design guidelines for relational schema Semantic of the relation attributes Do not mix attributes from distinct real world Design a relation schema so that it is easy to explain its meaning. Do not combine
More informationFuzzyRough Feature Significance for Fuzzy Decision Trees
FuzzyRough Feature Significance for Fuzzy Decision Trees Richard Jensen and Qiang Shen Department of Computer Science, The University of Wales, Aberystwyth {rkj,qqs}@aber.ac.uk Abstract Crisp decision
More informationUNIT 3 DATABASE DESIGN
UNIT 3 DATABASE DESIGN Objective To study design guidelines for relational databases. To know about Functional dependencies. To have an understanding on First, Second, Third Normal forms To study about
More informationEnhanced Image Retrieval using Distributed Contrast Model
Enhanced Image Retrieval using Distributed Contrast Model Mohammed. A. Otair Faculty of Computer Sciences & Informatics Amman Arab University Amman, Jordan Abstract Recent researches about image retrieval
More informationUNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES
UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data PreprocessingData Cleaning, Integration, Transformation, Reduction, Discretization Concept HierarchiesConcept Description: Data Generalization And
More informationUsing Decision Boundary to Analyze Classifiers
Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision
More informationCHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED KNEAREST NEIGHBOR (MKNN) ALGORITHM
CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED KNEAREST NEIGHBOR (MKNN) ALGORITHM 4.1 Introduction Nowadays money investment in stock market gains major attention because of its dynamic nature. So the
More informationSupervised and Unsupervised Learning (II)
Supervised and Unsupervised Learning (II) Yong Zheng Center for Web Intelligence DePaul University, Chicago IPD 346  Data Science for Business Program DePaul University, Chicago, USA Intro: Supervised
More informationFuzzy IfThen Rules. Fuzzy IfThen Rules. Adnan Yazıcı
Fuzzy IfThen Rules Adnan Yazıcı Dept. of Computer Engineering, Middle East Technical University Ankara/Turkey Fuzzy IfThen Rules There are two different kinds of fuzzy rules: Fuzzy mapping rules and
More informationANU MLSS 2010: Data Mining. Part 2: Association rule mining
ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements
More informationMarket basket analysis
Market basket analysis Find joint values of the variables X = (X 1,..., X p ) that appear most frequently in the data base. It is most often applied to binaryvalued data X j. In this context the observations
More informationAssociation Pattern Mining. Lijun Zhang
Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms
More informationData Mining Concepts
Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential
More informationFinding the boundaries of attributes domains of quantitative association rules using abstraction A Dynamic Approach
7th WSEAS International Conference on APPLIED COMPUTER SCIENCE, Venice, Italy, November 2123, 2007 52 Finding the boundaries of attributes domains of quantitative association rules using abstraction
More informationGenetically Enhanced Parametric Design for Performance Optimization
Genetically Enhanced Parametric Design for Performance Optimization Peter VON BUELOW Associate Professor, Dr. Ing University of Michigan Ann Arbor, USA pvbuelow@umich.edu Peter von Buelow received a BArch
More informationEfficiency of kmeans and KMedoids Algorithms for Clustering Arbitrary Data Points
Efficiency of kmeans and KMedoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai600106,
More informationData Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Mining Association Rules Definitions Market Baskets. Consider a set I = {i 1,...,i m }. We call the elements of I, items.
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 2 Sajjad Haider Spring 2010 1 Structured vs. NonStructured Data Most business databases contain structured data consisting of welldefined fields with numeric
More informationData Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes
0 Data Mining: Data What is Data? Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Collection of data objects and their attributes An attribute is a property or characteristic
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationOrganizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013
Organizing Your Data Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013 Learning Objectives Identify Different Types of Variables Appropriately Naming Variables Constructing
More informationER Model. Hi! Here in this lecture we are going to discuss about the ER Model.
ER Model Hi! Here in this lecture we are going to discuss about the ER Model. What is EntityRelationship Model? The entityrelationship model is useful because, as we will soon see, it facilitates communication
More informationThe Encoding Complexity of Network Coding
The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationAdvanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret
Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely
More informationClassifying Documents by Distributed P2P Clustering
Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 henrich}@unibayreuth.de
More informationData Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing
More informationKeywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.
Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online
More informationA NEW MILP APPROACH FOR THE FACILITY LAYOUT DESIGN PROBLEM WITH RECTANGULAR AND L/T SHAPED DEPARTMENTS
A NEW MILP APPROACH FOR THE FACILITY LAYOUT DESIGN PROBLEM WITH RECTANGULAR AND L/T SHAPED DEPARTMENTS Yossi Bukchin Michal Tzur Dept. of Industrial Engineering, Tel Aviv University, ISRAEL Abstract In
More informationsize, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a
MultiLayer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org
More informationCSI5387: Data Mining Project
CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play
More informationDynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers
Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers A. Srivastava E. Han V. Kumar V. Singh Information Technology Lab Dept. of Computer Science Information Technology Lab Hitachi
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationMeshlization of Irregular Grid Resource Topologies by Heuristic SquarePacking Methods
Meshlization of Irregular Grid Resource Topologies by Heuristic SquarePacking Methods UeiRen Chen 1, ChinChi Wu 2, and Woei Lin 3 1 Department of Electronic Engineering, Hsiuping Institute of Technology
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,
More informationConcept Tree Based Clustering Visualization with Shaded Similarity Matrices
Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 122002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationDiscovery of Multilevel Association Rules from Primitive Level Frequent Patterns Tree
Discovery of Multilevel Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania
More informationUNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania
UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of realvalued data is often used as a preprocessing
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of
More informationEnhancement of LempelZiv Algorithm to Estimate Randomness in a Dataset
, October 1921, 216, San Francisco, USA Enhancement of LempelZiv Algorithm to Estimate Randomness in a Dataset K. Koneru, C. Varol Abstract Experts and researchers always refer to the rate of error or
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationFrequency Distributions
Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,
More informationENTITIES IN THE OBJECTORIENTED DESIGN PROCESS MODEL
INTERNATIONAL DESIGN CONFERENCE  DESIGN 2000 Dubrovnik, May 2326, 2000. ENTITIES IN THE OBJECTORIENTED DESIGN PROCESS MODEL N. Pavković, D. Marjanović Keywords: object oriented methodology, design process
More informationClustering. Shishir K. Shah
Clustering Shishir K. Shah Acknowledgement: Notes by Profs. M. Pollefeys, R. Jin, B. Liu, Y. Ukrainitz, B. Sarel, D. Forsyth, M. Shah, K. Grauman, and S. K. Shah Clustering l Clustering is a technique
More informationEfficiently decodable insertion/deletion codes for highnoise and highrate regimes
Efficiently decodable insertion/deletion codes for highnoise and highrate regimes Venkatesan Guruswami Carnegie Mellon University Pittsburgh, PA 53 Email: guruswami@cmu.edu Ray Li Carnegie Mellon University
More informationTransforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm
Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel JosiahAkintonde December
More informationEvolving SQL Queries for Data Mining
Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper
More informationTIM 50  Business Information Systems
TIM 50  Business Information Systems Lecture 15 UC Santa Cruz Nov 10, 2016 Class Announcements n Database Assignment 2 posted n Due 11/22 The Database Approach to Data Management The Final Database Design
More information2 The IBM Data Governance Unified Process
2 The IBM Data Governance Unified Process The benefits of a commitment to a comprehensive enterprise Data Governance initiative are many and varied, and so are the challenges to achieving strong Data Governance.
More informationKrippendorff's Alphareliabilities for Unitizing a Continuum. Software Users Manual
Krippendorff's Alphareliabilities for Unitizing a Continuum Software Users Manual Date: 20161129 Written by Yann Mathet yann.mathet@unicaen.fr In consultation with Klaus Krippendorff kkrippendorff@asc.upenn.edu
More information