THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

Size: px
Start display at page:

Download "THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS"

Transcription

1 THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS Marcin Pełka 1 1 Wroclaw University of Economics, Faculty of Economics, Management and Tourism, Department of Econometrics and Computer Science ( marcin.pelka@ue.wroc.pl) KEYWORDS: Symbolic data analysis, Ensemble Clustering, Conceptual Clustering, Customer Loyalty 1 Introduction Ensemble approach based on aggregating information provided by different models has been proved to be a very useful tool in the context of supervised learning. The main goal of the ensemble approach is to increase the accuracy and stability of the classification. Recently the same techniques have been applied for cluster analysis where by combining a set of different clusterings, a better solution can be obtained. Ensemble clustering means combining (aggregating) N base clustering results (models) P1,...,P N into one model with P* clusters (see: Fred and Jain 2005). Nevertheless the idea of ensemble approach, that is combining (aggregating) the results of many base models, can be applied for cluster analysis of symbolic data. There are several proposals of applying the idea of ensemble approach in the context of clustering aggregation of results of different clustering algorithms, receiving different partitions by resampling the data, applying different subsets of variables, applying a given algorithm many times with different values of parameters or different initializations. 2 Symbolic data In classical multivariate data analysis, the basic units under the analysis are usually single individuals which are described by a set of quantitative (for example numerical) and/or quantitative (also known as categorical) variables each taking exactly one single value. For example, a specific car can be described by year of production, average fuel consumption, trunk capacity, color, etc. Data are often organized in a matrix or data array, where each cell contains the value of variable for an individual. Neverthenless this kind of data representation is too restricted to take into account variability and/or uncertainty of the data. Whether the data are obtained by

2 contemporaneous or temporal aggregation of individual observations to obtain descriptions of the entities which are of our interest, or whether we are facing concepts as such specified by experts or put in evidence by clustering, we are dealing with elements that can no longer be described by usual quantitative or qualitative framework without an loss of information. Symbolic data analysis (SDA) provides a framework where the variability observed may be effectively be considered in the data representation, and methods be developed that take into account. To describe groups of individuals or concepts, variables may now assume other forms of realizations (see Bock, Diday 2000 for details). Symbolic variables can be numerical (or quatitative) single valued (real or integer) if it takes one single value, multivalued if its values are finite subsets of the domain, interval variable if its are intervals. Categorical variable can be singlevalued (ordinal or not) when we have a single category form a given finite domain, multivalued if its values are finite subsets of the domain. A categorical modal variable is a multistate variable, where for each element, we are given a category set and, for each category, a frequency or a probability which indicates how frequent or likely that category is for this element. 3 Ensemble conceptual clustering There are two main approaches that can be applied in ensemble learning for symbolic interval-valued data (see: Ghaemi et. al. 2009; De Carvalho et. al. 2012; Hornik 2005): 1. Clustering algorithm for multiple relational matrices proposed by De Carvalho et. al This approach is based on different distance matrices. 2. Clustering ensemble that apply consensus functions in clustering ensembles. There are five main consensus functions that are applied in clustering ensemble. There are following methods in this solution: hypergraph partitioning, voting approach, mutual information, finite mixture model, co-association based functions. However these approaches does not allow to produce concepts as output of clustering ensemble. Since Michalski wrote about conceptual clustering as a new branch of machine learning (Michalski 1980) there has been increasing attention to that tasks. Conceptual clustering is not only the inherent structure of the data that drives cluster formation, but also the description language which is available to the learner. A concept is an abstraction or generalization from experience or the result of a transformation of existing concepts. The concept reifies all of its actual or potential instances whether these are things in the real world or other ideas. In order to obtain concepts as results of ensemble clustering adaptation of bagging can be used. Bagging, which stands for bootstrap aggregating, is one of the earliest, most intuitive and perhaps the simplest ensemble based algorithms, with a surprisingly good performance (Breiman 1996). Diversity of classifiers in bagging is obtained by using bootstrapped replicas of the training data. That is, different training data subsets are randomly drawn with replacement from the entire training

3 dataset. Each training data subset is used to train a different classifier of the same type. In clustering there are following adaptation of bagging for classical data case (Hornik 2005; Leisch 1999; Dudoit and Fridyland 2003): 1. Leisch s (1999) adaptation of bagging, where usually a k-means like base method is used. Centres obtained from each clustering are used as initial data set for some clustering method (e.g. hierarhical). Objects are assigned to the closest cluster centre. This kind of approach can be used to obtain concepts that describe clusters of symbolic objects. At the first stage subsets of objects (drawn with replacement) are obtained. Then for each subset the dynamic clustering for symbolic data (SCLUST) is used. Final cluster representatives (symbolic objects) are obtained at this step. This objects are then used as initial data for pyramidal/hierarchical clustering final clustering (assertation objects) is obtained. 2. Dudoit and Fridlyand (2003) proposal where k-means like algorithm is used to cluster entire data set and each of the subsets. Then a permutation is done to obtain best agreement between cluster labels for entire data set and subsets. 3. Hornik s (2005) proposal where a clustering is applied for each subset. The final solution is obtained by minimizing the distance between elements of ensemble and the set of all possible ensemble clusterings. 4 Short example To present the main idea of the paper short example will be used. Data set contains 20 artificial symbolic objects, that are decribed by two interval-valued variables, were obtained from cluster.gen function from custersim package (see Table 1). 40 subsets (each conaining 14 objects (drawn with replacement) were obtained. For each subset the dynamic clustering was done with cluster number drawn at random from the interval [2; 8]. Cluster representatives obtained at this stage were used as initial data set in hierarchical (see Bock, Diday 2000, ) clustering in SO- DAS 2.50 software assertation objects were obtained. Objects from OOB (out-ofbag) data set were assigned to the closest final cluster. At the end two cluster structure was obtained see Table 1. 5 Aim of the paper The article proposes to apply conceptual clustering in ensemble learning of symbolic data. An adaptation of Leisch s bagging is used. In the first stage data is divided into subsets (bags). For each of them the dynamical clusterong algorithm with different number of clusters is applied cluster representatives are obtained. These representatives are then used as the initial data set for hierarchical clustering which is a conceptual clustering method. Assertation objects are obtained as the final result of this step. In the empirical part of the paper results of the ensemble clustering are presented where customer loyalty data is used.

4 Table 1. Symbolic objects Object no. Variable V 1 Variable V 2 1. [ ; ] [ ; ] 2. [ ; ] [ ; ] 3. [ ; ] [ ; ] 4. [ ; ] [ ; ] 5. [ ; ] [ ; ] 6. [ ; ] [ ; ] 7. [ ; ] [ ; ] 8. [ ; ] [ ; ] 9. [ ; ] [ ; ] 10. [ ; ] [ ; ] 11. [ ; ] [ ; ] 12. [ ; ] [ ; ] 13. [ ; ] [ ; ] 14. [ ; ] [ ; ] 15. [ ; ] [ ; ] 16. [ ; ] [ ; ] 17. [ ; ] [ ; ] 18. [ ; ] [ ; ] 19. [ ; ] [ ; ] 20. [ ; ] [ ; ] Source: own research. Table 1. Examples of symbolic variables Variable V 1 Variable V 2 Cluster 1 [ ; ] [ ; ] Cluster 2 [ ; ] [ ; ] Source: own research. References BOCK, H.-H., DIDAY, E. (EDS.), Analysis of Symbolic Data. Explanatory Methods for Extracting Statistical Information from Complex Data. Berlin- Heidelberg: Springer. BREIMAN, L., Bagging predictors. Machine Learning, vol. 24, no. 2, DUDOIT, S., FRIDLYAND, J., Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19 (9), FRED, A.L.N., JAIN, A.K., Combining multiple clustering using evidence accumulation. IEEE Transaction on Pattern Analysis and Machine Intelligence., 27,

5 GHAEMI, R., SULAIMAN, N., IBRAHIM, H., MUSTAPHA, N., A survey: Clustering ensemble techniques [in:] Proceedings of World Academy of Science, Engineering and Technology, 38, HARTIGAN, J.A, Clustering Algorithms. New York: Wiley. HORNIK, K., A clue for clustering ensembles. Journal of Statistical Software. 14, LEISCH, F., Bagged clustering. Adaptive Information Systems and Modeling in Economics and Management Science. Working Papers, SFB, 51. MICHALSKI, R.S., Knowledge acquisition through conceptual clustering: A theoretical framework and algorithm for partitioning data into conjunctive concepts. International Journal of Policy Analysis and Information Systems, 4, PEŁKA, M., Ensemble approach for clustering of interval-valued symbolic data. Statistics in Transition, 13 (2),

A Comparison of Resampling Methods for Clustering Ensembles

A Comparison of Resampling Methods for Clustering Ensembles A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei-Bidgoli Computer Science Department Michigan State University East Lansing, MI, 48824, USA Alexander Topchy Computer Science Department

More information

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering Javier Béjar URL - Spring 2019 CS - MAI Consensus Clustering The ensemble of classifiers is a well established strategy in supervised learning Unsupervised learning aims the same goal:

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

April 3, 2012 T.C. Havens

April 3, 2012 T.C. Havens April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate

More information

Nonparametric Classification Methods

Nonparametric Classification Methods Nonparametric Classification Methods We now examine some modern, computationally intensive methods for regression and classification. Recall that the LDA approach constructs a line (or plane or hyperplane)

More information

Analysis of Symbolic Data

Analysis of Symbolic Data Hans-Hermann Bock Edwin Diday (Eds.) Analysis of Symbolic Data Exploratory Methods for Extracting Statistical Information from Complex Data Springer Contents Preface of the Scientific Editors Preface of

More information

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Classification of Hand-Written Numeric Digits

Classification of Hand-Written Numeric Digits Classification of Hand-Written Numeric Digits Nyssa Aragon, William Lane, Fan Zhang December 12, 2013 1 Objective The specific hand-written recognition application that this project is emphasizing is reading

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias

More information

Clustering Ensembles Based on Normalized Edges

Clustering Ensembles Based on Normalized Edges Clustering Ensembles Based on Normalized Edges Yan Li 1,JianYu 2, Pengwei Hao 1,3, and Zhulin Li 1 1 Center for Information Science, Peking University, Beijing, 100871, China {yanli, lizhulin}@cis.pku.edu.cn

More information

8. Tree-based approaches

8. Tree-based approaches Foundations of Machine Learning École Centrale Paris Fall 2015 8. Tree-based approaches Chloé-Agathe Azencott Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

Application of Fuzzy Classification in Bankruptcy Prediction

Application of Fuzzy Classification in Bankruptcy Prediction Application of Fuzzy Classification in Bankruptcy Prediction Zijiang Yang 1 and Guojun Gan 2 1 York University zyang@mathstat.yorku.ca 2 York University gjgan@mathstat.yorku.ca Abstract. Classification

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

CATEGORICAL DATA GENERATOR

CATEGORICAL DATA GENERATOR CATEGORICAL DATA GENERATOR Jana Cibulková Hana Řezanková Abstract Generated data are often required for quality evaluation of a newly proposed statistical method and its results. The number of datasets

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

Rough Set based Cluster Ensemble Selection

Rough Set based Cluster Ensemble Selection Rough Set based Cluster Ensemble Selection Xueen Wang, Deqiang Han, Chongzhao Han Ministry of Education Key Lab for Intelligent Networks and Network Security (MOE KLINNS Lab), Institute of Integrated Automation,

More information

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

CSE 6242/CX Ensemble Methods. Or, Model Combination. Based on lecture by Parikshit Ram

CSE 6242/CX Ensemble Methods. Or, Model Combination. Based on lecture by Parikshit Ram CSE 6242/CX 4242 Ensemble Methods Or, Model Combination Based on lecture by Parikshit Ram Numerous Possible Classifiers! Classifier Training time Cross validation Testing time Accuracy knn classifier None

More information

Stochastic global optimization using random forests

Stochastic global optimization using random forests 22nd International Congress on Modelling and Simulation, Hobart, Tasmania, Australia, 3 to 8 December 27 mssanz.org.au/modsim27 Stochastic global optimization using random forests B. L. Robertson a, C.

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

MASTER. Random forest visualization. Kuznetsova, N.I. Award date: Link to publication

MASTER. Random forest visualization. Kuznetsova, N.I. Award date: Link to publication MASTER Random forest visualization Kuznetsova, N.I. Award date: 2014 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Boosting Algorithms for Parallel and Distributed Learning

Boosting Algorithms for Parallel and Distributed Learning Distributed and Parallel Databases, 11, 203 229, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Boosting Algorithms for Parallel and Distributed Learning ALEKSANDAR LAZAREVIC

More information

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 1 (2016), pp. 1131-1140 Research India Publications http://www.ripublication.com A Monotonic Sequence and Subsequence Approach

More information

A Model of Machine Learning Based on User Preference of Attributes

A Model of Machine Learning Based on User Preference of Attributes 1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada

More information

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8 Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions

More information

Data Mining Lecture 8: Decision Trees

Data Mining Lecture 8: Decision Trees Data Mining Lecture 8: Decision Trees Jo Houghton ECS Southampton March 8, 2019 1 / 30 Decision Trees - Introduction A decision tree is like a flow chart. E. g. I need to buy a new car Can I afford it?

More information

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM XIAO-DONG ZENG, SAM CHAO, FAI WONG Faculty of Science and Technology, University of Macau, Macau, China E-MAIL: ma96506@umac.mo, lidiasc@umac.mo,

More information

Multi-Aspect Tagging for Collaborative Structuring

Multi-Aspect Tagging for Collaborative Structuring Multi-Aspect Tagging for Collaborative Structuring Katharina Morik and Michael Wurst University of Dortmund, Department of Computer Science Baroperstr. 301, 44221 Dortmund, Germany morik@ls8.cs.uni-dortmund

More information

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6 CHAPTER 6 Parallel Algorithm for Random Forest Classifier Random Forest classification algorithm can be easily parallelized due to its inherent parallel nature. Being an ensemble, the parallel implementation

More information

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Int. J. Advance Soft Compu. Appl, Vol. 9, No. 1, March 2017 ISSN 2074-8523 The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem Loc Tran 1 and Linh Tran

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Images Reconstruction using an iterative SOM based algorithm.

Images Reconstruction using an iterative SOM based algorithm. Images Reconstruction using an iterative SOM based algorithm. M.Jouini 1, S.Thiria 2 and M.Crépon 3 * 1- LOCEAN, MMSA team, CNAM University, Paris, France 2- LOCEAN, MMSA team, UVSQ University Paris, France

More information

Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings

Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings Hanan Ayad and Mohamed Kamel Pattern Analysis and Machine Intelligence Lab, Systems Design Engineering, University of Waterloo,

More information

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions

A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Journal of mathematics and computer Science 6 (2013) 154-165 A Graph Based Approach for Clustering Ensemble of Fuzzy Partitions Mohammad Ahmadzadeh Mazandaran University of Science and Technology m.ahmadzadeh@ustmb.ac.ir

More information

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA

INFORMATION-THEORETIC OUTLIER DETECTION FOR LARGE-SCALE CATEGORICAL DATA Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens

Trimmed bagging a DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) Christophe Croux, Kristel Joossens and Aurélie Lemmens Faculty of Economics and Applied Economics Trimmed bagging a Christophe Croux, Kristel Joossens and Aurélie Lemmens DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI) KBI 0721 Trimmed Bagging

More information

Combining Multiple Clustering Systems

Combining Multiple Clustering Systems Combining Multiple Clustering Systems Constantinos Boulis and Mari Ostendorf Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA boulis,mo@ee.washington.edu Abstract.

More information

2. Background. 2.1 Clustering

2. Background. 2.1 Clustering 2. Background 2.1 Clustering Clustering involves the unsupervised classification of data items into different groups or clusters. Unsupervised classificaiton is basically a learning task in which learning

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods An Empirical Comparison of Ensemble Methods Based on Classification Trees Mounir Hamza and Denis Larocque Department of Quantitative Methods HEC Montreal Canada Mounir Hamza and Denis Larocque 1 June 2005

More information

Classification with PAM and Random Forest

Classification with PAM and Random Forest 5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.

More information

Similarity Measures of Pentagonal Fuzzy Numbers

Similarity Measures of Pentagonal Fuzzy Numbers Volume 119 No. 9 2018, 165-175 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Similarity Measures of Pentagonal Fuzzy Numbers T. Pathinathan 1 and

More information

Massive Data Analysis

Massive Data Analysis Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

INCREASING CLASSIFICATION QUALITY BY USING FUZZY LOGIC

INCREASING CLASSIFICATION QUALITY BY USING FUZZY LOGIC JOURNAL OF APPLIED ENGINEERING SCIENCES VOL. 1(14), issue 4_2011 ISSN 2247-3769 ISSN-L 2247-3769 (Print) / e-issn:2284-7197 INCREASING CLASSIFICATION QUALITY BY USING FUZZY LOGIC DROJ Gabriela, University

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Ensemble Combination for Solving the Parameter Selection Problem in Image Segmentation

Ensemble Combination for Solving the Parameter Selection Problem in Image Segmentation Ensemble Combination for Solving the Parameter Selection Problem in Image Segmentation Pakaket Wattuya and Xiaoyi Jiang Department of Mathematics and Computer Science University of Münster, Germany {wattuya,xjiang}@math.uni-muenster.de

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Classification/Regression Trees and Random Forests

Classification/Regression Trees and Random Forests Classification/Regression Trees and Random Forests Fabio G. Cozman - fgcozman@usp.br November 6, 2018 Classification tree Consider binary class variable Y and features X 1,..., X n. Decide Ŷ after a series

More information

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Color-Based Classification of Natural Rock Images Using Classifier Combinations Color-Based Classification of Natural Rock Images Using Classifier Combinations Leena Lepistö, Iivari Kunttu, and Ari Visa Tampere University of Technology, Institute of Signal Processing, P.O. Box 553,

More information

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel CSC411 Fall 2014 Machine Learning & Data Mining Ensemble Methods Slides by Rich Zemel Ensemble methods Typical application: classi.ication Ensemble of classi.iers is a set of classi.iers whose individual

More information

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree

Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree World Applied Sciences Journal 21 (8): 1207-1212, 2013 ISSN 1818-4952 IDOSI Publications, 2013 DOI: 10.5829/idosi.wasj.2013.21.8.2913 Decision Making Procedure: Applications of IBM SPSS Cluster Analysis

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2014 Paper 321 A Scalable Supervised Subsemble Prediction Algorithm Stephanie Sapp Mark J. van der Laan

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Model combination. Resampling techniques p.1/34

Model combination. Resampling techniques p.1/34 Model combination The winner-takes-all approach is intuitively the approach which should work the best. However recent results in machine learning show that the performance of the final model can be improved

More information

Hierarchical and Ensemble Clustering

Hierarchical and Ensemble Clustering Hierarchical and Ensemble Clustering Ke Chen Reading: [7.8-7., EA], [25.5, KPM], [Fred & Jain, 25] COMP24 Machine Learning Outline Introduction Cluster Distance Measures Agglomerative Algorithm Example

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Support Vector Machine Ensemble with Bagging

Support Vector Machine Ensemble with Bagging Support Vector Machine Ensemble with Bagging Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and Sung-Yang Bang Department of Computer Science and Engineering Pohang University of Science and Technology

More information

A Method for Construction of Orthogonal Arrays 1

A Method for Construction of Orthogonal Arrays 1 Eighth International Workshop on Optimal Codes and Related Topics July 10-14, 2017, Sofia, Bulgaria pp. 49-54 A Method for Construction of Orthogonal Arrays 1 Iliya Bouyukliev iliyab@math.bas.bg Institute

More information

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes 1 CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

More information

Random Forests and Boosting

Random Forests and Boosting Random Forests and Boosting Tree-based methods are simple and useful for interpretation. However they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy.

More information

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R Lecture #17: Autoencoders and Random Forests with R Mat Kallada Introduction to Data Mining with R Assignment 4 Posted last Sunday Due next Monday! Autoencoders in R Firstly, what is an autoencoder? Autoencoders

More information

Exporting symbolic objects to databases

Exporting symbolic objects to databases 3 Exporting symbolic objects to databases Donato Malerba, Floriana Esposito and Annalisa Appice 3.1 The method SO2DB is a SODAS module that exports a set of symbolic objects (SOs) to a relational database

More information

8.11 Multivariate regression trees (MRT)

8.11 Multivariate regression trees (MRT) Multivariate regression trees (MRT) 375 8.11 Multivariate regression trees (MRT) Univariate classification tree analysis (CT) refers to problems where a qualitative response variable is to be predicted

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

K-MEANS BASED CONSENSUS CLUSTERING (KCC) A FRAMEWORK FOR DATASETS

K-MEANS BASED CONSENSUS CLUSTERING (KCC) A FRAMEWORK FOR DATASETS K-MEANS BASED CONSENSUS CLUSTERING (KCC) A FRAMEWORK FOR DATASETS B Kalai Selvi PG Scholar, Department of CSE, Adhiyamaan College of Engineering, Hosur, Tamil Nadu, (India) ABSTRACT Data mining is the

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

2 Renata M.C.R. de Souza et al cluster with its own representation. The advantage of these adaptive distances is that the clustering algorithm is able

2 Renata M.C.R. de Souza et al cluster with its own representation. The advantage of these adaptive distances is that the clustering algorithm is able Dynamic Cluster Methods for Interval Data based on Mahalanobis Distances Renata M.C.R. de Souza 1,Francisco de A.T. de Carvalho 1, Camilo P. Tenório 1 and Yves Lechevallier 2 1 Centro de Informatica -

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

On the Consequence of Variation Measure in K- modes Clustering Algorithm

On the Consequence of Variation Measure in K- modes Clustering Algorithm ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal Published By: Oriental Scientific Publishing Co., India. www.computerscijournal.org ISSN:

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Additional File 3 - ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis

Additional File 3 - ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis Additional File 3 - ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis Raffaele Giancarlo 1, Davide Scaturro 1, Filippo Utro 2 1 Dipartimento

More information

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,

More information

REDUNDANCY OF MULTISET TOPOLOGICAL SPACES

REDUNDANCY OF MULTISET TOPOLOGICAL SPACES Iranian Journal of Fuzzy Systems Vol. 14, No. 4, (2017) pp. 163-168 163 REDUNDANCY OF MULTISET TOPOLOGICAL SPACES A. GHAREEB Abstract. In this paper, we show the redundancies of multiset topological spaces.

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information