A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

Similar documents
HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

Dimension reduction : PCA and Clustering

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Dimension Reduction CS534

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Visual Representations for Machine Learning

Gene Clustering & Classification

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

LIMBO: A Scalable Algorithm to Cluster Categorical Data. University of Toronto, Department of Computer Science. CSRG Technical Report 467

LIMBO: Scalable Clustering of Categorical Data

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Clustering and Visualisation of Data

Methods for Intelligent Systems

Unsupervised Learning

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Discriminate Analysis

Hierarchical and Ensemble Clustering

Document Clustering using Concept Space and Cosine Similarity Measurement

Unsupervised Learning

A Spectral Based Clustering Algorithm for Categorical Data with Maximum Modularity

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

Lecture 7: Segmentation. Thursday, Sept 20

High Accuracy Clustering Algorithm for Categorical Dataset

Large-Scale Face Manifold Learning

Lesson 3. Prof. Enza Messina

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CSE 158. Web Mining and Recommender Systems. Midterm recap

Supervised vs unsupervised clustering

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Published in A R DIGITECH

Facial Expression Detection Using Implemented (PCA) Algorithm

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

SGN (4 cr) Chapter 11

2. Data Preprocessing

Unsupervised Clustering of Bitcoin Transaction Data

ECG782: Multidimensional Digital Signal Processing

University of Florida CISE department Gator Engineering. Clustering Part 5

Cluster Analysis. Ying Shen, SSE, Tongji University

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Generalized trace ratio optimization and applications

CSE 5243 INTRO. TO DATA MINING

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Generalized Transitive Distance with Minimum Spanning Random Forest

Unsupervised Learning

Network Traffic Measurements and Analysis

Community Detection. Community

ECLT 5810 Clustering

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Clustering. Bruno Martins. 1 st Semester 2012/2013

How do microarrays work

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Spectral Graph Sparsification: overview of theory and practical methods. Yiannis Koutis. University of Puerto Rico - Rio Piedras

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Image Analysis, Classification and Change Detection in Remote Sensing

CSE 5243 INTRO. TO DATA MINING

General Instructions. Questions

Forestry Applied Multivariate Statistics. Cluster Analysis

International Journal of Advancements in Research & Technology, Volume 2, Issue 8, August ISSN

CS Introduction to Data Mining Instructor: Abdullah Mueen

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Machine learning - HT Clustering

Clustering Lecture 5: Mixture Model

Lecture 8: Fitting. Tuesday, Sept 25

ECLT 5810 Clustering

Unsupervised Learning and Clustering

TELCOM2125: Network Science and Analysis

Feature selection. LING 572 Fei Xia

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Latent Semantic Indexing

An Effective Multi-Focus Medical Image Fusion Using Dual Tree Compactly Supported Shear-let Transform Based on Local Energy Means

Linear Methods for Regression and Shrinkage Methods

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Research Article Term Frequency Based Cosine Similarity Measure for Clustering Categorical Data using Hierarchical Algorithm

3. Data Preprocessing. 3.1 Introduction

Clustering CS 550: Machine Learning

RGB Digital Image Forgery Detection Using Singular Value Decomposition and One Dimensional Cellular Automata

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Performance Analysis of Classifying Unlabeled Data from Multiple Data Sources

Adaptive Learning of an Accurate Skin-Color Model

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Spectral Methods for Network Community Detection and Graph Partitioning

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

Road map. Basic concepts

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

10701 Machine Learning. Clustering

Social-Network Graphs

Anomaly Detection on Data Streams with High Dimensional Data Environment

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Transcription:

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center for Cybercrime Studies John Jay College of Criminal Justice The City University it of New York 1

Presentation Outline Motivation and Intended Applications Overview and Background SCCADDS Algorithm Experimental Evaluation and Comparative Analysis Conclusion o and Future Work 2

Problem The goal is to cluster a data set of data objects where each data object contains attributes (features). Each attribute is categorical that is each attribute can have one value from a finite domain. The possible values for each categorical attribute are discrete nominal values with no pre-defined order or relationship. 3

Intended Applications Cluster Crime Data FBI s National Incident Based Reporting System (NIBRS) Categorical Attributes High Dimensional data Large and Complex Database (incident, offense, victim, property, offender, arrestee) 4

Clustering Categorical Data Challenges Lack of geometric interpretation of data objects Difficulty in using traditional similarity measures such as cosine measure or Euclidean distance Difficulty in using other numeric techniques such as weighted average and other statistical measures (i.e., Standard of deviation, variance) High dimensionality 5

Current Techniques Parameters that are difficult to tune Not suitable for large data sets - Time complexity is often quadratic in terms of the number of data objects. Do not handle high dimensional data 6

Solution Categorical Data CLICKS, CACTUS, LIMBO, COOLCAT, STIRR, ROCK High Dimensional Data Spectral Techniques (SVD, PCA, LSI) SCCADDS Efficiency and Simplicity K-means, K-modes, K-representatives 7

Categorical Data in Binary Format binary vector each data object attribute represented by multiple components. Each component corresponds to one attribute domain value. K-means and spectral clustering algorithms can now be used on the binary records. color shape Red Blue Square Circle 1 red square 1 1 0 1 0 2 blue circle 2 0 1 0 1 8

Singular Value Decomposition (SVD) The Singular Value Decomposition (SVD) of an mxn matrix A is the decomposition A = USV t, U is an m x m orthogonal matrix V is an n x n orthogonal matrix S is an m x n matrix with non zeros on the diagonal only. 9

Truncated SVD of a Term/Document Matrix (Latent Semantic Indexing ) Y Matrix of m terms (rows) by n documents (columns) Y k = U k S k V k t 10

LSI Terms and documents are projected into a reduced space (k dimensions) matrices U k and V k LSI allows the comparison of Between terms (rows of U k S k ) Between documents (rows of V k S k ) Terms and documents (rows of U K sqrt(s k ) and rows of V K sqrt(s k )) LSI overcomes the problems of high dimensionality since all comparisons are performed in the reduced space. 11

SCCADDS Compute the attribute categories covariance matrix Find the k largest eigenvectors and eigenvalues, i.e., matrix U k and diagonal matrix (S k ) 2. Matrix AA t AA t U k = U k (S k ) 2 Compare each projected data object with each row in matrix U k (S k ) 2 Matrix U K K( (S 2 (U t k ) K ) A Project the data objects into reduced space A t U k Assign a data object (column) to a cluster (row) Merge the clusters (optional) 12

Experimental Evaluation Standard data sets Quality testing Comparative results Synthetic data sets Clustering quality Comparative results Scalability Data set size, dimensions 13

Standard Data sets UCI Machine Learning Repository Dataset Name Description Number Of Records Number Attributes Number of Known labels Soybean Soybean diseases 47 21 4 Congressional Voting dataset- Two parties democrats and republican 434 16 2 Mushroom Mushroom species classified into poisonous and edible 8,124 22 2 14

Data Set Name SCCADDS Quality Testing Standard Data Sets Number of data objects Number of eigen vectors Number of Clusters in the data set Number of Clusters Found by SCCADDS Accuracy Soybean 47 2 4 4 100% Congressional Votes 434 1 2 2 88% Mushroom 8,124 1 2 2 89% 15

SCCADDS Quality Testing Standard Data Sets Soybean Data Set Congressional Votes Data Set 12 120% 30 91% Number of Clu sters 10 8 6 4 2 100% 80% 60% 40% 20% Clustering Accu uracy Number of Clusters Clustering Accuracy Number of Clu usters 25 20 15 10 5 90% 90% 89% 89% 88% 88% 87% 87% Clustering Accu uracy Number of Clusters Clustering Accuracy 0 1 6 11 16 21 26 31 36 41 46 51 56 0% 0 1 6 11 16 21 26 31 36 41 46 86% Number of Eigenvectors Number of Eigenvectors Mushroom Data set Number of Clusters 35 30 25 20 15 10 5 0 1 6 11 16 21 26 31 36 41 46 102% 100% 98% 96% 94% 92% 90% 88% 86% 84% 82% Accuracy Level Number of Clusters Accuracy Level Number of Eigenvectors 16

Comparative Results Congressional Votes Data Set SCCADDS Accuracy Entropy Remarks 88% 0.45 This is the result for 2 clusters. Higher accuracy levels may be achieved with a higher number of clusters K-representatives ti This is average of 10 executions of the 88% algorithm using random partitioning. Traditional Hierarchical Clustering Algorithm 86% ROCK 79% 0.499 LIMBO 87% COOLCAT 85% 0.498 CLICKS Not available for 2 clusters Not available for 2 clusters Eigencluster 0.48 An accuracy level of 96% is achieved with 13 clusters. 17

Comparative Results Mushroom Data Set SCCADDS K-representative Accuracy Remarks 89% This is the accuracy level for 2 clusters. Higher accuracy levels may be achieved with a higher number of clusters. 78% ROCK 57% LIMBO 89% COOLCAT 73% This is average of 10 executions of the algorithm using random partitioning. CLICKS Not available for 2 clusters Eigencluster 81% An accuracy level of 97% is achieved with 19 clusters (Zaki et al., 2007). 18

Synthetic Data Sets Quality Testing SCCADDS Data Set Name Domain size of attributes Number of Attributes Number of attributes that participate in a rule Number of Clusters Data set Size Noise Ratio DS1 10-20 10 4 5 1000 0 DS2 10-20 10 4 10 5000 0 DS3 10-20 10 5 10 5000 2% DS4 10-20 10 5 10 5000 10% DS5 10-20 20 10 10 5000 10% 19

Synthetic Data Sets Quality Testing SCCADDS 20

Synthetic Data Sets Scalability Testing Number of Attributes (Dimensions) 21

Synthetic Data Sets Scalability Testing Data Set Size SCCADDS 22

Conclusion SCCADDS is a spectral-based algorithm for clustering categorical data Uses spectral techniques and data summaries Linear in terms of data objects and as such scalable to large data sets (assuming number of attributes t and categories is not tl large.) Produces quality clusterings, better than most clustering algorithms in its class Few parameters, reasonably easy to tune Not highly sensitive to number eigenvectors 23

Future Work Optimize the implementation of SCCADDS Test SCCADDS on NIBRS data sets with multiple segments Extend SCCADDS to do fuzzy clustering (each data object belongs to more than one cluster, intensity measure provided.) 24

NIBRS An incident-based reporting system that collects detailed information regarding crime incidents Contains data from 1994 2005 Contains data for over 29 million incidents Large and complex database with direct and indirect relationships between data segments 25

SCCADDS It uses the framework of Principle Component analysis to find the principle components for terms and documents Ding et al (2004) showed that principle components are a relaxed solution of the cluster membership indicators in k- means clustering algorithm. It uses the attribute categories covariance matrix Relation to LSI 26

Clustering Algorithms Algorithms specifically designed d for Categorical Data K-modes Clustering Algorithms (Huang 1998) K-representatives Clustering Algorithms (San et al., 2004) CACTUS (Ganti et al., 1999) CLICKS (Zaki et al., 2007) ROCK (Guha et al., 2000) COOLCAT (Barbara et. al., 2002) STIRR (Gibson et al., 1998) LIMBO (Andritsos et al., 2004) Spectral Algorithms A Divide-and-Merge Methodology for Clustering (Eigencluster; Cheng, Kannan, Vempala and Wang, 2006) 27

Spectral Based Methods An effective technique for data reduction A technique that has successfully been used in clustering large and high dimensional data sets Theoretical foundation graph partitioning i (Shi and Malik, 2000), SSE (Ding and He, 2004), Variance based clustering (Drineas et al., 2004) Is not a greedy algorithm and not vulnerable to local optimum 28

Data Objects Matrix A A is matrix of data objects where the data objects are the columns For efficiency, we use the eigenvector decomposition of AA t to arrive at the U (attribute values) and V (Data points) matrix. Mathematical relationships between eigenvectors decomposition and SVD A = USV t AV = US and A t U = VS AA t = USS t U t and A t A=VSS t V t AA t U= USS t Note that AA t is the attribute values similarity matrix and A t A is the document-document similarity matrix. 29

Types of Clustering Algorithms Partitional The clusters are disjoint and are not related to each other. Each record belongs only to one cluster Hierarchical The clusters form a tree structure Divisive hierarchical vs. agglomerative hierarchical Fuzzy Each record is associated with a weight that defines the extent of its membership to each cluster 30

SCCADDS (Spectral-Based Clustering Algorithm for Categorical Data Using Data Summaries) Efficient runs in linear time (number of data objects) unlike most spectral-based clustering algorithms Scalable - uses the attribute-value similarity matrix instead of the data objects similarity matrix and as such is scalable to very large datasets (number of records) Produces quality clustering better than most algorithms in its class Easy to implement and tune 31