Document Clustering using Concept Space and Cosine Similarity Measurement

Size: px
Start display at page:

Download "Document Clustering using Concept Space and Cosine Similarity Measurement"

Transcription

1 29 International Conference on Computer Technology and Development Document Clustering using Concept Space and Cosine Similarity Measurement Lailil Muflikhah Department of Computer and Information Science Universiti Teknologi Petronas Brawijaya University Bandar Seri Iskandar, Tronoh, Perak, Malaysia Baharum Baharudin Department of Computer and Information Science Universiti Teknologi Petronas Bandar Seri Iskandar, Tronoh, Perak, Malaysia Abstract Document clustering is related to data clustering concept which is one of data mining tasks and unsupervised classification. It is often applied to the huge data in order to make a partition based on their similarity. Initially, it used for Information Retrieval in order to improve the precision and recall from query. It is very easy to cluster with small data attributes which contains of important items. Furthermore, document clustering is very useful in retrieve information application in order to reduce the consuming time and get high precision and recall. Therefore, we propose to integrate the information retrieval method and document clustering as concept space approach. The method is known as Latent Semantic Index (LSI) approach which used Singular Vector Decomposition () or Principle Component Analysis (). The aim of this method is to reduce the matrix dimension by finding the pattern in document collection with refers to concurrent of the terms. Each method is implemented to weight of term-document in vector space model (VSM) for document clustering using fuzzy c-means algorithm. Besides reduction of term-document matrix, this research also uses the cosine similarity measurement as replacement of distance to involve in fuzzy c-means. And as a result, the performance of the proposed method is better than the existing method with f-measure around.9 and entropy around.5. Keywords-data mining; clustering; LSI; ; ; fuzzy c- means; euclidean distance; cosine similarity I. INTRODUCTION Data mining is a technique to get the pattern from hidden information. This technique is to find and describe structural patterns in data collection as a tool for helping to explain that data and make predictions from it. Generally, data mining tasks are divided into two major categories: predictive tasks which aim to predict the value of a particular attribute based on the values of other attributes and another one is descriptive tasks which aim to derive patterns (correlations, trends, clusters, trajectories, and anomalies)[]. Clustering is a method to organize automatically a large data collection by partition a set data, so the objects in the same cluster are more similar to one another than to objects in other clusters. Document clustering is related to organize a large data text collection. In the field of Information Retrieval (IR), document clustering is used to automatically group the document that belongs to the same topic in order to provide user s browsing of retrieval results [2]. Some experimental evidences show that IR application can benefit from the use of document clustering [3]. Document clustering has always been used as a tool to improve the performance of retrieval and navigating large data. The clustering methods can be classified into hierarchical method and partitioning method. Partitioning clustering can be divided into hard clustering and overlapping (fuzzy clustering). In any given document, there is the possibility that it can contain multiple subject or category. This issue is purposed to use fuzzy clustering approach, which allows a document to appear in multiple clusters. The method used is different from hard clustering method, which a document is only belongs to one cluster, not more. It means that we assume well defined boundary among the clusters. Many researchers have conducted in document clustering by hard clustering methods. In the research show that Bisecting K-Means algorithm is better than Basic K-Means algorithm and Agglomerative Hierarchical Clustering using similarity concept with cosine formulation [4]. Thus, another research that in grouping process used by variance value is better in the result [5], or by using distance [6]. Furthermore, there are several applications by using Fuzzy C-Means algorithm, the researcher had applied clustering for symbolic data and its result of clustering has better quality than hierarchical method (hard clustering) [7]. Also, it had been applied in text mining [8]. While grouping or clustering of document, the problem is very huge terms or words are represented to the full text vector space. Many lexical matching at term level are inaccurate. Sometimes the words have the number of meaning or the number of words has the same meaning, it effects to match returns irrelevant documents. It is difficult to judge which documents belong to the same cluster based on the specific category without any selection for the terms which have meaning full or correlation between the terms each other. Therefore, this research is used concept of Information Retrieval approach is Latent Semantic Index (LSI). In this method, the documents are projected onto a small subspace of this vector space and clustered. So, there is creation of new abstract vector space, which contain of the important term is captured in order [2] /9 $ IEEE DOI.9/ICCTD

2 In this paper, firstly it is described about information retrieval using LSI concept. Then, we describe document similarity as basic concept of clustering, and also one of clustering algorithm applied for document clustering itself is Fuzzy C-Means. After that, the methodology which used to implement document clustering by LSI and cosine similarity embedded to experiment evaluation. Since this experiment is implemented, the performance can be made an analysis and conclusion. II. INFORMATION RETRIEVAL Information retrieval is the way to search the match information as user desired. Unrelated document may be retrieved simply because terms occur accidentally in it, and on the other hand, related documents may be missed because no term in the document occurs in the query. Therefore, the retrieval could be based on concept rather than on terms, by mapping first items to a concept space and using ranking of similarity as shown in Fig. [2]. matrix[9]. And Fig.2 depicts how the LSI in getting the pattern in document collection. Figure 2. Description of LSI in data collection B. Singular Value Decomposition () The Singular Value Decomposition () is a method which can find the patterns in the matrix and identify which words and documents are similar to each other. It creates the new matrices from term (t) x document (d) matrix A that are matrices U, and V such that A= USV T which can be illustrated as in Fig. 3.[] Figure 3. The relative sizes of the three matrixes when t > d Figure. Using concept for Retrieval Information Fig. describes there is middle layer into two query based on the concept (c and c2) and document maps instead of directly relating documents and term as in vector retrieval. This vector, the query c2 of t3 return d2, d3, d4 in the answer set based on the observation that they relate to concept c2, without requiring that the document contains term t3. A. Latent Semantic Index (LSI) Initially, latent semantic indexing (LSI) is obtained to get pattern in the document collection which used to improve the accuracy in Information Retrieval. It uses Singular Vector Decomposition () or Principle Component Analysis () to decompose the original matrix A of document representation and to retain only k largest singular value from singular value matrix. In this matrix, selects only the largest singular value which is to keep the corresponding columns in two other matrixes U and V T. The choice of s determines on how many of the important concepts the ranking will be based on. It is assumption that concepts with small singular value are rather to be considered as noise and thus can be ignored. Therefore, LSI can be depicted of how the sizes of the involved matrixes reduce, when only the first s singular values are kept for the computation of the ranking and also how the position between term and document in the In Fig. 3, the matrix shows where U has orthogonal, unit-length column (U T U=I) and it is called left singular vectors; V has orthogonal which is called right singular vectors, unit-length column (V T V=I) and is diagonal matrix (k x k) of singular values, where k is the rank of A ( min (t, d)). Generally, A = U V T matrix must all be of full rank. The amount of dimension reduction, need to choice correctly in order to represent the real structure in the data [2]. C. Principal Component Analysis () Principal component analysis is a method to find k principal axes which are orthonormal coordinate systems that can capture most of the variance in data. Basically, is formed from Singular Vector Decomposition () on the covariance matrix which used eigen vector or value of covariance matrix [, 2]. III. DOCUMENT DISSIMILARITY The dissimilarity of data object (document) is shown by the distance between document as cluster center and the others. The distance, d, between two points, x and y, in one-, two-, three-, or higher-dimensional space, is given by the equation., () where n is the number of dimensions and X k and Y k are respectively, the k th attributes (components) of x and y []. 59

3 In contrast, the similarity between data object (document) is known as the small distance in one cluster. Documents are often represented as vectors, where each attribute represents the frequency with which a particular term (word) occurs in the document. A measure of similarity for document clustering is the cosine of the angle between two vectors as in this equation 2 []. cos, / (2) where d i and d j are two different documents IV. FUZZY C-MEANS CLUSTERING There are various fuzzy clustering algorithms and one simple fuzzy clustering technique is the fuzzy c-means algorithm (FCM) by Duda and Hart [3] which was birth of fuzzy method. The FCM is known to produce reasonable partitioning of the original data in many cases (see [4]) and is very quickly compared to some other approaches. Besides that, FCM is convergence to a minimum or saddle point from any initializations, it may be either a local or (the) global minimum of objective function [5]. As principle, this algorithm is based on minimization of the objective function J(X; U,V). Generally, the objective function is the summing up dissimilarity weighted by membership degree ( is shown as equation 3 [6] ;,, (3) where d is the distance; V is cluster center and X is data (document-term matrix). V. METHODOLOGY The proposed method is LSI concept in order to get the small vector space. The details steps for document clustering are as below:. Document preprocessing which includes case folding, parsing, removing stop word and stemming. 2. Removing the terms which have global frequency less than 2 and local frequency is more than a half of the total document. 3. Representation full text document to term-document A vector (vector space model) using TF-IDF weight term. 4. Mapping the term-document A matrix to V document matrix in concept space using LSI approach. Since (as property of matrix), thus: 5. Implementing Fuzzy C-means clustering algorithm by term-document V vector as representative of the document collection and using Cosine similarity as replacement of distance which defined as [ cos,. It is applied into the objective function of Fuzzy C-means algorithm. VI. PERFORMANCE EVALUATION In order to know the performance for quality of clustering, there are two measurements which are F-measure and entropy[7]. This basic idea is from information retrieval concept. In this technique, each cluster is considered as if it were the result of query and each class as if it were the desired set of documents for the query. Furthermore, the formulation of F-measure involves Precision and recall for each cluster j and class i are as follows:, ;, (4) where n ij is the number of documents with class label i in cluster j, n i is the number of documents with class label i and n j is the number of documents in cluster j. Thus, the F-measure cluster j and class i is obtained as this below equation:,,, (5),, The higher f-measure is the higher accuracy of cluster, includes precision and recall. Another measurement which related to the internal quality of clustering is entropy measurement ( and it can be formulated:,.log, (6) where, P(i, j) is probability that a document has class label i and is assigned to cluster j. Thus, the total entropy of clusters is obtained by summing the entropies of each cluster weighted by the size of each cluster: (7) where, n j is size of cluster j and n is total document number in the corpus. The lower value of entropy, the higher quality of cluster internally. VII. EXPERIMENTAL RESULT We evaluated the performance of the proposed method using the data sets taken from 2News Group [8]. The dataset is made up of four groups with refer to the data volume: Binary2, Multi5, Multi7 and Multi. They consist of short news with various topics which used as class reference in clustering process. The binary2 dataset contains of 2 documents (25 terms) and documents in each topic. Also for the other dataset, there are documents in each topic. The multi5 dataset contains of 5 documents (8367 terms), and multi7 dataset contains of 7 documents (866 terms). And another dataset contains of documents (25627 terms). Thus, each dataset is clustered separately based on the number of topics. The first step is document preprocessing in order to reduce the volume density of data and the result after preprocessing is shown in the Table I. 6

4 TABLE I. Binary2 Multi5 Multi7 Multi DESCRIPTION OF DATASET AFTER PREPROCESSING Dataset Total Topics Docs Terms talk.politics.mideast talk.politics.misc comp.graphics rec.motorcycles rec.sport.baseball sci.space talk.politics.mideast alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.electronics talk.politics.guns alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.crypt sci.electronics sci.med sci.space talk.politics.guns After that, it is represented the weight term using TF- IDF method in matrix (vector space model). By applying LSI method, the term-document matrix dimension is reduced as shown in the Table II. The size of volume reduction is based on the selected k-rank with optimum condition at interval [2...5]. TABLE II. Data set REDUCTION DIMENSION OF TERM-DOCUMENT USING LSI Total Documents Total Patternterms () Total Patternterms () Binary Multi Multi Multi 2 24 Thus, to cluster the document collection is used Fuzzy C-Means algorithm by parameter fuzziness (m=.), error rate ( =.) and specific cluster number (c) based on the number of topic or class. By applied LSI method ( and ), the distribution of term-document for binary2 dataset and the position of cluster center at the certain k-rank can be illustrated at Fig. 4. There is different distribution of dataset in clustering for the both methods. method of binary2 dataset method of binary2 dataset Figure 4. Dataset and cluster center distribution using LSI approach VIII. DISCUSSION It means that there are k patterns in each document collection. In order to know the effect of k-rank with quality of cluster and get optimum condition (high performance), we apply these methods by various k-rank. It is applied to binary2 dataset using and with k=, 2, 3,, 5 and the result is shown in Fig. 4 and k-rank prec recall f-measure entropy Figure 5. Performance of clustering for binary2 using 6

5 Figure 6. Performance of clustering for binary2 using The performance of clustering for each LSI method is obtained optimum condition at different k-rank. This is showed that the optimum of binary2 is at k-rank=8, and for is at k-rank=6. It is also applied to the other data sets: multi5, multi7 and multi. Furthermore, the comparison of performance for document clustering between without LSI applied and with LSI applied including Cosine similarity measurement as replacement of distance is shown in Fig. 7 for all data sets k-rank prec recall fmeasure entropy binary2 multi5 multi7 multi Precision Recall cosine Recall F-measure cosine F-measure Entropy cosine Entropy Figure 7. Performance comparison of document clustering Fig. 7 depicts that the accuracy of document clustering without LSI applied is very low, especially for huge data volume (multi7 and multi) which using distance. When it is applied to multi7 dataset, it has precision=.464, recall=.324, and f-measure =.38, but it has high entropy which is It is also happened to multi dataset which has the worst performance with precision=.4, recall=.34, f-measure=.372 and entropy= In contrast, it is used LSI approach and Cosine similarity as reprecement of distance method to be applied in FC-Means algorithm. And the performance for external and internal quality of cluster is very high. The multi7 dataset is obtain f-measure (svd:.96; pca:.93) and entropy(svd:.597; pca:.69) and multi with f-measure (svd:.888; pca:.887) and entropy (svd:.7; pca:.732). The performance of rest data sets is also increase even not significant. IX. CONCLUSION The document clustering can be applied using concept space and cosine similarity. It had made the significant reduction of term-document matrix dimension with refer to k-rank (total number of pattern). Also their average performance is very high with f-measure about.9 and entropy about.5. It is significant improvement when applied in huge data volume (multi7 and multi dataset) until more than 5% increasing. REFERENCES. Pang-Ning Tan, M.S., Vipin Kumar, Introduction to Data Mining. Pearson International ed. 26: Pearson Education, Inc. 2. M.A. Hearst, a.j.o.p. Reexamining the cluster hypothesis. 996: In Proceeding of SIGIR ' Jardine, N.a.v.R., C.J., The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval. Vol Steinbach M., K.G., Kumar V., A Comparison of Document Clustering Techniques. 2, University of Mineasota. 5. Saveresi, S.M., D.L. Boley, S.Bittanti and G. Gazzaniga, Cluster Selection in Divisive Clustering Algorithms Larose, D.T., An Introduction to Data Mining. Discovering Knowledge in Data. 25: Willi & Sons, Inc. 7. El-Sonbaty, Y.a.I., M.A., Fuzzy Clustering for Symbol Data. IEEE Transactions on Fuzzy Systems, Rodrigues, M.E.S.M.a.S., L. A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining. in The 5th International Conference on Recent Advances in Soft Cpmputing Aberer, K., EPFL-SSC, L.d.s.d.i. repartis, Editor S. Deerwester, e.a., Indexing by latent semantic analysis. Journal of American Society for Information Science and Technology, 99. 4: p Smith, L., A Tutorial on Principal Component Analysis Shlens, J., A Tutorial on Principal Component Analysis Bezdek, J.C., Fuzzy Mathematics in Pattern Classification. 973, Cornell University: Ithaca, New York. 4. Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function Algorithm. 98: Plenum Press. 5. Hathaway, R., Bezdek, J., and Tucker, W., An Improved Convergence Theory for the Fuzzy ISODATA Clustering Algorithms. Analysis of Fuzzy Information, (Boca Raton: CRC Press): p Sadaaki Miyamoto, H.I., Katsuhiro Honda, Algorithm for Fuzzy Clustering. Methods in c-means Clustering with Applications, ed. S.i.F.a.S. Computing. Vol , Osaka, Japan: Scientific Publishing Services Pvt. Ltd., Chennai, India. 7. Brojner Larsen and Chinatsu Aone, Fast and Effective Text Mining Using Linear-time Document Clustering, in KDD : San Diego, California databases/2newgroups.html 62

Space and Cosine Similarity measures for Text Document Clustering Venkata Gopala Rao S. Bhanu Prasad A.

Space and Cosine Similarity measures for Text Document Clustering Venkata Gopala Rao S. Bhanu Prasad A. Space and Cosine Similarity measures for Text Document Clustering Venkata Gopala Rao S. Bhanu Prasad A. M.Tech Software Engineering, Associate Professor, Department of IT, Vardhaman College of Engineering,

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

Cross-Instance Tuning of Unsupervised Document Clustering Algorithms

Cross-Instance Tuning of Unsupervised Document Clustering Algorithms Cross-Instance Tuning of Unsupervised Document Clustering Algorithms Damianos Karakos, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Johns Hopkins University Carey E. Priebe

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center

More information

Improved Clustering of Documents using K-means Algorithm

Improved Clustering of Documents using K-means Algorithm Improved Clustering of Documents using K-means Algorithm Merlin Jacob Department of Computer Science and Engineering Caarmel Engineering College, Perunadu Pathanamthitta, Kerala M G University, Kottayam

More information

Dimension Reduction CS534

Dimension Reduction CS534 Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering of text documents by implementation of K-means algorithms

Clustering of text documents by implementation of K-means algorithms Clustering of text documents by implementation of K-means algorithms Abstract Mr. Hardeep Singh Assistant Professor Department of Professional Studies Post Graduate Government College Sector 11, Chandigarh

More information

Concept Based Search Using LSI and Automatic Keyphrase Extraction

Concept Based Search Using LSI and Automatic Keyphrase Extraction Concept Based Search Using LSI and Automatic Keyphrase Extraction Ravina Rodrigues, Kavita Asnani Department of Information Technology (M.E.) Padre Conceição College of Engineering Verna, India {ravinarodrigues

More information

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Vector Space Models: Theory and Applications

Vector Space Models: Theory and Applications Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du

More information

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Improving the Performance of K-Means Clustering For High Dimensional Data Set

Improving the Performance of K-Means Clustering For High Dimensional Data Set Improving the Performance of K-Means Clustering For High Dimensional Data Set P.Prabhu Assistant Professor in Information Technology DDE, Alagappa University Karaikudi, Tamilnadu, India N.Anbazhagan Associate

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

This is the published version:

This is the published version: This is the published version: Ren, Yongli, Ye, Yangdong and Li, Gang 2008, The density connectivity information bottleneck, in Proceedings of the 9th International Conference for Young Computer Scientists,

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Bishnu Prasad Gautam and Dipesh Shrestha

Bishnu Prasad Gautam and Dipesh Shrestha Bishnu Prasad Gautam and Dipesh Shrestha In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization during our research. The key idea is

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Learning Spatially Variant Dissimilarity (SVaD) Measures

Learning Spatially Variant Dissimilarity (SVaD) Measures Learning Spatially Variant Dissimilarity (SVaD) Measures Krishna Kummamuru IBM India Research Lab Block 1, IIT, Hauz Khas New Delhi 110016 INDIA kkummamu@in.ibm.com Raghu Krishnapuram IBM India Research

More information

An Improved Clustering Method for Text Documents Using Neutrosophic Logic

An Improved Clustering Method for Text Documents Using Neutrosophic Logic An Improved Clustering Method for Text Documents Using Neutrosophic Logic Nadeem Akhtar, Mohammad Naved Qureshi and Mohd Vasim Ahamad 1 Introduction As a technique of Information Retrieval, we can consider

More information

Methods for Intelligent Systems

Methods for Intelligent Systems Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering

More information

Hierarchical Co-Clustering Based on Entropy Splitting

Hierarchical Co-Clustering Based on Entropy Splitting Hierarchical Co-Clustering Based on Entropy Splitting Wei Cheng 1, Xiang Zhang 2, Feng Pan 3, and Wei Wang 4 1 Department of Computer Science, University of North Carolina at Chapel Hill, 2 Department

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 494: Information Retrieval, Mining and Integration on the Internet CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:

More information

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Prof. S.N. Sawalkar 1, Ms. Sheetal Yamde 2 1Head Department of Computer Science and Engineering, Computer

More information

Object and Action Detection from a Single Example

Object and Action Detection from a Single Example Object and Action Detection from a Single Example Peyman Milanfar* EE Department University of California, Santa Cruz *Joint work with Hae Jong Seo AFOSR Program Review, June 4-5, 29 Take a look at this:

More information

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING SECOND EDITION IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING ith Algorithms for ENVI/IDL Morton J. Canty с*' Q\ CRC Press Taylor &. Francis Group Boca Raton London New York CRC

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Image Processing. Image Features

Image Processing. Image Features Image Processing Image Features Preliminaries 2 What are Image Features? Anything. What they are used for? Some statements about image fragments (patches) recognition Search for similar patches matching

More information

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Latent Semantic Indexing

Latent Semantic Indexing Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Evolutionary Clustering using Frequent Itemsets

Evolutionary Clustering using Frequent Itemsets Evolutionary Clustering using Frequent Itemsets by K. Ravi Shankar, G.V.R. Kiran, Vikram Pudi in SIGKDD Workshop on Novel Data Stream Pattern Mining Techniques (StreamKDD) Washington D.C, USA Report No:

More information

Application of k-nearest Neighbor on Feature. Tuba Yavuz and H. Altay Guvenir. Bilkent University

Application of k-nearest Neighbor on Feature. Tuba Yavuz and H. Altay Guvenir. Bilkent University Application of k-nearest Neighbor on Feature Projections Classier to Text Categorization Tuba Yavuz and H. Altay Guvenir Department of Computer Engineering and Information Science Bilkent University 06533

More information

ECM A Novel On-line, Evolving Clustering Method and Its Applications

ECM A Novel On-line, Evolving Clustering Method and Its Applications ECM A Novel On-line, Evolving Clustering Method and Its Applications Qun Song 1 and Nikola Kasabov 2 1, 2 Department of Information Science, University of Otago P.O Box 56, Dunedin, New Zealand (E-mail:

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento Towards Understanding Latent Semantic Indexing Bin Cheng Supervisor: Dr. Eleni Stroulia Second Reader: Dr. Mario Nascimento 0 TABLE OF CONTENTS ABSTRACT...3 1 INTRODUCTION...4 2 RELATED WORKS...6 2.1 TRADITIONAL

More information

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1 Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014

More information

Chapter 7 UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION

Chapter 7 UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION UNSUPERVISED LEARNING TECHNIQUES FOR MAMMOGRAM CLASSIFICATION Supervised and unsupervised learning are the two prominent machine learning algorithms used in pattern recognition and classification. In this

More information

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters Akhtar Sabzi Department of Information Technology Qom University, Qom, Iran asabzii@gmail.com Yaghoub Farjami Department

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS Toomas Kirt Supervisor: Leo Võhandu Tallinn Technical University Toomas.Kirt@mail.ee Abstract: Key words: For the visualisation

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

Visualization of Text Document Corpus

Visualization of Text Document Corpus Informatica 29 (2005) 497 502 497 Visualization of Text Document Corpus Blaž Fortuna, Marko Grobelnik and Dunja Mladenić Jozef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia E-mail: {blaz.fortuna,

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016 Machine Learning 10-701, Fall 2016 Nonparametric methods for Classification Eric Xing Lecture 2, September 12, 2016 Reading: 1 Classification Representing data: Hypothesis (classifier) 2 Clustering 3 Supervised

More information

Collaborative Filtering based on User Trends

Collaborative Filtering based on User Trends Collaborative Filtering based on User Trends Panagiotis Symeonidis, Alexandros Nanopoulos, Apostolos Papadopoulos, and Yannis Manolopoulos Aristotle University, Department of Informatics, Thessalonii 54124,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database

On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database Ashritha K.P, Sudheer Shetty 4 th Sem M.Tech, Dept. of CS&E, Sahyadri College of Engineering and Management,

More information

Image Analysis, Classification and Change Detection in Remote Sensing

Image Analysis, Classification and Change Detection in Remote Sensing Image Analysis, Classification and Change Detection in Remote Sensing WITH ALGORITHMS FOR ENVI/IDL Morton J. Canty Taylor &. Francis Taylor & Francis Group Boca Raton London New York CRC is an imprint

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition ISSN: 2321-7782 (Online) Volume 1, Issue 6, November 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com Facial

More information

CLASSIFICATION AND CHANGE DETECTION

CLASSIFICATION AND CHANGE DETECTION IMAGE ANALYSIS, CLASSIFICATION AND CHANGE DETECTION IN REMOTE SENSING With Algorithms for ENVI/IDL and Python THIRD EDITION Morton J. Canty CRC Press Taylor & Francis Group Boca Raton London NewYork CRC

More information

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS H.S Behera Department of Computer Science and Engineering, Veer Surendra Sai University

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University CS490W Text Clustering Luo Si Department of Computer Science Purdue University [Borrows slides from Chris Manning, Ray Mooney and Soumen Chakrabarti] Clustering Document clustering Motivations Document

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Fuzzy C-means Clustering with Temporal-based Membership Function

Fuzzy C-means Clustering with Temporal-based Membership Function Indian Journal of Science and Technology, Vol (S()), DOI:./ijst//viS/, December ISSN (Print) : - ISSN (Online) : - Fuzzy C-means Clustering with Temporal-based Membership Function Aseel Mousa * and Yuhanis

More information

Data Distortion for Privacy Protection in a Terrorist Analysis System

Data Distortion for Privacy Protection in a Terrorist Analysis System Data Distortion for Privacy Protection in a Terrorist Analysis System Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang Department of Computer Science, University of Kentucky, Lexington KY 40506-0046, USA

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Chuck Cartledge, PhD. 23 September 2017

Chuck Cartledge, PhD. 23 September 2017 Introduction Definitions Numerical data Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Agglomerative Clustering Chuck Cartledge, PhD 23 September 2017 1/30 Table of contents

More information