A Modified Fuzzy Relational Clustering Approach for Sentence-Level Text

Size: px
Start display at page:

Download "A Modified Fuzzy Relational Clustering Approach for Sentence-Level Text"

Transcription

1 Procs. of the IEEE 201 2nd International Conference on Electrical, Information and Communication Technology (EICT 201), Khulna, Bangladesh, December 10-12, (201). Note: This is the pre-print version A Modified Fuzzy Relational Clustering Approach for Sentence-Level Text Sikder Tahsin Al-Amin, Mahade Hasan, and M. M. A. Hashem Department of Computer Science and Engineering Khulna University of Engineering and Technology Khulna-9203, Bangladesh stahsin.cse@gmail.com, mahade0@gmail.com, mma.hashem@outlook.com Abstract This paper proposes a fuzzy relational clustering (FRC) to find similar sentences from a set of sentences as well as group them in clusters. For finding similar sentences here FRC used both word-to-word and order similarity. For word-to-word similarity FRC used Jiang and Conrath similarity measure (JnC) with the help of WordNet database. Order similarity is calculated from joint word set. As a sentence may relate to more than one theme so FRC used a fuzzy clustering approach. Here FRC used FRECCA algorithm for the sentence clustering purpose. The algorithm works on Expectation-Maximization where importance of a sentence is expressed by PageRank score which is treated as likelihood. The PageRank scores and mixing coefficients are initialized with Uniform Random Number generation technique. Applying this method on a quotation dataset of different classes we found that it is capable of identifying and grouping similar sentences in a cluster. FRC is also applied on a news article dataset and found admirable results. Keywords Fuzzy Clustering, Sentence Similarity, Relational Data. I. INTRODUCTION The amount of data being generated and stored is growing exponentially which presents new opportunities and challenges for unlock the information embedded within this data. Modern field of data mining can be used to extract important knowledge from data [1]. This method only emphasizes on text data. FRC used pre-processed text data of quotes. This method can be further used on pre-processed text data from s, newspapers, websites, online blogs etc. One of the applications is that, on a certain day many newspapers are published. There are some news articles that are printed in all or most of them but are not written in the same format or language nevertheless the idea of the news is similar. So the main objective is to find similarity between the sentences residing in the news article. Several works have been done on fuzzy clustering. First one was Relational Fuzzy c-means by Hathaway et al [2]. Then ARCA and K-Medoids [3],[] was also proposed. But these algorithms had some limitations. In Relational Fuzzy C-means distance between data points couldn t be measured although it could operate on relational data. If the size of data set increased by high number, it causes the ARCA algorithm to fail. K- Medoids were affected by poor initialization which is done randomly. A sentence is likely to be related to more than one theme or topic present within a document. However, because most sentence similarity measures do not represent sentences in a common metric space, a bag-of-words [] approach or conventional fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering []. Hence there is a need of fuzzy clustering algorithm that operates on relational input data. Relational data are in the form of a square matrix of pairwise similarities between data objects. FRC finds similarities between sentences and puts them into a cluster by calculating the cluster membership values. As the approach is fuzzy, a sentence may reside in more than one cluster with some amount of membership value. The sum of the membership values of a sentence in all clusters is 1. A sentence is considered to be in a cluster for which the corresponding membership value is largest. This paper has improved the Measuring Sentence Similarity method. The sentence similarity method measures similarity between two sentences based on both Word-to-Word similarity and Order similarity. Word-to-Word similarity measures the similarity between two words with the help of a lexical database known as WordNet []. On the other hand Order similarity measures the positioning of a word in a sentence. The proposed method Sentence-Clustering based on FRC (SBFRC) has improved the Initialization step of the algorithm. The FRECCA algorithm initializes the membership values and PageRank values using simple random number generation technique. FRC implemented Uniform Random Number generation technique. The rest of the paper is organized as follows. Section 2 summarizes some related works. Section 3 describes the methodology of SBFRC. Section illustrates the experimental studies. Finally Section concludes the paper. II. RELATED WORKS Hathaway et al. s Relational Fuzzy c-means (RFCM) algorithm [2] is considered as the first successful fuzzy clustering algorithm. Although RFCM operates on relational /1/$ IEEE

2 data input, it does not the express the relation by this data to be Euclidean. Though having success, the Euclidean requirement in RFCM was considered limited, and various alternatives have been proposed. For instance, the ARCA algorithm [8] uses an attribute-based representation. A limitation of this method is the high dimensionality. It causes by representing objects in terms of their similarity with all other objects. K-Medoid family is popular in clustering algorithms. Fuzzy versions of k-medoids have also been introduced. Like k- Means, k-medoids are also highly sensitive to the initial selection of centroids which is done randomly. And often requires running the algorithm several times from different random initializations. Many spectral clustering algorithms that can be applied to sentence clustering by Zha [9], and Wang et al. [10]. They have recently applied a closely related nonnegative matrix factorization [11] technique to sentence clustering in the context of multi-document summarization. Our proposed approach Sentence-Clustering based on FRC (SBFRC) is based on a fuzzy relational clustering algorithm which is known as FRECCA algorithm []. Here, Cluster Membership values for each node represent the degree or extent to which the object or a sentence belongs to each of the respective clusters, and mixing coefficients represent the probability of an object being in a cluster []. III. SENTENCE-CLUSTERING BASED ON FRC (SBFRC) To overcome the shortcomings of the above approaches, SBFRC has proposed in this paper that can group similar sentences together in a cluster. It contains several steps which are described as follows. Fig. 1 is the flowchart of the steps of SBFRC. Sentence i Word-to-word similarity S word WordNet Database Document (Collection of sentences), i=1,2.n Sentence j Overall Sentence Similarity, Sentence Similarity Matrix Order similarity S order Step 1: The input set is produced by pre-processing a text document. The sentences are extracted from a paragraph and the numbers of sentences are determined. Sentence 1 Sentence 2 Step 2: Second step of SBFRC is to word-to-word similarity measure. Sentence similarity measures play an important role in text-related research. Existing methods for measuring sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient and are not adaptable to some application domains. [12] The proposed sentence similarity method derives similarity between two sentences using both word-to-word similarity and order similarity. A text is considered to be a sequence of words. The words, along with their combination structure, make a specific meaning. Unlike existing methods, the proposed method calculates both word-to-word similarity and order similarity to compute final sentence similarity. For word-to-word similarity the method used knowledge based measures, the WordNet [] based measure due to Jiang and Conrath[13]. The Jiang-Conrath measure is based on the idea that the degree to which two words are similar is proportional to the amount of information they share. The similarity between words w1 and w2 is defined as equation 1: & 1, 2 Word-to-word similarity WordNet Database Order vector 1 Order vector 2 Fig. 2. Measuring Sentence Similarity Sentence Similarity Order similarity, (1) Page Rank Algorithm Expectation Maximization Determine cluster membership values Expected Result? No Where, 1, 2 is the word that is the deepest common ancestor of word w1 and w2, is the information content of word w, and defined as, where is the probability that word w appears in a large corpus [12]. Step 3: Third of SBFRC is to measure order similarity. For each sentence, a raw semantic vector is derived [12]. An order similarity is calculated using the two order vectors. Stop Yes Fig. 1. Flow chart of SBFRC 2

3 Step : The fourth step is to calculate the overall similarity as showed in Fig 2. Word-to-word similarity represents the lexical similarity. On the other hand, word order similarity provides information about the relationship between words: which words come before or after other words. Then, the overall sentence similarity is derived by combining word-toword similarity and order similarity. And the similarity value is stored in the similarity matrix in respective position. Overall sentence similarity is calculated using the equation 2 [12]-, 1 Where r <=1 decides the relative contributions of word-toword and word order information to the overall similarity computation [12]. Since word-to-word similarity is more important and plays a vital role to overall sentence similarity, r should be a value greater than 0.. Step : Fifth step is to form the similarity matrix. The similarity method is applied between all possible pairs of sentences. After calculating similarity values between all the sentences in a document, the similarity values are stored in a matrix. If there are N sentences in a document then the matrix which size is NxN. Step : Sixth step is to applying the PageRank algorithm [1],[1]. Unlike Gaussian mixture models, which use a likelihood function parameterized by the means and covariances of the mixture components, this method uses the PageRank score of an object within a cluster as a measure of its centrality to that cluster. These PageRank values are then treated as likelihoods. PageRank values are calculated using the equation no. 3: 1 Here, is the pagerank score of object in cluster. is the weight between objects and in cluster calculated previously. And d is the damping factor. The damping factor d that appears in the PageRank calculation affects the fuzziness of the clustering, but generally does not affect the number of clusters, provided that the value is above approximately 0.8. In general, the higher the value of d, the harder is the clustering, with cluster membership values being close to either zero or one. We have used a value of 0.8. After PageRank scores are calculated, these are treated as likelihoods and then they are used to calculate cluster membership values. Cluster membership values are obtained using the equation : Where, is the mixing coefficient for cluster. And is the likelihoods for object in cluster which is obtained from the PageRank values. Membership values are normalized so that membership for an object sums to 1 for all clusters []. Step : Seventh step is the Expectation-Maximization (EM) step. Here there are two steps. The E-step calculates the membership values for each cluster. For each pair of sentences (2) (3) () in a cluster weight is calculated using values from similarity matrix and membership values. That is () In equation no., is the weight between sentences and in cluster. is the similarity between sentences and derived from Sentence similarity matrix. And and are the respective membership values of sentences and in cluster []. With the help of this step once the PageRank scores are calculated they are treated as likelihoods and cluster membership values are calculated afterwards. In the M-step the membership values are updated until convergence. It actually updates the mixing coefficients based on membership values. The value of mixing coefficient for each sentence in a cluster is same. As in the equation no.. () Where, is the mixing coefficient for cluster, is the membership values calculated in expectation step and N is the total number of sentences. Step 8: The eighth and last step is to find the convergence point for the clusters. The convergence is achieved when the membership of the sentences in the clusters is not updating anymore or the difference between the previous values is very little. IV. EXPERIMENTAL ANALYSIS A. Experimental Setup SBFRC has been development under the environment on Intel core i-2.30 GHz processor with.0 GBytes of RAM running on windows operating system. Here SBFRC has been developed in Java. WordNet database has been used while measuring word-to-word similarity. The quotation dataset on Table 1 [1] and news article dataset on Table 3 [] are partial of the full dataset. Table 1: Quotation dataset Knowledge Class 1. Our knowledge can only be finite, while our ignorance must necessarily be infinite. 2. Everybody gets so much common information all day long that they lose their common sense. Marriage Class 11. A husband is what is left of a lover, after the nerve has been extracted. 12. Marriage has many pains, but celibacy has no pleasures.. Nature Class 21. I have called this principle by which each slight variation if useful is preserved by the term natural selection. 22. Nature is reckless of the individual. When she has points to carry, she carries them.. Peace Class 31. There is no such thing as inner peace there is only nervousness and death. 32. Once you hear the details of victory, it is hard to distinguish it from a defeat... Food Class 1. Food is an important part of a balanced diet. 2. To eat well in England you should have breakfast three times a day... 3

4 Experimental Results and Comparisons Table 2 shows the results of applying our method, ARCA, Spectral Clustering, and k-medoids algorithms to the quotations data set and evaluating using the external measures. Our method requires that an initial number of clusters must be specified. This number was varied from 3 to 8, running 0 trials for each case, each trial commencing from a different random initialization of membership values. In each case the same affinity matrix was used, with pairwise similarities calculated as described before. However, only three unique clustering were found, each containing a different number of clusters, which ranged from four to seven. No. of Cluster Table 2: Evaluation of quotation dataset Purity Entropy Rand F- Measure SBFRC ARCA Spectral Clust K-Medoids To demonstrate how the algorithm may perform in more general use in activities related to text mining, the system is also applied to clustering sentences from a news article. Table 3 shows the sentences from an article about President Barak Obama s presidency. Table 3: News article dataset News Articles 1. President Barak Obama on Tuesday championed nuclear energy expansion as the latest way that feuding parties can move beyond the broken politics 12. That mission however remains in doubt 1. In Saginaw Biden insisted the stimulus is working even as he acknowledged it s going to take us a while to get us out of this ditch.. 2. It includes more direct and rapid response to criticism more events at which the president speaks directly to the public without the filter of the media 28. The intended narrative is one in which Obama hears people s frustrations and is working directly to end them. 29. There is a little doubt the public is angry. 30. A CBS news poll in early February found eighty one percent saying it s time to elect new people to congress The result from the news article dataset is shown below- Sentences Table : Results of News Article dataset 3,,,9,10,13,1,1,1,1,19,21,2 2,,12,18,20,22,2,2,2,28,29 1,8,23,30 B. Discussions The performance measure techniques (Purity, Entropy [1], Rand, F-measure [18]) evaluated our method. The algorithm is run 0 times for the quotation dataset and chosen the best results from it. Since the method is unsupervised, there is no fixed output. As the four performance measures are not always consistent as to which algorithm achieves best performance for a given number of clusters, boldface indication of the value corresponding to SBFRC for which the measure is a maximum. For example, for the Rand Index corresponding to Cluster No =, our method achieves a value of 0.9, which is greater than that achieved by the other algorithms (0.8, 0.8, and 0.33), and for cluster No. =, the proposed method achieves the value of entropy. whereas other method has higher value than it (0.1, 0. and 0.) hence these values are represented in boldface. It can be seen from the table that when the number of clusters is, the algorithm performs well than when cluster number is or as measured by all five external cluster evaluation criteria. The result achieved from the news article dataset, it is observed that in 1st cluster, sentence 1 has the highest PageRank score. So the sentence 1 would go in cluster 1. The sentences 3,,, 9, 10, 13, 1, 1, 1, 19, 21, 2 are also in cluster 1. There PageRank score in close to the PageRank of sentence 1. These sentences are of similar meaning. For 2nd cluster sentence 2 has highest PageRank. Closer to this are the sentences 2,, 12, 18, 20, 22, 2, 2, 28, 29. And most of them are similar as they are of negative sense (criticism, angry, frustration). V. CONCLUSION An obvious potential application of the algorithm is to document classification and summarization. Like any other clustering algorithm, the performance of this method will ultimately depend on the quality of the input set, and as for the sentence clustering the performance can be improved with better sentence similarity measures. As the cluster number is provided initially the algorithm appears to be able to converge to an appropriate number of clusters. The idea can be expanded to the development of a hierarchical fuzzy relational clustering algorithm as well as the proposed method can also be applied to a cloud trust system to measure its performance by analyzing user feedback comments.

5 REFERENCES [1] Gary M. Weiss & Brian D. Davison, To appear in the Handbook of Technology Management, H. Bidgoli (Ed.), John Wiley and Sons, [2] R.J. Hathaway, J.W. Devenport, and J.C. Bezdek, Relational Dual of the C-Means Clustering Algorithms, Pattern Recognition, vol. 22, no. 2, pp , [3] R. Krishnapuram, A. Joshi, and Y. Liyu, A Fuzzy Relative of the k-medoids Algorithm with Application to Web Document and Snippet Clustering, Proc. IEEE Fuzzy Systems Conf.,pp , [] T. Geweniger, D. Zu hlke, B. Hammer, and T. Villmann, Median Fuzzy C-Means for Clustering Dissimilarity Data, Neurocomputing, vol. 3, nos. -9, pp , [] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 2 August UTC. Accessed on 2 September 201. [] Skabar, A. and Abdalgader, K. Clustering Sentence- Level Text Using a Novel Fuzzy Relational Clustering Algorithm, IEEE transactions on knowledge and data engineering, vol. 2, no. 1, [] C. Fellbaum, WordNet: An Electronic Lexical Database. MIT Press, [8] P. Corsini, F. Lazzerini, and F. Marcelloni, A New Fuzzy Relational Clustering Algorithm Based on the Fuzzy C-Means Algorithm, Soft Computing, vol. 9, pp. 39-, 200. [9] H. Zha, Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering, Proc. 2th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [10] D. Wang, T. Li, S. Zhu, and C. Ding, Multi-Document Summarization via Sentence-Level Semantic Analysis and Symmetric Matrix Factorization, Proc. 31st Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [11] D. Lee and H. Seung, Algorithms for Non-Negative Matrix Factorization, Advances in Neural Information Processing Systems, vol. 13, pp. -2, [12] Y. Li, D. McLean, Z.A. Bandar, J.D. O Shea, and K. Crockett, Sentence Similarity Based on Semantic Nets and CorpusStatistics, IEEE Trans. Knowledge and Data Eng., vol. 8, no. 8, pp , Aug [13] J.J. Jiang and D.W. Conrath, Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy, Proc. 10th Int l Conf. Research in Computational Linguistics,pp , 199. [1] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, vol. 30, pp , [1] [] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 30 August UTC. accessed on 02 September, 201. [1] accessed on 30 August, 201. [1] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 1 September UTC. y), accessed on 02 September, 201. [18] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 2 July UTC. accessed on 02 September, 201.

CLUSTERING PERFORMANCE IN SENTENCE USING FUZZY RELATIONAL CLUSTERING ALGORITHM

CLUSTERING PERFORMANCE IN SENTENCE USING FUZZY RELATIONAL CLUSTERING ALGORITHM CLUSTERING PERFORMANCE IN SENTENCE USING FUZZY RELATIONAL CLUSTERING ALGORITHM Purushothaman B PG Scholar, Department of Computer Science and Engineering Adhiyamaan College of Engineering Hosur, Tamilnadu

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK STUDY ON DIFFERENT SENTENCE LEVEL CLUSTERING ALGORITHMS FOR TEXT MINING RAKHI S.WAGHMARE,

More information

On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database

On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database On-Lib: An Application and Analysis of Fuzzy-Fast Query Searching and Clustering on Library Database Ashritha K.P, Sudheer Shetty 4 th Sem M.Tech, Dept. of CS&E, Sahyadri College of Engineering and Management,

More information

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters

An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters An Improved Fuzzy K-Medoids Clustering Algorithm with Optimized Number of Clusters Akhtar Sabzi Department of Information Technology Qom University, Qom, Iran asabzii@gmail.com Yaghoub Farjami Department

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2 A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION

HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION HARD, SOFT AND FUZZY C-MEANS CLUSTERING TECHNIQUES FOR TEXT CLASSIFICATION 1 M.S.Rekha, 2 S.G.Nawaz 1 PG SCALOR, CSE, SRI KRISHNADEVARAYA ENGINEERING COLLEGE, GOOTY 2 ASSOCIATE PROFESSOR, SRI KRISHNADEVARAYA

More information

2. Design Methodology

2. Design Methodology Content-aware Email Multiclass Classification Categorize Emails According to Senders Liwei Wang, Li Du s Abstract People nowadays are overwhelmed by tons of coming emails everyday at work or in their daily

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

An Enhanced K-Medoid Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm An Enhanced Clustering Algorithm Archna Kumari Science &Engineering kumara.archana14@gmail.com Pramod S. Nair Science &Engineering, pramodsnair@yahoo.com Sheetal Kumrawat Science &Engineering, sheetal2692@gmail.com

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

Expectation Maximization!

Expectation Maximization! Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University and http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Steps in Clustering Select Features

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION

FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION FUZZY KERNEL K-MEDOIDS ALGORITHM FOR MULTICLASS MULTIDIMENSIONAL DATA CLASSIFICATION 1 ZUHERMAN RUSTAM, 2 AINI SURI TALITA 1 Senior Lecturer, Department of Mathematics, Faculty of Mathematics and Natural

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

MIA - Master on Artificial Intelligence

MIA - Master on Artificial Intelligence MIA - Master on Artificial Intelligence 1 Hierarchical Non-hierarchical Evaluation 1 Hierarchical Non-hierarchical Evaluation The Concept of, proximity, affinity, distance, difference, divergence We use

More information

Image Similarity Measurements Using Hmok- Simrank

Image Similarity Measurements Using Hmok- Simrank Image Similarity Measurements Using Hmok- Simrank A.Vijay Department of computer science and Engineering Selvam College of Technology, Namakkal, Tamilnadu,india. k.jayarajan M.E (Ph.D) Assistant Professor,

More information

Powered Outer Probabilistic Clustering

Powered Outer Probabilistic Clustering Proceedings of the World Congress on Engineering and Computer Science 217 Vol I WCECS 217, October 2-27, 217, San Francisco, USA Powered Outer Probabilistic Clustering Peter Taraba Abstract Clustering

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

http://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM Saroj 1, Ms. Kavita2 1 Student of Masters of Technology, 2 Assistant Professor Department of Computer Science and Engineering JCDM college

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2 1 P.G. Scholar, Department of Computer Engineering, ARMIET, Mumbai University, India 2 Principal of, S.S.J.C.O.E, Mumbai University, India ABSTRACT Now a

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October-2015 726 Performance Validation of the Modified K- Means Clustering Algorithm Clusters Data S. Govinda Rao Associate

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

Conceptual Review of clustering techniques in data mining field

Conceptual Review of clustering techniques in data mining field Conceptual Review of clustering techniques in data mining field Divya Shree ABSTRACT The marvelous amount of data produced nowadays in various application domains such as molecular biology or geography

More information

Data Clustering. Danushka Bollegala

Data Clustering. Danushka Bollegala Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering

More information

A ew Algorithm for Community Identification in Linked Data

A ew Algorithm for Community Identification in Linked Data A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062

More information

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining. K-means clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 K-means Outline K-means, K-medoids Choosing the number of clusters: Gap test, silhouette plot. Mixture

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications Anil K Goswami 1, Swati Sharma 2, Praveen Kumar 3 1 DRDO, New Delhi, India 2 PDM College of Engineering for

More information

ISyE 6416 Basic Statistical Methods Spring 2016 Bonus Project: Big Data Report

ISyE 6416 Basic Statistical Methods Spring 2016 Bonus Project: Big Data Report ISyE 6416 Basic Statistical Methods Spring 2016 Bonus Project: Big Data Report Team Member Names: Caroline Roeger, Damon Frezza Project Title: Clustering and Classification of Handwritten Digits Responsibilities:

More information

Summarizing Public Opinion on a Topic

Summarizing Public Opinion on a Topic Summarizing Public Opinion on a Topic 1 Abstract We present SPOT (Summarizing Public Opinion on a Topic), a new blog browsing web application that combines clustering with summarization to present an organized,

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context 6 th. International Work Conference on Artificial and Natural Neural Networks (IWANN2001), Granada, June 13-15 2001 Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance

More information

Pattern Clustering with Similarity Measures

Pattern Clustering with Similarity Measures Pattern Clustering with Similarity Measures Akula Ratna Babu 1, Miriyala Markandeyulu 2, Bussa V R R Nagarjuna 3 1 Pursuing M.Tech(CSE), Vignan s Lara Institute of Technology and Science, Vadlamudi, Guntur,

More information

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem Stefan Hauger 1, Karen H. L. Tso 2, and Lars Schmidt-Thieme 2 1 Department of Computer Science, University of

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Image Classification Using Wavelet Coefficients in Low-pass Bands

Image Classification Using Wavelet Coefficients in Low-pass Bands Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August -7, 007 Image Classification Using Wavelet Coefficients in Low-pass Bands Weibao Zou, Member, IEEE, and Yan

More information

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem To cite this article:

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved.

DOCUMENT CLUSTERING USING HIERARCHICAL METHODS. 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar. 3. P.Praveen Kumar. achieved. DOCUMENT CLUSTERING USING HIERARCHICAL METHODS 1. Dr.R.V.Krishnaiah 2. Katta Sharath Kumar 3. P.Praveen Kumar ABSTRACT: Cluster is a term used regularly in our life is nothing but a group. In the view

More information

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances

On Sample Weighted Clustering Algorithm using Euclidean and Mahalanobis Distances International Journal of Statistics and Systems ISSN 0973-2675 Volume 12, Number 3 (2017), pp. 421-430 Research India Publications http://www.ripublication.com On Sample Weighted Clustering Algorithm using

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

The Effect of Word Sampling on Document Clustering

The Effect of Word Sampling on Document Clustering The Effect of Word Sampling on Document Clustering OMAR H. KARAM AHMED M. HAMAD SHERIN M. MOUSSA Department of Information Systems Faculty of Computer and Information Sciences University of Ain Shams,

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

IBL and clustering. Relationship of IBL with CBR

IBL and clustering. Relationship of IBL with CBR IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

More information

A Novel PAT-Tree Approach to Chinese Document Clustering

A Novel PAT-Tree Approach to Chinese Document Clustering A Novel PAT-Tree Approach to Chinese Document Clustering Kenny Kwok, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Comparative Study of Web Structure Mining Techniques for Links and Image Search

Comparative Study of Web Structure Mining Techniques for Links and Image Search Comparative Study of Web Structure Mining Techniques for Links and Image Search Rashmi Sharma 1, Kamaljit Kaur 2 1 Student of M.Tech in computer Science and Engineering, Sri Guru Granth Sahib World University,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Flexibility and Robustness of Hierarchical Fuzzy Signature Structures with Perturbed Input Data

Flexibility and Robustness of Hierarchical Fuzzy Signature Structures with Perturbed Input Data Flexibility and Robustness of Hierarchical Fuzzy Signature Structures with Perturbed Input Data B. Sumudu U. Mendis Department of Computer Science The Australian National University Canberra, ACT 0200,

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

Swarm Based Fuzzy Clustering with Partition Validity

Swarm Based Fuzzy Clustering with Partition Validity Swarm Based Fuzzy Clustering with Partition Validity Lawrence O. Hall and Parag M. Kanade Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract

More information

A Miniature-Based Image Retrieval System

A Miniature-Based Image Retrieval System A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Multi prototype fuzzy pattern matching for handwritten character recognition

Multi prototype fuzzy pattern matching for handwritten character recognition Multi prototype fuzzy pattern matching for handwritten character recognition MILIND E. RANE, DHABE P. S AND J. B. PATIL Dept. of Electronics and Computer, R.C. Patel Institute of Technology, Shirpur, Dist.

More information