A Study on K-Means Clustering in Text Mining Using Python

Similar documents
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Review on Text Mining

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

CHAPTER 4: CLUSTER ANALYSIS

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

Iteration Reduction K Means Clustering Algorithm

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Analyzing Outlier Detection Techniques with Hybrid Method

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

A REVIEW ON K-mean ALGORITHM AND IT S DIFFERENT DISTANCE MATRICS

Clustering in Data Mining

Text Mining Research: A Survey

A Review on Cluster Based Approach in Data Mining

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Filtering of Unstructured Text

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Comparative Study on Classification Meta Algorithms

A Survey On Different Text Clustering Techniques For Patent Analysis

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Comparative Study of Clustering Algorithms using R

International Journal of Advanced Research in Computer Science and Software Engineering

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

ABSTRACT I. INTRODUCTION. Gurpreet Virdi, Neena Madan CSE, GNDU RC, Jalandhar, Punjab, India

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

A Comparative Study of Various Clustering Algorithms in Data Mining

A Review of K-mean Algorithm

Comparative Study of Web Structure Mining Techniques for Links and Image Search

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Dr. Chatti Subba Lakshmi

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Unsupervised Learning

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

A Comparative Study of Selected Classification Algorithms of Data Mining

Keywords: clustering algorithms, unsupervised learning, cluster validity

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Performance Analysis of K-Mean Clustering on Normalized and Un-Normalized Information in Data Mining

K-Means Clustering With Initial Centroids Based On Difference Operator

Redefining and Enhancing K-means Algorithm

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

GRID SIMULATION FOR DYNAMIC LOAD BALANCING

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Gene Clustering & Classification

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters

CSE 5243 INTRO. TO DATA MINING

An Increasing Efficiency of Pre-processing using APOST Stemmer Algorithm for Information Retrieval

A Recommender System Based on Improvised K- Means Clustering Algorithm

Unsupervised Learning

Clustering (COSC 416) Nazli Goharian. Document Clustering.

A CRITIQUE ON IMAGE SEGMENTATION USING K-MEANS CLUSTERING ALGORITHM

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Machine Learning using MapReduce

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Regression Based Cluster Formation for Enhancement of Lifetime of WSN

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

K-means clustering based filter feature selection on high dimensional data

CSE 5243 INTRO. TO DATA MINING

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Novel Hybrid k-d-apriori Algorithm for Web Usage Mining

Knowledge Discovery and Data Mining

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Extracting Algorithms by Indexing and Mining Large Data Sets

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

I. INTRODUCTION II. RELATED WORK.

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Clustering Part 4 DBSCAN

Cluster Analysis on Statistical Data using Agglomerative Method

Computational Time Analysis of K-mean Clustering Algorithm

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Enhancing K-means Clustering Algorithm with Improved Initial Center

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Anil Saini Ph.D. Research Scholar Department of Comp. Sci. & Applns, India. Keywords AODV, CBR, DSDV, DSR, MANETs, PDF, Pause Time, Speed, Throughput.

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

K-means based data stream clustering algorithm extended with no. of cluster estimation method

Agglomerative clustering on vertically partitioned data

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

An Efficient Clustering for Crime Analysis

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

COMP 465: Data Mining Still More on Clustering

Comparative studyon Partition Based Clustering Algorithms

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Transcription:

International Journal of Computer Systems (ISSN: 2394-1065), Volume 03 Issue 08, August, 2016 Available at http://www.ijcsonline.com/ Dr. (Ms). Ananthi Sheshasayee 1, Ms. G. Thailambal 2 1 Head and Associate Professor, Quaid -e- Milleth College for Women, Chennai, India 2 Research Scholar, SCSVMV University, Kancheepuram, India Abstract According to Statistics 195,248,950 Internet users are in India, which is the second largest internet user in the world. The total number of websites gets increased to 672,985,183 in the year of 2013. Text Mining is an emerging research area in nowadays as the information gets increased everyday on the web. The User did not know how the documents were linked to the query given and displayed. Sometimes the documents are relevant and many times the documents are irrelevant to the query typed by the user. These appropriate and inappropriate results are due to the clustering algorithm applied to it. Getting proper results page from these websites are possible only with the process of Clustering. Clustering is the fundamental process in many disciplines whereas Cluster Analysis is used for grouping of similar collection of patterns based on Similarity factors. This paper discusses the tasks of Text Mining algorithms and clustering techniques. Different types of clustering algorithm available where K-Means clustering algorithm presented in detail along with its Strengths and Limitations in this paper. It also includes various Computation measures of algorithm which is used to identify the similar objects to cluster. This paper gives the detailed information about the applications of Clustering and tools used for clustering in different applications. Related works of K-means clustering algorithm in Text Mining applications and other applications are presented with the conclusion that the K-Means algorithm can be combined with other algorithms to get efficient results. Keywords: Text Mining, Clustering Algorithm, K-Means Clustering, Python. I. INTRODUCTION Text Mining is retrieving information of different patterns from unstructured textual data in the web Repository. Text mining is a variation on a field called data mining that tries to find interesting patterns from large databases. Text mining, also known as Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. [8]. Typically, only a small fraction of the many available documents will be relevant to a given individual user. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users need tools to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Thus, text mining has become an increasingly popular and essential theme in data mining. [9] II. TASKS OF TEXT MINING ALGORITHMS [7] A. Text Categorization Assigning the documents to pre-defined categories. Many Statistical approaches have been applied such as Regression Models, Support Vector Machines. B. Text Clustering Finding Group of Similar objects of data based on the Similarity Function. Methods applied are categorized as Hierarchical and Partitioning. C. Concept Mining The task of discovering concepts which combine Categorization and clustering approach to find concepts and their relations from text collections. D. Information Retrieval Retrieving the information from a collection of information resources available depending on the user's query. E. Information Extraction Task of automatically extracting structured information from unstructured or Semi-Structured documents. III. CLUSTERING TECHNIQUES Clustering is grouping of similar data sets with the same content. It includes grouping of same text messages in e-mail, same content from different Books. Text Clustering algorithms are classified into many types, namely distance-based algorithms, frequent sequence algorithms, feature selection and extraction algorithms, density-based algorithms, distance-based algorithms, frequent sequence algorithms, feature selection and extraction algorithms, density-based algorithms. A clustering algorithm discovers groups in the set of documents such that documents within a group are more similar than documents across groups [2]. 560 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

Distance from x to y always same as y to x Distance from point x to point y cannot be greater than the sum of the distance from x to any other point z and distance from y to x. Clustering Tasks Scattered Document Clustered Document Fig.1 Documents Before and after Clustering The following conditions help to increase the effectiveness of the clustering. [1] A. Similarity Measure: Only Similar documents to be considered which is hard to define. B. Dimension Reduction: The size of the data needs to be reduced to increase the operations efficiency by removing the irrelevant words from the text collection. C. Cluster Labels: Giving separate names to different clusters in an appropriate way are needed to identify the clusters in a clear way. D. Number of Clusters: Number of clusters used to be deciding earlier, which is difficult when you have less information. E. Overlapping of Clusters: algorithm should accept overlapping of clusters since several topics are used by certain documents. F. Scalability: Irrespective of size the algorithm should be used. G. Flexibility: Algorithm should be scalable with different attributes, clusters etc. Clustering hypothesis formulated as Given a Suitable Clustering collection, if d documents interested then other members of d also interested by the user. The Parameters used by the clustering algorithms are [3] Number of clusters desired A Minimum and Maximum size of the cluster. The Control of overlap between Clusters. An arbitrarily chosen objective function optimized. A threshold value of the matching function below which an object will not be included in the cluster. H. Distance Computation Most clusters analysis methods based on similarity between objects by computing distance between each pair. The Properties of distance are Distance is always positive Distance from a point to itself is zero Fig 2. Key Tasks of Clustering A. Distance measures of Clusters Euclidean distance: The largest value attributes are Properly scaled. D(x,y) = (E(x i -y i )2)1/2.(1) Manhattan distance: The domination of largest valued is not much as Euclidean distance. D(x,y)=E i mod x i -y i Chebychev distance: (2) This is based on maximum attribute difference. D(x,y)= Max mod x i -y i Document Representation ------------------- Convert the documents into structured form. Clustering Logic ---------------------------------------------- Determining the documents is assigned to the clusters based on similarity measure. Categorical distance: (3) If many attributes have categorical values with only a small number of values. Let N be the total number of categorical attributes. D(x, y) = (Number of x i -y i )/N (4) Definition of Similarity Measure ------------------- Similarities between two documents. 561 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

I. Types of Clustering [5] Partitional clustering The given n data is partitioned into k partitions represent cluster, i.e. (k<=n). The partitioned data should follow the criteria: (i) (ii) At least One data object should be in each cluster A Data object should belong to only one cluster group. The widely used methods are Iterative clustering or Reallocation clustering in which data objects move from one cluster to another and in Single pass Clustering the data object processing is done only once. K-Means Clustering: The widely used Partitional clustering is K-Means in which it assigns each point to a cluster whose center called centroid is nearest. The center is the average of all the points and its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. [6] The Steps of K-Means: Step 1: Choose the k number of clusters. Step 2: Randomly generate k random points as a cluster center. Step 3: Determine the Euclidean distance of each Object to all Centroids. Step 4: Assign each point to the nearest Centroid. Step 5: Re-compute the new cluster Centers. Step 6: Repeat steps 2 & 3 until Convergence. This algorithm aims to minimize the following function for k clusters and no data points J= x i -c j 2 (5) Where j=1 to k and i=1 to n and x i -c j is a chosen Euclidean distance measure between data point xi from cluster cj. Still K-means have some limitations such as Handling Outliers is not possible, Intermediate Solutions are not made. But this algorithm is traditionally used in most of the applications since it is easy to implement and the time complexity is O (N) [10] where N is the number of objects to be grouped. Table 1 contains the advantages of K-Means Clustering. Hierarchical Clustering These methods start with one cluster and then split into smaller and smaller clusters and then merge similar clusters into larger and larger clusters in which objects resulting in a tree of clusters. Density Based clustering For each data point in a cluster at least a minimum number of points must exist within a given radius. Each cluster is a dense region of points surrounded by regions of low density. Grid based clustering Object space is divided into grid according to the characteristics of data. This method not affected by data ordering and they can deal with non numeric data easily Model based clustering This algorithm builds clusters with a high level of similarity within them and low level of similarity between them. This algorithm works Based on the Mean values and this minimizes the squared error function. Advantages Type of Attributes algorithm can handle Time Complexity Data ordering Dependency Prior Knowledge and User Defined parameters Interpretability of Results Ability to Memorize results Table 1: Advantages of K-Means K-Means Numeric Low Yes Yes Clusters Centroids J. Clustering Implementation in Python The following partial code implemented in Python language [22]. 562 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

Fig. 3 Sample Clustering Implementation using Python IV. RELATED WORK OF K-MEANS CLUSTERING IN OTHER APPLICATIONS Oyelade, O. J et.al., presents k-means clustering algorithm as a simple and an efficient tool to monitor the progression of students' performance in higher institution. They analyzed the students' results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance [11]. Bader Aljaber et.al use of citation contexts, when combined with the vocabulary in the full-text of the document in High Energy Physics and Genomics, is a promising alternative means of capturing critical topics covered by journal articles. The author uses link based clustering algorithm which determines the similarity between documents with a number of co-citations. They used bi-clustering algorithm and at the end they include K- means algorithm to reduce the size of the bi-clusters by merging its similar documents [12]. V. RELATED WORK OF TEXT MINING APPLICATIONS USING K-MEANS CLUSTERING ALGORITHM Anil Kumar Pandey et.al., uses k-means algorithm to cluster web documents to help researchers. The author extracts document features and applies the Apriori 563 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

algorithm which generates mutually exclusive frequent sets taken as initial points of k-means clustering algorithm. This displays the highly related documents appearing together with same features [13]. Neetu Sharma et al uses K-means algorithm and Random Forest Classifier in WEKA tool and concluded that using clustering before classification on the data file poach.arff from WORDNET has optimized the performance [14]. VI. CONCLUSION The performance of Clustering algorithm depends on the structure, the amount and the representativeness of the data. Some of the applications where Clustering is widely used are discussed in this paper that shows the importance of clustering in Text Mining. Many other clustering algorithms available with some Pros and Cons which can be combined for getting better results. of Computer Technology & Applications, Vol 3 (4), 1598-1604, ISSN: 2229-6093. [15] L.V. Bijuraj Clustering and its Applications, Proceedings of National Conference on New Horizons in IT ISBN 978-93-82338-79-6. [16] https://code.google.com/p/sofia-ml [17] http://nlp.fi.muni.cz/projekty/gensim [18] http://mahout.apache.org [19] http://radimrehurek.com/gensim [20] http://carrotsearch.com/lingo3g [21] http://graphlab.org [22] Toby Segaran, Programming Collective Intelligence: Building Smart Web 2.0 Applications. Sebastopol, CA: O'Reilly Media. REFERENCES [1] Francis Musembi Kwale, A Critical Review of K - Means Text Clustering Algorithms, International Journal of Advanced Research in Computer Science, Volume 4, No. 9, ISSN No. 0976-5697. [2] Dan Munteanu, Severin Bumbaru, A Survey Of Text Clustering Techniques Used For Web Mining, The Annals Of Dunarea De Jos University Of Galati Fascicle III, ISSN 1221-454x. [3] C. J. Van Rijsbergen, Information Retrieval, Butterworths, London. [4] Pushplata, Mr. Ram Chatterjee, An Analytical Assessment on Document Clustering, I.J. Computer Network and Information Security, 5, 63-71, DOI: 10.5815/ijcnis. 2012.05.08. [5] Ms.S.Prabha, Dr.K.Duraiswamy, Ms.M.Sharmila Analysis of Different Clustering Techniques in Data and Text Mining, International Journal of Computer Science Engineering (IJCSE), Vol. 3 No.02, ISSN: 2319-7323. [6] Mrs.S.C.Punitha, Dr. M. Punithavalli A Comparative Study to Find a Suitable Method for Text Document Clustering, International Journal of Computer Science & Information Technology, Vol3, No.6. [7] Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay, A Tutorial Review On Text Mining Algorithms, International Journal of Advanced Research in Computer and Communication Engineering Vol. 1, Issue 4, ISSN : 2278 1021. [8] Vishal Gupta, Gurpreet S. Lehal A Survey of Text Mining Techniques and Applications, Journal of Emerging Technologies in Web Intelligence, Vol. 1, No. 1. [9] R. Sagayam, S.Srinivasan, S. Roshni A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques, International Journal of Computational Engineering Research Vol. 2 Issue. 5.pp: 1443-1446. [10] Comparative Study of Clustering Algorithms On Textual Databases, Thesis submitted to Technical University Ilmenau, Germany. [11] O. J. Oyelade, O. O. Oladipupo, I. C. Obagbuwa, Application Of K-Means Clustering Algorithm For Prediction Of Students Academic Performance, (IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, Issue 1. [12] Bader Aljaber Æ Nicola Stokes Æ James Bailey Æ Jian Pei Document Clustering Of Scientific Texts using Citation Contexts, Information Retrieval DOI 10.1007/s10791-009-9108-x, Springer Science+Business Media, LLC. [13] Anil Kumar Pandey, T. Jaya Laxmi, Web Document Clustering for Finding Expertise in Research Area, BVICAM s International Journal of Information Technology, Vol. 1 No. 2 ISSN 0973 5658. [14] Neetu Sharma, Dr. S. Niranjan Optimization Of Word Sense Disambiguation Using Clustering In Weka, International Journal 564 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016