Parallel K-Means Clustering with Triangle Inequality

Size: px
Start display at page:

Download "Parallel K-Means Clustering with Triangle Inequality"

Transcription

1 Parallel K-Means Clustering with Triangle Inequality Rachel Krohn and Christer Karlsson Mathematics and Computer Science Department, South Dakota School of Mines and Technology Rapid City, SD, 5771, USA Abstract Clustering divides data objects into groups to minimize the variation within each group. This technique is widely used in data mining and other areas of computer science. K-means is a partitional clustering algorithm that produces a fixed number of clusters through an iterative process. The relative simplicity and obvious data ism of the K-means algorithm make it an excellent candidate for distributed-memory optimization, particularly as datasets grow beyond the size of a single machine. The triangle inequality, when applied to the K-means algorithm, allows unnecessary distance calculations between data objects and cluster centroids to be avoided. Various implementations of the K-means algorithm exist, but no example comparing a standard implementation to one utilizing the could be located. This paper seeks to fill this gap by presenting experimental results demonstrating the performance of both standard and improved K-means implementations compared to a implementation. keywords: K-means, clustering, data mining,, 1 Introduction The field of data mining seeks to extract useful information from large datasets drawn from real-world situations. As big data becomes more popular and data storage capabilities increase, demand for data mining analysis increases while datasets grow in size. One common data mining task is clustering, which partitions the data into discrete categories. The K- means algorithm clusters data objects into k separate clusters using some distance measure, with each data object belonging to the cluster with the nearest cluster centroid[6]. Although the K-means algorithm is simple to implement, growing datasets present unique challenges. Larger datasets require more time to analyze, and if the entire dataset does not fit into the memory of a single machine, costly disk I/O operations further slow the analysis process. One solution is to use processing to speed up the analysis process and, in the case of distributed-memory solutions, reduce disk I/O. This paper will introduce the basic K-means algorithm and summarize current implementations before introducing a new flexible distributed-memory solution that utilizes the for improved speedup. 2 Basic K-Means Algorithm K-means is a simple prototype-based, partitional clustering algorithm. Each cluster is represented by a prototype, or centroid, which is typically defined as the mean of all data points in that cluster. Because K- means is partitional, each data point will only belong to one cluster, and clusters will not be nested. The number of clusters, k, is specified by the user before the algorithm begins processing. Initial centroids can be selected in many different ways; the exact method does not impact the main algorithm. K-means then relies on an iterative procedure to refine the centroids, as described in Algorithm 1 [6]. Algorithm 1 Basic K-Means Algorithm 1: Select k points as initial centroids 2: repeat 3: Form k clusters by assigning each point to its closest centroid 4: Recompute the centroid of each cluster 5: until Centroids do not change Initial centroid selection varies depending on the specific K-means implementation being used. Some common methods include randomly selecting k data points, or selecting k points with maximum separation distance. Initial centroids can impact the final clustering results, as poor initial centroids may produce a sub-optimal clustering. At each K-means iteration, all points are reassigned / copyright ISCA, CAINE 16 September 26-28, 16, Denver, Colorado, USA

2 to the nearest centroid based on some distance metric. Note that finding the nearest centroid requires calculating the distance between the current data object and all cluster centroids. Euclidean distance is often used, since it is simple to implement and produces good results if the data is normalized. Based on the new cluster assignments, the centroids for each cluster are recalculated to reflect the new cluster membership. Often a cluster centroid is defined as the average of all data objects belonging to that cluster. The data object assignment and centroid adjustment process is repeated until the cluster centroids do not move, indicating a stable final clustering. In practice, particularly for large datasets, at each iteration the number of data points that switch clusters is tracked, and the algorithm terminates once this count falls below a certain threshold [6]. Since most centroid movement occurs in the first few iterations, stopping early does not greatly undermine the final results. There are also cases where a small number of points will repeatedly oscillate between two clusters; using the threshold termination condition prevents this. One of the greatest advantages of the K-means algorithm is its simplicity and efficiency. Despite these advantages, K-means is not suitable for all datasets. K-means does not handle non-globular clusters, or clusters of differing sizes and densities; other clustering algorithms exist to handle these situations. The algorithm is also not robust to outliers; detecting and removing outliers before running K-means is a viable solution. 3 Parallel K-Means Algorithm Examining the K-means algorithm reveals an obvious data ism in the algorithm [7]. To reassign each point to the nearest cluster centroid, the distance from each point to all the cluster centroids must be calculated; these distances are then compared to determine the nearest centroid. Because this process must be repeated for every data object in the set, and results for one data object do not impact other data objects, it is appealing to ize the K-means algorithm to reduce overall execution time. 3.1 Existing Work Following theoretical development of a K- means algorithm [7], various implementations have been created. Some implementations rely on shared memory [4] for simplicity, but these programs cannot handle extremely large datasets. Other researchers, desiring a distributed-memory solution, use MPI [1] [8] or Apache Hadoop [9] for ization. More recent work focusing on GPU-based solutions [2] also exists. Various algorithm modifications have been researched and implemented to improve the running time of the basic K-means algorithm. Improvements seek to reduce either the number of algorithm iterations or the number of distance calculations at each iteration. Both of these strategies reduce the overall complexity of the K-means algorithm, thereby achieving speedup. Implementing the to reduce the number of Euclidean distance calculations greatly improves the scalability of the K-means algorithm [5]. Applying the to an object s previous cluster assignment and a table of inter-centroid distances eliminates unnecessary distance calculations. Further improvement is achieved by sorting the centroids in order of distance from the previous cluster centroid. The main advantage of this improvement strategy is its simplicity, as the triangle inequality does not alter the foundation of the basic K-means algorithm. Another improvement strategy is to avoid looping the entire dataset at each centroid adjustment iteration by removing data points close to the centroid from consideration [3]. Using statistical analysis, the dataset can be reduced to a subset of boundary points, which are points near the edges of a cluster. If centroids do not move significantly, only the boundary points must be reassigned, decreasing the cost of each algorithm iteration. This strategy produces speedup by reducing the number of passes over the entire dataset, which reduces the total number of distance calculations. Although the boundary point approach offers significant speedups, it is much more difficult to implement than the triangle inequality. 3.2 Implemenation Details OpenMPI is a library implementation of the Message Passing Interface for distributed-memory computing. Typically, OpenMPI provides greater control to the programmer than MapReduce, allowing for more complex and frequent communication between processes. For this reason, OpenMPI was selected as the platform for this implementation. Unfortunately, increased control generally causes larger communication overhead, so careful algorithm design is required to minimize communication between processes. Once the desired number of MPI processes are initialized, the root process retrieves the data object attributes from the user-specified input file. The data is partitioned between the different processes; each process is then responsible for clustering only part of the entire dataset. A series of k data objects are randomly

3 selected to serve as the initial cluster centroids. The number of desired clusters and the initial centroids are disseminated to all processes before the main clustering procedure begins. The K-means algorithm was converted to an MPI procedure to exploit data ism. Each process is responsible for a subset of the data objects, but knows the location of all cluster centroids. At each iteration, a process assigns each of its own data objects to the nearest cluster centroid. The attributes of all data objects assigned to each cluster are summed as they are assigned, and the size of each cluster is tracked. After all data objects are assigned, the processes exchange the local cluster sizes and attribute sums. Each process then computes updated centroid locations independently before repeating the data object assignment procedure. Using this strategy, our implementation produces the same results as the K-means algorithm while limiting communication and reducing overall computation time. Because a few data objects may oscillate between cluster centroids, preventing the algorithm from terminating, our implementation utilizes a stopping threshold. Once fewer than 1% of data objects change clusters, the program stops. Two results files are created, one giving the exact location of each cluster centroid, and the other indicating the membership of each cluster. 4 The Triangle Inequality The can be applied to the K-means algorithm to reduce the number of distance calculations required. By comparing the distance between a data object X i and its current cluster centroid C curr to the distance between C curr and any potential cluster centroid C pot, unnecessary distance calculations are eliminated. Taking d(a, b) as the Euclidean distance between points a and b, the states that d(c curr, C pot ) d(x i, C curr ) + d(x i, C pot ), which means that d(x i, C curr ) d(c curr, C pot ) d(x i, C pot ). Considering d(c curr, C pot ) 2d(X i, C curr ), the allows us to conclude that d(x i, C pot ) d(x i, C curr ). This indicates that object X i will not be assigned to cluster C pot, and d(x i, C pot ) does not need to be calculated. A prerequisite for these comparisons is a table of distances between all pairs of centroids, which must be recalculated following each centroid adjustment. Each process computes this table independently after receiving the new centroid location data. Because the number of cluster centroids is much smaller than the number of objects, the cost of computing this table is negligible in relation to the number of distance calculations saved by this improvement. For our implementation the centroids are not sorted because the extra speedup did not outweigh the additional complexity, particularly when k is small. 5 Experimental Results To evaluate the value of ization, we conducted extensive experiments on the, basic, and optimized K-means implementations. All experiments were performed on a cluster of computers; each machine has 16 GB of memory and eight cores running at 3.6 GHz. All timing measurements rely on the MPI Wtime call. Since the goal of this paper is to compare the performance of the clustering algorithm, I/O time is ignored for these measurements. Sequential testing consisted of a single process on a single machine, while programs used eight processes on two nodes. This configuration was selected to include inter-process communication both across and within network nodes. Each graph datapoint represents an average of several test runs. The datasets used for all test runs were synthetically generated. Each dataset contains k equal-sized, wellseparated, globular clusters, with theoretical cluster centers following a normal distribution. Within each cluster, individual data items also follow a normal distribution. Attributes are floating-point numbers ranging from zero to one with five digits of precision. If not specifically mentioned, the default dataset for each run consists of 5, items, each with 3 attributes, with k set to clusters. All individual test runs use the same initial centroids for consistency. By comparing the outputs of the programs, we proved that all three versions of the K-means algorithm produced the same final clustering. Centroid locations and object cluster assignments were identical across all test runs, regardless of which program performed the clustering. Figure 1 shows the runtimes of all three program versions for datasets with varying numbers of objects, ranging from, to 2 million objects. As expected, the program takes much longer to produce the same results as the ized version. The K-means version with the optimization does perform better than the standard algorithm. The runtime of both programs grows slower than the runtime of the version, indicating that the algorithm is scalable. The effect of k on runtime was also considered;

4 number of objects (X) 5 15 number of attributes Figure 1: Effect of dataset size on runtime. Figure 3: Effect of number of attributes on runtime plementation, particularly as the problem size grows. The optimization does improve upon the standard implementation, with the greatest benefit occurring for large numbers of clusters Figure 2: Effect of number of clusters on runtime. Figure 2 shows these results. As with the number of data objects, the runtime of the K- means algorithm grows quickly as k increases, while the runtime of the programs increases more slowly. The speedup from the optimization is more apparent here, with the consistently reducing clustering runtime by half. Finally, the number of attributes of the data objects was adjusted and the impact on runtime examined. Figure 3 shows that once again, the implementations significantly outperform the K- means algorithm. The optimization appears to offer some benefit in these cases, although the improvement is not as significant as for larger numbers of clusters. Overall, the experiments prove that a implemenation of the K-means algorithm greatly reduces the execution time as compared to a im- k 6 Future Work Because the size of the test datasets was limited in this study by the need to time execution, the benefits of the optimization does not seem significant when compared with the improvements achieved by a standard implementation. Further testing with larger datasets and more attributes is required to better assess the effectiveness of the triangle inequality. Experiments using real-world data instead of synthetically generated objects is also necessary. The standard K-means algorithm does not specify a method for selecting the initial centroids, so many implementations rely on randomly-selected initial centroids. Because poor initial centroids can result in sub-optimal clustering results [6], clustering is often repeated so that the best result can be used. Additional research into initial centroid selection methods is required to solve this problem. Other K-means algorithm improvements, such as boundary point tracking [3], may allow for even greater speedups. 7 Conclusion In this paper, a survey of the basic K-means algorithm and current related work is presented as an introduction to clustering. Then, a distributed-memory implementation that exploits the data ism present in the K-means algorithm is discussed. The is also introduced as an op-

5 timization of the basic algorithm, including implementation details. Finally, experimental results are presented for all three program versions. Our OpenMPI implementation of K-means clustering partitions the dataset between multiple processes to speedup program execution. The algorithm is designed to limit communication between nodes and maximize data ism. A second variation of this program utilizes the to prevent unnecessary Euclidean distance calculations. We evaluate our programs performance through detailed experimentation. The results show that ization of the K-means algorithm does improve running time, and that the optimization offers further benefits. More testing with larger datasets is required to more clearly demonstrate the additional speedup gained through the optimization. To the best of our knowledge, a performance comparison between basic K-means and a version implementing the does not exist in literature. The K-means algorithm is a relatively simple and flexible clustering mechanism for many datasets. Exploiting the data ism present in this algorithm greatly reduces overall runtime, and the provides an opportunity for further optimization, particularly for large datasets. [5] Jitendra Kumar, Richard T. Mills, Forrest M. Hoffman, and William W. Hargrove. Parallel k- means clustering for quantitative ecoregion delineation using large data sets. In Procedia Computer Science, pages Elsevier, 11. [6] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson Addison Wesley, 1st edition, 5. [7] Jinlan Tian, Lin Zhu, Suqin Zhang, and Lu Liu. Improvement and ism of k-means clustering algorithm. Tsinghua Science and Technology, (3): , 5. [8] Jing Zhang, Gongqing Wu, Xuegang Hu, Shiying Li, and Shuilong Hao. A clustering algorithm with mpi - mkmeans. Journal of Computers, 8(1): 17, 13. [9] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In Cloud Computing, pages Springer, 9. References [1] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers from Large- Scale Parallel Data Mining, Workshop on Large- Scale Parallel KDD Systems, SIGKDD, pages Springer-Verlag, [2] Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy Campbell. A implementation of k-means clustering on gpus. In Parallel and Distributed Processing Techniques and Applications, pages CSREA Press, 8. [3] Ruoming Jin, Anjan Goswami, and Gagan Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, (1):17 4, 6. [4] Tayfan Kucukyilmaz. Parallel k-means algorithm for shared memory multiprocessors. Journal of Computer and Communications, 2(11):15 23, 14.

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633 Parallel K-means Clustering Ajay Padoor Chandramohan Fall 2012 CSE 633 Outline Problem description Implementation MPI Implementation OpenMP Test Results Conclusions Future work Problem Description Clustering

More information

Parallel K-Means Algorithm for Shared Memory Multiprocessors

Parallel K-Means Algorithm for Shared Memory Multiprocessors Journal of Computer and Communications, 2014, 2, 15-23 Published Online September 2014 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2014.211002 Parallel K-Means Algorithm for

More information

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohio-state.edu

More information

Parallel Implementation of K-Means on Multi-Core Processors

Parallel Implementation of K-Means on Multi-Core Processors Parallel Implementation of K-Means on Multi-Core Processors Fahim Ahmed M. Faculty of Science, Suez University, Suez, Egypt, ahmmedfahim@yahoo.com Abstract Nowadays, all most personal computers have multi-core

More information

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA

More information

Parallelization of K-Means Clustering on Multi-Core Processors

Parallelization of K-Means Clustering on Multi-Core Processors Parallelization of K-Means Clustering on Multi-Core Processors Kittisak Kerdprasop and Nittaya Kerdprasop Data Engineering and Knowledge Discovery (DEKD) Research Unit School of Computer Engineering, Suranaree

More information

FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913)

FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913) FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913) Overview The partitioning of data points according to certain features of the points into small groups is called clustering.

More information

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,

More information

A Mixed Hierarchical Algorithm for Nearest Neighbor Search

A Mixed Hierarchical Algorithm for Nearest Neighbor Search A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA cdel@vt.edu ABSTRACT The k nearest neighbor (knn) search

More information

Classifying Documents by Distributed P2P Clustering

Classifying Documents by Distributed P2P Clustering Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 henrich}@uni-bayreuth.de

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

New Approach for K-mean and K-medoids Algorithm

New Approach for K-mean and K-medoids Algorithm New Approach for K-mean and K-medoids Algorithm Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Purnima Singh Department of

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Chuck Cartledge, PhD. 23 September 2017

Chuck Cartledge, PhD. 23 September 2017 Introduction Definitions Numerical data Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Agglomerative Clustering Chuck Cartledge, PhD 23 September 2017 1/30 Table of contents

More information

On Initial Effects of the k-means Clustering

On Initial Effects of the k-means Clustering 200 Int'l Conf. Scientific Computing CSC'15 On Initial Effects of the k-means Clustering Sherri Burks, Greg Harrell, and Jin Wang Department of Mathematics and Computer Science Valdosta State University,

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

An Optimization Algorithm of Selecting Initial Clustering Center in K means

An Optimization Algorithm of Selecting Initial Clustering Center in K means 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) An Optimization Algorithm of Selecting Initial Clustering Center in K means Tianhan Gao1, a, Xue Kong2, b,* 1 School

More information

Accelerating K-Means Clustering with Parallel Implementations and GPU computing

Accelerating K-Means Clustering with Parallel Implementations and GPU computing Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani Electrical and Computer Engineering Dept. Northeastern University Boston, MA Email: bhimani@ece.neu.edu Miriam

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

DBSCAN. Presented by: Garrett Poppe

DBSCAN. Presented by: Garrett Poppe DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Mounica B, Aditya Srivastava, Md. Faisal Alam

Mounica B, Aditya Srivastava, Md. Faisal Alam International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Inital Starting Point Analysis for K-Means Clustering: A Case Study

Inital Starting Point Analysis for K-Means Clustering: A Case Study lemson University TigerPrints Publications School of omputing 3-26 Inital Starting Point Analysis for K-Means lustering: A ase Study Amy Apon lemson University, aapon@clemson.edu Frank Robinson Vanderbilt

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

2 Proposed Implementation. 1 Introduction. Abstract. 2.1 Pseudocode of the Proposed Merge Procedure

2 Proposed Implementation. 1 Introduction. Abstract. 2.1 Pseudocode of the Proposed Merge Procedure Enhanced Merge Sort Using Simplified Transferrable Auxiliary Space Zirou Qiu, Ziping Liu, Xuesong Zhang Department of Computer Science Southeast Missouri State University Cape Girardeau, MO 63701 zqiu1s@semo.edu,

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong MIS2502: Data Analytics Clustering and Segmentation Jing Gong gong@temple.edu http://community.mis.temple.edu/gong What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or

More information

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Michael Beckerle, ChiefTechnology Officer, Torrent Systems, Inc., Cambridge, MA ABSTRACT Many organizations

More information

Online Document Clustering Using the GPU

Online Document Clustering Using the GPU Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Communication and Memory Efficient Parallel Decision Tree Construction

Communication and Memory Efficient Parallel Decision Tree Construction Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohiostate.edu Gagan Agrawal

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

K-Means Clustering Using Localized Histogram Analysis

K-Means Clustering Using Localized Histogram Analysis K-Means Clustering Using Localized Histogram Analysis Michael Bryson University of South Carolina, Department of Computer Science Columbia, SC brysonm@cse.sc.edu Abstract. The first step required for many

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

Fuzzy C-means Clustering with Temporal-based Membership Function

Fuzzy C-means Clustering with Temporal-based Membership Function Indian Journal of Science and Technology, Vol (S()), DOI:./ijst//viS/, December ISSN (Print) : - ISSN (Online) : - Fuzzy C-means Clustering with Temporal-based Membership Function Aseel Mousa * and Yuhanis

More information

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce Maria Malek and Hubert Kadima EISTI-LARIS laboratory, Ave du Parc, 95011 Cergy-Pontoise, FRANCE {maria.malek,hubert.kadima}@eisti.fr

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

A MPI-based parallel pyramid building algorithm for large-scale RS image

A MPI-based parallel pyramid building algorithm for large-scale RS image A MPI-based parallel pyramid building algorithm for large-scale RS image Gaojin He, Wei Xiong, Luo Chen, Qiuyun Wu, Ning Jing College of Electronic and Engineering, National University of Defense Technology,

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DISTRIBUTED FRAMEWORK FOR DATA MINING AS A SERVICE ON PRIVATE CLOUD RUCHA V. JAMNEKAR

More information

Genetic Algorithms with Mapreduce Runtimes

Genetic Algorithms with Mapreduce Runtimes Genetic Algorithms with Mapreduce Runtimes Fei Teng 1, Doga Tuncay 2 Indiana University Bloomington School of Informatics and Computing Department CS PhD Candidate 1, Masters of CS Student 2 {feiteng,dtuncay}@indiana.edu

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS Pradeep Nagendra Hegde 1, Ayush Arunachalam 2, Madhusudan H C 2, Ngabo Muzusangabo Joseph 1, Shilpa Ankalaki 3, Dr. Jharna Majumdar

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb

More information

An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing

An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing Shuzhi Nie Abstract Clustering is one of the most effective algorithms in data analysis and management.

More information

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti

The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria Astuti Information Systems International Conference (ISICO), 2 4 December 2013 The Comparison of CBA Algorithm and CBS Algorithm for Meteorological Data Classification Mohammad Iqbal, Imam Mukhlash, Hanim Maria

More information

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG * 2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5

More information

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2 1 Department of Computer Science and Systems Engineering, Andhra University, Visakhapatnam-

More information

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

Communication and Memory Efficient Parallel Decision Tree Construction

Communication and Memory Efficient Parallel Decision Tree Construction Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohiostate.edu Gagan Agrawal

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Research and Improvement of Apriori Algorithm Based on Hadoop

Research and Improvement of Apriori Algorithm Based on Hadoop Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

Using Center Representation and Variance Effect on K-NN Classification

Using Center Representation and Variance Effect on K-NN Classification Using Center Representation and Variance Effect on K-NN Classification Tamer TULGAR Department of Computer Engineering Girne American University Girne, T.R.N.C., Mersin 10 TURKEY tamertulgar@gau.edu.tr

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Clustering Documentation

Clustering Documentation Clustering Documentation Release 0.3.0 Dahua Lin and contributors Dec 09, 2017 Contents 1 Overview 3 1.1 Inputs................................................... 3 1.2 Common Options.............................................

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

Multiple Pivot Sort Algorithm is Faster than Quick Sort Algorithms: An Empirical Study

Multiple Pivot Sort Algorithm is Faster than Quick Sort Algorithms: An Empirical Study International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 11 No: 03 14 Multiple Algorithm is Faster than Quick Sort Algorithms: An Empirical Study Salman Faiz Solehria 1, Sultanullah Jadoon

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up

More information

A Review of K-mean Algorithm

A Review of K-mean Algorithm A Review of K-mean Algorithm Jyoti Yadav #1, Monika Sharma *2 1 PG Student, CSE Department, M.D.U Rohtak, Haryana, India 2 Assistant Professor, IT Department, M.D.U Rohtak, Haryana, India Abstract Cluster

More information

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning

Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Improved Classification of Known and Unknown Network Traffic Flows using Semi-Supervised Machine Learning Timothy Glennan, Christopher Leckie, Sarah M. Erfani Department of Computing and Information Systems,

More information

A Comparative Analysis between K Means & EM Clustering Algorithms

A Comparative Analysis between K Means & EM Clustering Algorithms A Comparative Analysis between K Means & EM Clustering Algorithms Y.Naser Eldin 1, Hythem Hashim 2, Ali Satty 3, Samani A. Talab 4 P.G. Student, Department of Computer, Faculty of Sciences and Arts, Ranyah,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

An Accelerated MapReduce-based K-prototypes for Big Data

An Accelerated MapReduce-based K-prototypes for Big Data An Accelerated MapReduce-based K-prototypes for Big Data Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N'cir, and Nadia Essoussi LARODEC, Université de Tunis, Institut Supérieur de Gestion de Tunis, 41

More information

Pthread Parallel K-means

Pthread Parallel K-means Pthread Parallel K-means Barbara Hohlt CS267 Applications of Parallel Computing UC Berkeley December 14, 2001 1 Introduction K-means is a popular non-hierarchical method for clustering large datasets.

More information

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1 3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao

More information