Classifying Documents by Distributed P2P Clustering

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Classifying Documents by Distributed P2P Clustering"

Transcription

1 Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 Abstract: Clustering documents into classes is an important task in many Information Retrieval (IR) systems. This achieved grouping enables a description of the contents of the document collection in terms of the classes the documents fall into. The compactness of such a description is even more desirable in cases where the document collection is spread across different computers and locations; document classes can then be used to describe each partial document collection in a conveniently short form that can easily be exchanged with other nodes on the network. Unfortunately, most clustering schemes cannot easily be distributed. Additionally, the costs of transferring all data to a central clustering service are prohibitive in large-scale systems. In this paper, we introduce an approach which is capable of classifying documents that are distributed across a Peer-to-Peer (P2P) network. We present measurements taken on a P2P network using synthetic and real-world data sets. 1 Motivation In P2P IR systems and Distributed IR Systems, an important problem is the routing of queries or the selection of sources which might contain relevant documents. Some approaches in this respect are based on compact representations of adjacent peers document collections or the documents maintained on the reachable source nodes. In this context, an interesting approach is to determine a consistent clustering of all documents in the network, and to acquire knowledge about which node contains how many documents falling into each cluster. Then, for a given query, the querying node can identify the most promising clusters by that compact representation of other peers collections; the query can then be routed precisely to nodes potentially containing relevant documents. In the scenario above, an efficient algorithm is needed to determine the clustering of the global document collection. The present paper will describe and assess such an algorithm. The concrete motivation behind this approach stems from our research on large-scale P2P IR networks (cf. [EH02]). In environments like this, the costs of transferring documents and/or data to a central clustering service become prohibitive; in some cases, it might even be infeasible to transfer the data, as Skillicorn points out (see [Ski01]). Therefore, a distributed clustering algorithm is needed.

2 2 Background Our approach is based on two well-known algorithms from the areas of statistics and distributed systems. We use the k-means clustering algorithm to do the actual categorization of the documents. A probe/echo mechanism is used to distribute the task through the P2P network and propagate the results back to the initiator of the clustering. The k-means Clustering Algorithm The k-means clustering algorithm is a non-hierarchical approach to clustering (see the box titled Algorithm 1). Its input parameters are a number κ of desired clusters, a set D of n-dimensional data objects d i, and a initial set C of n-dimensional cluster centroids c j where d i, c j R n ; c i,l and d j,l depict the n single vector elements. The clustering algorithm follows a very straight-forward approach: in each iteration, the distances of each d i D to all c j C are computed. Each d i is allocated to the nearest c j. Then, every centroid c j is recalculated as the centroid of all data objects d i allocated to it. There are a few mostly equivalent termination conditions for the k-means algorithm. One of the most widely used is that the algorithm terminates if the sum of all mean square errors (MSE) within each cluster does no longer decrease from iteration to iteration. Note that the function dist(d i, c j ) can be chosen according to the data set you are clustering: a common class of distance metrics are the Minkowski metrics dist mink (c i, d j ) = m d i,l c j,l m where the parameter m N can be set to generate a specific metric, l e.g. the Manhattan metric (m = 1) or the Euclidean metric (m = 2); the latter metric was used in our experiments. Other metrics (f.e. cosine metric) are also possible. Various methods exist for the generation of initial cluster centroids, ranging from quite obvious ones (just take C randomly chosen documents as the initial centroids) to sophisticated ones that lead to a faster convergation and termination of the algorithm (see [BF98]). As Modha and Dhillon point out in [DM00], the k-means algorithm is inherently data parallel. Since it involves no hierarchical scheme, the only data that must be shared among the nodes are the cluster centroids; all locally refined centroids can then be merged at the node which initiated the clustering; this node decides according to a termination condition whether another iteration of the k-means algorithm is necessary. The advantage of this approach is that the transfer of relatively few centroids is feasible as opposed to transferring whole data sets. Additionally, the k-means algorithm is well-suited for the clustering of documents, as is shown in [SC02]. The Probe/Echo mechanism The Probe/Echo mechanism was first described by E. J. H. Chang in [Cha82]. It is an algorithm to distribute a piece of information across a general graph, in our case a computer network. We want every node to receive the information only once to reduce the amount of messages sent and the number of processing steps. The algorithm implicitly constructs a spanning tree of the graph. The message complexity on a graph G = (V, E) with E edges is 2 E. The initiator of the probe/echo mechanism marks itself as both engaged and initiator and sends a PROBE message to all its neighbours. When a node receives the first PROBE,

3 Algorithm 1: The k-means clustering algorithm. Input : D R n the set of data points d i D. Input : κ the number of clusters the algorithm should calculate. Output : C R n the set of cluster centroids c j C. Data : M 2 D the set of sets m j where each m j contains the d i closest to c j. begin MSE 0, MSE old ; end repeat foreach m j M do m j ; foreach d i D do Find closest c j C; m j m j {d i }; foreach c j C do c j 1 m j d i m j d i ; MSE old MSE; MSE 1 c j (dist(d i, c j )) 2 ; j d i m j until MSE old MSE < ε // initialize all clusters as empty sets // assign each object to its nearest cluster // recalculate cluster centroids // calculate cluster quality it marks itself as engaged and sends PROBEs to all adjacent nodes. All subsequent received PROBEs do not cause the node to send an additional PROBE to its neighbours. After a node has received PROBE or ECHO messages from all its neighbours it marks itself as not engaged and sends an ECHO to the node it received its first PROBE from. When the initiator has received ECHOs from all its neighbours, the algorithm terminates. 3 A Distributed Algorithm to Cluster Documents Based on the two algorithms above, we have developed and implemented an algorithm for the distributed clustering of documents (see box Algorithm 2). The initiator of the clustering guesses an initial set of cluster centroids and sends these centroids along with an PROBE to all its neighbouring peers. Every peer receiving a PROBE proceeds with resending the PROBE to all its neighbours except the one it received the PROBE from (the predecessor). Then, a k-means clustering on the locally available documents is done. On receiving an ECHO from an adjacent peer, the peer merges the clustering done by the remote peer C p with the clustering resultant from the local clustering C l ; this is done by calculating a weighted mean taking into account how many documents each peer grouped into the respective clusters (W l and W p map clusters to their associated weights). After having received either a PROBE or an ECHO from every adjacent peer, the peer sends an ECHO containing the merged cluster centroids and the weighting to the peer it received the first PROBE from. When the initiator has received ECHOs from all its adjacent peers, it has

4 Algorithm 2: Our distributed clustering algorithm. Input : Neighbours the set of all neighbouring nodes. Input : D R n the data points d i D on the local peer. Data : (C p, W p ) remote set of cluster centroids and associated weights. Data : (C l, W l ) local set of cluster centroids and associated weights. begin receive (PROBE, C p ) or (ECHO, C p, W p ) from p Neighbours; if PROBE then if engaged then engaged true; received 1; pred p; // received first PROBE send (PROBE, C p ) to Neighbours\{pred}; (C l, W l ) kmeans(c p, D); // one local iteration of k-means end else received received + 1; // received additional PROBE if ECHO then received received + 1; // received ECHO (C l, W l ) mergeresults(c l, W l, C p, W p ); // merge local and received results if received = Neighbours then engaged false; if initiator then terminate; else send (ECHO, C l, W l ) to pred; // termination condition fulfilled complete information on the current iteration of the k-means algorithm. Therefore, this iteration is complete and terminates. Full k-means clustering typically requires more than one iteration. Thus, based on a condition as described in section 2, the initiator decides whether the currently achieved clustering needs to be improved by a further iteration. In those subsequent iterations, the initiator sends the resultant cluster centroids C l of the last iteration as the new guess to its neighbours. 4 Experimental Setup Hardware For our experiments, we used the computers in our laboratory and two other computer laboratories on campus. The hardware used is quite heterogeneous, ranging from Intel Pentium 4 processors at various speeds ( 1.4GHz 2.0 GHz) to Athlon MPs at 1.5 GHz. The nodes were equipped with different amounts of main memory ( 256 MBytes 1024 MBytes). The three sites (our laboratory and the two laboratories on campus) were interconnected by the university s Fast Ethernet backbone; each site had a dedicated Fast Ethernet switch to route local messages and to connect the site to the backbone. Data Sets We used synthetic data as well as real-world data. The synthetic data sets consist of evenly distributed random numbers [0; 1000]. We used data sets with

5 Time (s) [log scale] clusters, 32 dimensions D = 131,072 D = 262,144 D = 524,288 D =1,048,576 D =2,097,152 Time (s) [log scale] ,578 Documents, 3775 Dimensions, 57 Clusters Each peer can connect to 4 to 8 others Each peer can connect to 3 to 6 others Number of Peers Number of Peers Figure 1: The left figure shows the times for one iteration of our Algorithm 2, measured on data sets with (131,072-2,097,152) tupels with 32 dimensions. The right figure shows the times for clustering the Reuters collection (21,578 documents with 3,775 features) D 2 21 data points and 2 n 32 dimensions or features. Our real-world data set is an index of the Reuters news article collection. It contains 21,578 documents; n = 3, 775 features remained after stopword elimination, elimination of extremely rare terms, and Porter stemming. Measurements Each measurement was performed on 2, 4, 8, 16, 32 and 48 peers. The measurements on the synthetic data sets were additionally made with only one peer running, to achieve a baseline against which the distributed case can be compared. The synthetic data sets were clustered into κ = 8, 32, 128 clusters, whereas the real-world data set from the Reuters news article collection was clustered into 57 clusters, which seems to be a reasonable value, since the collection has been classified by human experts into 57 clusters with at least 20 documents each 1. For our experiments, we allowed each peer to communicate with 4 to 8 other randomly chosen peers. We took care to make inter-sites connections as probable as intra-site connections. 5 Experimental Results From the measurements as described above, we depicted two especially interesting results. As can be seen on the left side of figure 1, the time for clustering the synthetic data sets diminishes with an increasing number of peers. Best results can be obtained if the size of the local document collection on each peer as well as the number of desired clusters does not fall beyond a certain limit. This explains why the times for the two smaller data sets with 131,072 and 262,144 tupels increase when clustered on 48 peers instead of 32 peers. A similar observation has been made in [DM00]. Figure 1, right, contains the results obtained from measurements on the Reuters text collection. Clearly, our approach is capable to perform the clustering of such a collection. 1 See for more information.

6 Again, the times for one iteration are higher in the case of 48 clustering peers than they are in the case of 32 peers; in the first case, there are only documents per peer; with a document collection this small our algorithm shows only a limited amount of its potential. An interesting fact is that the network topology has an important impact on the times measured. When changing the connectivity each peer now only connects to 3 6 other peers the measured times for one iteration decrease, especially when 16, 32, or 48 peers participate in the distributed clusterings shown in the graph. 6 Conclusion In this paper, we presented an algorithm for the distributed clustering of documents. We have shown that our algorithm is capable of handling real-world-sized problems and that distribution creates even in our P2P setting speed ups in comparison to centralized processing. For data sets such as text (many features, many documents with respect to the number of desired clusters), the savings in transfer time alone justify our distributed approach. We find these results very encouraging for our setting of P2P IR, as in addition to the documented speedup, the processor load in each peer is dramatically reduced with respect to a client/server scenario in which clients send data to a dedicated clustering service. It can thus be safely assumed that a distributed calculation of k-means clusters can be performed as a background task in a P2P IR network. As a consequence, the achieved clustering can be used for a compact representation of peer contents in the form of histograms denoting the number of documents per peer falling into each cluster. These representations can be distributed in the network and used as a basis for source selection and query routing in a P2P network. References [BF98] Paul S. Bradley and Usama M. Fayyad. Refining initial points for K-Means clustering. In Proc. 15th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, [Cha82] E. J. H. Chang. Echo Algorithms: Depth Parallel Operations on General Graphs. IEEE Transactions on Software Engineering, 8(4): , [DM00] Inderjit S. Dhillon and Dharmendra S. Modha. A Data-Clustering Algorithm on Distributed Memory Multiprocessors. In Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, Aug 15, 1999, San Diego, USA, volume 1759 of Lecture Notes in Computer Science, pages Springer, [EH02] Martin Eisenhardt and Andreas Henrich. Problems and Solutions for P2P IR Systems (in german). In Sigrid Schubert, Bernd Reusch, and Norbert Jesse, editors, Informatik 2002, volume 19 of LNI, Dortmund, GI. [SC02] Mark Sinka and David Corne. A Large Benchmark Dataset for Web Document Clustering. In Soft Computing Systems: Design, Management and Applications, Vol. 87 of Frontiers in Artificial Intelligence and Applications, pages , [Ski01] David B. Skillicorn. The Case for Datacentric Grids. External technical report, Dept. of Computing and Information Science, Queen s University, Kingston, Canada, Nov

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Building a Concept Hierarchy from a Distance Matrix

Building a Concept Hierarchy from a Distance Matrix Building a Concept Hierarchy from a Distance Matrix Huang-Cheng Kuo 1 and Jen-Peng Huang 2 1 Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 hckuo@mail.ncyu.edu.tw

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,

More information

Improved Performance of Unsupervised Method by Renovated K-Means

Improved Performance of Unsupervised Method by Renovated K-Means Improved Performance of Unsupervised Method by Renovated P.Ashok Research Scholar, Bharathiar University, Coimbatore Tamilnadu, India. ashokcutee@gmail.com Dr.G.M Kadhar Nawaz Department of Computer Application

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Association Rule Mining and Clustering

Association Rule Mining and Clustering Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Special Session: Intelligent Knowledge Management A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Jiping Sun 1, Jeremy Sun 1, Kacem Abida 2, and Fakhri Karray

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

New Approach for K-mean and K-medoids Algorithm

New Approach for K-mean and K-medoids Algorithm New Approach for K-mean and K-medoids Algorithm Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Purnima Singh Department of

More information

Semantic Object Recognition in Digital Images

Semantic Object Recognition in Digital Images Semantic Object Recognition in Digital Images Semantic Object Recognition in Digital Images Falk Schmidsberger and Frieder Stolzenburg Hochschule Harz, Friedrichstr. 57 59 38855 Wernigerode, GERMANY {fschmidsberger,fstolzenburg}@hs-harz.de

More information

Automated Clustering-Based Workload Characterization

Automated Clustering-Based Workload Characterization Automated Clustering-Based Worload Characterization Odysseas I. Pentaalos Daniel A. MenascŽ Yelena Yesha Code 930.5 Dept. of CS Dept. of EE and CS NASA GSFC Greenbelt MD 2077 George Mason University Fairfax

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Improved Similarity Measure For Text Classification And Clustering

Improved Similarity Measure For Text Classification And Clustering Improved Similarity Measure For Text Classification And Clustering Rahul Nalawade 1, Akash Samal 2, Kiran Avhad 3 1Computer Engineering Department, STES Sinhgad Academy Of Engineering,Pune 2Computer Engineering

More information

General Instructions. Questions

General Instructions. Questions CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr Knowledge Relationship Discovery Know-Center Graz Graz, Austria mmuhr@know-center.at Michael Granitzer) Institute

More information

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN

More information

Agglomerative clustering on vertically partitioned data

Agglomerative clustering on vertically partitioned data Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com

More information

Scalability of Efficient Parallel K-Means

Scalability of Efficient Parallel K-Means Scalability of Efficient Parallel K-Means David Pettinger and Giuseppe Di Fatta School of Systems Engineering The University of Reading Whiteknights, Reading, Berkshire, RG6 6AY, UK {D.G.Pettinger,G.DiFatta}@reading.ac.uk

More information

Feature Extraction in Wireless Personal and Local Area Networks

Feature Extraction in Wireless Personal and Local Area Networks Feature Extraction in Wireless Personal and Local Area Networks 29. October 2003, Singapore Institut für Praktische Informatik Johannes Kepler Universität Linz, Austria rene@soft.uni-linz.ac.at < 1 > Content

More information

Analysis of K-Means Clustering Based Image Segmentation

Analysis of K-Means Clustering Based Image Segmentation IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 01-06 www.iosrjournals.org Analysis of K-Means

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

A Two Stage Zone Regression Method for Global Characterization of a Project Database

A Two Stage Zone Regression Method for Global Characterization of a Project Database A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,

More information

Clustering in Big Data Using K-Means Algorithm

Clustering in Big Data Using K-Means Algorithm Clustering in Big Data Using K-Means Algorithm Ajitesh Janaswamy, B.E Dept of CSE, BITS Pilani, Dubai Campus. ABSTRACT: K-means is the most widely used clustering algorithm due to its fairly straightforward

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Analytics and Visualization of Big Data

Analytics and Visualization of Big Data 1 Analytics and Visualization of Big Data Fadel M. Megahed Lecture 15: Clustering (Discussion Class) Department of Industrial and Systems Engineering Spring 13 Refresher: Hierarchical Clustering 2 Key

More information

Bandwidth Aware Routing Algorithms for Networks-on-Chip

Bandwidth Aware Routing Algorithms for Networks-on-Chip 1 Bandwidth Aware Routing Algorithms for Networks-on-Chip G. Longo a, S. Signorino a, M. Palesi a,, R. Holsmark b, S. Kumar b, and V. Catania a a Department of Computer Science and Telecommunications Engineering

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means

Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Fast Approximate Minimum Spanning Tree Algorithm Based on K-Means Caiming Zhong 1,2,3, Mikko Malinen 2, Duoqian Miao 1,andPasiFränti 2 1 Department of Computer Science and Technology, Tongji University,

More information

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

Internet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017

Internet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017 Internet Traffic Classification Using Machine Learning Tanjila Ahmed Dec 6, 2017 Agenda 1. Introduction 2. Motivation 3. Methodology 4. Results 5. Conclusion 6. References Motivation Traffic classification

More information

AIIA shot boundary detection at TRECVID 2006

AIIA shot boundary detection at TRECVID 2006 AIIA shot boundary detection at TRECVID 6 Z. Černeková, N. Nikolaidis and I. Pitas Artificial Intelligence and Information Analysis Laboratory Department of Informatics Aristotle University of Thessaloniki

More information

Object-based Image Segmentation OBIS Tool. User s Manual. Version 1.50

Object-based Image Segmentation OBIS Tool. User s Manual. Version 1.50 P2P Tools Object-based Image Segmentation OBIS Tool User s Manual Version 1.50 Principal Investigator: Dr. Xuefeng Chu Postdoctoral Research Associate: Dr. Jianli Zhang Graduate Research Assistants: Jun

More information

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA E-mail:

More information

Scalability of efficient parallel K Means

Scalability of efficient parallel K Means Scalability of efficient parallel K Means Conference or Workshop Item Published Version Pettinger, D. and Di Fatta, G. (2009) Scalability of efficient parallel K Means. In: 5th IEEE International Conference

More information

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Available online at  ScienceDirect. Procedia Computer Science 89 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 341 348 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Parallel Approach

More information

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

CLASSIFICATION FOR SCALING METHODS IN DATA MINING CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department

More information

Clustering Color/Intensity. Group together pixels of similar color/intensity.

Clustering Color/Intensity. Group together pixels of similar color/intensity. Clustering Color/Intensity Group together pixels of similar color/intensity. Agglomerative Clustering Cluster = connected pixels with similar color. Optimal decomposition may be hard. For example, find

More information

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633

Parallel K-means Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633 Parallel K-means Clustering Ajay Padoor Chandramohan Fall 2012 CSE 633 Outline Problem description Implementation MPI Implementation OpenMP Test Results Conclusions Future work Problem Description Clustering

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

ScienceDirect. Plan Restructuring in Multi Agent Planning

ScienceDirect. Plan Restructuring in Multi Agent Planning Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 396 401 International Conference on Information and Communication Technologies (ICICT 2014) Plan Restructuring

More information

Applying Kohonen Network in Organising Unstructured Data for Talus Bone

Applying Kohonen Network in Organising Unstructured Data for Talus Bone 212 Third International Conference on Theoretical and Mathematical Foundations of Computer Science Lecture Notes in Information Technology, Vol.38 Applying Kohonen Network in Organising Unstructured Data

More information

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,

More information

SYMBOLIC FEATURES IN NEURAL NETWORKS

SYMBOLIC FEATURES IN NEURAL NETWORKS SYMBOLIC FEATURES IN NEURAL NETWORKS Włodzisław Duch, Karol Grudziński and Grzegorz Stawski 1 Department of Computer Methods, Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract:

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Linear combinations of simple classifiers for the PASCAL challenge

Linear combinations of simple classifiers for the PASCAL challenge Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu

More information

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa

More information

Density-based clustering algorithms DBSCAN and SNN

Density-based clustering algorithms DBSCAN and SNN Density-based clustering algorithms DBSCAN and SNN Version 1.0, 25.07.2005 Adriano Moreira, Maribel Y. Santos and Sofia Carneiro {adriano, maribel, sofia}@dsi.uminho.pt University of Minho - Portugal 1.

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

USING IMAGES PATTERN RECOGNITION AND NEURAL NETWORKS FOR COATING QUALITY ASSESSMENT Image processing for quality assessment

USING IMAGES PATTERN RECOGNITION AND NEURAL NETWORKS FOR COATING QUALITY ASSESSMENT Image processing for quality assessment USING IMAGES PATTERN RECOGNITION AND NEURAL NETWORKS FOR COATING QUALITY ASSESSMENT Image processing for quality assessment L.-M. CHANG and Y.A. ABDELRAZIG School of Civil Engineering, Purdue University,

More information

Parallel Implementation of K-Means on Multi-Core Processors

Parallel Implementation of K-Means on Multi-Core Processors Parallel Implementation of K-Means on Multi-Core Processors Fahim Ahmed M. Faculty of Science, Suez University, Suez, Egypt, ahmmedfahim@yahoo.com Abstract Nowadays, all most personal computers have multi-core

More information

A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters

A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters Slobodan Petrović NISlab, Department of Computer Science and Media Technology, Gjøvik University College,

More information

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane

More information

University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences

University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences CPEN 405: Artificial Intelligence Lab 7 November 15, 2017 Unsupervised Learning

More information

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper

More information

A grid representation for Distributed Virtual Environments

A grid representation for Distributed Virtual Environments A grid representation for Distributed Virtual Environments Pedro Morillo, Marcos Fernández, Nuria Pelechano Instituto de Robótica, Universidad de Valencia. Polígono Coma S/N. Aptdo.Correos 22085, CP: 46071

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation ÖGAI Journal 24/1 11 Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning. Xiaojin Zhu Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

More information

Parallel Fuzzy c-means Cluster Analysis

Parallel Fuzzy c-means Cluster Analysis Parallel Fuzzy c-means Cluster Analysis Marta V. Modenesi; Myrian C. A. Costa, Alexandre G. Evsukoff and Nelson F. F. Ebecken COPPE/Federal University of Rio de Janeiro, P.O.Box 6856, 945-97 Rio de Janeiro

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Hierarchical Clustering of Process Schemas

Hierarchical Clustering of Process Schemas Hierarchical Clustering of Process Schemas Claudia Diamantini, Domenico Potena Dipartimento di Ingegneria Informatica, Gestionale e dell'automazione M. Panti, Università Politecnica delle Marche - via

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Parallel Algorithms K means Clustering

Parallel Algorithms K means Clustering CSE 633: Parallel Algorithms Spring 2014 Parallel Algorithms K means Clustering Final Results By: Andreina Uzcategui Outline The problem Algorithm Description Parallel Algorithm Implementation(MPI) Test

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Unsupervised Distributed Clustering

Unsupervised Distributed Clustering Unsupervised Distributed Clustering D. K. Tasoulis, M. N. Vrahatis, Department of Mathematics, University of Patras Artificial Intelligence Research Center (UPAIRC), University of Patras, GR 26110 Patras,

More information

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING

SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING SIMILARITY MEASURES FOR MULTI-VALUED ATTRIBUTES FOR DATABASE CLUSTERING TAE-WAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas 77204-3475 {twryu, ceick}@cs.uh.edu

More information

A Composite Graph Model for Web Document and the MCS Technique

A Composite Graph Model for Web Document and the MCS Technique A Composite Graph Model for Web Document and the MCS Technique Kaushik K. Phukon Department of Computer Science, Gauhati University, Guwahati-14,Assam, India kaushikphukon@gmail.com Abstract It has been

More information

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Optimization CBIR using K-Means Clustering for Image Database

Optimization CBIR using K-Means Clustering for Image Database Juli Rejito et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (), 0,79-793 Optimization CBIR using K-Means Clustering for Database Juli Rejito*, Retantyo

More information

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. K-means, knn Contents K-means clustering K-Nearest Neighbour Power Systems Analysis An automated learning approach Understanding states in

More information

Optical Flow Estimation with CUDA. Mikhail Smirnov

Optical Flow Estimation with CUDA. Mikhail Smirnov Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent

More information

An Optimized Search Mechanism for Large Distributed Systems

An Optimized Search Mechanism for Large Distributed Systems An Optimized Search Mechanism for Large Distributed Systems Herwig Unger 1, Thomas Böhme, Markus Wulff 1, Gilbert Babin 3, and Peter Kropf 1 Fachbereich Informatik Universität Rostock D-1051 Rostock, Germany

More information

FORMALIZED SOFTWARE DEVELOPMENT IN AN INDUSTRIAL ENVIRONMENT

FORMALIZED SOFTWARE DEVELOPMENT IN AN INDUSTRIAL ENVIRONMENT FORMALIZED SOFTWARE DEVELOPMENT IN AN INDUSTRIAL ENVIRONMENT Otthein Herzog IBM Germany, Dept. 3100 P.O.Box 80 0880 D-7000 STUTTGART, F. R. G. ABSTRACT tn the IBM Boeblingen Laboratory some software was

More information

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS

PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS Pradeep Nagendra Hegde 1, Ayush Arunachalam 2, Madhusudan H C 2, Ngabo Muzusangabo Joseph 1, Shilpa Ankalaki 3, Dr. Jharna Majumdar

More information

Introduction to Clustering

Introduction to Clustering Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information