Classifying Documents by Distributed P2P Clustering

 Cynthia Caldwell
 9 months ago
 Views:
Transcription
1 Classifying Documents by Distributed P2P Clustering Martin Eisenhardt Wolfgang Müller Andreas Henrich Chair of Applied Computer Science I University of Bayreuth, Germany {eisenhardt mueller2 Abstract: Clustering documents into classes is an important task in many Information Retrieval (IR) systems. This achieved grouping enables a description of the contents of the document collection in terms of the classes the documents fall into. The compactness of such a description is even more desirable in cases where the document collection is spread across different computers and locations; document classes can then be used to describe each partial document collection in a conveniently short form that can easily be exchanged with other nodes on the network. Unfortunately, most clustering schemes cannot easily be distributed. Additionally, the costs of transferring all data to a central clustering service are prohibitive in largescale systems. In this paper, we introduce an approach which is capable of classifying documents that are distributed across a PeertoPeer (P2P) network. We present measurements taken on a P2P network using synthetic and realworld data sets. 1 Motivation In P2P IR systems and Distributed IR Systems, an important problem is the routing of queries or the selection of sources which might contain relevant documents. Some approaches in this respect are based on compact representations of adjacent peers document collections or the documents maintained on the reachable source nodes. In this context, an interesting approach is to determine a consistent clustering of all documents in the network, and to acquire knowledge about which node contains how many documents falling into each cluster. Then, for a given query, the querying node can identify the most promising clusters by that compact representation of other peers collections; the query can then be routed precisely to nodes potentially containing relevant documents. In the scenario above, an efficient algorithm is needed to determine the clustering of the global document collection. The present paper will describe and assess such an algorithm. The concrete motivation behind this approach stems from our research on largescale P2P IR networks (cf. [EH02]). In environments like this, the costs of transferring documents and/or data to a central clustering service become prohibitive; in some cases, it might even be infeasible to transfer the data, as Skillicorn points out (see [Ski01]). Therefore, a distributed clustering algorithm is needed.
2 2 Background Our approach is based on two wellknown algorithms from the areas of statistics and distributed systems. We use the kmeans clustering algorithm to do the actual categorization of the documents. A probe/echo mechanism is used to distribute the task through the P2P network and propagate the results back to the initiator of the clustering. The kmeans Clustering Algorithm The kmeans clustering algorithm is a nonhierarchical approach to clustering (see the box titled Algorithm 1). Its input parameters are a number κ of desired clusters, a set D of ndimensional data objects d i, and a initial set C of ndimensional cluster centroids c j where d i, c j R n ; c i,l and d j,l depict the n single vector elements. The clustering algorithm follows a very straightforward approach: in each iteration, the distances of each d i D to all c j C are computed. Each d i is allocated to the nearest c j. Then, every centroid c j is recalculated as the centroid of all data objects d i allocated to it. There are a few mostly equivalent termination conditions for the kmeans algorithm. One of the most widely used is that the algorithm terminates if the sum of all mean square errors (MSE) within each cluster does no longer decrease from iteration to iteration. Note that the function dist(d i, c j ) can be chosen according to the data set you are clustering: a common class of distance metrics are the Minkowski metrics dist mink (c i, d j ) = m d i,l c j,l m where the parameter m N can be set to generate a specific metric, l e.g. the Manhattan metric (m = 1) or the Euclidean metric (m = 2); the latter metric was used in our experiments. Other metrics (f.e. cosine metric) are also possible. Various methods exist for the generation of initial cluster centroids, ranging from quite obvious ones (just take C randomly chosen documents as the initial centroids) to sophisticated ones that lead to a faster convergation and termination of the algorithm (see [BF98]). As Modha and Dhillon point out in [DM00], the kmeans algorithm is inherently data parallel. Since it involves no hierarchical scheme, the only data that must be shared among the nodes are the cluster centroids; all locally refined centroids can then be merged at the node which initiated the clustering; this node decides according to a termination condition whether another iteration of the kmeans algorithm is necessary. The advantage of this approach is that the transfer of relatively few centroids is feasible as opposed to transferring whole data sets. Additionally, the kmeans algorithm is wellsuited for the clustering of documents, as is shown in [SC02]. The Probe/Echo mechanism The Probe/Echo mechanism was first described by E. J. H. Chang in [Cha82]. It is an algorithm to distribute a piece of information across a general graph, in our case a computer network. We want every node to receive the information only once to reduce the amount of messages sent and the number of processing steps. The algorithm implicitly constructs a spanning tree of the graph. The message complexity on a graph G = (V, E) with E edges is 2 E. The initiator of the probe/echo mechanism marks itself as both engaged and initiator and sends a PROBE message to all its neighbours. When a node receives the first PROBE,
3 Algorithm 1: The kmeans clustering algorithm. Input : D R n the set of data points d i D. Input : κ the number of clusters the algorithm should calculate. Output : C R n the set of cluster centroids c j C. Data : M 2 D the set of sets m j where each m j contains the d i closest to c j. begin MSE 0, MSE old ; end repeat foreach m j M do m j ; foreach d i D do Find closest c j C; m j m j {d i }; foreach c j C do c j 1 m j d i m j d i ; MSE old MSE; MSE 1 c j (dist(d i, c j )) 2 ; j d i m j until MSE old MSE < ε // initialize all clusters as empty sets // assign each object to its nearest cluster // recalculate cluster centroids // calculate cluster quality it marks itself as engaged and sends PROBEs to all adjacent nodes. All subsequent received PROBEs do not cause the node to send an additional PROBE to its neighbours. After a node has received PROBE or ECHO messages from all its neighbours it marks itself as not engaged and sends an ECHO to the node it received its first PROBE from. When the initiator has received ECHOs from all its neighbours, the algorithm terminates. 3 A Distributed Algorithm to Cluster Documents Based on the two algorithms above, we have developed and implemented an algorithm for the distributed clustering of documents (see box Algorithm 2). The initiator of the clustering guesses an initial set of cluster centroids and sends these centroids along with an PROBE to all its neighbouring peers. Every peer receiving a PROBE proceeds with resending the PROBE to all its neighbours except the one it received the PROBE from (the predecessor). Then, a kmeans clustering on the locally available documents is done. On receiving an ECHO from an adjacent peer, the peer merges the clustering done by the remote peer C p with the clustering resultant from the local clustering C l ; this is done by calculating a weighted mean taking into account how many documents each peer grouped into the respective clusters (W l and W p map clusters to their associated weights). After having received either a PROBE or an ECHO from every adjacent peer, the peer sends an ECHO containing the merged cluster centroids and the weighting to the peer it received the first PROBE from. When the initiator has received ECHOs from all its adjacent peers, it has
4 Algorithm 2: Our distributed clustering algorithm. Input : Neighbours the set of all neighbouring nodes. Input : D R n the data points d i D on the local peer. Data : (C p, W p ) remote set of cluster centroids and associated weights. Data : (C l, W l ) local set of cluster centroids and associated weights. begin receive (PROBE, C p ) or (ECHO, C p, W p ) from p Neighbours; if PROBE then if engaged then engaged true; received 1; pred p; // received first PROBE send (PROBE, C p ) to Neighbours\{pred}; (C l, W l ) kmeans(c p, D); // one local iteration of kmeans end else received received + 1; // received additional PROBE if ECHO then received received + 1; // received ECHO (C l, W l ) mergeresults(c l, W l, C p, W p ); // merge local and received results if received = Neighbours then engaged false; if initiator then terminate; else send (ECHO, C l, W l ) to pred; // termination condition fulfilled complete information on the current iteration of the kmeans algorithm. Therefore, this iteration is complete and terminates. Full kmeans clustering typically requires more than one iteration. Thus, based on a condition as described in section 2, the initiator decides whether the currently achieved clustering needs to be improved by a further iteration. In those subsequent iterations, the initiator sends the resultant cluster centroids C l of the last iteration as the new guess to its neighbours. 4 Experimental Setup Hardware For our experiments, we used the computers in our laboratory and two other computer laboratories on campus. The hardware used is quite heterogeneous, ranging from Intel Pentium 4 processors at various speeds ( 1.4GHz 2.0 GHz) to Athlon MPs at 1.5 GHz. The nodes were equipped with different amounts of main memory ( 256 MBytes 1024 MBytes). The three sites (our laboratory and the two laboratories on campus) were interconnected by the university s Fast Ethernet backbone; each site had a dedicated Fast Ethernet switch to route local messages and to connect the site to the backbone. Data Sets We used synthetic data as well as realworld data. The synthetic data sets consist of evenly distributed random numbers [0; 1000]. We used data sets with
5 Time (s) [log scale] clusters, 32 dimensions D = 131,072 D = 262,144 D = 524,288 D =1,048,576 D =2,097,152 Time (s) [log scale] ,578 Documents, 3775 Dimensions, 57 Clusters Each peer can connect to 4 to 8 others Each peer can connect to 3 to 6 others Number of Peers Number of Peers Figure 1: The left figure shows the times for one iteration of our Algorithm 2, measured on data sets with (131,0722,097,152) tupels with 32 dimensions. The right figure shows the times for clustering the Reuters collection (21,578 documents with 3,775 features) D 2 21 data points and 2 n 32 dimensions or features. Our realworld data set is an index of the Reuters news article collection. It contains 21,578 documents; n = 3, 775 features remained after stopword elimination, elimination of extremely rare terms, and Porter stemming. Measurements Each measurement was performed on 2, 4, 8, 16, 32 and 48 peers. The measurements on the synthetic data sets were additionally made with only one peer running, to achieve a baseline against which the distributed case can be compared. The synthetic data sets were clustered into κ = 8, 32, 128 clusters, whereas the realworld data set from the Reuters news article collection was clustered into 57 clusters, which seems to be a reasonable value, since the collection has been classified by human experts into 57 clusters with at least 20 documents each 1. For our experiments, we allowed each peer to communicate with 4 to 8 other randomly chosen peers. We took care to make intersites connections as probable as intrasite connections. 5 Experimental Results From the measurements as described above, we depicted two especially interesting results. As can be seen on the left side of figure 1, the time for clustering the synthetic data sets diminishes with an increasing number of peers. Best results can be obtained if the size of the local document collection on each peer as well as the number of desired clusters does not fall beyond a certain limit. This explains why the times for the two smaller data sets with 131,072 and 262,144 tupels increase when clustered on 48 peers instead of 32 peers. A similar observation has been made in [DM00]. Figure 1, right, contains the results obtained from measurements on the Reuters text collection. Clearly, our approach is capable to perform the clustering of such a collection. 1 See for more information.
6 Again, the times for one iteration are higher in the case of 48 clustering peers than they are in the case of 32 peers; in the first case, there are only documents per peer; with a document collection this small our algorithm shows only a limited amount of its potential. An interesting fact is that the network topology has an important impact on the times measured. When changing the connectivity each peer now only connects to 3 6 other peers the measured times for one iteration decrease, especially when 16, 32, or 48 peers participate in the distributed clusterings shown in the graph. 6 Conclusion In this paper, we presented an algorithm for the distributed clustering of documents. We have shown that our algorithm is capable of handling realworldsized problems and that distribution creates even in our P2P setting speed ups in comparison to centralized processing. For data sets such as text (many features, many documents with respect to the number of desired clusters), the savings in transfer time alone justify our distributed approach. We find these results very encouraging for our setting of P2P IR, as in addition to the documented speedup, the processor load in each peer is dramatically reduced with respect to a client/server scenario in which clients send data to a dedicated clustering service. It can thus be safely assumed that a distributed calculation of kmeans clusters can be performed as a background task in a P2P IR network. As a consequence, the achieved clustering can be used for a compact representation of peer contents in the form of histograms denoting the number of documents per peer falling into each cluster. These representations can be distributed in the network and used as a basis for source selection and query routing in a P2P network. References [BF98] Paul S. Bradley and Usama M. Fayyad. Refining initial points for KMeans clustering. In Proc. 15th International Conf. on Machine Learning, pages Morgan Kaufmann, San Francisco, CA, [Cha82] E. J. H. Chang. Echo Algorithms: Depth Parallel Operations on General Graphs. IEEE Transactions on Software Engineering, 8(4): , [DM00] Inderjit S. Dhillon and Dharmendra S. Modha. A DataClustering Algorithm on Distributed Memory Multiprocessors. In LargeScale Parallel Data Mining, Workshop on LargeScale Parallel KDD Systems, SIGKDD, Aug 15, 1999, San Diego, USA, volume 1759 of Lecture Notes in Computer Science, pages Springer, [EH02] Martin Eisenhardt and Andreas Henrich. Problems and Solutions for P2P IR Systems (in german). In Sigrid Schubert, Bernd Reusch, and Norbert Jesse, editors, Informatik 2002, volume 19 of LNI, Dortmund, GI. [SC02] Mark Sinka and David Corne. A Large Benchmark Dataset for Web Document Clustering. In Soft Computing Systems: Design, Management and Applications, Vol. 87 of Frontiers in Artificial Intelligence and Applications, pages , [Ski01] David B. Skillicorn. The Case for Datacentric Grids. External technical report, Dept. of Computing and Information Science, Queen s University, Kingston, Canada, Nov
Efficiency of kmeans and KMedoids Algorithms for Clustering Arbitrary Data Points
Efficiency of kmeans and KMedoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai600106,
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationBuilding a Concept Hierarchy from a Distance Matrix
Building a Concept Hierarchy from a Distance Matrix HuangCheng Kuo 1 and JenPeng Huang 2 1 Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 hckuo@mail.ncyu.edu.tw
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationSOMSN: An Effective Self Organizing Map for Clustering of Social Networks
SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,
More informationImproved Performance of Unsupervised Method by Renovated KMeans
Improved Performance of Unsupervised Method by Renovated P.Ashok Research Scholar, Bharathiar University, Coimbatore Tamilnadu, India. ashokcutee@gmail.com Dr.G.M Kadhar Nawaz Department of Computer Application
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIITDelhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIITDelhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationAssociation Rule Mining and Clustering
Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical:
More informationAnalyzing Outlier Detection Techniques with Hybrid Method
Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,
More informationA Comparative Study of Data Mining Process Models (KDD, CRISPDM and SEMMA)
International Journal of Innovation and Scientific Research ISSN 23518014 Vol. 12 No. 1 Nov. 2014, pp. 217222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issrjournals.org/
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationA Novel Template Matching Approach To SpeakerIndependent Arabic Spoken Digit Recognition
Special Session: Intelligent Knowledge Management A Novel Template Matching Approach To SpeakerIndependent Arabic Spoken Digit Recognition Jiping Sun 1, Jeremy Sun 1, Kacem Abida 2, and Fakhri Karray
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 WolfTilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework
More informationFuzzy Ant Clustering by Centroid Positioning
Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We
More informationEnhancing Kmeans Clustering Algorithm with Improved Initial Center
Enhancing Kmeans Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of
More informationIteration Reduction K Means Clustering Algorithm
Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware Email: jdimarco@udel.edu, taufer@udel.edu
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationNew Approach for Kmean and Kmedoids Algorithm
New Approach for Kmean and Kmedoids Algorithm Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Purnima Singh Department of
More informationSemantic Object Recognition in Digital Images
Semantic Object Recognition in Digital Images Semantic Object Recognition in Digital Images Falk Schmidsberger and Frieder Stolzenburg Hochschule Harz, Friedrichstr. 57 59 38855 Wernigerode, GERMANY {fschmidsberger,fstolzenburg}@hsharz.de
More informationAutomated ClusteringBased Workload Characterization
Automated ClusteringBased Worload Characterization Odysseas I. Pentaalos Daniel A. MenascŽ Yelena Yesha Code 930.5 Dept. of CS Dept. of EE and CS NASA GSFC Greenbelt MD 2077 George Mason University Fairfax
More informationIntroduction to Computer Science
DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at
More informationImproved Similarity Measure For Text Classification And Clustering
Improved Similarity Measure For Text Classification And Clustering Rahul Nalawade 1, Akash Samal 2, Kiran Avhad 3 1Computer Engineering Department, STES Sinhgad Academy Of Engineering,Pune 2Computer Engineering
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationAutomatic Cluster Number Selection using a Split and Merge KMeans Approach
Automatic Cluster Number Selection using a Split and Merge KMeans Approach Markus Muhr Knowledge Relationship Discovery KnowCenter Graz Graz, Austria mmuhr@knowcenter.at Michael Granitzer) Institute
More informationA Vector Space Equalization Scheme for a Conceptbased Collaborative Information Retrieval System
A Vector Space Equalization Scheme for a Conceptbased Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 16031 Kamitomiokacho, Nagaokashi Niigata, 9402188 JAPAN
More informationAgglomerative clustering on vertically partitioned data
Agglomerative clustering on vertically partitioned data R.Senkamalavalli Research Scholar, Department of Computer Science and Engg., SCSVMV University, Enathur, Kanchipuram 631 561 sengu_cool@yahoo.com
More informationScalability of Efficient Parallel KMeans
Scalability of Efficient Parallel KMeans David Pettinger and Giuseppe Di Fatta School of Systems Engineering The University of Reading Whiteknights, Reading, Berkshire, RG6 6AY, UK {D.G.Pettinger,G.DiFatta}@reading.ac.uk
More informationFeature Extraction in Wireless Personal and Local Area Networks
Feature Extraction in Wireless Personal and Local Area Networks 29. October 2003, Singapore Institut für Praktische Informatik Johannes Kepler Universität Linz, Austria rene@soft.unilinz.ac.at < 1 > Content
More informationAnalysis of KMeans Clustering Based Image Segmentation
IOSR Journal of Electronics and Communication Engineering (IOSRJECE) eissn: 22782834,p ISSN: 22788735.Volume 10, Issue 2, Ver.1 (Mar  Apr.2015), PP 0106 www.iosrjournals.org Analysis of KMeans
More informationImproved MapReduce kmeans Clustering Algorithm with Combiner
2014 UKSimAMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce kmeans Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab M. Duwairi* and Khaleifah Al.jada'** * Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110,
More informationAn Improvement of CentroidBased Classification Algorithm for Text Classification
An Improvement of CentroidBased Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationA Two Stage Zone Regression Method for Global Characterization of a Project Database
A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,
More informationClustering in Big Data Using KMeans Algorithm
Clustering in Big Data Using KMeans Algorithm Ajitesh Janaswamy, B.E Dept of CSE, BITS Pilani, Dubai Campus. ABSTRACT: Kmeans is the most widely used clustering algorithm due to its fairly straightforward
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationAnalytics and Visualization of Big Data
1 Analytics and Visualization of Big Data Fadel M. Megahed Lecture 15: Clustering (Discussion Class) Department of Industrial and Systems Engineering Spring 13 Refresher: Hierarchical Clustering 2 Key
More informationBandwidth Aware Routing Algorithms for NetworksonChip
1 Bandwidth Aware Routing Algorithms for NetworksonChip G. Longo a, S. Signorino a, M. Palesi a,, R. Holsmark b, S. Kumar b, and V. Catania a a Department of Computer Science and Telecommunications Engineering
More informationCOMPARISON OF DENSITYBASED CLUSTERING ALGORITHMS
COMPARISON OF DENSITYBASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,
More informationFast Approximate Minimum Spanning Tree Algorithm Based on KMeans
Fast Approximate Minimum Spanning Tree Algorithm Based on KMeans Caiming Zhong 1,2,3, Mikko Malinen 2, Duoqian Miao 1,andPasiFränti 2 1 Department of Computer Science and Technology, Tongji University,
More informationUNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania
UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of realvalued data is often used as a preprocessing
More informationInternet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017
Internet Traffic Classification Using Machine Learning Tanjila Ahmed Dec 6, 2017 Agenda 1. Introduction 2. Motivation 3. Methodology 4. Results 5. Conclusion 6. References Motivation Traffic classification
More informationAIIA shot boundary detection at TRECVID 2006
AIIA shot boundary detection at TRECVID 6 Z. Černeková, N. Nikolaidis and I. Pitas Artificial Intelligence and Information Analysis Laboratory Department of Informatics Aristotle University of Thessaloniki
More informationObjectbased Image Segmentation OBIS Tool. User s Manual. Version 1.50
P2P Tools Objectbased Image Segmentation OBIS Tool User s Manual Version 1.50 Principal Investigator: Dr. Xuefeng Chu Postdoctoral Research Associate: Dr. Jianli Zhang Graduate Research Assistants: Jun
More informationUSING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING
USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING SARAH COPPOCK AND LAWRENCE MAZLACK Computer Science, University of Cincinnati, Cincinnati, Ohio 45220 USA Email:
More informationScalability of efficient parallel K Means
Scalability of efficient parallel K Means Conference or Workshop Item Published Version Pettinger, D. and Di Fatta, G. (2009) Scalability of efficient parallel K Means. In: 5th IEEE International Conference
More informationAvailable online at ScienceDirect. Procedia Computer Science 89 (2016 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 341 348 Twelfth International MultiConference on Information Processing2016 (IMCIP2016) Parallel Approach
More informationCLASSIFICATION FOR SCALING METHODS IN DATA MINING
CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 8747563, ekyper@mail.uri.edu Lutz Hamel, Department
More informationClustering Color/Intensity. Group together pixels of similar color/intensity.
Clustering Color/Intensity Group together pixels of similar color/intensity. Agglomerative Clustering Cluster = connected pixels with similar color. Optimal decomposition may be hard. For example, find
More informationCluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University
Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical
More informationSpace Filling Curves and Hierarchical Basis. Klaus Speer
Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of
More informationParallel Kmeans Clustering. Ajay Padoor Chandramohan Fall 2012 CSE 633
Parallel Kmeans Clustering Ajay Padoor Chandramohan Fall 2012 CSE 633 Outline Problem description Implementation MPI Implementation OpenMP Test Results Conclusions Future work Problem Description Clustering
More informationUsing Decision Boundary to Analyze Classifiers
Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More informationScienceDirect. Plan Restructuring in Multi Agent Planning
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 396 401 International Conference on Information and Communication Technologies (ICICT 2014) Plan Restructuring
More informationApplying Kohonen Network in Organising Unstructured Data for Talus Bone
212 Third International Conference on Theoretical and Mathematical Foundations of Computer Science Lecture Notes in Information Technology, Vol.38 Applying Kohonen Network in Organising Unstructured Data
More informationHIMIC : A Hierarchical Mixed Type Data Clustering Algorithm
HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur784028,
More informationSYMBOLIC FEATURES IN NEURAL NETWORKS
SYMBOLIC FEATURES IN NEURAL NETWORKS Włodzisław Duch, Karol Grudziński and Grzegorz Stawski 1 Department of Computer Methods, Nicolaus Copernicus University ul. Grudziadzka 5, 87100 Toruń, Poland Abstract:
More informationFinding Clusters 1 / 60
Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. kmeans Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60
More informationLinear combinations of simple classifiers for the PASCAL challenge
Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu
More informationIndex Terms: Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.
International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa
More informationDensitybased clustering algorithms DBSCAN and SNN
Densitybased clustering algorithms DBSCAN and SNN Version 1.0, 25.07.2005 Adriano Moreira, Maribel Y. Santos and Sofia Carneiro {adriano, maribel, sofia}@dsi.uminho.pt University of Minho  Portugal 1.
More informationUsing Google s PageRank Algorithm to Identify Important Attributes of Genes
Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105
More informationUSING IMAGES PATTERN RECOGNITION AND NEURAL NETWORKS FOR COATING QUALITY ASSESSMENT Image processing for quality assessment
USING IMAGES PATTERN RECOGNITION AND NEURAL NETWORKS FOR COATING QUALITY ASSESSMENT Image processing for quality assessment L.M. CHANG and Y.A. ABDELRAZIG School of Civil Engineering, Purdue University,
More informationParallel Implementation of KMeans on MultiCore Processors
Parallel Implementation of KMeans on MultiCore Processors Fahim Ahmed M. Faculty of Science, Suez University, Suez, Egypt, ahmmedfahim@yahoo.com Abstract Nowadays, all most personal computers have multicore
More informationA Comparison Between the Silhouette Index and the DaviesBouldin Index in Labelling IDS Clusters
A Comparison Between the Silhouette Index and the DaviesBouldin Index in Labelling IDS Clusters Slobodan Petrović NISlab, Department of Computer Science and Media Technology, Gjøvik University College,
More informationTraffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers
Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane
More informationUniversity of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences
University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences CPEN 405: Artificial Intelligence Lab 7 November 15, 2017 Unsupervised Learning
More informationEfficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper
More informationA grid representation for Distributed Virtual Environments
A grid representation for Distributed Virtual Environments Pedro Morillo, Marcos Fernández, Nuria Pelechano Instituto de Robótica, Universidad de Valencia. Polígono Coma S/N. Aptdo.Correos 22085, CP: 46071
More informationIntroducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values
Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine
More informationColour Segmentationbased Computation of Dense Optical Flow with Application to Video Object Segmentation
ÖGAI Journal 24/1 11 Colour Segmentationbased Computation of Dense Optical Flow with Application to Video Object Segmentation Michael Bleyer, Margrit Gelautz, Christoph Rhemann Vienna University of Technology
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationParallel Fuzzy cmeans Cluster Analysis
Parallel Fuzzy cmeans Cluster Analysis Marta V. Modenesi; Myrian C. A. Costa, Alexandre G. Evsukoff and Nelson F. F. Ebecken COPPE/Federal University of Rio de Janeiro, P.O.Box 6856, 94597 Rio de Janeiro
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationHierarchical Clustering of Process Schemas
Hierarchical Clustering of Process Schemas Claudia Diamantini, Domenico Potena Dipartimento di Ingegneria Informatica, Gestionale e dell'automazione M. Panti, Università Politecnica delle Marche  via
More informationTransforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm
Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel JosiahAkintonde December
More informationParallel Algorithms K means Clustering
CSE 633: Parallel Algorithms Spring 2014 Parallel Algorithms K means Clustering Final Results By: Andreina Uzcategui Outline The problem Algorithm Description Parallel Algorithm Implementation(MPI) Test
More informationA Formal Approach to Score Normalization for Metasearch
A Formal Approach to Score Normalization for Metasearch R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003
More informationPerformance Analysis of Data Mining Classification Techniques
Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal
More informationUnsupervised Distributed Clustering
Unsupervised Distributed Clustering D. K. Tasoulis, M. N. Vrahatis, Department of Mathematics, University of Patras Artificial Intelligence Research Center (UPAIRC), University of Patras, GR 26110 Patras,
More informationSIMILARITY MEASURES FOR MULTIVALUED ATTRIBUTES FOR DATABASE CLUSTERING
SIMILARITY MEASURES FOR MULTIVALUED ATTRIBUTES FOR DATABASE CLUSTERING TAEWAN RYU AND CHRISTOPH F. EICK Department of Computer Science, University of Houston, Houston, Texas 772043475 {twryu, ceick}@cs.uh.edu
More informationA Composite Graph Model for Web Document and the MCS Technique
A Composite Graph Model for Web Document and the MCS Technique Kaushik K. Phukon Department of Computer Science, Gauhati University, Guwahati14,Assam, India kaushikphukon@gmail.com Abstract It has been
More informationClustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme
Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some
More informationLINUX. Benchmark problems have been calculated with dierent cluster con gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters  An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationClustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic
Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the
More informationAdvanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret
Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely
More informationOptimization CBIR using KMeans Clustering for Image Database
Juli Rejito et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (), 0,79793 Optimization CBIR using KMeans Clustering for Database Juli Rejito*, Retantyo
More informationKTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. Kmeans, knn
KTH ROYAL INSTITUTE OF TECHNOLOGY Lecture 14 Machine Learning. Kmeans, knn Contents Kmeans clustering KNearest Neighbour Power Systems Analysis An automated learning approach Understanding states in
More informationOptical Flow Estimation with CUDA. Mikhail Smirnov
Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent
More informationAn Optimized Search Mechanism for Large Distributed Systems
An Optimized Search Mechanism for Large Distributed Systems Herwig Unger 1, Thomas Böhme, Markus Wulff 1, Gilbert Babin 3, and Peter Kropf 1 Fachbereich Informatik Universität Rostock D1051 Rostock, Germany
More informationFORMALIZED SOFTWARE DEVELOPMENT IN AN INDUSTRIAL ENVIRONMENT
FORMALIZED SOFTWARE DEVELOPMENT IN AN INDUSTRIAL ENVIRONMENT Otthein Herzog IBM Germany, Dept. 3100 P.O.Box 80 0880 D7000 STUTTGART, F. R. G. ABSTRACT tn the IBM Boeblingen Laboratory some software was
More informationPERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS
PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS Pradeep Nagendra Hegde 1, Ayush Arunachalam 2, Madhusudan H C 2, Ngabo Muzusangabo Joseph 1, Shilpa Ankalaki 3, Dr. Jharna Majumdar
More informationIntroduction to Clustering
Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of
More informationAssociation Rule Mining. Introduction 46. Study core 46
Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FPGrowth Algorithm 47 4 Assignment Bundle: Frequent
More informationClustering Algorithms for general similarity measures
Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative
More information