Determining the k in k-means with MapReduce
|
|
- Erika Blair
- 5 years ago
- Views:
Transcription
1 Algorithms for MapReduce and Beyond 2014 Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard
2 Clustering & k-means Clustering K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28: , 1982.] 1982 (a great year!) But still largely used Drawbacks (amongst others): Local minimum K is a parameter! 2
3 Clustering & k-means Determine k: VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, jump method (based on information theory), Gap statistic,... O(k²) 3
4 G-means G-means [Greg Hamerly and Charles Elkan. Learning the k in kmeans. In Neural Information Processing Systems. MIT Press, 2003] K-means : points in each cluster are spherically distributed around the center Source: scikit-learn 4
5 G-means G-means [Greg Hamerly and Charles Elkan. Learning the k in kmeans. In Neural Information Processing Systems. MIT Press, 2003] K-means : points in each cluster are spherically distributed around the center normality test & recursion 5
6 G-means Dataset 6
7 G-means 1. Pick 2 centers 7
8 G-means 2. k-means 8
9 G-means 3. Project 9
10 G-means 3. Project 10
11 G-means Normal? No => iterate 4. Normality test 11
12 G-means 5. Recursion 12
13 MapReduce G-means Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage 13
14 MapReduce G-means Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage 14
15 MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while 15
16 MapReduce G-means 3. Maximize parallelism 4. Limit memory usage TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if 16
17 MapReduce G-means 3. Maximize parallelism 4. Limit memory usage (risk of crash) TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster ck e end if n e l t ot B 17
18 MapReduce G-means TestFewClusters TestClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if iner b m o c y r o In mem 18
19 MapReduce G-means TestFewClusters TestClusters Map(key, point) Map(key, point) #clusters > #reducers Find cluster Find cluster Find vector Find vector & Project point on vector Project point on vector Add projection to list Emit(cluster, projection) Estimated required memory < Java heap Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if 19
20 MapReduce G-means TestFewClusters TestClusters Map(key, point) Map(key, point) #clusters > #reducers Find cluster Find cluster Find vector Find vector & Project point on vector Project point on vector Add projection to list Emit(cluster, projection) Estimated required memory < Java heap Experimentally: Close() 64 do Bytes / point For each list Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if 20
21 Comparison MR multi-k-means MR G-means Speed all possible values of k in a single job Quality 21
22 Comparison Speed MR multi-k-means O(nk²) computations MR G-means O(nk) computations But: more iterations more dataset reads log (k) 2 Quality New centers added if and where needed But: tends to overestimate k! 22
23 Experimental results : Speed Hadoop Synthetic dataset 10M points in R10 Euclidean distance 8 machines 23
24 Experimental results : Quality k x ~1.5 kfound Within Cluster Sum of Square (less is better) MR G-means multi-k-means (with same k) Hadoop Synthetic dataset 10M points in R10 Euclidean distance 8 machines 24
25 Conclusions & future work... MapReduce algorithm to determine k Running time proportional to k Future: Overestimation of k Test on real data Test scalability Reduce I/O (using Spark) Consider skewed data Consider impact of machine failure 25
26 Thank you! 26
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduce Abstract Thibault Debatty Royal Military Academy, Brussels, Belgium thibault.debatty@rma.ac.be Wim Mees Royal Military Academy, Brussels, Belgium wim.mees@rma.ac.be
More informationpfk-means: A Parameter Free K-means Algorithm Omar Kettani*, *(Scientific Institute, Physics of the Earth Laboratory/Mohamed V- University, Rabat)
RESEARCH ARTICLE Abstract: pfk-means: A Parameter Free K-means Algorithm Omar Kettani*, *(Scientific Institute, Physics of the Earth Laboratory/Mohamed V- University, Rabat) K-means clustering is widely
More informationAn Efficient Clustering for Crime Analysis
An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationTowards the world s fastest k-means algorithm
Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Objective function and optimization Lloyd s algorithm 1 The k-means clustering
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationCOSC6376 Cloud Computing Homework 1 Tutorial
COSC6376 Cloud Computing Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Outline Homework1 Tutorial based on Netflix dataset Homework 1 K-means
More informationBeyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist
Beyond Online Aggregation: Parallel and Incremental Data Mining with MapReduce Joos-Hendrik Böse*, Artur Andrzejak, Mikael Högqvist *ICSI Berkeley Zuse Institut Berlin 4/26/2010 Joos-Hendrik Boese Slide
More informationMapReduce ML & Clustering Algorithms
MapReduce ML & Clustering Algorithms Reminder MapReduce: A trade-off between ease of use & possible parallelism Graph Algorithms Approaches: Reduce input size (filtering) Graph specific optimizations (Pregel
More informationMounica B, Aditya Srivastava, Md. Faisal Alam
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationk-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University
k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University Outline Unsupervised versus Supervised Learning Clustering Problem k-means Clustering Algorithm Visual
More informationDistances, Clustering! Rafael Irizarry!
Distances, Clustering! Rafael Irizarry! Heatmaps! Distance! Clustering organizes things that are close into groups! What does it mean for two genes to be close?! What does it mean for two samples to
More informationTurkit: Human Computation Algorithms on Mechanical Turk. Greg Little, Lydia B. Chilton, Max Goldman, Robert C. Miller
Turkit: Human Computation Algorithms on Mechanical Turk Greg Little, Lydia B. Chilton, Max Goldman, Robert C. Miller Human computation Algorithms Tasks that build on each other. Human computation Algorithms
More informationFAST K-MEANS ALGORITHM CLUSTERING
FAST K-MEANS ALGORITHM CLUSTERING Raied Salman 1,Vojislav Kecman, Qi Li, Robert Strack and Erick Test 1 Department of Computer Science, Virginia Commonwealth University, Virginia, 601 West Main Street,
More informationk-means Gaussian mixture model Maximize the likelihood exp(
k-means Gaussian miture model Maimize the likelihood Centers : c P( {, i c j,...,, c n },...c k, ) ep( i c j ) k-means P( i c j, ) ep( c i j ) Minimize i c j Sum of squared errors (SSE) criterion (k clusters
More informationData clustering & the k-means algorithm
April 27, 2016 Why clustering? Unsupervised Learning Underlying structure gain insight into data generate hypotheses detect anomalies identify features Natural classification e.g. biological organisms
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationFrequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management
Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES
More informationCLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,
More informationUniversity of Washington Department of Computer Science and Engineering / Department of Statistics
University of Washington Department of Computer Science and Engineering / Department of Statistics CSE 547 / Stat 548 Machine Learning (Statistics) for Big Data Homework 2 Winter 2014 Issued: Thursday,
More informationA Simple Cluster Validation Index with Maximal Coverage
A Simple Cluster Validation Index with Maximal Coverage Susanne Jauhiainen and Tommi Ka rkka inen Department of Mathematical Information Technology University of Jyvaskyla, Finland Abstract. Clustering
More informationPERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS
PERFORMANCE ANALYSIS OF DATA MINING TECHNIQUES FOR REAL TIME APPLICATIONS Pradeep Nagendra Hegde 1, Ayush Arunachalam 2, Madhusudan H C 2, Ngabo Muzusangabo Joseph 1, Shilpa Ankalaki 3, Dr. Jharna Majumdar
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationLecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic
SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association
More informationk-means Clustering David S. Rosenberg April 24, 2018 New York University
k-means Clustering David S. Rosenberg New York University April 24, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 1 / 19 Contents 1 k-means Clustering 2 k-means:
More informationMultivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)
Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationDS-Means: Distributed Data Stream Clustering
DS-Means: Distributed Data Stream Clustering Alessio Guerrieri and Alberto Montresor University of Trento, Italy Abstract. This paper proposes DS-means, a novel algorithm for clustering distributed data
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationParallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case
1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationComparison of Cluster Validation Indices with Missing Data
Comparison of Cluster Validation Indices with Missing Data Marko Niemela,2, Sami A yra mo and Tommi a rkka inen - Faculty of Information Technology, University of yvaskyla O Box 35, FI-4004 yva skyla,
More informationAn Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve
An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application Fouad Khan Central European University-Environmental
More informationAn Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm
An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm Jared Vicory and M. Emre Celebi Department of Computer Science Louisiana State University, Shreveport, LA, USA ABSTRACT Gray-level
More informationAlgorithm Design for MapReduce
Algorithm Design for MapReduce CprE 419X, Spring 2014 Iowa State University Srikanta Tirthapura 2/4/14 CprE 419 X, Srikanta Tirthapura 1 Problem 0: Sum Find the sum and average of many input integers Write
More informationImproved Version of Kernelized Fuzzy C-Means using Credibility
50 Improved Version of Kernelized Fuzzy C-Means using Credibility Prabhjot Kaur Maharaja Surajmal Institute of Technology (MSIT) New Delhi, 110058, INDIA Abstract - Fuzzy c-means is a clustering algorithm
More informationFast Fuzzy Clustering of Infrared Images. 2. brfcm
Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.
More informationAnalyzing Big Data with Microsoft R
Analyzing Big Data with Microsoft R 20773; 3 days, Instructor-led Course Description The main purpose of the course is to give students the ability to use Microsoft R Server to create and run an analysis
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationA Comparative study of Clustering Algorithms using MapReduce in Hadoop
A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationChapter 6 Continued: Partitioning Methods
Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal
More informationComparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique
International Journal of Soft Computing and Engineering (IJSCE) Comparing and Selecting Appropriate Measuring Parameters for K-means Clustering Technique Shreya Jain, Samta Gajbhiye Abstract Clustering
More informationFREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India
Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1,
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationObjective of clustering
Objective of clustering Discover structures and patterns in high-dimensional data. Group data with similar patterns together. This reduces the complexity and facilitates interpretation. Expression level
More informationOverview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::
Module Title Duration : 20773A: Analyzing Big Data with Microsoft R : 3 days Overview The main purpose of the course is to give students the ability to use Microsoft R Server to create and run an analysis
More informationComparative Analysis of K means Clustering Sequentially And Parallely
Comparative Analysis of K means Clustering Sequentially And Parallely Kavya D S 1, Chaitra D Desai 2 1 M.tech, Computer Science and Engineering, REVA ITM, Bangalore, India 2 REVA ITM, Bangalore, India
More informationSBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data
International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ SBKMMA: Sorting Based K Means and Median Based Algorithm
More informationClustering. Chapter 10 in Introduction to statistical learning
Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What
More informationImplementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b
International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory
More informationApproximation Algorithms for NP-Hard Clustering Problems
Approximation Algorithms for NP-Hard Clustering Problems Ramgopal R. Mettu Dissertation Advisor: Greg Plaxton Department of Computer Science University of Texas at Austin What Is Clustering? The goal of
More informationLecture 10: Semantic Segmentation and Clustering
Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationHow to keep capacity predictions on target and cut CPU usage by 5x
How to keep capacity predictions on target and cut CPU usage by 5x Lessons from capacity planning a Java enterprise application Kansas City, Sep 27 2016 Stefano Doni stefano.doni@moviri.com @stef3a linkedin.com/in/stefanodoni
More informationComputational Social Choice in the Cloud
Computational Social Choice in the Cloud Theresa Csar, Martin Lackner, Emanuel Sallinger, Reinhard Pichler Technische Universität Wien Oxford University PPI, Stuttgart, March 2017 By Sam Johnston [CC BY-SA
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationImage Compression with Competitive Networks and Pre-fixed Prototypes*
Image Compression with Competitive Networks and Pre-fixed Prototypes* Enrique Merida-Casermeiro^, Domingo Lopez-Rodriguez^, and Juan M. Ortiz-de-Lazcano-Lobato^ ^ Department of Applied Mathematics, University
More informationImproving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core
More informationRecitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018
Recitation 9 CS435: Introduction to Big Data GTA: Bibek R. Shrestha Email: cs435@cs.colostate.edu March 23, 2018 Today... Discussion on issues in Programming Assignment 2 2 How to program PA2: Part-A?
More informationApache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM
Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value
More informationCHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE
32 CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE 3.1 INTRODUCTION In this chapter we present the real time implementation of an artificial neural network based on fuzzy segmentation process
More informationk-means A classical clustering algorithm
k-means A classical clustering algorithm Devert Alexandre School of Software Engineering of USTC 30 November 2012 Slide 1/65 Table of Contents 1 Introduction 2 Visual demo Step by step Voronoi diagrams
More informationOutline. CS-562 Introduction to data analysis using Apache Spark
Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationIterative random projections for high-dimensional data clustering
Iterative random projections for high-dimensional data clustering Ângelo Cardoso, Andreas Wichert INESC-ID Lisboa and Instituto Superior Técnico, Technical University of Lisbon Av. Prof. Dr. Aníbal Cavaco
More informationProject Design. Version May, Computer Science Department, Texas Christian University
Project Design Version 4.0 2 May, 2016 2015-2016 Computer Science Department, Texas Christian University Revision Signatures By signing the following document, the team member is acknowledging that he
More informationk-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008
k-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008 Outline General problem definition Several specific examples k-center, k-means, k-mediod Approximation
More informationUnsupervised Learning and Clustering
Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationHadoop: The Definitive Guide PDF
Hadoop: The Definitive Guide PDF Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, youâ ll learn how to build and maintain reliable, scalable, distributed
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must
More informationCOSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig
COSC 6339 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout Edgar Gabriel Fall 2018 Pig Pig is a platform for analyzing large data sets abstraction on top of Hadoop Provides high
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationAn Improvement of Centroid-Based Classification Algorithm for Text Classification
An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,
More informationClustering Algorithms. Margareta Ackerman
Clustering Algorithms Margareta Ackerman A sea of algorithms As we discussed last class, there are MANY clustering algorithms, and new ones are proposed all the time. They are very different from each
More informationMeta-Clustering. Parasaran Raman PhD Candidate School of Computing
Meta-Clustering Parasaran Raman PhD Candidate School of Computing What is Clustering? Goal: Group similar items together Unsupervised No labeling effort Popular choice for large-scale exploratory data
More information2/26/2017. For instance, consider running Word Count across 20 splits
Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:
More informationBig Data Using Hadoop
IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for
More informationDynamic Clustering in WSN
Dynamic Clustering in WSN Software Recommended: NetSim Standard v11.1 (32/64 bit), Visual Studio 2015/2017, MATLAB (32/64 bit) Project Download Link: https://github.com/netsim-tetcos/dynamic_clustering_project_v11.1/archive/master.zip
More informationAN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang
International Journal of Innovative Computing, Information and Control ICIC International c 2017 ISSN 1349-4198 Volume 13, Number 3, June 2017 pp. 1037 1046 AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA
More informationKapitel 4: Clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More informationFINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK
CSE427S FINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK M. Neumann Due: NO EXTENSION FRI 4 MAY 2018 (MIDNIGHT) Project Goal In this project you and your group will interactively get to know SPARK and
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationDynamic local search algorithm for the clustering problem
UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE Report Series A Dynamic local search algorithm for the clustering problem Ismo Kärkkäinen and Pasi Fränti Report A-2002-6 ACM I.4.2, I.5.3 UDK 519.712
More informationALTERNATIVE METHODS FOR CLUSTERING
ALTERNATIVE METHODS FOR CLUSTERING K-Means Algorithm Termination conditions Several possibilities, e.g., A fixed number of iterations Objects partition unchanged Centroid positions don t change Convergence
More informationMachine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves
Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves
More informationAn Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing
An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing Shuzhi Nie Abstract Clustering is one of the most effective algorithms in data analysis and management.
More information