Big Data SONY Håkan Jonsson Vedran Sekara

Similar documents
Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

CSE 5243 INTRO. TO DATA MINING

COMP 465: Data Mining Still More on Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering Techniques

Clustering Part 4 DBSCAN

Knowledge Discovery in Databases

MLI - An API for Distributed Machine Learning. Sarang Dev

Clustering CS 550: Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

A Review on Cluster Based Approach in Data Mining

CS Data Mining Techniques Instructor: Abdullah Mueen

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

Spark and Machine Learning

CS570: Introduction to Data Mining

CS Introduction to Data Mining Instructor: Abdullah Mueen

ECLT 5810 Clustering

ECLT 5810 Clustering

DBSCAN. Presented by: Garrett Poppe

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

Unsupervised Learning : Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

DS504/CS586: Big Data Analytics Big Data Clustering II

Clustering in Data Mining

DS504/CS586: Big Data Analytics Big Data Clustering II

Unsupervised Learning Partitioning Methods

Clustering part II 1

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering in Ratemaking: Applications in Territories Clustering

A Technical Insight into Clustering Algorithms & Applications

Unsupervised Learning

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CS145: INTRODUCTION TO DATA MINING

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Large-Scale Flight Phase identification from ADS-B Data Using Machine Learning Methods

University of Florida CISE department Gator Engineering. Clustering Part 5

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Network Traffic Measurements and Analysis

CHAPTER 4: CLUSTER ANALYSIS

Analysis and Extensions of Popular Clustering Algorithms

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

COMP5331: Knowledge Discovery and Data Mining

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Lecture 7 Cluster Analysis: Part A

Kapitel 4: Clustering

Cluster Analysis (b) Lijun Zhang

Data Mining 4. Cluster Analysis

Course Content. What is Classification? Chapter 6 Objectives

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Data Mining Algorithms

Gene Clustering & Classification

Clustering Lecture 4: Density-based Methods

University of Florida CISE department Gator Engineering. Clustering Part 2

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

K-DBSCAN: Identifying Spatial Clusters With Differing Density Levels

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

A Comparative Study of Various Clustering Algorithms in Data Mining

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Course Content. Classification = Learning a Model. What is Classification?

Clustering Analysis Basics

Clustering Lecture 5: Mixture Model

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture 3 Clustering. January 21, 2003 Data Mining: Concepts and Techniques 1

Clustering Basic Concepts and Algorithms 1

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Introduction to Machine Learning CMU-10701

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Unsupervised Learning

Metodologie per Sistemi Intelligenti. Clustering. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano Leonardo

Lezione 21 CLUSTER ANALYSIS

CS6220: DATA MINING TECHNIQUES

Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

10701 Machine Learning. Clustering

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Cluster Analysis: Basic Concepts and Algorithms

Clustering Algorithms In Data Mining

Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimensional Spatial Data

Introduction to Mobile Robotics

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. CSE634 Data Mining

CSE 5243 INTRO. TO DATA MINING

数据挖掘 Introduction to Data Mining

Clustering from Data Streams

ECE 5424: Introduction to Machine Learning

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

Distance-based Methods: Drawbacks

Information Retrieval and Organisation

Introduction to Trajectory Clustering. By YONGLI ZHANG

K-Means. Oct Youn-Hee Han

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Transcription:

Big Data 2016 - SONY Håkan Jonsson Vedran Sekara

Schedule 09:15-10:00 Cluster analysis, partition-based clustering 10.00 10.15 Break 10:15 12:00 Exercise 1: User segmentation based on app usage 12:00-13:15 Lunch break 13:15 13:45 Density-based clustering 13:45-15:00 Exercise 2: Spatial clustering with DBSCAN 15:00-16:00 Exercise 3 / Assignment: Your own dataset

Unsupervised Learning? Process of inferring, describing, and uncovering hidden structures in data where for which we have no ground truth

Cluster Analysis Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Density-Based Methods

What is Cluster Analysis?

What is Cluster Analysis?

What is Cluster Analysis? Humans are skilled at dividing objects into groups. But how do we train a machine to do the same?

Application Finance: Map interdependence of companies Marketing: Customer segmentation Information retrieval: Document clustering Land use: Identification of areas of similar land use City-planning: Identifying similar neighbourhoods according to house type, value, geo-location

Preprocessing tool Summarization - Preprocessing step for other machine learning algorithms (regression, classification, or association tasks) Outlier detection - Outliers are often viewed as those far away from any cluster

Quality What is a good clustering? A good clustering method will produce high quality clusters - high intra-class similarity - low inter-class similarity The quality of a clustering method depends on - the applied similarity measure - the implementation (some are based on randomness) - its ability to discover specific types of latent structures

Measuring quality Dissimilarity/similarity metrics - Similarity is expressed in terms of a distance function - Definition of distance is different for various types of data (interval-scaled, boolean, categorical, ordinal, ration, vector, ) Quality of clustering - There is no single universal quality function, depends on the problem at hand and the data (gini, entropy, class-error, impurity, squared sum of errors, ) - It is difficult to define similar enough or good enough quality (depends on problem at hand)

Considerations when clustering Partitioning criteria - Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)

Considerations when clustering Partitioning criteria - Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) Separation of clusters - overlapping vs. non-overlapping (a document can belong to multiple classes) Similarity measure - Distance vs. connectivity Dimensionality - Look for cluster in full D-dimensional space vs. subspace

Challenges Scalability - Sample data or cluster everything? Ability to deal with different types of attributes - Numerical, binary, categorical, ordinal, and combinations of these Constraints on clustering - User may want to explicitly specify constraints Interpretability and usability (do clusters look sane, and can we use them?) Others - Ability to deal with noisy data - Ability to infer clusters with arbitrary shapes - Ability to deal with high-dimensional data - Incremental clustering and insensitivity to input order

Clustering approaches Partitioning approach - Construct partitions and evaluate them by some criterion (e.g. sum of errors squared) (typical methods: k-means, k-medoids, CLARANS) Hierarchical approach - Create a hierarchical tree of the set of data using some distance criterion (typical methods: Diana, Agnes, BIRCH, CAMELEON) Density-based approach - Cluster data using to connectivity and density constraints (typical methods: DBSCAN, OPTICS, DenClue) Grid-based approach - Based on a multiple-level granularity structure (typical methods: STING, WaveCluster, CLIQUE)

Clustering approaches (II) Model approach - A model is hypothesised for each cluster, points that fit the model are assigned to the cluster (typical methods: EM, SOM, COBWEB) Frequent pattern approach - Based on the analysis of frequent patterns (typical methods: p-cluster) Link-based clustering - Objects are linked together i various ways (e.g. network) (typical methods: LinkClustering, SimRank, Modularity)

Why are there so many methods?

Why are there so many methods? Data!

Example 1 Well-Separated Each point is closer to all points in its cluster than any point in another cluster

Example 2 Center based Each point is closer to the center of its cluster than to the center of any other cluster

Example 3 Contiguity based Each point is closer to at least one point in its cluster than to any point in another cluster

Example 4 Density based Clusters are regions of high density separated by regions of low density

Example 5 Conceptual Points in a cluster share some general property that can be derived from the entire set of points

K-means Objective Given a dataset D with n points find a value k that separates the data into k clusters (by minimizing some error function) Algorithm 1) Select k points as initial centroids Repeat 2) Form k clusters by assigning each points to its closest centroid 3) Recompute the centroids of each cluster Until centroids do not change

demo http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Comments on k-means Strength Fairly fast [O(ntk), n = # datapoints, k = # clusters, t = # iterations] Limitations - Can get stuck at a local minimum - We need to explicitly specify k - Sensitive to noise, and especially outliers - Mainly applicable to continuous data - Cannot detect clusters with strange shapes Alternative versions addressing some of the issues - k-medoids, k-modes

k-means with Spark MLLib from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt # Load and parse the data data = sc.textfile("data/mllib/kmeans_data.txt") parseddata = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, k=2, maxiterations=10, runs=10, initializationmode="random") WSSSE = clusters.computecost(parseddata) print("within Set Sum of Squared Error = " + str(wssse)) # Save and load model clusters.save(sc, "mymodelpath") samemodel = KMeansModel.load(sc, "mymodelpath")

k-means with Spark MLLib clusters = KMeans.train(parsedData, k=2, maxiterations=10, runs=10, initializationmode="random") Parameters - k = number of clusters - maxiterations = max number of iterations to run - epsilon = threshold within we consider k-means to have converged - runs = number of k-means runs, from which optimal one is chosen - InitialModel = how to pick cluster centres when initializing model

Exercise I User segmentation using k-means Question Can we segment users into groups based on their application usage? Representation (Bag-of-Words) We are going to use a very simple representation of user behaviour, only taking into account the apps he/she starts and disregard everything else Procedure - Every app belongs to a category (Tools, Weather, Music, ) - Count the number of times a users has started an app within each category - This creates a vector, one for each user, where each index represents one category

Density based clustering methods

Density based clustering methods Instead of using on similarity (distance) to cluster datapoints we are going to look at density Advantages Can infer clusters of arbitrary shapes Can handle noise Are relatively fast (usually require only one scan of data) We don't have to define how many cluster we want Algorithms - DBSCAN [Ester, et al. (KDD 1996)] - CLIQUE [Agrawal, et al. (KDD 1998)] - OPTICS [Ankerst, et al. (SIGMOD 1999)]

Basic concepts Instead of using on similarity (distance) to cluster datapoints we are going to look at density Parameters epsilon - Radius of circle (epsilon) - Min number of points in that neighbourhood (minpoints)

Basic minpts = 3 concepts Core points A point (a) is a core point if at least minpts points are within a distance epsilon of it Reachable points c A point (b) is reachable from (a) if there is a path where all points on the path are core points a epsilon Noise b Points not reachable from any other point are outliers (c)

DBSCAN Density-Based Spatial Clustering of Applications with Noise - Utilizes the notion of reachability - Defines a cluster as the maximal set of connected points - Can discover clusters with arbitrary shape Any caveats? - Assumes that clusters have identical densities

DBSCAN Algorithm 1) Select an arbitrary data-point, p Repeat 2) Find all points that are reachable (epsilon, minpts) 3) If p is a core points, form a cluster, otherwise go to next data-point Continue until All points have been visited

demo http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

DBSCAN in Spark Unfortunately DBSCAN is not implemented in Spark yet, however, Spark friendly implementations do exist on GitHub If the number of data-points per key it not excessively large we can cache the data in memory and apply non-distributed versions of DBSCAN Today we are going to use python s implementation of DBSCAN First group values by key, then apply DBSCAN on each key using map() from sklearn.cluster import DBSCAN dbscan = DBSCAN(eps,minPts) clusterings = datardd.map(lambda (k,v): dbscan.fit(v))

Points of interest Latitude Longitude

Points of interest Latitude Latitude Longitude Longitude

Exercise II Clustering using DBSCAN Dataset Geographic traces for multiple individuals Purpose Apply DBSCAN to each individuals traces to discover personal points of interest, i.e. often visited locations (high density of points)

But first!

Assignment Option I Complete optional exercises of Lab4-1 and Lab4-2 Option II Analysis of your own dataset - Exploration - Cleaning & filtering - Clustering - Interpretation (do cluster make sense?)

Clustering methods in Spark - k-means streaming k-means (dynamical clustering) - Gaussian mixture models - Latent Dirichlet Allocation (LDA) - Power Iteration Clustering (PIC)

Gaussian mixture models A Gaussian mixture model is a probabilistic model that assumes all data-points are generated from a mixture of a finite number of Gaussian distributions

Latent Dirichlet Allocation LDA is a generative model that allows sets of observations to be described by unknown groups or topics HMS Courageous was the lead ship of the Courageous-class cruisers built for the Royal Navy during the First World War. Lightly armoured and armed with only a few heavy guns, the ship was designed HMS Courageous to support was the the lead ship of Baltic Project, the a plan Courageous-class championed by cruisers built First Sea Lord for John the Fisher Royal to Navy invade during the First the German World coast War. north Lightly of Berlin. armoured and Courageous was armed completed with only a in few late heavy guns, the 1916 and spent ship the was war patrolling designed the to support the North Sea. The Baltic ship Project, participated a in the Second Battle of Heligoland HMS plan Bight Courageous championed was by the lead ship of First Sea Lord in November 1917 and was the John present Courageous-class Fisher to invade cruisers built the German coast when the German High Seas for the north Fleet Royal of Navy Berlin. during the First Courageous was completed in late surrendered a 1916 year and later. spent Courageous World War. Lightly armoured and the war patrolling the was decommissioned North Sea. after armed The the ship war, with only a few heavy guns, the participated in but rebuilt as an the aircraft Second carrier ship Battle during was designed to support the of Heligoland Bight the mid-1920s. in The November ship could Baltic 1917 carry Project, a plan championed by First and Sea was Lord present John Fisher to invade when the German the German High Seas coast Fleet north of Berlin. surrendered a year Courageous later. Courageous was completed in late was decommissioned 1916 and after spent the the war, war patrolling the but rebuilt as an North aircraft Sea. carrier The during ship participated in the mid-1920s. The the Second ship could Battle carry of Heligoland Bight in November 1917 and was present when the German High Seas Fleet surrendered a year later. Courageous was decommissioned after the war, but rebuilt as an aircraft carrier during the mid-1920s. The ship could carry News [50%] Sports [40%] Science [10%] Example LDA can look at words written in documents and determine the mixture of topics

Power Iteration Clustering Power Iteration Clustering uses the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions Treats clustering problem as a graph partitioning problem.

Assignment summary Option I Complete optional exercises of Lab4-1 and Lab4-2 Option II Analysis of your own dataset - Exploration - Cleaning & filtering - Clustering - Interpretation (do cluster make sense?) Data If you don't have a dataset you can use nips12raw_str602.tgz (LDA) http://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz Quiz Will be mailed to you