Machine Learning - Clustering. CS102 Fall 2017

Similar documents
Machine Learning - Regression. CS102 Fall 2017

Jarek Szlichta

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining Algorithms

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining: Models and Methods

What to come. There will be a few more topics we will cover on supervised learning

SOCIAL MEDIA MINING. Data Mining Essentials

数据挖掘 Introduction to Data Mining

Road map. Basic concepts

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Unsupervised Learning

Clustering algorithms and autoencoders for anomaly detection

Chapter 1, Introduction

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Unsupervised Learning I: K-Means Clustering

Cluster Analysis. CSE634 Data Mining

Exploratory Analysis: Clustering

ECLT 5810 Clustering

Statistics 202: Statistical Aspects of Data Mining

Data Warehousing and Machine Learning

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Basic Data Mining Technique

Hypermarket Retail Analysis Customer Buying Behavior. Reachout Analytics Client Sample Report

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Machine Learning using MapReduce

ECLT 5810 Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning : Clustering

University of Florida CISE department Gator Engineering. Clustering Part 5

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Unsupervised Learning Partitioning Methods

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Supervised vs unsupervised clustering

Data Informatics. Seon Ho Kim, Ph.D.

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

Data Preprocessing. Slides by: Shree Jaswal

Gene Clustering & Classification

االمتحان النهائي للفصل الدراسي الثاني للعام 2015/2014

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Cluster Analysis: Agglomerate Hierarchical Clustering

Input: Concepts, Instances, Attributes

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Machine Learning Chapter 2. Input

Introduction to Clustering

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Intelligent Image and Graphics Processing

Unsupervised Learning

Introduction to Supervised Learning

Chapter 3: Data Mining:

Clustering Part 2. A Partitional Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Network Traffic Measurements and Analysis

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Machine Learning: Basic Principles

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data clustering & the k-means algorithm

Cluster Analysis: Basic Concepts and Algorithms

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

University of Ghana Department of Computer Engineering School of Engineering Sciences College of Basic and Applied Sciences

Clustering Basic Concepts and Algorithms 1

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Bayesian Network & Anomaly Detection

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Overview of Clustering

Data Set. What is Data Mining? Data Mining (Big Data Analytics) Illustrative Applications. What is Knowledge Discovery?

Comparative Study of Clustering Algorithms using R

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Overview of Big Data

I211: Information infrastructure II

CSE4334/5334 DATA MINING

Lecture 25: Review I

1) Give decision trees to represent the following Boolean functions:

Machine Learning (BSMC-GA 4439) Wenke Liu

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Machine Learning (BSMC-GA 4439) Wenke Liu

Data mining techniques for actuaries: an overview

COMP 465: Data Mining Classification Basics

Introduction to Data Mining and Data Analytics

k-means Clustering Todd W. Neller Gettysburg College

Clustering and Visualisation of Data

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Supervised and Unsupervised Learning (II)

Preprocessing DWML, /33

Data Preprocessing. Komate AMPHAWAN

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Transcription:

Machine Learning - Fall 2017

Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data Machine Learning Using data to make inferences or predictions Data Visualization Graphical depiction of data Data Collection and Preparation

Machine Learning Using data to make inferences or predictions Supervised machine learning Set of labeled examples to learn from: training data Develop model from training data Use model to make inferences about new data Unsupervised machine learning Unlabeled data, look for patterns or structure (similar to data mining)

Like classification, data items consist of values for a set of features (numeric or categorical) Medical patients Feature values: age, gender, symptom1-severity, symptom2-severity, test-result1, test-result2 Web pages Feature values: URL domain, length, #images, heading 1, heading 2,, heading n Products Feature values: category, name, size, weight, price

Like classification, data items consist of values for a set of features (numeric or categorical) Medical patients Unlike classification, Feature values: age, gender, there symptom1-severity, is no label symptom2-severity, test-result1, test-result2 Web pages Feature values: URL domain, length, #images, heading 1, heading 2,, heading n Products Feature values: category, name, size, weight, price

Like K-nearest neighbors, for any pair of data items i 1 and i 2, from their feature values can compute distance function: distance(i 1,i 2 ) Example: Features - gender, profession, age, income, postal-code person 1 = (male, teacher, 47, $25K, 94305) person 2 = (female, teacher, 43, $28K, 94309) distance(person 1, person 2 ) distance() can be defined as inverse of similarity()

GOAL: Given a set of data items, partition them into groups (= clusters) so that items within groups are close to each other based on distance function Ø Sometimes number of clusters is pre-specified Ø Typically clusters need not be same size

Some Uses for Classification! Assign labels to clusters New data items get the label of their cluster Identify similar items For substitutes or recommendations For de-duplication Anomaly (outlier) detection Items that are far from any cluster

K-Means Reminder: for any pair of data items i 1 and i 2 have distance(i 1,i 2 ) For a group of items, the mean value (centroid) of the group is the item i (in the group or not) that minimizes the sum of distance(i,i ) for all i in the group

K-Means For a group of items, the mean value (centroid) of the group is the item i (in the group or not) that minimizes the sum of distance(i,i ) for all i in the group Error for each item: distance d from the mean for its group; squared error is d 2 Error for the entire clustering: sum of squared errors (SSE) Remind you of anything?

K-Means Given set of data items and desired number of clusters k, K-means groups the items into k clusters minimizing the SSE Extremely difficult to compute efficiently Ø In fact, impossible Most algorithms compute an approximate solution (might not be absolute lowest SSE)

European Cities By geographic distance, then by temperature

European Cities Distance = actual distance, k = 5

European Cities Distance = actual distance, k = 8, with cluster means

European Cities Distance = actual distance, k = 2, with cluster means

European Cities Distance = actual distance, k = 30

European Cities Distance = temperature, k = 5

European Cities Distance = temperature, k = 8, with means

European Cities Distance = temperature, k = 2

European Cities Distance = temperature, k = 3

European Cities Distance = temperature, k = 30

Some Uses for Classification Assign labels to clusters New data items get the label of their cluster Identify similar items For substitutes or recommendations For de-duplication Anomaly (outlier) detection Items that are far from any cluster