Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Similar documents
Data Mining. Jeff M. Phillips. January 8, 2014

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Data Mining. Jeff M. Phillips. January 9, 2013

Network Traffic Measurements and Analysis

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Outlier Pursuit: Robust PCA and Collaborative Filtering

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Generalized Principal Component Analysis CVPR 2007

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Table Of Contents: xix Foreword to Second Edition

Contents. Foreword to Second Edition. Acknowledgments About the Authors

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Data Preprocessing. Slides by: Shree Jaswal

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering and Visualisation of Data

UNIT 2 Data Preprocessing

Python With Data Science

3. Data Preprocessing. 3.1 Introduction

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

2. Data Preprocessing

K236: Basis of Data Science

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Introduction to Data Mining

ECLT 5810 Clustering

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Science Center Eindhoven. The Mathematics Behind Big Data. Alessandro Di Bucchianico

Question Bank. 4) It is the source of information later delivered to data marts.

Data Preprocessing. Data Preprocessing

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

9. Conclusions. 9.1 Definition KDD

Dimension Reduction CS534

Pre-Requisites: CS2510. NU Core Designations: AD

SCHEME OF COURSE WORK. Data Warehousing and Data mining

Unsupervised learning in Vision

Machine Learning - Clustering. CS102 Fall 2017

Image Analysis, Classification and Change Detection in Remote Sensing

ECLT 5810 Clustering

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Rapid growth of massive datasets

ELEG Compressive Sensing and Sparse Signal Representations

Machine Learning - Regression. CS102 Fall 2017

Similarity Ranking in Large- Scale Bipartite Graphs

Data preprocessing Functional Programming and Intelligent Algorithms

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Specialist ICT Learning

MSA220 - Statistical Learning for Big Data

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Name of the lecturer Doç. Dr. Selma Ayşe ÖZEL

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Modern Multidimensional Scaling

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Elysium Technologies Private Limited::IEEE Final year Project

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Semi-supervised Learning

CLASSIFICATION AND CHANGE DETECTION

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Mining Web Data. Lijun Zhang


CSE 158. Web Mining and Recommender Systems. Midterm recap

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Customer Clustering using RFM analysis

Data Representation and Indexing

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

DATA STREAMS: MODELS AND ALGORITHMS

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Dta Mining and Data Warehousing

COMPUTER AND ROBOT VISION

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Data Mining and Analytics. Introduction

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

DYNAMMO: MINING AND SUMMARIZATION OF COEVOLVING SEQUENCES WITH MISSING VALUES

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering: Classic Methods and Modern Views

Big-data Clustering: K-means vs K-indicators

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Sublinear Models for Streaming and/or Distributed Data

Using Existing Numerical Libraries on Spark

Unsupervised Learning

Mining Web Data. Lijun Zhang

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

SOCIAL MEDIA MINING. Data Mining Essentials

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Transcription:

Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019

What is Data Mining?

What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics?

What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? How to think about data analytics.

What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? How to think about data analytics. Principals of converting from messy raw data to abstract representations. Algorithms of how to analyze data in abstract representations. Addressing challenges in scalability, error, and modeling.

Modeling versus Efficiency Two Intertwined (and often competing) Objectives: Model Data Correctly Process Data Efficiently Statistics Data Mining Algorithms Modeling Efficiency

Other Data Mining Courses Every university teaches data mining differently!

Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat

Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat no specific software tools / programming languages

Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat no specific software tools / programming languages Maths: Linear Algebra, Probability, High-dimensional geometry

Classic (Old) View of Data Mining labeled data (X, y) regression classification prediction unlabeled data X dimensionality reduction clustering structure scalar outcome set outcome

Outline Statistical and Mathematical Principals: 1. Hashing, Concentration of Measure 2. Similarity (find duplicates and similar items) Structure in Data: 3. Clustering (aggregate close items) 4. Regression (linearity of high-d data, sparsity) 5. Dimensionality Reduction (PCA, embeddings) 7. Link Analysis (prominent structure in large graphs) Controlling for Noise and Uncertainty: 6. Noisy Data (anomalies in data, ethics, privacy)

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Raw Data to Abstract Representations How to measure similarity between data? Key idea: data point a quick brown fox jumped... 1 1 1 1 1 1 1 1 1 1 1 1 11 joe bob sue age income height 25 32 28 90K 45K 38K 1.85 1.52 1.61

Similarity Given a large set of data P. Given new point q, is q in P? Given a large set of data P. Given new point q, what is the closest point in P to q? q P

Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance?

Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance? Hierarchical clustering : When to combine groups? k-means clustering : k-median, k-center, k-means++ Graph clustering : modularity, spectral

Regression Consider a data set P R d, where d is BIG! Want to find linear (or polynomial) function that represents P. degree 3 fit 240 240 220 200 180 160 140 120 220 200 180 160 140 120 100 60 62 64 66 68 70 72 74 76 100 60 62 64 66 68 70 72 74 76

Regression Consider a data set P R d, where d is BIG! Want to find linear (or polynomial) function that represents P. degree 3 fit 240 240 220 200 180 160 140 120 220 200 180 160 140 120 100 60 62 64 66 68 70 72 74 76 100 60 62 64 66 68 70 72 74 76 Least Squares : Common easy approach (polynomial, high-dimensional) L 1 Regression : Sparser, generalizes better, Orthogonal Matching Pursuit Info Recovery : Compressed Sensing

Dimensionality Reduction Again consider a data set P R d, where d is BIG! Want to find linear subspace that represents P. a1 =(1.5, 0) S k V T k a3 =(1, 1) a2 =(2, 0.5) A k = U k

Dimensionality Reduction Again consider a data set P R d, where d is BIG! Want to find linear subspace that represents P. a1 =(1.5, 0) S k V T k a3 =(1, 1) a2 =(2, 0.5) A k = U k SVD : Linear Algebra basis for PCA Multidimensional Scaling : Fits sets of distances in R k with k small Matrix Sketching: Random Projections, Sampling, FD

Noisy Data What to do when data is noisy? Identify it : Find and remove outliers Model it : It may be real, affect answer Exploit it : Differential privacy, Ethics of Data Science

Link Analysis, Graphs How does Google Search work? Converts webpage links into directed graph. Markov Chains : Models movement in a graph PageRank : How to convert graph into important nodes MapReduce : How to scale up PageRank Communities : Other important nodes in graphs

Summaries Reducing massive data to small space. Want to retain as much as possible (not specific structure) error guarantees OnePass Sampling : Reservoir Sampling MinCount Hash : Sketching data, abstract features Density Approximation : Quantiles Matrix Sketching : Preprocessing complex data Spanners : graph approximations memory length m CPU word 2 [n] u ω(k, u) ω(p, u)

Themes What are course goals? Intuition for data analytics How to model data (convert to abstract data types) How to process data efficiently (balance models with algorithms)