Data Mining. Jeff M. Phillips. January 9, 2013

Similar documents
Data Mining. Jeff M. Phillips. January 8, 2014

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

K236: Basis of Data Science

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

UNIT 2 Data Preprocessing

Introduction to Data Mining

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Data Preprocessing. Slides by: Shree Jaswal

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Sublinear Models for Streaming and/or Distributed Data

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

ECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1

Elysium Technologies Private Limited::IEEE Final year Project

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

3. Data Preprocessing. 3.1 Introduction

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

2. Data Preprocessing

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

DATA STREAMS: MODELS AND ALGORITHMS

Network Traffic Measurements and Analysis

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Table Of Contents: xix Foreword to Second Edition

Contents. Foreword to Second Edition. Acknowledgments About the Authors

ECLT 5810 Data Preprocessing. Prof. Wai Lam

CSE P521: Applied Algorithms

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

K-Means Clustering 3/3/17

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

Question Bank. 4) It is the source of information later delivered to data marts.

data-based banking customer analytics

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Contents. Part I Setting the Scene

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

The Data Science Process. Polong Lin Big Data University Leader & Data Scientist IBM

Context. File Organizations and Indexing. Cost Model for Analysis. Alternative File Organizations. Some Assumptions in the Analysis.

Machine Learning - Clustering. CS102 Fall 2017

Data Science Bootcamp Curriculum. NYC Data Science Academy

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Epilog: Further Topics

Distributed computing: index building and use

Outlier Pursuit: Robust PCA and Collaborative Filtering

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Python With Data Science

Data Mining Learning from Large Data Sets

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

A Geometric Analysis of Subspace Clustering with Outliers

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Generalized Principal Component Analysis CVPR 2007

CPSC 340: Machine Learning and Data Mining

Information Retrieval May 15. Web retrieval

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Specialist ICT Learning

Collaborative filtering based on a random walk model on a graph

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Information Retrieval Spring Web retrieval

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Data preprocessing Functional Programming and Intelligent Algorithms

MapReduce: A Programming Model for Large-Scale Distributed Computation

Clustering Lecture 4: Density-based Methods

MapReduce. Stony Brook University CSE545, Fall 2016

数据挖掘 Introduction to Data Mining

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor


NeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Data Foundations. Topic Objectives. and list subcategories of each. its properties. before producing a visualization. subsetting

DATA MINING II - 1DL460

Data Representation and Indexing

ECLT 5810 Clustering

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Search Engines. Information Retrieval in Practice

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning - Regression. CS102 Fall 2017

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

The Future of High Performance Computing

Data Preprocessing. Komate AMPHAWAN

CAS CS 460/660 Introduction to Database Systems. File Organization and Indexing

Mining Web Data. Lijun Zhang

Oracle Big Data Connectors

Transcription:

Data Mining Jeff M. Phillips January 9, 2013

Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics?

Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? What you can recover from data and what you cannot recover. Algorithms for how to recover it efficiently.

Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? What you can recover from data and what you cannot recover. Algorithms for how to recover it efficiently. How to think about data analytics.

Modeling versus Efficiency Two Intertwined Objectives: Model Data Correctly Process Data Efficiently Statistics Machine Learning Data Mining Algorithms Modeling Efficiency

Outline Statistical Principals: 1. Understanding random effects Data and Distances: 2. Similarity (find duplicates and similar items) 3. Clustering (aggregate close items) Structure in Data: 3. Clustering (aggregate close items) 4. Regression (patterns in data) 5. Anomaly Detection (outliers in data) Controlling for Noise and Uncertainty: 5. Anomaly Detection (outliers in data) 6. Link Analysis (prominent structure in large graphs) 7. Summaries (concise representation)

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Raw Data to Abstract Representations How to measure similarity between data? Key idea: data point a quick brown fox jumped... 1 1 1 1 1 1 1 1 1 1 1 1 11 joe bob sue age income height 25 32 28 90K 45K 38K 1.85 1.52 1.61

Similarity Given a large set of data P. Given new point q, is q in P? Given a large set of data P. Given new point q, what is closest point in P to q? q P

Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance?

Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance? Hierarchical clustering : When to combine groups? k-means clustering : k-median, k-center, k-means++ Graph clustering : modularity, spectral

Regression Consider a data set P R d, where d is BIG! Want to find representation of P in some R k µ(p) Q R k so p i p j q i q j Q R k should capture most data in P

Regression Consider a data set P R d, where d is BIG! Want to find representation of P in some R k µ(p) Q R k so p i p j q i q j Q R k should capture most data in P L 2 Regression + PCA : Common easy approach Multidimensional Scaling : Fits in R k with k small Random Projections : Faster and easier (different bounds) L 1 Regression : Better, Orthogonal Matching Pursuit Special Topic : Compressed Sensing

Anomaly Detection What to do when data is noisy? Identify it : Find and remove outliers Model it : It may be real, affect answer Exploit it : Differential privacy (special topic)

Link Analysis How does Google Search work? Converts webpage links into directed graph. Markov Chains : Models movement in a graph PageRank : How to convert graph into important nodes MapReduce : How to scale up PageRank Communities : Other important nodes in graphs

Summaries Reducing massive data to small space. Want to retain as much as possible (not specific structure) error guarantees OnePass Sampling : Reservoir Sampling Density Approximation : Quantiles MinCount Hash : Sketching data, abstract features Spanners : graph approximations [...] :... on request... memory length m CPU word 2 [n] u ω(k, u) ω(p, u)

Themes What are course goals? Intuition for data analytics How to model data (convert to abstract data types) How to process data efficiently (balance models with algorithms)

Themes What are course goals? Intuition for data analytics How to model data (convert to abstract data types) How to process data efficiently (balance models with algorithms) Work Plan: 2-3 weeks each topic. Overview classic techniques Focus on modeling / efficiency tradeoff Special topics Short homework for each (analysis + with data) Course Project (1/2 grade). Focus on specific data set Deep exploration with technique Ongoing refinement of presentation + approach

Data Group Data Group Meeting Thursdays @ noonish in Graphics Annex http://datagroup.cs.utah.edu/dbgroup.php