Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Similar documents
Data Mining. Jeff M. Phillips. January 8, 2014

Data Mining. Jeff M. Phillips. January 9, 2013

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Introduction to Data Mining

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

SCHEME OF COURSE WORK. Data Warehousing and Data mining

USC Viterbi School of Engineering

Lecture 27: Learning from relational data

STA141C: Big Data & High Performance Statistical Computing

Part A: Course Outline

K236: Basis of Data Science

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

MATH36032 Problem Solving by Computer. Data Science

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

COSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era. Spring Instructor: Grace Hui Yang

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

UNIT 2 Data Preprocessing

Introduction to CS 4604

Dta Mining and Data Warehousing

Lecture 17: Hash Tables, Maps, Finish Linked Lists

Overlay and P2P Networks. Introduction. Prof. Sasu Tarkoma

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

61A Lecture 3. Friday, September 5

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Calculus (Math 1A) Lecture 1

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

CS 4800: Algorithms & Data. Lecture 1 January 10, 2017

CSE 158. Web Mining and Recommender Systems. Midterm recap

CPSC 340: Machine Learning and Data Mining

Data Preprocessing. Slides by: Shree Jaswal

CSE P521: Applied Algorithms

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COMP Data Structures

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Question Bank. 4) It is the source of information later delivered to data marts.

General course information

DEPARTMENT OF COMPUTER SCIENCE

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Generalized Principal Component Analysis CVPR 2007

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

0.1 Welcome. 0.2 Insertion sort. Jessica Su (some portions copied from CLRS)

COMP Data Structures

Table Of Contents: xix Foreword to Second Edition

Data Science Bootcamp Curriculum. NYC Data Science Academy

3. Data Preprocessing. 3.1 Introduction

2. Data Preprocessing

Network Traffic Measurements and Analysis

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

CMPSCI 711: More Advanced Algorithms

CS2013 Course Syllabus Spring 2017 Lecture: Friday 8:00 A.M. 9:40 A.M. Lab: Friday 9:40 A.M. 12:00 Noon

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Introduction to Cryptology ENEE 459E/CMSC 498R. Lecture 1 1/26/2017

Conformal Mapping and Fluid Mechanics

CPSC 340: Machine Learning and Data Mining. Hierarchical Clustering Fall 2017

ECS289: Scalable Machine Learning

In this course, you need to use Pearson etext. Go to "Pearson etext and Video Notes".

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

More on Arrays CS 16: Solving Problems with Computers I Lecture #13

Outlier Pursuit: Robust PCA and Collaborative Filtering

Bachelor of Science in Software Engineering (BSSE) Scheme of Studies ( )

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

CS 460/560 Introduction to Computational Robotics Fall 2017, Rutgers University. Course Logistics. Instructor: Jingjin Yu

61A Lecture 4. Monday, September 9

In this course, you need to use Pearson etext. Go to "Pearson etext and Video Notes".

CS 245: Database System Principles

Syllabus CSCI 405 Operating Systems Fall 2018

CS639: Data Management for Data Science. Lecture 1: Intro to Data Science and Course Overview. Theodoros Rekatsinas

Overlay and P2P Networks. Introduction. Prof. Sasu Tarkoma

Exercise 2. AMTH/CPSC 445a/545a - Fall Semester September 21, 2017

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Orange Coast College BUSINESS & COMPUTING DIVISION EVENING, WEEKEND & ONLINE COURSES. that fit your schedule and goals!

Sublinear Models for Streaming and/or Distributed Data

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Python With Data Science

Math 2280: Introduction to Differential Equations- Syllabus

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

COMP Analysis of Algorithms & Data Structures

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Supervised Learning: The Setup. Spring 2018

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Pearson Edexcel Award

ECE 646 Cryptography and Computer Network Security. Course web page: Kris Gaj Research and teaching interests: Contact: ECE web page Courses ECE 646

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CLASSIFICATION AND CHANGE DETECTION

CSE 373: Data Structures and Algorithms

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

COMP6237 Data Mining Introduction to Data Mining

Introduction to Data Structures

Homework # 4. Example: Age in years. Answer: Discrete, quantitative, ratio. a) Year that an event happened, e.g., 1917, 1950, 2000.

The University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Doctorate Degree

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Transcription:

Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 12, 2015

Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics?

Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? How to think about data analytics.

Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? How to think about data analytics. What you can recover from data and what you cannot recover. Algorithms for how to recover it efficiently.

Modeling versus Efficiency Two Intertwined (and often competing) Objectives: Model Data Correctly Process Data Efficiently Statistics Machine Learning Data Mining Algorithms Modeling Efficiency

Machine Learning Mondays + Wednesdays @ 3:00-4:20 in WEB L102 CS 5350 Machine Learning CS 6350 Machine Learning Vivek Srikumar

Machine Learning Mondays + Wednesdays @ 3:00-4:20 in WEB L102 CS 5350 Machine Learning CS 6350 Machine Learning Vivek Srikumar Classification: Given data labeled {True = +} or {False = -}, given new data, guess a label. More continuous optimization (DM more discrete)

Other Data Mining Courses Every university teaches data mining differently!

Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat

Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat no specific software tools / programming languages

Other Data Mining Courses Every university teaches data mining differently! What flavor is offered in this class: Focus on techniques for very large scale data Broad coverage... with recent developments Formally and generally presented (proof sketches)... but useful in practice (e.g. internet companies) Probabilistic algorithms: connections to CS and Stat no specific software tools / programming languages Maths: Linear Algebra, Probability, High-dimensional geometry

Outline Statistical Principals: 1. Understanding random effects Data and Distances: 2. Similarity (find duplicates and similar items) 3. Clustering (aggregate close items) Structure in Data: 3. Clustering (aggregate close items) 4. Regression (linearity of (high-d) data) 5. Noisy Data (anomalies in data) Controlling for Noise and Uncertainty: 5. Noisy Data (anomalies in data) 6. Link Analysis (prominent structure in large graphs)

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Statistical Principals What happens as data is generated with replacement {IP addresses, words in dictionary, edges in graph, hash table} When do items collide? When do you see all items? When is the distribution almost uniform?

Raw Data to Abstract Representations How to measure similarity between data? Key idea: data point a quick brown fox jumped... 1 1 1 1 1 1 1 1 1 1 1 1 11 joe bob sue age income height 25 32 28 90K 45K 38K 1.85 1.52 1.61

Similarity Given a large set of data P. Given new point q, is q in P? Given a large set of data P. Given new point q, what is closest point in P to q? q P

Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance?

Clustering How to find groups of similar data. do we need a representative? can groups overlap? what is structure of data/distance? Hierarchical clustering : When to combine groups? k-means clustering : k-median, k-center, k-means++ Graph clustering : modularity, spectral

Clustering Class Thursdays @ 10:45-12:05 in WEB 1248 CS 6955 Clustering Suresh Venkatasubramanian full semester on area we spend 3 lectures

Regression Consider a data set P R d, where d is BIG! Want to find representation of P in some R k µ(p) Q R k so p i p j q i q j Q R k should capture most data in P

Regression Consider a data set P R d, where d is BIG! Want to find representation of P in some R k µ(p) Q R k so p i p j q i q j Q R k should capture most data in P L 2 Regression + PCA : Common easy approach Multidimensional Scaling : Fits in R k with k small Random Projections : Faster and easier (different bounds) L 1 Regression : Better, Orthogonal Matching Pursuit Special Topic : Compressed Sensing

Matrix Sketching Fridays @ 1:45-3:00 in MEB 3147 (LCR) CS 7931 Data Mining Seminar CS 6961 Advanced Topic in Data Jeff Phillips full semester on area we spend 1 lectures can be taken for 1 credit

Noisy Data What to do when data is noisy? Identify it : Find and remove outliers Model it : It may be real, affect answer Exploit it : Differential privacy (ethics in data)

Link Analysis How does Google Search work? Converts webpage links into directed graph. Markov Chains : Models movement in a graph PageRank : How to convert graph into important nodes MapReduce : How to scale up PageRank Communities : Other important nodes in graphs

Summaries Reducing massive data to small space. Want to retain as much as possible (not specific structure) error guarantees OnePass Sampling : Reservoir Sampling Density Approximation : Quantiles MinCount Hash : Sketching data, abstract features Spanners : graph approximations [...] :... on request... memory length m CPU word 2 [n] u ω(k, u) ω(p, u)

Themes What are course goals? Intuition for data analytics How to model data (convert to abstract data types) How to process data efficiently (balance models with algorithms)

Themes What are course goals? Intuition for data analytics How to model data (convert to abstract data types) How to process data efficiently (balance models with algorithms) Work Plan: 2-3 weeks each topic. Overview classic techniques Focus on modeling / efficiency tradeoff Special topics Short homework for each (analysis + with data) (45% grade) 2 Quizes (10% grade) Course Project (45% grade). Focus on specific data set Deep exploration with technique Ongoing refinement of presentation + approach

On Homeworks Managed through Canvas (will be up by end of week) No restriction on programming language. Some designed for matlab, others better in python or C++. Programming assignments with not too many specifications. Bonus Questions!

Data Group Data Group Meeting Thursdays @ 12:15-1:30 in MEB 3147 (LCR) CS 7941 Data Reading Group requires one presentation if taken for credit http://datagroup.cs.utah.edu

http://www.utahdataanalyticssummit.com Thurs, Jan 22, 2015 9 AM 2 PM Register [free] at www.utahdataanalyticssummit.com