CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Similar documents
Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

INTRODUCTION TO DATA MINING

An Introduction to Data Mining BY:GAGAN DEEP KAUSHAL

Introduction to Data Mining

Statistical Learning and Data Mining CS 363D/ SSC 358

DATA MINING II - 1DL460

DATA MINING II - 1DL460

Knowledge Discovery and Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Stats Overview Ji Zhu, Michigan Statistics 1. Overview. Ji Zhu 445C West Hall

Data Mining Course Overview

Data mining fundamentals

Foundation of Data Mining: Introduction

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP90049 Knowledge Technologies

CISC 4631 Data Mining Lecture 01:

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

Introduction to Data Mining CS 584 Data Mining (Fall 2016)

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining: Introduction. Lecture Notes for Chapter 1. Introduction to Data Mining

Knowledge Discovery & Data Mining

Includes Review of Syllabus OVERVIEW OF THE CLASS

ISM 50 - Business Information Systems

Data Mining Concept. References. Why Mine Data? Commercial Viewpoint. Why Mine Data? Scientific Viewpoint

COMP 465 Special Topics: Data Mining

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

DATA MINING LECTURE 1. Introduction

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

An Introduction to Data Mining

Introduction to Data Mining. Komate AMPHAWAN

Jarek Szlichta

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Mining of Massive Datasets

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

DATA MINING LECTURE 1. Introduction

Data warehouse and Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining and Data Analytics

DATA MINING LECTURE 1. Introduction

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

Question Bank. 4) It is the source of information later delivered to data marts.

Introduction to Data Mining S L I D E S B Y : S H R E E J A S W A L

Knowledge Discovery in Data Bases

Chapter 4 Data Mining A Short Introduction

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

CS-490WIR Web Information Retrieval and Management. Luo Si

Chapter 1, Introduction

Data Mining Concepts

Data Mining Algorithms

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Data Mining and Data Warehousing Introduction to Data Mining

Jeffrey D. Ullman Stanford University

Managing and mining (streaming) sensor data

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

Applications and Trends in Data Mining

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Defining a Data Mining Task. CSE3212 Data Mining. What to be mined? Or the Approaches. Task-relevant Data. Estimation.

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

INTRODUCTION... 2 FEATURES OF DARWIN... 4 SPECIAL FEATURES OF DARWIN LATEST FEATURES OF DARWIN STRENGTHS & LIMITATIONS OF DARWIN...

Data Mining Concepts & Tasks

CS639: Data Management for Data Science. Lecture 1: Intro to Data Science and Course Overview. Theodoros Rekatsinas

Big Data Analytics CSCI 4030

Data mining overview. Data Mining. Data mining overview. Data mining overview. Data mining overview. Data mining overview 3/24/2014

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Introduction to Text Mining. Hongning Wang

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

DATA MINING INTRO LECTURE. Introduction

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

COMP6237 Data Mining Introduction to Data Mining

Data Mining and Business Process Management of Apriori Algorithm

Big Data Analytics CSCI 4030

Chapter 3: Data Mining:

9. Conclusions. 9.1 Definition KDD

DATA MINING LECTURE 1. Introduction

Grading. 20% Activity (course attendance and homework) 40% Project (project attendance, algorithm presentation, project delivery)

Grading. Road Map. Definition ([Liu 11]) Definition ([Wikipedia]) Definition ([Ullman 09, 10])

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Data Mining & Machine Learning

Overview of Big Data

Transcription:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

2 Instructor: Jure Leskovec TAs: Aditya Parameswaran Bahman Bahmani Peyman Kazemian

3 Course website: http://cs246.stanford.edu Lecture slides (~30min before the lecture) Announcements, homeworks, solutions Readings! Readings: Book Mining of Massive Datasets by Anand Rajaraman nad Jeffrey D. Ullman Fee online: http://i.stanford.edu/~ullman/mmds.html

4 Send questions/clarifications to: cs246-win1011-staff@lists.stanford.edu Course mailing list: cs246-win1011-all@lists.stanford.edu If you are auditing send us email and we will subscribe you! Office hours: Jure: Tuesdays 9-10am, Gates 418 See course website for TA office hours

5 4 Longer homeworks: 30% theoretical and programming/data analysis questions All homeworks (even if empty) must be handed in Start early!!!! Short weekly quizes: 20% Short e-quizes on Gradiance No late days! Final Exam: 50% It s going to be fun and hard work

6 Date Out In 1/5 HW1 1/19 HW2 HW1 2/2 HW3 HW2 2/16 HW4 HW3 3/2 HW4 No class: 1/17: Martin Luther King Jr. 2/21: President s day 2 recitations: Review of basic concepts Installing and working with Hadoop

7 Discovery of useful, possibly unexpected, patterns in data Subsidiary issues: Data cleansing: detection of bogus data E.g., age = 150 Entity resolution Visualization: something better than megabyte files of output Warehousing of data (for retrieval)

8 Databases: concentrate on large-scale (non-main-memory) data AI (machine-learning): concentrate on complex methods, usually small data Statistics: concentrate on models

9 To a database person, data-mining is an extreme form of analytic processing queries that examine large amounts of data: Result is the data that answers the query. To a statistician, data-mining is the inference of models: Result is the parameters of the model.

10 Much of the course will be devoted to ways to data mining on the Web: Mining to discover things about the Web E.g., PageRank, finding spam sites Mining data from the Web itself E.g., analysis of click streams, similar products at Amazon, making recommendations.

11 Much of the course will be devoted to Large scale computing for data mining Challenges: How to distribute computation? Distributed/parallel programming is hard Map-reduce addresses all of the above Google s computational/data manipulation model Elegant way to work with big data

12 Association rules, frequent itemsets PageRank and related measures of importance on the Web (link analysis) Spam detection Topic-specific search Recommendation systems E.g., what should Amazon suggest you buy? Large scale machine learning methods SVMs, decision trees,

13 Min-hashing/Locality-Sensitive Hashing Finding similar Web pages Clustering data Extracting structured data (relations) from the Web Managing Web advertisements Mining data streams

14 Algorithms: Dynamic programming, basic data structures Basic probability (CS109 or Stat116): Moments, typical distributions, regression, Programming (CS107 or CS145): Your choice, but C++/Java will be very useful We provide some background, but the class will be fast paced

15 CS345a: Data mining got split into 2 course: CS246: Mining massive datasets: Methods oriented course Homeworks (theory & programming) No massive class project CS341: Advanced topics in data mining: Project oriented class Lectures/readings related to the project Unlimited access to Amazon EC2 cluster We intend to keep the class to be small Taking CS246 is basically essential

16 Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers are cheap and powerful Goal: Provide better, customized services (e.g. in Customer Relationship Management)

17 Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining helps scientists in classifying and segmenting data in Hypothesis Formation

There is often information hidden in the data that is not readily evident Human analysts take weeks to discover useful information Much of the data is never analyzed at all 4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 The Data Gap Total new disk (TB) since 1995 1,000,000 500,000 0 Number of analysts 1995 1996 1997 1998 1999 18

Non-trivial extraction of implicit, previously unknown and useful information from data 19

20 A big data-mining risk is that you will discover patterns that are meaningless. Bonferroni s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

21 A parapsychologist in the 1950 s hypothesized that some people had Extra-Sensory Perception He devised an experiment where subjects were asked to guess 10 hidden cards red or blue He discovered that almost 1 in 1000 had ESP they were able to get all 10 right

22 He told these people they had ESP and called them in for another test of the same type Alas, he discovered that almost all of them had lost their ESP What did he conclude? He concluded that you shouldn t tell people they have ESP; it causes them to lose it

23 Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and architectures automation for handling large, heterogeneous data Statistics/ AI Data Mining Database systems Machine Learning/ Pattern Recognition

24 Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data.

Given database of user preferences, predict preference of new user Example: Predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies Example: Predict what books/cds a person may want to buy (and suggest it, or give discounts to tempt customer) 25

26 Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection

27 Supermarket shelf management: Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule: If a customer buys diaper and milk, then he is likely to buy beer. So, don t be surprised if you find six-packs stacked next to diapers! TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

28

29 Process of semi-automatically analyzing large datasets to find patterns that are: valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern

Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data Won over (manual) knowledge engineering approach http://www.cs.columbia.edu/~sal/jam/project/ provides good detailed description of the entire process 30

Major US bank: Customer attrition prediction Segment customers based on financial behavior: 3 segments Build attrition models for each of the 3 segments 40-50% of attritions were predicted == factor of 18 increase 31

Targeted credit marketing: major US banks find customer segments based on 13 months credit balances build another response model based on surveys increased response 4 times 2% 32

Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data 33

Banking: loan/credit card approval: predict good customers based on old customers Customer relationship management: identify those who are likely to leave for a competitor Targeted marketing: identify likely responders to promotions Fraud detection: telecommunications, finance from an online stream of event identify fraudulent events Manufacturing and production: automatically adjust knobs when process parameter changes 34

35 Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis: identify new galaxies by searching for sub clusters Web site/store design and promotion: find affinity of visitor to pages and modify layout