Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Similar documents
Introduction to Data Mining

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data Analytics CSCI 4030

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics CSCI 4030

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data Analytics CSCI 4030

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Distributed computing: index building and use

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Data contains value and knowledge. 4/1/19 Tim Althoff, UW CS547: Machine Learning for Big Data,

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Distributed computing: index building and use

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mining of Massive Datasets

Chapter 4: Apache Spark

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Information Networks: PageRank

Fall 2017 ECEN Special Topics in Data Mining and Analysis

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data Management Glossary

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Databases 2 (VU) ( / )

COMP 465 Special Topics: Data Mining

MapReduce: Algorithm Design for Relational Operations

Chapter 1, Introduction

Epilog: Further Topics

Locality-Sensitive Hashing

Managing and mining (streaming) sensor data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP6237 Data Mining Introduction to Data Mining

Big Data Architect.

Data Mining. Jeff M. Phillips. January 8, 2014

D B M G Data Base and Data Mining Group of Politecnico di Torino

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Data Partitioning and MapReduce

Question Bank. 4) It is the source of information later delivered to data marts.

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Cluster Computing Architecture. Intel Labs

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Data Mining. Jeff M. Phillips. January 9, 2013

Thanks to Jure Leskovec, Anand Rajaraman, Jeff Ullman

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Introduction to Data Mining and Data Analytics

60-538: Information Retrieval

Predictive Indexing for Fast Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Jarek Szlichta

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Chapter 5: Stream Processing. Big Data Management and Analytics 193

CS6220: DATA MINING TECHNIQUES

Principles of Data Management. Lecture #14 (Spatial Data Management)

Introduction to Data Management CSE 344

Distinct Stream Counting

Graphs (Part II) Shannon Quinn

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Intro to Artificial Intelligence

Data Science Bootcamp Curriculum. NYC Data Science Academy

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Part 1: Link Analysis & Page Rank

Machine Learning using MapReduce

Lecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Stream Processing

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Data mining with sparse grids

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Mining Data that Changes. 17 July 2015

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Database Systems CSE 414

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

CompSci 516: Database Systems

Applying SnapVX to Real-World Problems

Pre-Requisites: CS2510. NU Core Designations: AD

Evolution of Database Systems

Mining of Massive Datasets

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

CSE5243 INTRO. TO DATA MINING

Jeffrey D. Ullman Stanford University

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Recommender Systems 6CCS3WSN-7CCSMWAL

Scalable Data Analysis (CIS )

Transcription:

Instructor: Dr. Mehmet Aktaş Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3 Data contains value and knowledge

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 But to extract the knowledge data needs to be Stored Managed And ANALYZED this class Data Mining Big Data Predictive Analytics Data Science

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6 Given lots of data Discover patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 Descriptive methods Find human-interpretable patterns that describe the data Example: Clustering Predictive methods Use some variables to predict unknown or future values of other variables Example: Recommender systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 A risk with Data mining is that an analyst can discover patterns that are meaningless Statisticians call it Bonferroni s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap

Example: We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day 10 9 people being tracked 1,000 days Each person stays in a hotel 1% of time (1 day out of 100) Hotels hold 100 people (so 10 5 hotels) If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of suspicious pairs of people: 250,000 too many combinations to check we need to have some additional evidence to find suspicious pairs of people in some more efficient way J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 Usage Quality Context Streaming Scalability

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 Data mining overlaps with: Databases: Large-scale data, simple queries Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model In this class we will do both! CS Theory Data Mining Database systems Machine Learning

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 This class overlaps with machine learning, statistics, artificial intelligence, databases but more stress on Scalability (big data) Algorithms Computing architectures Automation for handling large data Statistics Data Mining Machine Learning Database systems

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 We will learn to mine different types of data: Data is high dimensional Data is a graph Data is infinite/never-ending Data is labeled We will learn to use different models of computation: MapReduce Streams and online algorithms Single machine in-memory

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 We will learn to solve real-world problems: Recommender systems Market Basket Analysis Spam detection Duplicate document detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising Decision Trees Association Rules Dimensional ity reduction Spam Detection Queries on streams Perceptron, knn Duplicate document detection