Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Similar documents
Distributed Machine Learning" on Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CSE 444: Database Internals. Lecture 23 Spark

An Introduction to Apache Spark

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

A Tutorial on Apache Spark

Scaled Machine Learning at Matroid

Map Reduce Group Meeting

CompSci 516: Database Systems

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

DATA SCIENCE USING SPARK: AN INTRODUCTION

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud, Big Data & Linear Algebra

Scalable Tools - Part I Introduction to Scalable Tools

Bringing Data to Life

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

15.1 Data flow vs. traditional network programming

Specialist ICT Learning

Big Data Infrastructures & Technologies

Rapid growth of massive datasets

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Matrix Computations and " Neural Networks in Spark

Distributed Computing with Spark

Databases and Big Data Today. CS634 Class 22

732A54/TDDE31 Big Data Analytics

Deep Learning Frameworks with Spark and GPUs

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Spark, Shark and Spark Streaming Introduction

CS 245: Principles of Data-Intensive Systems. Instructor: Matei Zaharia cs245.stanford.edu

Map Reduce. Yerevan.

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Lecture 22 : Distributed Systems for ML

Databases 2 (VU) ( / )

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Chapter 1 - The Spark Machine Learning Library

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Lecture 11 Hadoop & Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Chapter 4: Apache Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Processing of big data with Apache Spark

Machine Learning With Spark

Cloud Computing & Visualization

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Python Certification Training

Introduction to MapReduce

Blurring the Line Between Developer and Data Scientist

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Accelerating Spark Workloads using GPUs

Using Existing Numerical Libraries on Spark

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Distributed Computing with Spark and MapReduce

The Evolution of Big Data Platforms and Data Science

Always start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark

Data processing in Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

2.3 Algorithms Using Map-Reduce

Introduction to MapReduce

CS370 Operating Systems

Data Analytics on RAMCloud

ELTMaestro for Spark: Data integration on clusters

Apache SystemML Declarative Machine Learning

Higher level data processing in Apache Spark

Scalable Machine Learning in R. with H2O

Webinar Series TMIP VISION

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Data Analytics and Machine Learning: From Node to Cluster

Unifying Big Data Workloads in Apache Spark

New Developments in Spark

MLI - An API for Distributed Machine Learning. Sarang Dev

Introduction to Apache Spark

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Linear Regression Optimization

Map-Reduce. John Hughes

An Overview of Apache Spark

Why do we need graph processing?

Machine learning library for Apache Flink

Fast, Interactive, Language-Integrated Cluster Computing

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Center for Information Services and High Performance Computing (ZIH) Current trends in big data analysis: second generation data processing

Data Science Bootcamp Curriculum. NYC Data Science Academy

Big Data Management and NoSQL Databases

Distributed Computation Models

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Analyzing Flight Data

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Transcription:

Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9

Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory & Practice Note: Most lectures so far used stats concepts Today we ll turn to computer science 2

Modern view of ML Understand massive quantities of data Google (4K searches/s), Twitter: (6K tweets/s), Amazon: (100s sold products/s) (source: internetlifestatscom) Banks, insurance companies, etc Modestly-sized websites Both large n and large p Computation will scale up with the data n p X Often fitting an ML models requires one or multiple operations that looks at the whole dataset 3

Modern view of ML Understand massive quantities of data Google (4K searches/s), Twitter: (6K tweets/s), Amazon: (100s sold products/s) (source: internetlifestatscom) Banks, insurance companies, etc Modestly-sized websites Both large n and large p Computation will scale up with the data n p X Often fitting an ML models requires one or multiple operations that looks at the whole dataset Linear regression w =(X X) 1 X Y 3

Issues with massive datasets 1 Storage 2 Computation 4

Moore s Law [https://enwikipediaorg/wiki/moore%27s_law]

[https://enwikipediaorg/wiki/moore%27s_law]

Modern Computation paradigms 7

Modern Computation Floating point operations per second (Flop) Smart phone ~ 0005 TFlops paradigms 1 Tera: 1,000 Giga 7

Modern Computation Floating point operations per second (Flop) Smart phone ~ 0005 TFlops paradigms 1 Tera: 1,000 Giga 1 Single computers Large Computers 125 4359 TFlops 7

Modern Computation Floating point operations per second (Flop) Smart phone ~ 0005 TFlops paradigms 1 Tera: 1,000 Giga 1 Single computers 2 Specialized hardware Large Computers 125 4359 TFlops Focusses on subset of operations Graphical Processing Unit (GPU), Field Programmable Gated Array (FPGA) 75 TFlops 7

Modern Computation Floating point operations per second (Flop) Smart phone ~ 0005 TFlops paradigms 1 Tera: 1,000 Giga 1 Single computers Large Computers 125 4359 TFlops 2 Specialized hardware Focusses on subset of operations Graphical Processing Unit (GPU), Field Programmable Gated Array (FPGA) 2 Distributed computation ~100 000 TFlops (Folding@home) 75 TFlops 7

Modern Computation Floating point operations per second (Flop) Smart phone ~ 0005 TFlops paradigms 1 Tera: 1,000 Giga 1 Single computers Large Computers 125 4359 TFlops 2 Specialized hardware Focusses on subset of operations Graphical Processing Unit (GPU), Field Programmable Gated Array (FPGA) 2 Distributed computation ~100 000 TFlops (Folding@home) 75 TFlops 7

Distributed Computing 8

Distributed Computing Faster computers can help 8

Distributed Computing Faster computers can help What about a large of slow computers working together? Divide the computation into small problems 1 All (slow) computers solve a small problem at the same time 2 Combine the solution of small problems into initial solution 8

Distributed Computing Faster computers can help What about a large of slow computers working together? Divide the computation into small problems 1 All (slow) computers solve a small problem at the same time 2 Combine the solution of small problems into initial solution 8

Simple example You are tasked with counting the number of houses in Montreal 1 Centralized (single computer): Ask a marathon runner to jog around the city and count Build a system to count houses from satellite imagery 9

Simple example You are tasked with counting the number of houses in Montreal 1 Centralized (single computer): Ask a marathon runner to jog around the city and count Build a system to count houses from satellite imagery 2 Distributed (many computers): Ask 1,000 people to each count houses from a small geographical area Once they are done they report their result at your HQ 9

Tool for distributed computing (for machine learning) Apache Spark (https://sparkapacheorg/) Builds on MapReduce ideas More flexible computation graphs High-level APIs MLlib 10

Outline MapReduce Fundamentals and bag-of-words example Spark Fundamentals & MLlib Live examples 11

MapReduce From Google engineers MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, 2004 Now also known as (Apache) Hadoop Google built large-scale computation from commodity hardware Specific distributed interface Useful for algorithms that can be expressed using this interface 12

MapReduce Two types of tasks: A Map: Solve a subproblem B Reduce: Combine the results of map workers Map Reduce Initial Problem Solution 13

TASK: Create a document s bag-of-word representation A Map B Reduce The black dog A black cat The blue cat

TASK: Create a document s bag-of-word representation A Map The black dog B Reduce The black dog A black cat The blue cat A black cat The blue cat

TASK: Create a document s bag-of-word representation A Map The, 1 black, 1 dog, 1 The black dog A, 1 black, 1 cat, 1 B Reduce The black dog A black cat The blue cat A black cat The blue cat The, 1 blue, 1 cat, 1

TASK: Create a document s bag-of-word representation A Map The, 1 black, 1 dog, 1 Sort by key The black dog A black cat The blue cat The black dog A black cat The blue cat A, 1 black, 1 cat, 1 The, 1 blue, 1 cat, 1 The, 1 The, 1 black, 1 black, 1 dog, 1 cat, 1 cat, 1 B Reduce

TASK: Create a document s bag-of-word representation A Map The, 1 black, 1 dog, 1 Sort by key The black dog A black cat The blue cat The black dog A black cat The blue cat A, 1 black, 1 cat, 1 The, 1 blue, 1 cat, 1 The, 1 The, 1 black, 1 black, 1 dog, 1 cat, 1 cat, 1 B Reduce The, 2 black, 2 dog, 1 cat, 2

Some details Typically the number of subproblems is higher than the number of available machines ~linear speed-up wrt to the number of machines If a node crashes, need to recompute its subproblem Input/Output Data is read from disk when beginning Data is written to disk at the end 15

MapReduce is quite versatile When I was at Google the saying was: If your problem cannot be framed as MapReduce you haven t thought hard enough about it A few examples of map-reduceable problems: Sorting, filtering, distinct values, basic statistics Finding common friends, sql-like queries, sentiment analysis 16

MapReduce for machine learning Training linear regression Reminder: there is a closed form solution w =(X X) 1 X Y 17

MapReduce for machine learning Training linear regression Reminder: there is a closed form solution w =(X X) 1 X Y w =( X i X j ) 1 ( X i Y i ) ij i 17

MapReduce for machine learning Training linear regression Reminder: there is a closed form solution w =(X X) 1 X Y w =( X i X j ) 1 ( X i Y i ) Each term in the sums can be computer independently ij i 17

MapReduce for machine learning Training linear regression Reminder: there is a closed form solution w =(X X) 1 X Y w =( X i X j ) 1 ( X i Y i ) Each term in the sums can be computer independently A Map ij i X 0 X 1 17

MapReduce for machine learning Training linear regression Reminder: there is a closed form solution w =(X X) 1 X Y w =( X i X j ) 1 ( X i Y i ) Each term in the sums can be computer independently A Map ij i X 0 X 1 Other models we studies have a closed form solution (eg, Naive Bayes and LDA) 17

MapReduce for machine learning Training linear regression Reminder: there is a closed form solution w =(X X) 1 X Y w =( X i X j ) 1 ( X i Y i ) Each term in the sums can be computer independently A Map ij i X 0 X 1 Other models we studies have a closed form solution (eg, Naive Bayes and LDA) Hyper-parameter search A neural network with 2 hidden layers and 5 hidden units per layer and another with 3 hidden layers and 10 hidden units 17

Shortcomings of MapReduce Many models are fitted with iterative algorithms Gradient descent: 1 Find the gradient for the current set parameters 2 Update the parameters with the gradient Not ideal for MapReduce Would require several iterations of MapReduce Each time the data is read/written from/to the disk 18

(Apache) Spark Advantages over MapReduce Solution 1 Less restrictive computations graph (DAG instead of Map then Reduce) Initial Problem Doesn t have to write to disk in-between operations 2 Richer set of transformations map, filter, cartesian, union, intersection, distinct, etc 3 In-memory processing 19

Spark History Started in Berkeley s AMPLab (2009) Version 10 2014 Based on Resilient Distributed Datasets (RDDs) Version 20 June 2016 Version 23 February 2018 We will be using pyspark Best (current) documentation: 1 Advanced Analytics with Spark, 2nd edition (2017) 2 Project docs: https://sparkapacheorg/docs/latest/ 20

[https://wwwsafaribooksonlinecom/library/view/hadoop-the-definitive/9781491901687/ch04html]

Resilient Distributed Datasets (RDDs) A data abstraction Collection of partitions Partitions are the distribution unit Operations on RDDs are (automatically) distributed RDDs support two types of operations: 1 Transformations Transform a dataset and return it 2 Actions Compute a result based on an RDD These operations can then be chained into complex execution flows 22

DataFrames An extra abstraction on top of RDDs Encodes rows as a set of columns Each column has a defined type Useful for (pre-processed) machine learning datasets Same name as dataframe (R) or pandasdataframe Similar type of abstraction but for distributed datasets Two types of actions (for our needs): transformers, estimators 23

Spark s Hello World data = sparkreadformat("libsvm")load("hdfs://") model = LogisticRegression(regParam=001)fit(data)

Spark s Hello World DataFrame data = sparkreadformat("libsvm")load("hdfs://") model = LogisticRegression(regParam=001)fit(data)

Spark s Hello World DataFrame data = sparkreadformat("libsvm")load("hdfs://") model = LogisticRegression(regParam=001)fit(data) Estimator

Parallel gradient descent Logistic Regression y = 1 1 +exp( w 0 w 1 x 1 w 2 x 2 w p x p ) 25

Parallel gradient descent Logistic Regression y = 1 1 +exp( w 0 w 1 x 1 w 2 x 2 w p x p ) No closed-form solution, can use gradients Loss(Y, X, w) w i 25

Parallel gradient descent Logistic Regression y = 1 1 +exp( w 0 w 1 x 1 w 2 x 2 w p x p ) No closed-form solution, can use gradients Loss(Y, X, w) w i Loss functions are often decomposable j Loss(Y j, X j, w) w i 25

Parallel gradient descent Logistic Regression y = 1 1 +exp( w 0 w 1 x 1 w 2 x 2 w p x p ) No closed-form solution, can use gradients Loss(Y, X, w) w i Loss functions are often decomposable j Loss(Y j, X j, w) w i Loss(Y 0, X 0, w) w i Loss(Y 1, X 1, w) w i Loss(Y 2, X 2, w) w i Loss(Y n, X n, w) w i 25

Demo Fitting Logistic Regression [https://wwwkagglecom/c/criteo-display-ad-challenge] Dataset: Ad clicks given ad and user features 45M observations, 39 features (13 continuous, 26 categorical) [Spark: http://3523714520:8123/notebooks/80-629-storage/notebooks/sparkexamplesipynb ] 26

Resources for 80-629 (think of your project) Ask me for a account Notebook access Available kernels: python2, python3, pyspark Your account is on a google cloud cluster 27

Takeaways Distributed computing is useful: for large-scale data for faster computing Current frameworks (eg, spark) offer easy access to popular ML models + algorithms Useful speedups by decomposing the computation into a number of identical smaller pieces Still requires some engineering/coding 28

Administrative matters Final presentation In front of class or poster session? Final exam With or without documentation? 29