Matrix Computations and " Neural Networks in Spark

Similar documents
Distributed Computing with Spark and MapReduce

Scaled Machine Learning at Matroid

Distributed Machine Learning" on Spark

Distributed Computing with Spark

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Introduction to Apache Spark

Chapter 1 - The Spark Machine Learning Library

15.1 Data flow vs. traditional network programming

An Introduction to Apache Spark

MLI - An API for Distributed Machine Learning. Sarang Dev

Using Existing Numerical Libraries on Spark

Data Analytics and Machine Learning: From Node to Cluster

Big Data Infrastructures & Technologies

Integration of Machine Learning Library in Apache Apex

Rapid growth of massive datasets

Using Numerical Libraries on Spark

CS-541 Wireless Sensor Networks

DATA SCIENCE USING SPARK: AN INTRODUCTION

Outline. CS-562 Introduction to data analysis using Apache Spark

Apache SystemML Declarative Machine Learning

Accelerating Spark Workloads using GPUs

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Scalable Machine Learning in R. with H2O

Processing of big data with Apache Spark

Machine Learning With Spark

Apache Spark 2.0. Matei

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Apache Spark MLlib. Machine Learning Library for a parallel computing framework. Review by Renat Bekbolatov (June 4, 2015) BSP

Lecture 11 Hadoop & Spark

Oracle Machine Learning Notebook

15.1 Optimization, scaling, and gradient descent in Spark

Linear Regression Optimization

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

Specialist ICT Learning

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

A Tutorial on Apache Spark

L6: Introduction to Spark Spark

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Deep Learning Frameworks with Spark and GPUs

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Higher level data processing in Apache Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Why do we need graph processing?

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms

GPU Acceleration for Machine Learning

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Programming Systems for Big Data

An Introduction to Big Data Analysis using Spark

CERN openlab & IBM Research Workshop Trip Report

ECS289: Scalable Machine Learning

Optimization for Machine Learning

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Big Data processing: a framework suitable for Economists and Statisticians

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

BIDMach: Large-scale Learning with Zero Memory Allocation

Tutorial on Machine Learning Tools

Spark Overview. Professor Sasu Tarkoma.

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Introduction to Spark

Fast, Interactive, Language-Integrated Cluster Computing

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

M. Sc. (Artificial Intelligence and Machine Learning)

Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-

Data Science Bootcamp Curriculum. NYC Data Science Academy

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

ImageNet Classification with Deep Convolutional Neural Networks

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Spark, Shark and Spark Streaming Introduction

A Brief Look at Optimization

STA141C: Big Data & High Performance Statistical Computing

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Cloud, Big Data & Linear Algebra

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

C5##54&6*"6*1%2345*D&'*E2)2*F"4G)&"69

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Cloud Computing & Visualization

5 Machine Learning Abstractions and Numerical Optimization

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Chapter 4: Apache Spark

Deep Learning With Noise

Transcription:

Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com

Training Neural Networks Datasets growing faster than processing speeds Solution is to parallelize on clusters and GPUs How do we do train neural networks on these things?

Outline Resilient Distributed Datasets and Spark Key idea behind mllib.linalg The Three Dimensions of Machine Learning Neural Networks in Spark 1.5 Local Linear Algebra Future of Deep Learning in Spark

Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How to split problem across nodes? Must consider network & data locality» How to deal with failures? (inevitable at scale)» Even worse: stragglers (node not failed, but slow)» Ethernet networking not fast» Have to write programs for each machine Rarely used in commodity datacenters

Spark Computing Engine Extends a programming language with a distributed collection data-structure» Resilient distributed datasets (RDD) Open source at Apache» Most active community in big data, with 75+ companies contributing Clean APIs in Java, Scala, Python, R

!! Resilient Distributed Datasets (RDDs) Main idea: Resilient Distributed Datasets» Immutable collections of objects, spread across cluster» Statically typed: RDD[T] has objects of type T val sc = new SparkContext()! val lines = sc.textfile("log.txt") // RDD[String]! // Transform using standard collection operations! val errors = lines.filter(_.startswith("error"))! val messages = errors.map(_.split( \t )(2))! messages.saveastextfile("errors.txt")! lazily evaluated kicks off a computation

MLlib: Available algorithms classification: logistic regression, linear SVM," naïve Bayes, least squares, classification tree, neural networks regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

Key idea in mllib.linalg

Key Idea Distribute matrix across the cluster Ship computations involving matrix to cluster Keep vector operations local to driver

Simple Observation Matrices are often quadratically larger than vectors A: n x n (matrix) O(n 2 ) v: n x 1 (vector) O(n) " Even n = 1 million makes cluster necessary

Distributing Matrices How to distribute a matrix across machines?» By Entries (CoordinateMatrix)» By Rows (RowMatrix)» By Blocks (BlockMatrix) As of version 1.3 All of Linear Algebra to be rebuilt using these partitioning schemes

Distributing Matrices Even the simplest operations require thinking about communication e.g. multiplication How many different matrix multiplies needed?» At least one per pair of {Coordinate, Row, Block, LocalDense, LocalSparse} = 10» More because multiplies not commutative

Example mllib.linalg algorithms» Matrix Multiplication» Singular Value Decomposition (SVD)» QR decomposition» Optimization primitives» yours? Simple idea goes a long way

The Three Dimensions of Scalable Machine Learning (and Deep Learning)

ML Objectives Almost all machine learning objectives are optimized using this update w is a vector of dimension d" we re trying to find the best w via optimization

Scaling Dimensions 1) Data size: n 2) Number of models: hyperparameters 3) Model size: d

Neural Networks Data Scaling

Separable Updates Can be generalized for» Unconstrained optimization» Smooth or non-smooth» LBFGS, Conjugate Gradient, Accelerated Gradient methods,

Local Linear Algebra

Why local linear algebra? GPU and CPU hardware acceleration can be tricky on the JVM Forward and Backward propagation in Neural Networks and many other places need local acceleration

Comprehensive Benchmarks From our paper: http://arxiv.org/abs/1509.02256

Future of the Spark and Neural Networks

Coming for Neural Networks Future versions will include» Convolutional Neural Networks (Goal: 1.6)» Dropout, different layer types (Goal: 1.6)» Recurrent Neural Networks and LSTMs (Goal: 1.7)

Spark Community Most active open source community in big data 200+ developers, 50+ companies contributing 150 Contributors in past year 100 50 0 Giraph Storm

Continuing Growth Contributors per month to Spark source: ohloh.net

Conclusions

Spark and Research Spark has all its roots in research, so we hope to keep incorporating new ideas!

Conclusion Data flow engines are becoming an important platform for machine learning More info: spark.apache.org