Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Similar documents
CERN openlab & IBM Research Workshop Trip Report

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Scaled Machine Learning at Matroid

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

arxiv: v2 [cs.lg] 18 Jun 2018

Deep Learning mit PowerAI - Ein Überblick

Deep Learning Frameworks with Spark and GPUs

Matrix Computations and " Neural Networks in Spark

Harp-DAAL for High Performance Big Data Computing

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Building the Most Efficient Machine Learning System

Data Analytics and Machine Learning: From Node to Cluster

Machine Learning with Python

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

World s most advanced data center accelerator for PCIe-based servers

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

WITH INTEL TECHNOLOGIES

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Building the Most Efficient Machine Learning System

Distributed Optimization for Machine Learning

IBM Leading High Performance Computing and Deep Learning Technologies

Scalable Machine Learning in R. with H2O

Data Science Bootcamp Curriculum. NYC Data Science Academy

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Linear Regression Optimization

The Evolution of Big Data Platforms and Data Science

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

Machine Learning at the Limit

OpenPOWER Performance

Oracle Big Data Connectors

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

IBM Power AC922 Server

IBM Spectrum Scale IO performance

Scaling Out Python* To HPC and Big Data

Accelerating Spark Workloads using GPUs

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Apache SystemML Declarative Machine Learning

Interactive HPC: Large Scale In-Situ Visualization Using NVIDIA Index in ALYA MultiPhysics

Architectures for Scalable Media Object Search

IBM Power Advanced Compute (AC) AC922 Server

Oracle Machine Learning Notebook

NVIDIA DATA LOADING LIBRARY (DALI)

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

IBM Systems for Cognitive Solutions IBM Machine Learning for z/os

Benchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Distributed Computing with Spark and MapReduce

Unifying Big Data Workloads in Apache Spark

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

An Introduction to Apache Spark

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

S8688 : INSIDE DGX-2. Glenn Dearth, Vyas Venkataraman Mar 28, 2018

Building NVLink for Developers

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Demystifying Machine Learning

NVIDIA PLATFORM FOR AI

Chapter 1 - The Spark Machine Learning Library

Convex and Distributed Optimization. Thomas Ropars

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Mapping MPI+X Applications to Multi-GPU Architectures

Machine Learning in WAN Research

TPCX-BB (BigBench) Big Data Analytics Benchmark

SVM multiclass classification in 10 steps 17/32

Machine Learning and SystemML. Nikolay Manchev Data Scientist Europe E-

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

MIOVISION DEEP LEARNING TRAFFIC ANALYTICS SYSTEM FOR REAL-WORLD DEPLOYMENT. Kurtis McBride CEO, Miovision

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Why data science is the new frontier in software development

Integration of Machine Learning Library in Apache Apex

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

GPU Acceleration for Machine Learning

SUPERMICRO, VEXATA AND INTEL ENABLING NEW LEVELS PERFORMANCE AND EFFICIENCY FOR REAL-TIME DATA ANALYTICS FOR SQL DATA WAREHOUSE DEPLOYMENTS

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

IN11E: Architecture and Integration Testbed for Earth/Space Science Cyberinfrastructures

Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

IBM Deep Learning Solutions

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Specialist ICT Learning

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Accelerated Machine Learning Algorithms in Python

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Big Data processing: a framework suitable for Economists and Statisticians

BAYESIAN GLOBAL OPTIMIZATION

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Democratizing Machine Learning on Kubernetes

Transcription:

Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich

What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression Logistic Regression 2

Why are GLMs useful? Fast Training Can scale to datasets with billions of examples and/or features. Less tuning State-of-the-art algorithms for training linear models do not involve a step-size parameter. Interpretability Widely used in industry New data protection regulations in Europe (GDPR) give E.U. citizens the right to obtain an explanation of a decision reached by an algorithm. The Kaggle State of Data Science survey asked 16,000 data scientists and ML practitioners what tools and algorithms they use on a daily basis. 37.6% of respondents use Neural Networks 63.5% of respondents use Logistic Regression 3

What is Snap Machine Learning? Snap ML: A new framework for fast training of GLMs Machine Learning (ML) Framework Models GPU Acceleration Distributed Training Sparse Data Support scikit-learn ML/{DL} No No Yes Deep Learning (DL) GLMs Apache Spark* MLlib ML/{DL} No Yes Yes TensorFlow** ML Yes Yes Limited Snap ML GLMs Yes Yes Yes * The Apache Software Foundation (ASF) owns all Apache-related trademarks, service marks, and graphic logos on behalf of our Apache project communities, and the names of all Apache projects are trademarks of the ASF. **TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc. 4

Multi-level Parallelism Level 1 Parallelism across nodes connected via a network interface. Level 2 Parallelism across GPUs within the same node connected via an interconnect (e.g. NVLINK). Level 3 Parallelism across the streaming multiprocessors of the GPU hardware. 5

GPU Acceleration T. Parnell, C. Dünner, K. Atasu, M. Sifalakis and H. Pozidis, Tera-scale coordinate descent on GPUs, Future Generation Computer Systems, 2018 GLMs can be effectively trained using stochastic coordinate descent (SCD). [Shalev-Schwartz 2013] Recently, asynchronous variants of SCD have been proposed to run on multi-core CPUs. [Liu 2013] [Tran 2015] [Hsiesh 2015] Twice parallel asynchronous SCD (TPA-SCD) is a another recent variant designed to run on GPUs. It assigns each coordinate update to a different block of threads that executed asynchronously. Local Model Within each thread block, the coordinate update is computed using many tightly coupled threads. 6

Streaming Pipeline CPU Interconnect GPU Sort RN (i) RNG (i+1) Copy data (i+1) Train (i) When the dataset is too big to fit into aggregate GPU memory, we need to copy the training data onto the GPU in batches. The time required to copy the next batch of data on the GPU can become a bottleneck. Copy RN (i+1) Sort RN (i+1) RNG (i+2) Copy data (i+2) Copy RN (i+2) Train (i+1) In Snap ML we use CUDA streams to implement a streaming pipeline. RNG (i+3) Copy data (i+3) Copy RN (i+3) Sort RN (i+2) Train (i+2) We copy the data for the next batch while the GPU is training on the current batch. With high-speed interconnects such as NVLINK, this allows us to completely hide the copy time behind the training time. 7

libglm snap-ml-local snap-ml-mpi snap-ml-spark Underlying C++/CUDA template library. Small to mediumscale data. Large-scale data. Large-scale data Single node deployment. Multi-node deployment in HPC environments. Multi-node deployment in Apache Spark environments. Multi-GPU support. Many-GPU support. Many GPU support scikit-learn compatible Python API Python API. Python/Scala /Java* API. *Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. 8

Example: snap-ml-local Acceleration existing scikit-learn applications by changing only 2 lines of code. 9

Example: snap-ml-mpi Describe application using high-level Python code. Launch application on 4 nodes using mpirun (4 GPUs per node): 10

Click-through Rate (CTR) Prediction Can use train ML models to predict whether or not a user will click on an advert? Core business of many internet giants. Labelled training examples are being generated in real-time. Dataset: criteo-kaggle Dataset: criteo-tb Number of features: 1 million Number of examples: 45 million Size: 40 Gigabytes (SVM Light) Number of features: 1 million Number of examples: 4.2 billion Size: 3 Terabytes (SVM Light) 11

Single-node Performance Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-kaggle 45 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0 12

Terabyte-scale Benchmark Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 4.2 billion 4x Power AC922 server 16x V100 InfiniBand NVLINK 2.0 13

Conclusions The Snap ML team: Snap ML is a new framework for fast training of GLMs. Snap ML benefits from GPU acceleration. It can be deployed in single-node and multi-node environments. The hierarchical structure of the framework makes it suitable for cloud-based deployments. Celestine Dünner Dimitrios Sarigiannis Andreea Anghel It can leverage fast interconnects like NVLINK to achieve streaming, out-of-core performance. Snap ML significantly outperforms other software frameworks for training GLMS in both single-node and multi-node benchmarks. Nikolas Ioannou Haris Pozidis Thomas Parnell https://www.zurich.ibm.com/snapml/ Snap ML can train a logistic regression classifier on the Criteo Terabyte Click Logs data in 1.5 minutes. Snap ML is available as part of IBM PowerAI (v5.2 onwards) 14

15

Example: snap-ml-spark Describe application using high-level Python code: Launch application on Spark cluster (1 GPU per Spark executor): 16

Out-of-core performance (PCIe Gen3) Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 200 million 1x Intel Xeon* Gold 6150 1x V100 n/a PCI Gen3 12ms 90ms 12ms 90ms S1 Init Train chunk (i) on GPU Init Train chunk (i+1) on GPU S2 Copy chunk (i+1) onto GPU Copy chunk (i+2) onto GPU 318ms 318ms 330ms 330ms *Intel Xeon is a trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. 17

Out-of-core performance (NVLINK 2.0) Dataset Examples Nodes Total GPUs Network CPU-GPU interface criteo-tb 200 million 1x Power AC922 server 1x V100 n/a NVLINK 2.0 3ms 90ms 3ms 90ms 3ms 90ms 3ms 90ms 3ms 90ms S1 Init Train chunk (i) on GPU Init Train chunk (i+1) on GPU Init Train chunk (i+2) on GPU Init Train chunk (i+3) on GPU Init Train chunk (i+4) on GPU S2 Copy chunk (i+1) onto GPU Copy chunk (i+2) onto GPU Copy chunk (i+3) onto GPU Copy chunk (i+4) onto GPU Copy chunk (i+5) onto GPU 55ms 55ms 55ms 55ms 55ms 93ms 93ms 93ms 93ms 93ms 18