CERN openlab & IBM Research Workshop Trip Report

Similar documents
Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Deep Learning mit PowerAI - Ein Überblick

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Beyond Training The next steps of Machine Learning. Chris /in/chrisparsonsdev

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Deep Learning Frameworks with Spark and GPUs

CafeGPI. Single-Sided Communication for Scalable Deep Learning

IBM Deep Learning Solutions

Onto Petaflops with Kubernetes

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Building the Most Efficient Machine Learning System

Matrix Computations and " Neural Networks in Spark

Building the Most Efficient Machine Learning System

The Evolution of Big Data Platforms and Data Science

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

IBM Leading High Performance Computing and Deep Learning Technologies

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Machine Learning Software ROOT/TMVA

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

IBM Power AC922 Server

Data Analytics and Machine Learning: From Node to Cluster

Democratizing Machine Learning on Kubernetes

NVMf based Integration of Non-volatile Memory in a Distributed System - Lessons learned

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Deep Learning Accelerators

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

NVIDIA DEEP LEARNING INSTITUTE

IBM Spectrum Scale IO performance

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. Singularity overview. Vanessa HAMAR

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT

Towards Scalable Machine Learning

SVM multiclass classification in 10 steps 17/32

In partnership with. VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

The OpenVX Computer Vision and Neural Network Inference

Deep Learning Inference on Openshift with GPUs

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Machine Learning at the Limit

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

An Open Accelerator Infrastructure Project for OCP Accelerator Module (OAM)

Fast Hardware For AI

April Copyright 2013 Cloudera Inc. All rights reserved.

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

TECHNOLOGIES CO., LTD.

Research Faculty Summit Systems Fueling future disruptions

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Transforming Transport Infrastructure with GPU- Accelerated Machine Learning Yang Lu and Shaun Howell

Accelerating Convolutional Neural Nets. Yunming Zhang

Analytics Platform for ATLAS Computing Services

Scaling Distributed Machine Learning

Parallelism. CS6787 Lecture 8 Fall 2017

Intelligent Hybrid Flash Management

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

IBM Power Advanced Compute (AC) AC922 Server

Service Oriented Performance Analysis

Using MRAM to Create Intelligent SSDs

Scaled Machine Learning at Matroid

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Polytechnic University of Tirana

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Neural Network Exchange Format

Accelerating Spark Workloads using GPUs

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

An introduction to Machine Learning silicon

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Flash Storage Complementing a Data Lake for Real-Time Insight

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Deep learning in MATLAB From Concept to CUDA Code

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Efficient Communication Library for Large-Scale Deep Learning

Scalable Tools - Part I Introduction to Scalable Tools

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

HPC Current Development in Indonesia. Dr. Bens Pardamean Bina Nusantara University Indonesia

IBM Power Systems HPC Cluster

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Enable AI on Mobile Devices

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

TESLA P100 PERFORMANCE GUIDE. HPC and Deep Learning Applications

Transcription:

CERN openlab & IBM Research Workshop Trip Report Jakob Blomer, Javier Cervantes, Pere Mato, Radu Popescu 2018-12-03

Workshop Organization 1 full day at IBM Research Zürich ~25 participants from CERN ~10 staff from IBM Second joint workshop on AI technologies planned 11 December at CERN Goal: identify ground for common research activities Morning session: presentation and open discussion on key technologies by IBM engineers AI software kit for generalized linear models ( Snap-ML ) NVlink CPU-GPU interconnect Near-memory and in-memory computing Afternoon session: split between quantum computing and storage technologies Forecast of tape drive evolution AI based prediction of data popularity Apache Crail: Spark for fast (NVM-like) storage 2

Artificial Intelligence software kit Increasing demand in AI from all sectors Synergies between Hardware and Software are crucial Three main research topics: Software OpenSource Framework Hardware optimization Software - Hardware integration 3

IBM PowerAI Platform Environment for data science as a service Deep learning and Machine learning more accessible Built on opensource tools Accelerated IBM Power servers, optimized for: Distributed Deep Learning (DDL) Deep Learning Inference (DLI) Scheduling work at HW level (Distributed GPU s) Machine Learning 46x faster (Same algorithm, diff HW) 4

SnapML Framework Library for Fast-training of generalized linear models Only supports models most widely used (Based on Kaggle 2017 survey) Benefits from optimized HW architectures (IBM Power, NVIDIA GPU s) Aim to remove training time as a bottleneck 5

SnapML Framework Features 3 levels of parallelism Distributed training GPU acceleration Supports sparse data structures Data-parallelism across worker nodes in a cluster Parallelism across heterogeneous compute units within one worker node Multi-core parallelism within individual compute units 3 API s Snap-ml-local scikit-learn-like interface for training on a single machine Snap-ml-mpi Snap-ml-spark distributed training of ML models across a cluster of machines Source: https://arxiv.org/pdf/1803.06333.pdf spark.ml-like interface, integration with pyspark applications 6

SnapML + IBM Hardware Single-Node Performance Scikit-learn: single-threaded, w/o GPU (dataset in CPU mem) TensorFlow: multi-threaded, one GPU (batch mode) Snap ML: multi-threaded, one GPU (dataset in GPU mem) Difference between TF and SKlearn can be explained by the highly optimized C++ backend of scikit-learn for workloads that fit in memory, whereas TensorFlow processes data in batches Difference between TF and SnapML not well defined. Cluster of 4 IBM Power Systems AC922 servers Each server has 4 NVIDIA Tesla V100 GPUs attached via the NVLINK 2.0 interface 7

SnapML + IBM Hardware Out-of-core Performance Dataset does not fit into memory NVLINK 2.0 speed-up hides the data copy time behind the kernel execution, effectively removing the copy between CPU-GPU time from the critical path and resulting in a 3.5x speed-up. Cluster of 4 IBM Power Systems AC922 servers Each server has 4 NVIDIA Tesla V100 GPUs attached via the NVLINK 2.0 interface 8

SnapML + IBM Hardware Tera-scale Benchmark Click-through rate prediction (CTR) Classification task 2.3 TB training data SnapML: 1.53 minutes including data loading, initialization, training and testing time. 46x faster than the best previously reported results, obtained using TensorFlow Cluster of 4 IBM Power Systems AC922 servers Each server has 4 NVIDIA Tesla V100 GPUs attached via the NVLINK 2.0 interface 9

Near-memory and in-memory computing Near memory: programmable FPGA between memory and CPU that allows manipulating memory controller behavior E.g. dynamically adjusting the precision of floating point values Gather values from memory in cache-line optimized layout Follow pointer chains such as virtual function calls In-memory computing: use physics of phase-change memory chip for (analog) matrix-vector multiplication Can speed-up forward and backward propagation for deep-learning 10

Data Storage: Tape is (still) relevant! Bit density improvements in HDD are flattening out Energy assisted writing techniques are quite challenging Bit density on tape much larger than on HDD Clear roadmap for the next 2 decades Software challenge: predict data temperature and optimize tape transfers Archival storage on optical drives has no well-defined roadmap But: IBM remains the only vendor of tape hardware 11

High Performance Distributed Data Store Apache CRAIL Distributed storage middleware optimized for ephemeral data on fast devices Agnostic to the framework (Spark, Tensorflow,.) Plugins (API) for different FS Use of kernel bypass techniques Impressive benchmarks (factor 6 can be obtained by just changing the storage compared to off-the-shelf spark) Apache incubator project Requires conversion to a custom format* *To be further discussed 12

Crail: Hardware Abstraction Layer 13

Crail: Deployment Modes 14

Summary IBM is looking for new clients to leverage its new products Strong focus on tailored hardware to optimized current ML and DL problems Innovative new technologies offering improvement of one order of magnitude Many interesting ideas and opened questions: Possible R&D topics for distributed ROOT analysis cluster RDataFrame + Apache Crail + RDataSource (if custom format were needed)? Distributed analysis on the C++ side (SnapML internal kernels) SnapML as a complement to TMVA? Smarter movement between tape and disk based on predicted data popularity HEP Data analysis + Near memory computing (FPGA Access processor) CERN Contact persons were identified to discuss some of these topics more in depth 15

Backup slides 16

Model most widely used What data science methods are used at work? Logistic regression is the most commonly reported data science method used at work for all industries except Military and Security where Neural Networks are used slightly more frequently. Source: https://www.kaggle.com/surveys/2017 17

Automated ML Machine learning involves several manual jobs: Challenge is to automate all these processes Feature selection, model selection, hyperparameter optimization, model ensembling, No slides and limited relevant information online Working on a project where uses can specify budget, time, architecture among other parameters as input 18