Zhang HPC Application R&D Manager,Inspur

Size: px

Start display at page:

Download "Zhang HPC Application R&D Manager,Inspur"

Clifford Chapman
6 years ago
Views:

1 Zhang HPC Application R&D Manager,Inspur

2 Inspur-Nvidia GPU Joint Lab Introduction Caffe-MPI: Parallel CAFFE framework based on GPU cluster

Inspur-Nvidia GPU Joint Lab Introduction Inspur-Nvidia GPU Joint Lab App Research Directions Traditional HPC Deep Learning Field Application Clients Speed-up ratio Platform Life Science BLASTN

3 Inspur-Nvidia GPU Joint Lab Introduction Inspur-Nvidia GPU Joint Lab App Research Directions Traditional HPC Deep Learning Field Application Clients Speed-up ratio Platform Life Science BLASTN Beijing Institute of Genomics 35X(kernel) 1GPU /1CPU core ET Institute of Biophysics, CSA 48X 1GPU /1CPU core CFD LBM_LES 100X 1GPU /1CPU core RNA 8X 24 GPU nodes /24 CPU nodes Oil&gas PSTM BGP 5X 6 GPU nodes / 6 CPU nodes Scandip 9X 4GPU+2CPU /2CPU Caffe Qihoo 12.5X 16GPU/1GPU CSP DNN IFlick 13X 16GPU/1GPU K-means Qihoo 35X 1GPU/1CPU core Neural Network Qihoo 270X 4GPU/1CPU core

4 Application :DNN Client:IFLYTEK Performance:16GPU/1GPU = 13X Mobile Phone Car Deep learning For speech recognition Intelligent customer service Business travel query

5 Application: neural network Client:Qihoo Performance:4 GPU/1 CPU core =270X Time(s)

6 ForwardBackward computing 80% Data parallel Weight computing 16% Some part can be paralleled Net update 4% Some part can be paralleled Caffe has many users, it is very popular in China. Caffe need a long training time for big data based one GPU node. Caffe s ForwardBackward computing,weight computing and net update all can be paralleled with GPU cluster.

7 What is Caffe-MPI? Developed by Inspur Open-source: Based on the Berkeley Vision and Learning Center (BVLC) Single GPU Caffe version A GPU Cluster Caffe version Support 16+ GPUs to Train

8 based on HPC Technology Hardware arch:ib+gpu cluster+lustre Software arch:mpi+pthread+cuda Data parallel on GPU Cluster GPU Cluster Configuration GPU master node GPU Salve Node Storage network Software Multi GPUs Multi GPUs Lustre 56Gb/s IB Linux/Cuda7.5/Mvapi ch2

9 MPI Mast-Slave model Master Process:Multi Pthread Threads+CUDA Threads Slave Process:CUDA Threads Reference:Q Ho,J Cipar,H Cui,JK Kim,S Lee,... More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server.

10 Master Process (0 process) Three Pthread Groups Parallel read data and send data Weight Computing and The parameter update The parameter communication

11 Slave process CPU To receive training data from the master process To send weight data(gpu-to-gpu) To receive new net data(gpu-to-gpu) GPU ForwardBackward computing Slave Node The number of Slave process = the number of GPU

12 GPU parallel computing Computing & Communication asynchronous parallel Communication Optimization GPU RDMA:Weight Data and Net data between GPUs Total Time=max(T Read Data+Send Data,T ForwardBackWord Computing+ Weight Computing and Net Update+ Net Send )

13 Speed-up Ratio:16GPU/1GPU=10.45X Scalability efficiency:65%

14 Speed-up Ratio:16GPU/1GPU=10.74X Scalability efficiency:67%

15 Peformance speed by cudnn =21% Speed-up Ratio:16GPU/1GPU=12.66X Scalability efficiency:79% 1,4 0 0 G ooglen et(iterations= ,b atchsize=6 4 ) 1,3 8 0 Training T im e(s) 1, (C affe-m P I) 1 6 (C affe-m P I+cuD N N ) T he N um b er of G P U

Parallel read training data from Lustre Storage and send data to different GPUs GPU Cluster be divided into many groups Every group have a master

16 Parallel read training data from Lustre Storage and send data to different GPUs GPU Cluster be divided into many groups Every group have a master node Every master node parallel read and send data with Multi Processes +Multi Threads Can support large-scale GPU computing for a big training platform

17 Speed-up Ratio:16GPU/1GPU=13X Scalability efficiency:81%

18 The Next work: Support cudnn 4.0 MPI Framework tuning Symmetric model Caffe-MPI version open source roadmap Q2:Computing-Intensive Model:support 32+ GPU parallel Q3:IO-Intensive Model:support 16+ GPU parallel Q4:Support Half Precision for Pascal GPU

19 Conclusions Caffe-MPI is based on HPC technology architecture Performance:16 GPU/1GPU=13X Caffe-MPI can support 16+ GPU to train big data Inspur will continue to open source new versions 32 GPU parallel version for Computing-Intensive Model 16+ GPU parallel version for IO Support Half Precision for Pascal GPU

HPC New Developments. Vangel Bojaxhi HPC Business Development Manager COMPUTING INSPIRES FUTURE

HPC New Developments. Vangel Bojaxhi HPC Business Development Manager COMPUTING INSPIRES FUTURE HPC New Developments Vangel Bojaxhi HPC Business Development Manager COMPUTING INSPIRES FUTURE Agenda Inspur Global Server & HPC Leader HPC Market & Technology Trends Inspur HPC Products, Integrated Solutions