S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Size: px

Start display at page:

Download "S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems"

Deirdre Parks
5 years ago
Views:

1 S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM

2 Evolving from compute systems to Cognitive Systems Dev Ecosystem Industry Alignment Partnerships Open Frameworks IBM Software P8 P9 P10 Not Just About Hardware Design It s about co-optimized software + hardware Open Accelerator Interfaces Accelerator Roadmaps which just work for ML, DL, and AI NDA until product announce 2

TensorFlow, Caffe, SparkML Spark, MPI PowerAI Transform & Prep Data (ETL) Data Lake & Data Stores Hadoop HDFS,

3 AI Infrastructure Stack Applications Segment Specific: Finance, Retail, Healthcare Cognitive APIs (Eg: Watson) In-House APIs Speech, Vision, NLP, Sentiment Machine & Deep Learning Libraries & Frameworks Distributed Computing TensorFlow, Caffe, SparkML Spark, MPI PowerAI Transform & Prep Data (ETL) Data Lake & Data Stores Hadoop HDFS, NoSQL DBs Accelerated Servers Storage Accelerated Infrastructure Think 2018 / DOC ID / Month XX, 2018 / 2018 IBM Corporation 3

4 PowerAI Integrated & Supported AI Platform Higher Productivity for Data Scientists Enable non-data Scientists to use AI Developer Ease-of-Use Tools Open Source Frameworks: Supported Distribution Faster Training Times via HW & SW Performance Optimizations 4

5 In IBM Cloud Available early April 2018 Delivered via IBM Cloud Catalog Billed through IBM Cloud Supported by IBM and Nimbix PowerAI Version 5.0 Native Distributed Deep Learning On-Demand Cloud Provisioning Delivered on CentOS Linux 5

Distributed Deep Learning Large Model Support ü Containerized and extensible Faster Training Times via

6 PowerAI in IBM Cloud Ease of Use & Performance Developer Ease-of-Use Tools Open Source Frameworks: Supported Distribution PowerAI in IBM Cloud ü ü ü Leadership price performance Highly scalable Distributed Deep Learning Large Model Support ü Containerized and extensible Faster Training Times via HW & SW Performance Optimizations ü Turn-key solution ü Powered by trusted partner Nimbix ü PowerAI Version 5.0 6

5x Larger L3 cache - On-die acceleration - Zero-power core idle state 2H12 - SMT8 - CAPI Acceleration - High Bandwidth Attach 1H14 2H16 Built

7 IBM s Latest Processor: POWER9 POWER9 Family 14nm POWER8 Family 22nm POWER7 45 nm Enterprise - 8 Cores - SMT4 - edram L3 Cache 1H10 POWER7+ 32 nm Enterprise & Big Data Optimized Enterprise - Up to 12 Cores - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state 2H12 - SMT8 - CAPI Acceleration - High Bandwidth Attach 1H14 2H16 Built for the Cognitive Era Only processor with NVLink, PCIe Gen 4 advanced IO interfaces and coherence Premier Platform for Accelerated Computing Processor Family with Scale-Up and Scale-Out Optimized Silicon 2H17 2H18+ 7

8 Systems Designed for AI POWER9 High-Speed System Memory OpenCAPI NVLink 2.0 PCIe Gen4 Fast & Large Memory System High Performance Cores Fastest Accelerator Interconnects 4-5X Memory Bandwidth 2x More Memory vs Intel 4X Threads per Core vs. Intel OpenCAPI / NVLink x vs. Intel

9 5x Faster Data Communication with Unique CPU- NVLink High-Speed Connection Store Large Models in System Memory 1 TB Memory 1 TB Memory 170GB/s 170GB/s Fast Transfer via NVLink Operate on One Layer at a Time NVLink 150 GB/s V100 Power 9 CPU V100 V100 Power 9 CPU V100 NVLink 150 GB/s IBM AC922 Power System Deep Learning Server (4- Config) Think 2018 / DOC ID / Month XX, 2018 / 2018 IBM Corporation 9

0 Pascal Technology POWER9 with NVLink 2.

80 GB/s NVLink NVLink 80 GB/s NVLink 150

150 GB/s Graphics Memory Graphics Memory

per NVLink ü Duplex bandwidth ü 3 Bricks

10 10 Accelerator Comparison POWER8 with NVLink 1.0 Pascal Technology POWER9 with NVLink 2.0 Volta Technology DDR4 P8 CPU DDR4 P9 CPU 80 GB/s NVLink NVLink 80 GB/s NVLink 150 GB/s NVLink 150 GB/s NVLink NVLink 80 GB/s 150 GB/s Graphics Memory Graphics Memory Graphics Memory Graphics Memory ü 2 Bricks per NVLink ü Duplex bandwidth ü 3 Bricks per NVLink ü Duplex bandwidth POWER9 with NVLink 2.0 delivers 87.5% increased bandwidth over POWER8

IBM POWER System Roadmap POWER S822LC Power Systems introduction of design and cost optimized Linux only servers 2X x86 memory bandwidth Dedicated technical compute versions for

NVLink (now 2nd Gen) and introduces PCIe Gen 4, Coherence for near direct access to system memory (2TB) Co-optimization with deep learning frameworks 2017 Higher NVLink through-put

11 IBM POWER System Roadmap POWER S822LC Power Systems introduction of design and cost optimized Linux only servers 2X x86 memory bandwidth Dedicated technical compute versions for acceleration 2015 POWER S822LC for HPC Introduced the first processor with NVLink from CPU 2x memory bandwidth Air and water cooled versions 2016 AC922 Remains the only processor with NVLink (now 2nd Gen) and introduces PCIe Gen 4, Coherence for near direct access to system memory (2TB) Co-optimization with deep learning frameworks 2017 Higher NVLink through-put Significant advancements on memory bandwidths New memory architectures More dense accelerated compute options Future Unmatched track record of innovation delivery A portfolio to invest in 11

IBM Power System AC922 Realize unprecedented performance and application gains with POWER9 and NVLink 2.0 2 POWER9 CPUs and up to 4 Volta NVLink 2.

12 IBM Power System AC922 Realize unprecedented performance and application gains with POWER9 and NVLink POWER9 CPUs and up to 4 Volta NVLink 2.0 s in a versatile 2U Linux server PCIe Gen4 bus has double I/O Bandwidth vs. PCIe Gen3 CPU (Turbo)/ (Boost) enabled for improved data center efficiency and performance to be maintained at high levels High level System Overview 2-Socket, 2U Packaging 40 P9 Processor cores 4 NVIDIA Volta 2.0 s 1 TB Memory (16x - 64GB DIMMs) 4 PCIe Gen4 Slots 2x SFF (HDD/SSD), SATA, Up to 7.7 TB storage Supports 1.6TB and 3.2TB NVMe Adapters Redundant Hot Swap Power Supplies and Fans Default 3 year 9x5 warranty, 100% CRU 12

13 AC922 Configurations 4 s - Air (4Q 17)/Water Cooled (2Q 18) 6 s - Water Cooled (2Q 18) Up to 4 s, air/water cooled options 150GB/s of bandwidth from CPU- Coherent access to system memory PCIe Gen 4 and CAPI 2.0 to InfiniBand Water cooled options available in 2Q 18 Up to 6 s, water cooled only 100 GB/s of bandwidth from CPU- 13

Large Model Support (LMS) Traditional Model Support Large Model Support

system memory and to support more complex and higher resolution data DDR4

NVLink and coherence enables larger and more complex models Improves model

14 Large Model Support (LMS) Traditional Model Support Large Model Support Limited memory on forces trade-off in model size / data resolution Use system memory and to support more complex and higher resolution data DDR4 CPU DDR4 NVLink PCIe Graphics Memory POWER CPU Graphics Memory Leveraging NVLink and coherence enables larger and more complex models Improves model accuracy with more images and higher resolution images NDA until product announce 14

15 Caffe with LMS (Large Model Support) Runtime of 1000 Iterations Large AI Models Train ~4 Times Faster Hours 3.8x Faster POWER9 Servers with NVLink to s vs x86 Servers with PCIe to s Time (secs) Mins 0 Xeon x v4 w/ 4x V100 s Power AC922 w/ 4x V100 s Think 2018 / DOC ID / Month XX, 2018 / 2018 IBM Corporation GoogleNet model on Enlarged ImageNet Dataset (2240x2240) 15

16 Distributed Deep Learning (DDL) Deep learning training takes days to weeks Limited scaling to multiple x86 servers 16 Days Down to 7 Hours 58x Faster 16 Days Near Ideal Scaling to 256 s Speedup Ideal Scaling DDL Actual Scaling 95%Scaling with 256 S 4 PowerAI with DDL enables 7 Hours scaling to 100s of servers 1 System 64 Systems Number of s ResNet-101, ImageNet-22K ResNet-50, ImageNet-1K Think 2018 / DOC ID / Month XX, 2018 / 2018 IBM Corporation Caffe with PowerAI DDL, Running on Minsky (S822Lc) Power System 16

InfiniBand EDR 100Gb/s PCIe Gen 4 verses PCIe Gen 3 450 The PCIe Gen 4 Difference (Gen3 to Gen4 EDR InfiniBand bandwidth test) 400 AVERAGE GBITS/S 350 300 250 200 150 100 Gen4 Dual Port

17 InfiniBand EDR 100Gb/s PCIe Gen 4 verses PCIe Gen The PCIe Gen 4 Difference (Gen3 to Gen4 EDR InfiniBand bandwidth test) 400 AVERAGE GBITS/S Gen4 Dual Port Bidirectional, Gb/s Gen3 Dual Port Bidirectional, Gb/s MESSAGE SIZE BYTES ~2x faster IB network connectivity enabled PCIe Gen 4 17

Deep-Learning (DL) Neural Network Models Most ML approaches (linear regression, decision trees, association rules, etc.) do not need s Many ML frameworks (Python Scikit, Spark Mllib, etc.

18 Deep-Learning (DL) Neural Network Models Most ML approaches (linear regression, decision trees, association rules, etc.) do not need s Many ML frameworks (Python Scikit, Spark Mllib, etc.) do not support options Lots of multi-dimensional matrix multiplication operations Reference: An Analysis of Deep Neural Network Models for Practical Applications, A. Canziani et al., April 2017.

19 Deep-Learning Model Training - TensorFlow Single Node (No Cluster) Aggregate Number of Images Processed Per Second Higher is better InceptionV3 With same numbers of V100 s, Power9 servers deliver better performance than Amazon P3 instances while SoftLayer bare-metal servers with V100 s deliver similar performance to Amazon P3 instances. This could be attributed to Power9 CPU optimizations and CPU- NVLink support. With 4 x V100 s, Power9 server has higher performance than Amazon P3 instance with 8 x V100 s in single precision mode. AWS P3 instance does not scale well beyond 4 x V100 s in single precision mode (although it does scale well leveraging Tensor Cores in half precision (FP16) mode. TensorFlow 1.4 and 1.5 versions do not leverage the Tensor Cores in V100 s very well, so the latest TensorFlow 1.6-dev build was used for optimal half precision (FP16) performance ResNet x P100 2 x P100 s 1 x P100 2 x P100 PCIe PCIe s 1 x V100 1 x V100 (FP16) 2 x V100 s 2 x V100 s (FP16) 1 x P100 2 x P100 s 4 x P100 s 1 x V100 1 x V100 (FP16) 2 x V100 s 2 x V100 s (FP16) 4 x V100 s 4 x V100 s (FP16) 1 x V100 1 x V100 (FP16) 2 x V100 s 2 x V100 s (FP16) 4 x V100 s 4 x V100 s (FP16) 8 x V100 s 8 x V100 s (FP16) SL Bare- Metal Server SL Bare- Metal Server SL VSI (16 VCPUs) SL VSI (16 VCPUs) SL Bare- Metal Server SL Bare- Metal Server SL Bare- Metal Server SL Bare- Metal Server Power8 Minsky Power8 Minsky Power8 Minsky Power9 Power9 Power9 Power9 Power9 Power9 AWS P3 Instance AWS P3 Instance AWS P3 instance AWS P3 Instance AWS P3 Instance AWS P3 Instance AWS P3 Instance AWS P3 Instance Notes: Input dataset: ImageNet (crop size=224x224); Batch size = 64 per (for both InceptionV3 and ResNet50 neural net models) With V100 s, independent distribution mode for model variables and gradients was used for optimal performance. Mixed precision (FP16/32) leverages Tensor Cores in V100 s. SoftLayer bare-metal server has 48 logical CPU cores while Power9 server and AWS P3 instance have 64 logical CPU cores.

20 Aggregate Number of Images Processed Per Second Higher is better Deep-Learning Model Training - InceptionV3 on TensorFlow SoftLayer Bare Metal vs. Power8 Minsky vs. Power9 vs. Amazon P3 Instance Single Precision Unless Noted Otherwise; Single Node (No Cluster) For InceptionV3 on TensorFlow, half precision (FP16) on the V100 s uses Tensor Cores to achieve ~1.8X better performance than single precision. The larger performance gain (up to 4.4X) of FP16 on AWS P3 is due to the relatively low performance of the Deep Learning AMI with TensorFlow v1.4 used for single precision compared to TensorFlow 1.6-dev used for half precision mode. Given the same number of V100 s, SoftLayer servers generally deliver similar performance as Amazon P3 instances Power9 delivers better performance than Amazon P3 instance at the same number of V100 s across the board up to 1.58X in single precision mode Only Amazon P3 instance supports 8 x V100 s at the present time Number of s SL BM w/ P100 s SL VSI w/ P100 s SL BM w/ V100 s SL BM w/ V100 s (FP16) Power8 Minsky w/ P100 s Power9 w/ V100 s Power9 w/ V100 s (FP16) AWS P3 w/ V100 s AWS P3 w/ V100 s (FP16) Notes: Input dataset: ImageNet (crop size=224x224); Batch size = 64 per (for both InceptionV3 and ResNet50 neural net models) With V100 s, independent distribution mode for model variables and gradients was used for optimal performance. Mixed precision (FP16/32) leverages Tensor Cores in V100 s. SoftLayer bare-metal server has 48 logical CPU cores while Power9 server and AWS P3 instance have 64 logical CPU cores.

21 Aggregate Number of Images Processed Per Second Deep-Learning Model Training - Power9 Server w/ V100 s Impact of Model Variable Distribution & Gradient Aggregration Modes Higher is better For TensorFlow, independent distribution mode (replicated_distributed) for model variables and gradient aggregation delivers much better performance for 4 s (and higher) than the default parameter_server mode. Single Precision, Single Node (No Cluster) Number of s parameter_server, InceptionV3 parameter_server, ResNet replicated, InceptionV3 replicated, ResNet independent, InceptionV3 independent, ResNet Notes: Input dataset: ImageNet (crop size=224x224) For Caffe, highest batch sizes were used to fully exploit memory. For TensorFlow, batch size = 64 per. Mixed precision (16-bit input matrices, 32-bit accumulator) leverages Tensor Cores in V100 s

22 Number of Images Processed Per Second Deep-Learning Model Training - InceptionV3 on TensorFlow Impact of Number of POWER9 CPU Threads on TensorFlow (Number of vcpus = Number of logical CPU cores seen by the OS) (Single Precision, No Cluster, Batch Size = 64/) 1 (independent) 2 s (independent) 3 s (independent) 4 s (independent) 1 (parameter_server) 2 s (parameter_server) 3 s (parameter_server) 4 s (parameter_server) Number of POWER9 CPU Threads Number of Images Processed Per Second Deep-Learning Model Training - InceptionV3 on TensorFlow Impact of Number of POWER9 CPU Threads on TensorFlow (Number of vcpus = Number of logical CPU cores seen by the OS) (Half Precision - FP16, No Cluster, Batch Size = 64/) 1 (independent) 2 s (independent) 3 s (independent) 4 s (independent) 1 (parameter_server) 2 s (parameter_server) 3 s (parameter_server) 4 s (parameter_server) Number of POWER9 CPU Threads

23 THANK YOU!!!

Deep Learning mit PowerAI - Ein Überblick

Deep Learning mit PowerAI - Ein Überblick Stephen Lutz Deep Learning mit PowerAI - Open Group Master Certified IT Specialist Technical Sales IBM Cognitive Infrastructure IBM Germany Ein Überblick Stephen.Lutz@de.ibm.com What s that? and what s