HPC Advisory COUNCIL

Size: px

Start display at page:

Download "HPC Advisory COUNCIL"

Penelope Willis
5 years ago
Views:

1 Inside AI

2 HPC Advisory COUNCIL Lugano 2017 Scalable Systems for Distributed Deep Learning Benchmarking, Performance Optimization and Architectures Gaurav kaul SYSTEMS ARCHITECT, INTEL DATACENTRE GROUP

3 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Optimization Notice

4 Legal Notices & disclaimers This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Statements in this document that refer to Intel s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel s results and plans is included in Intel s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, Pentium, Celeron, Atom, Core, Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others Intel Corporation.

5 Agenda Why Optimization matters on modern architectures Mapping Deep Learning to HPC Hardware Compute Kernels for Deep Learning Case Study - Optimizing Caffe and TensorFlow Key challenges Optimization techniques Performance data Scaling on Mulitnode (WIP)

6 Attainable Gflops/s Combined Amdahl s Law for Vector Multicores Speedup = 1 Serial frac + 1 Serial frac NumCores Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code 1 Scalar frac + 1 Scalar frac VectorLength Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work) Compute Bound Performance Most kernels of ML codes are compute bound i.e. raw FLOPS matter Roofline Model Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte) Peak Compute Gflops/s Peak Compute Gflops/s without SIMD Compute intensity (flops/byte)

7 Deep Learning Convolutional Neural Network Convolution Parameters: Number of outputs/feature-maps: < 4 > Filter size: < 3 x 3 > Stride: < 2 > Pad_size (for corner case): <1> Feature maps Filter = 3 x 3 Stride = 2 Pad_size = 1

15 Case Study - Optimizing Deep Learning Frameworks

TensorFlow 2 nd generation open source machine learning framework from Google Widely used across Google in many key apps Search, Gmail, photos, translate etc General computing mathematical framework

16 TensorFlow 2 nd generation open source machine learning framework from Google Widely used across Google in many key apps Search, Gmail, photos, translate etc General computing mathematical framework can be used on Deep Neural network Other machine learning framework HPC applications Core system provides set of key computational kernel, extendable user ops Core in C++, front end wrapper is in python specifies/drives computation Multi node support using propriety GRPC protocols

17 TensorFlow : Computation is a Dataflow Graph with Tensors Biases Google s open source machine learning framework Input Matmul Add Relu Weights Xent Labels Example from Jeff Dean s presentation

18 Performance Optimization on Modern Platforms Hierarchical Parallelism Coarse-Grained / multi-node Domain decomposition Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers) 2) Data decomposition (layer parallelism) Scaling Improve load balancing Reduce synchronization events, all-to-all comms Utilize all the cores OpenMP, MPI, TBB Reduce synchronization events, serial code Improve load balancing vectorize/simd Unit strided access per SIMD lane High vector efficiency Data alignment Efficient memory/cache use Blocking Data reuse Prefetching Memory allocation

19 Example Challenge 1: Data Layout Has Big Impact on Performance Data Layouts impacts performance Sequential access to avoid gather/scatter Have iterations in inner most loop to ensure high vector utilization Maximize data reuse; e.g. weights in a convolution layer Converting to/from optimized Layout is some times less expensive than operating on unoptimized Layout Better optimized for some operations vs

20 Example Challenge 2: Minimize Conversions Overhead End to end optimization can reduce conversions Staying in optimized layout as long as possible becomes one of the tuning goals Minimize the number of back and forth conversions Use of graph optimization techniques Native to MKL layout Convolution MKL layout to Native Max Pool Native to MKL layout Convolution MKL layout to Native

Example Challenge 3: Ensuring Enough Parallelism to Leverage all Cores Maximize parallelism to use all cores efficiently Intra operation/layer parallelism within operators (OpenMP) 8

21 Example Challenge 3: Ensuring Enough Parallelism to Leverage all Cores Maximize parallelism to use all cores efficiently Intra operation/layer parallelism within operators (OpenMP) Inter operation parallelism across operators 1x1 Conv Parallel execution 3x3 Conv 5x5 Conv Convolution of tiles in parallel concat

22 Example Challenge 4: Optimizing the Data Layer Data Layer comprises 3 major ops o Read data C0 Boost thread C1 Boost thread C2 OpenMP C3 OpenMP.. Cn-1 OpenMP o Decode data: e.g. JPEG decode, decompression o Transform data Result of read, decode & transform is input to DNN layers Reduce number of cores dedicated to feed DNN o IO optimization: consider compression o Decode: consider LMDB instead of JPEG o Resizing/data processing: consider pre-processing o Then vectorize, parallelize

23 Optimizing TensorFlow & Other DL Frameworks for Intel Architecture Leverage high performant compute libraries and tools e.g. Intel Math Kernel Library, Intel Python, Intel Compiler etc. Data Format/Shape: Right format/shape for max performance: blocking, gather/scatter Data Layout: Minimize cost of data layout conversions Parallelism: Use all cores, eliminate serial sections, load imbalance Other Functions/Primitives (un-optimized in libraries): Optimize via compiler knobs, improve existing implementations Memory allocation unique characteristics and ability to reuse buffers Data layer optimizations: parallelization, vectorization, IO Optimize hyper parameters: e.g. batch size for more parallelism affects core utilization learning rate and optimizer to ensure accuracy/convergence

24 Allreduce Allreduce Allreduce Intel Machine Learning Scaling Library (MLSL) Some of the Intel MLSL features include: Alltoall Reduce Scatter Built on top of MPI, allows for use of other communication libraries Optimized to drive scalability of communication patterns L A Y E R L A Y E R FORWARD PROPAGATION L A Y E R Works across various interconnects: Intel Omni- Path Architecture, InfiniBand*, and Ethernet 1 2 BACK PROPAGATION N Common API to support deep learning frameworks (Caffe*, Theano*, Torch*, etc.) Alltoall Allgather Allreduce

25 Performance Status on Xeon (Broadwell 2S 22 Cores) Sample Topologies Benchmark Metric Higher is better Baseline performance Training Before any optimization WW01 Optimized Performance_Inference WW01 Optimized Performance_Training WW01 Speedup over Baseline AlexNet-ConvNet images/sec x AlexNet-BVLC (dummy) images/sec x AlexNet-BVLC (imagenet) images/sec x ConvNet- GoogleNet v1 images/sec x GoogleNet v3 images/sec x ResNet images/sec x RNN-Seq2Seq English to French Translation words/sec x ConvNet-VGG images/sec x

26 Performance Status on Xeon Phi (Knights Landing Bin1-68 Cores) Sample Topologies Benchmark Metric Higher is better Baseline performance Training Before any optimization WW01 Optimized Performance_Inference WW01 Optimized Performance_Training WW01 Speedup over Baseline AlexNet-ConvNet images/sec x AlexNet-BVLC (dummy) images/sec x AlexNet-BVLC (imagenet) images/sec x ConvNet-GoogleNet v1 images/sec x GoogleNet v3 images/sec x ResNet images/sec x RNN-Seq2Seq English to French Translation words/sec x ConvNet-VGG images/sec x

27 Cumulative speedup AlexNet Optimization Progression x x x 27.17x x x 2.20x 4.18x 2.16x 12.48x Baseline MKL Integration Thread Optimization Optimization Notice 6.96x 7.72x Compiler Knobs Tuning Broadwell Matrix Transpose/Data Transformations Knights Landing 9.27x Memory Allocations 13.36x 13.72x Conversions Optimization Memory Allocation Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.

28 Cumulative Speedup VGG Optimization Progression x x x x x 23.60x 1.00x 1.00x 3.15x 5.40x Baseline MKL Integration Thread Optimization Compiler Knobs Tuning 10.18x 13.29x 14.65x Matrix Transpose/Data Transformations Memory Allocations 19.27x Conversions Optimization Broadwell Knights Landing Optimization Notice

29 Configuration details 29 Intel Xeon processor E5-2699v4 (22 Cores, 2.2 GHz), 128GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 Intel Xeon Phi processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: Flat mode), 96GB DDR memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2 AlexNet and VGG benchmarks:

30 Key Challenges in Deep Learning Training Scaling of the order of Petaflop performance (a la classical HPC ) Inherent difficulty in reproducibility and provenance due to the black box nature Dependence on hyper-parameters (batch size, momentum, initialization weights etc.) Deep learning is data inefficient and requires labelled data unsupervised deep learning still infancy

31 Summary Large performance cost of DL workloads when using unoptimized frameworks Significant performance headroom from optimization on Xeon and Xeon Phi Close to 400x speedup in certain topologies on Caffe and TensorFlow Traditional vectorization and parallelization strategies apply Other unique performance challenges: hyper parameters, data layer, inter/intra layer parallelization, etc. Call to action: Try Intel optimized frameworks available today, more to come soon

32 Configuration details 2S Intel Xeon processor E5-2697A v4 on Apache Spark with MKL2017 up to 18x performance increase compared to 2S E v2 + F2JBLAS machine learning training BASELINE: Intel Xeon Processor E v2 (12 Cores, 2.7 GHz), 256GB memory, CentOS 6.6*, F2JBLAS: Relative performance 1.0 Intel Xeon processor E v2 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit/sec Ethernet fabric, Each system with 2 Processors, Intel Xeon processor E v2 (12 Cores, 2.7 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-240GB SSD OS Drive, 12-3TB HDDs Data Drives Per System, CentOS* 6.6, Linux el6.x86_64, Intel Intel MKL 2017 build U1_ , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 3.4x Intel Xeon processor E v3 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit/sec Ethernet fabric, Each system with 2 Processors, Intel Xeon processor E v3 (18 Cores, 2.3 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-480GB SSD OS Drive, 12-4TB HDDs Data Drives Per System, CentOS* 7.0, Linux el7.x86_64, Intel Intel MKL 2017 build U1_ , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 8.8x Intel Xeon processor E5-2697A v4 Apache* Spark* Cluster: 1-Master + 8-Workers, 10Gbit Ethernet/sec fabric, Each system with 2 Processors, Intel Xeon processor E5-2697A v4 (16 Cores, 2.6 GHz), Hyper-Threading Enabled, 256GB RAM per System, 1-800GB SSD OS Drive, GB SSDs Data Drives Per System, CentOS* 6.7, Linux el6.x86_64, Intel MKL 2017 build U1_ , Cloudera Distribution for Hadoop (CDH) 5.7, Apache* Spark* standalone, OMP_NUM_THREADS=1 set in CDH*, Total Java Heap Size of 200GB for Spark* Master and Workers, Relative performance up to 18x Machine learning algorithm used for all configurations : Alternating Least Squares ALS Machine Learning Algorithm Intel Xeon Phi Processor 7250 GoogleNet V1 Time-To-Train Scaling Efficiency up to 97% on 32 nodes 32 nodes of Intel Xeon Phi processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: flat mode), 96GB DDR4 memory, Red Hat* Enterprise Linux 6.7, export OMP_NUM_THREADS=64 (the remaining 4 cores are used for driving communication) MKL 2017 Update 1, MPI: , Endeavor KNL bin1 nodes, export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2, Throughput is measured using train command. Data pre-partitioned across all nodes in the cluster before training. There is no data transferred over the fabric while training. Scaling efficiency computed as: (Single node performance / (N * Performance measured with N nodes))*100, where N = Number of nodes Intel Caffe: Intel internal version of Caffe GoogLeNetV1: batch size 1536 Intel Xeon Phi processor 7250 up to 400x performance increase with Intel Optimized Frameworks compared to baseline out of box performance BASELINE: Caffe Out Of the Box, Intel Xeon Phi processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, BVLC-Caffe: with OpenBLAS, Relative performance 1.0 NEW: Caffe: Intel Xeon Phi processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: cache mode), 96GB memory, Centos 7.2 based on Red Hat* Enterprise Linux 7.2, Intel Caffe: : based on BVLC Caffe as of Jul 16, 2016, MKL GOLD UPDATE1, Relative performance up to 400x AlexNet used for both configuration as per image database-classification-with-deep-convolutional-neural-networks.pdf, Batch Size: 256 Intel Xeon Phi Processor 7250, 32 node cluster with Intel Omni Path Fabric up to 97% GoogleNetV1 Time-To-Train Scaling Efficiency Caffe: Intel Xeon Phi processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM: flat mode), 96GB DDR4 memory, Red Hat* Enterprise Linux 6.7, Intel Caffe: not publically available yet export OMP_NUM_THREADS=64 (the remaining 4 cores are used for driving communication) MKL 2017 Update 1, MPI: , Endeavor KNL bin1 nodes, export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2, Throughput is measured using train command. Split the images across nodes and copied locally on each node at the beginning of training. No IO happens over fabric while training. GoogLeNetV1: batch size 1536 Intel Xeon Phi processor Knights Mill up to 4x estimated performance improvement over Intel Xeon Phi processor 7290 BASELINE: Intel Xeon Phi Processor 7290 (16GB, 1.50 GHz, 72 core) with 192 GB Total Memory on Red Hat Enterprise Linux* 6.7 kernel using MKL 11.3 Update 4, Relative performance 1.0 NEW: Intel Xeon phi processor family Knights Mill, Relative performance up to 4x Intel Arria FPGA energy efficiency on Caffe/AlexNet up to 25 img/s/w with FP16 at 297MHz Vanilla AlexNet Classification Implementation as specified by Training Parameters taken from Caffe open-source Framework are 224x224x3 Input, 1000x1 Output, FP16 with Shared Block- Exponents, All compute layers (incl. Fully Connected) done on the FPGA except for Softmax, Arria FPGA, -1 Speed Grade on Altera PCIe DevKit with x MHz, Power measured through on-board power monitor (FPGA POWER ONLY), ACDS 16.1 Internal Builds + OpenCL SDK 16.1 Internal Build, Compute machine is an HP Z620 Workstation, Xeon E at 3.3 GHz with 32GB RAM. The Xeon is not used for compute. Knights Mill performance : Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision # Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured everything except Knights Mill which is estimated as of November 2016

TensorFlow* on Modern Intel Architectures Jing Huang and Vivek Rane Artificial Intelligence Product Group Intel

TensorFlow* on Modern Intel Architectures Jing Huang and Vivek Rane Artificial Intelligence Product Group Intel Tensorflow* on CPU has been very slow https://www.tensorflow.org/install/install_linux UNTIL