Jacek Czaja, Machine Learning Engineer, AI Product Group

Size: px

Start display at page:

Download "Jacek Czaja, Machine Learning Engineer, AI Product Group"

Dorothy Warren
5 years ago
Views:

1 Jacek Czaja, Machine Learning Engineer, AI Product Group

2 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

4 Artificial Intelligence, Machine Learning & Deep Learning 4

5 Deep Learning use cases Transport: Automated Driving Health: Pneumonia detection Agriculture: Robotics Space: Lunar Craters detection Positive: [1] Negative: Consumer: Speech/text search Finance: Customer support Energy: Oil & gas search Finance: financial forecasting

6 Why Now? Bigger Data Better Hardware Smarter Algorithms Image: 1000 KB / picture Audio: 5000 KB / song Video: 5,000,000 KB / movie Transistor density doubles every 18 months Cost / GB in 1995: $ Cost / GB in 2017: $0.02 Advances in algorithm innovation, including neural networks, leading to better accuracy in training models 6

juanmerodio.com/en/wp-content/uploads/gold-data.

7 Sharing Companies share algorithms and topologies Their gold is: Data Trained models Talent 7

Loss Visual Understanding Research @ Intel Labs China Innovate in

smart computing to enable novel usages and user experience Face

Learning DNN Design based Visual & Compression Recognition Visual

Multimodal Emotion Recognition Efficient CNN Algorithm Design DNN

8 Loss Visual Understanding Intel Labs China Innovate in cutting-edge visual cognition & machine learning technologies for smart computing to enable novel usages and user experience Face 2D/3D Analysis Face & Emotion Recognition Engine Efficient Deep Learning DNN Design based Visual & Compression Recognition Visual Parsing & Multimodal Analysis 128 F C 128 Face Analysis Technology Multimodal Emotion Recognition Efficient CNN Algorithm Design DNN Model Compression Automatic Image/Video Captioning Visual Question & Answering 6 8

9 Machine Learning Types Supervised Teach desired behavior with labeled data and infer new data Unsupervised Make inferences with unlabeled data and discover patterns Semi-supervised A combination of supervised and unsupervised learning Labeled Data Labeled and Unlabeled Data Unlabeled Data Clustered Data Classified Data Reinforcement Act in a environment to maximize reward Build autonomous agents that learn Classified Data 9

10 Machine Learning Types Supervised Teach desired behavior with labeled data and infer new data Unsupervised Make inferences with unlabeled data and discover patterns Semi-supervised A combination of supervised and unsupervised learning Labeled Data Labeled and Unlabeled Data Unlabeled Data Clustered Data Classified Data Reinforcement Act in a environment to maximize reward Build autonomous agents that learn Classified Data 10

11 data Training Forward Propagation Back Propagation output expected penalty (error or cost) person cat dog bike person cat dog bike 11

12 data Inference Forward Propagation output person cat dog bike 12

AIPG Nervana Deep Learning Portfolio RESEARCH AND APPLICATION SUPPORT Intel Brain Data Scientist Team BDM &

Enable customers to deploy products DEEP LEARNING PLATFORM Nervana Deep Learning Studio Titanium: HW mgmt.

developers and academics DL appliance for DLaaS ENABLING PRODUCT SOFTWARE MKL-DNN, other math libraries

Graph Accelerate framework optimization on IA; open source For framework developers & Intel Multi-node

sales Deep Learning Systems Enable direct and end customers with Deep Learn System portfolio Intel branded

14 AIPG Nervana Deep Learning Portfolio RESEARCH AND APPLICATION SUPPORT Intel Brain Data Scientist Team BDM & Direct Optimization Team Research new AI usages and models Develop POC with customers to apply AI methods Enable customers to deploy products DEEP LEARNING PLATFORM Nervana Deep Learning Studio Titanium: HW mgmt. Nervana Cloud Intel branded Data scientist and developer DL productivity tools DL Cloud Service for POC, developers and academics DL appliance for DLaaS ENABLING PRODUCT SOFTWARE MKL-DNN, other math libraries Frameworks Nervana Graph HW Transformers, Non-x86 libraries Frameworks for developers Back end APIs to Nervana Graph Accelerate framework optimization on IA; open source For framework developers & Intel Multi-node optimizations Extend to non-dc inference products and use cases SYSTEMS Node & rack reference designs Channel sales Deep Learning Systems Enable direct and end customers with Deep Learn System portfolio Intel branded under investigation PRODUCTS Datacenter Edge, client, gateway Comprehensive product portfolio General purpose x86 Dedicated DL NPU accelerators 14

AI gateway/edge All purpose Flexible acceleration ADAS LOW power vision Intel Processors agile AI

deep learning inference Intel FPGA Enhanced DL Inference Acceleration for deep learning inference in

autonomous driving Real-time fused camera/radar inference, path planning, roadreconstruction in vehicle

gateway and devices *Knights Mill (KNM); select = single-precision highly-parallel workloads generally

bandwidth e.g. energy (reverse time migration), deep learning training, etc.

15 AI gateway/edge All purpose Flexible acceleration ADAS LOW power vision Intel Processors agile AI Platforms Range of performance and power for widest variety of AI, gateway & edge workloads including deep learning inference Intel FPGA Enhanced DL Inference Acceleration for deep learning inference in real-time with higher efficiency, and wide range of workloads & configurations Mobileye s EyeQ-5 autonomous driving Real-time fused camera/radar inference, path planning, roadreconstruction in vehicle Movidius Myriad-X computer vision Low power computer vision engine using deep learning inference in gateway and devices *Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. 15

AI Datacenter All purpose Flexible acceleration Deep Learning Intel Xeon

variety of AI & other datacenter workloads including breakthrough deep

acceleration for deep learning inference in real-time with higher

Neural Network Processor Deep learning by design Scalable acceleration

16 AI Datacenter All purpose Flexible acceleration Deep Learning Intel Xeon Scalable Processors Known Compute for AI Scalable performance for widest variety of AI & other datacenter workloads including breakthrough deep learning training & inference Intel FPGA Enhanced DL Inference Scalable acceleration for deep learning inference in real-time with higher efficiency, and wide range of workloads & configurations Intel Nervana Neural Network Processor Deep learning by design Scalable acceleration with best performance for intensive deep learning training & inference, period 16

17 Intel Xeon scalable processor Scalable performance for widest variety of AI & other datacenter workloads including deep learning Most agile AI platform Built-in ROI Begin your AI journey today using existing, familiar infrastructure Potent performance Up to 2.2X deep learning training & inference perf vs. prior gen 1 ; 113X with SW optimizations 2 Production-ready Robust support for full range of AI deployments Classic ML Deep Learning Reasoning Emerging AI Analytics More 1,2 Configuration details on slide: 18, 20, 24 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of November 2016 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #

18 Performance Drivers for AI Workloads Compute Bandwidth SW Optimizations 18

19 GEMM performance (Measured in GFLOPS) represented relative to a baseline 1.0 Higher is Better Up to 3.4x Integer Matrix Multiply Performance on Intel Xeon Platinum 8180 Processor Matrix Multiply Performance on Intel Xeon Platinum 8180 Processor compared to Intel Xeon Processor E v Single Precision Floating Point General Matrix Multiply SGEMM (FP32) 1S Intel Xeon Processor E v4 3.4 Integer General Matrix Multiply IGEMM (INT8) 1S Intel Xeon Platinum 8180 Processor 8bit IGEMM will be available in Intel Math Kernel Library (Intel MKL) 2018 Gold to be released by end of Q Enhanced matrix multiply performance on Intel Xeon Scalable Processors Configuration Details on Slide: 24 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microproc essor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. 19

20 AI Performance Gen over Gen INFERENCE THROUGHPUT TRAINING THROUGHPUT Up to 2.4x Intel Xeon Platinum 8180 Processor higher Neon ResNet 18 inference throughput compared to Intel Xeon Processor E v4 Up to 2.2x Intel Xeon Platinum 8180 Processor higher Neon ResNet 18 training throughput compared to Intel Xeon Processor E v4 Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher. Advance previous generation AI workload performance with Intel Xeon Scalable Processors Inference throughput batch size: 1 Training throughput batch size: 256 Configuration Details on Slide: 18, 20 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. 20

AI Performance Software + Hardware INFERENCE THROUGHPUT TRAINING THROUGHPUT Up to 138x Up to 113x Optimized Frameworks Intel Xeon Platinum 8180 Processor higher Intel optimized Caffe GoogleNet v1

21 AI Performance Software + Hardware INFERENCE THROUGHPUT TRAINING THROUGHPUT Up to 138x Up to 113x Optimized Frameworks Intel Xeon Platinum 8180 Processor higher Intel optimized Caffe GoogleNet v1 with Intel MKL inference throughput compared to Intel Xeon Processor E v3 with BVLC-Caffe Intel Xeon Platinum 8180 Processor higher Intel Optimized Caffe AlexNet with Intel MKL training throughput compared to Intel Xeon Processor E v3 with BVLC-Caffe Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher. Optimized Intel MKL Libraries Deliver significant AI performance with hardware and software optimizations on Intel Xeon Scalable Processors INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 18, 25 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. 21

22 Inference Throughput shown in Images/Second Up to 2.4x Higher Inference Throughput on Intel Xeon Platinum 8180 Processor 2S Intel Xeon Processor E5-2699v4, 22C, 2.2GHz 2S Intel Xeon Platinum 8180 Processor, 28C, 2.5GHz AlexNet BS = GoogLeNet v1 BS = ResNet-50 BS = 1024 VGG-19 BS = AlexNet ConvNet BS = GoogLeNet ConvNet BS = VGG ConvNet BS = AlexNet BS = VGG-19 BS = 256 Inception V3 BS = 1024 ResNet-50 BS = AlexNet ConvNet BS = GoogLeNet v1 ConvNet BS = 1024 Caffe TensorFlow MXNet Neon Inference throughput measured with FP32 instructions. Inference with INT8 will be higher. Additional optimizations may further improve performance ResNet 18 BS = 1024 Intel Xeon Platinum Processor delivers Inference throughput performance across different frameworks Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. 22

23 Time to Train (hours) Intel Xeon Scalable Processor Multi-node Performance (1 node) ResNet-50 Time to train (Hours) - Weak scaling minibatch 64 (2 nodes) (4 nodes) (8 nodes) (16 nodes) (32 nodes) (64 nodes) 3.9 SKX SKX-8180* (128 nodes) (256 nodes) (352 nodes) (470 nodes) (704 nodes) Global minibatch - scaled across nodes MB-32 per node MB-24 per node MB-16 per node Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations an d functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of August

24 November

with best performance for intensive deep learning training & inference, period Custom hardware Blazing

25 Crest family 2017 Deep learning By design Unprecedented compute density Large reduction in time-to-train 32 GB of in package memory via HBM2 technology 8 Tera-bits/s of memory access speed Scalable acceleration with best performance for intensive deep learning training & inference, period Custom hardware Blazing data access High-speed scalability 12 bi-directional high-bandwidth links Seamless data transfer via interconnects

26 ICL ICL ICL ICL ICL ICL ICL ICL ICL ICL ICL ICL Intel Nervana Lake Crest NPU Architecture Interposer ICC HBM2 HBM PHY Mem Ctrlr Processing Cluster Processing Cluster Processing Cluster Mem Ctrlr HBM PHY HBM2 Processing Cluster Processing Cluster Processing Cluster SPI, IC2, GPIO MGMT CPU Processing Cluster Processing Cluster Processing Cluster HBM2 HBM PHY Mem Ctrlr Processing Cluster Processing Cluster Processing Cluster Mem Ctrlr HBM PHY HBM2 PCIe Controller & DMA PCI Express x16 ICC Floorplan not to scale 26

27 FlexPoint Numerical Format Designed Float16 Flex DEC=8 DEC=7 DEC=8 DEC=6 DEC=7 DEC=8 DEC=7 DEC=6 DEC=8 MANTISSA EXPONENT 11 bit mantissa precision (-1024 to 1023) Individual 5-bit exponents DEC=8 MANTISSA EXPONENT 16 bit mantissa 45% more precision than Float16 (-32,768 to 32,767) Tensor-wide shared 5-bit exponent Flex16 accuracy on par with Float32 but with much smaller cores 27

Diversity in Deep Networks VVariety in Network Topology

Networks with memory Recurrent NN BBut there are a few

image recognition tasks GEMMs for recurrent network

29 Diversity in Deep Networks VVariety in Network Topology Recurrent NNs common for NLP/ASR, DAG for GoogLeNet, Networks with memory Recurrent NN BBut there are a few well defined building blocks Convolutions common for image recognition tasks GEMMs for recurrent network layers could be sparse ReLU, tanh, softmax CNN - AlexNet GoogLeNet 29

30 Naïve Convolution 30

31 Cache Friendly Convolution arxiv.org/pdf/ v1.pdf 31

32 Performance Optimization on Modern Platforms Hierarchical Parallelism Coarse-Grained / multi-node Domain decomposition Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers) 2) Data decomposition (layer parallelism) Scaling Improve load balancing Reduce synchronization events, all-to-all comms Utilize all the cores OpenMP, MPI, TBB Reduce synchronization events, serial code Improve load balancing Vectorize/SIMD Unit strided access per SIMD lane High vector efficiency Data alignment Efficient memory/cache use Blocking Data reuse Prefetching Memory allocation

33 Intel MKL and Intel MKL-DNN for Deep Learning Intel Math Kernel Library (Intel MKL) Deep Learning Frameworks Intel MKL-DNN Xeon Xeon Phi FPGA Intel MKL DNN primitives + wide variety of other math functions C DNN APIs (C++ future) Binary distribution Free community license. Premium support available as part of Parallel Studio XE Broad usage DNN primitives; not specific to individual frameworks Quarterly update releases Intel MKL-DNN DNN primitives C/C++ DNN APIs Open source DNN code* Apache 2.0 license Multiple variants of DNN primitives as required for framework integrations Rapid development ahead of Intel MKL releases * GEMM matrix multiply building blocks are binary

35 Intel Nervana Deep Learning Studio Compress Innovation Cycle to Accelerate Time-to-Solution What it is A comprehensive software suite to allow groups of data scientists to reduce the innovation cycle and enable them to develop custom, enterprise-grade deep learning solutions in record time. Available as part of Intel Nervana Cloud and Intel Nervana Deep Learning System. Images Video Text Speech Tabular Time series Why it's important It is both time consuming and expensive to develop a deep learning solution due to expensive data scientists spending too much time wrangling data and manually executing hundreds of experiments to find the right network topology and combination of parameters to achieve a converged model that fits their use case. Intel Nervana Deep Learning Studio Deep Learning Frameworks Neon (more coming soon) Intel Nervana Hardware Learn More: intelnervana.com Users Primary: Data scientists Secondary: Software developers who take trained deep learning models and integrate into their applications. 35

36 High-Level Workflow Data Scientist ncloud Command Line Interface Multiple Interface options Interactive Notebooks User Interface Label Dataset Import Dataset Build Model Train Deploy Cloud/Server Model Library Trained Model Edge 36

38 Intel Nervana ai academy Intel Developer Zone for Artificial Intelligence Deep Learning Frameworks, libraries and additional tools Workshops, Webinars, Meet Ups & Remote Access software.intel.com/ai/academy Intelnervana.com

39 39

40 [1] CheXNet: Radiologist-Level Pneumonia Detection on Chest X- Rays with Deep Learning

Enabling the future of Artificial intelligence

Enabling the future of Artificial intelligence Contents AI Overview Intel Nervana AI products Hardware Software Intel Nervana Deep Learning Platform Learn more - Intel Nervana AI Academy Artificial Intelligence,