Akhilesh Kumar, Sailesh Kottapalli, Ian M Steiner, Bob Valentine, Israel Hirsh, Geetha Vedaraman, Lily P Looi, Mohamed Arafa, Andy Rudoff, Sreenivas

Size: px

Start display at page:

Download "Akhilesh Kumar, Sailesh Kottapalli, Ian M Steiner, Bob Valentine, Israel Hirsh, Geetha Vedaraman, Lily P Looi, Mohamed Arafa, Andy Rudoff, Sreenivas"

Alban Waters
5 years ago
Views:

1 Akhilesh Kumar, Sailesh Kottapalli, Ian M Steiner, Bob Valentine, Israel Hirsh, Geetha Vedaraman, Lily P Looi, Mohamed Arafa, Andy Rudoff, Sreenivas Mandava, Bahaa Fahim, Sujal A Vora Intel Corporation, 2018

2 Notices and Disclaimers This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, the Intel logo, Intel Optane and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the united states and other countries. * Other names and brands may be claimed as the property of others Intel Corporation. 2

3 Outline Intel Xeon Scalable Processor Roadmap Focus Areas for Cascade Lake-SP Instruction Enhancement for AI/Deep Learning Inference Intel Optane DC Persistent Memory Side Channel Mitigations Wrap up 3

4 First Generation Intel Xeon Scalable Processor Introduced in July 2017 Skylake-SP core microarchitecture with data center specific enhancements Intel AVX-512 with 32 DP flops per cycle per core Data center optimized cache hierarchy 1MB L2 per core, non-inclusive L3 New Intel Mesh architecture Enhanced 6 channel memory subsystem 48 lanes of PCIe Gen3 with integrated DMA, NTB, and VMD devices New Intel Ultra Path Interconnect (Intel UPI) 6 Channels DDR4 DDR4 DDR4 DDR4 Core Core Core Core 2 or 3 UPI UPI Features Cores and Threads Per CPU Last-level Cache (LLC) QPI/UPI Speed (GT/s) Intel Xeon Scalable Processor Up to 28 cores and 56 threads Up to 38.5 MB (non-inclusive) Up to 3x 10.4 GT/s DDR4 DDR4 DDR4 Core Core Shared L3 UPI UPI PCIe* Lanes/ Controllers Memory Population Up to 48 / 12 / PCIe 3.0 (2.5, 5, 8 GT/s) Up to 6 channels of up to 2 RDIMMs, LRDIMMs, or 3DS LRDIMMs 48 Lanes PCIe* 3.0 DMI3 Max Memory Speed Up to Foundation for Accelerating Data Center Innovations 4

5 Next Step in the Intel Xeon Scalable Processor Cascade Lake CPU is designed to be compatible with first-gen Intel Xeon Scalable platform Same core count, cache size, and I/O speeds as first-gen Process tuning, frequency push, targeted performance improvements Architectural improvements through targeted instruction set enhancements New platform capabilities with support for Intel Optane DC persistent memory Hardware enhancements for protection against side-channel methods Grantley Platform Intel Microarchitecture Codenamed Haswell Haswell 22nm Features Cores and Threads Last-level Cache UPI Speed (GT/s) PCIe* 3.0 Lanes Memory Speed Broadwell 14nm Purley Platform Intel Microarchitecture Codenamed Skylake New Microarchitecture Skylake- SP 14nm New Microarchitecture Cascade Lake CPU Cascade Lake-SP 14nm Up to 28 Cores and 56 Threads Up to 38.5 MB (non-inclusive) Up to 3x 10.4 GT/s Up to 48 lanes with 12 controllers Up to 6 up to 2666 MHz 5

6 6

7 Images/Second AI/Deep Learning Software Optimizations on first generation Intel Xeon Scalable Processor Higher is Better 5.4X(1) In 10 months since Intel Xeon Scalable Processor launch July 11, 2017 July 11, 2017 December 14, 2017 January 19, 2018 April 3, 2018 May 8, 2018 May 8, 2018 FP32 FP32 FP32 FP32 Int8 Int8 Int8 BS = 64 BS = 64 BS = 64 BS = 64 BS = 64 BS = 64 BS = 64, 4 instances 2S Intel Xeon Processor 2699 V4 Inference Throughput on Intel Caffe ResNet50 2S Intel Xeon Scalable Platinum 8180 Processor 2S Intel Xeon Scalable Platinum 8180 Processor 2S Intel Xeon Scalable Platinum 8180 Processor 2S Intel Xeon Scalable Platinum 8180 Processor 2S Intel Xeon Scalable Platinum 8180 Processor 2S Intel Xeon Scalable Platinum 8180 Processor (1) Up to 5.4X performance improvement with software optimizations on Caffe Resnet-50 in 10 months with 2 socket Intel Xeon Scalable Processor, Configuration Details 1, 2. Performance measurements were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Performance measurements were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of April

Neural Machine Translation Software Optimization on first generation Intel Xeon Scalable Processor Up to 14X(1) higher inference performance Configuration Details 3 4 Performance measurements were

8 Neural Machine Translation Software Optimization on first generation Intel Xeon Scalable Processor Up to 14X(1) higher inference performance Configuration Details 3 4 Performance measurements were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of May

9 Cascade Lake Vector Neural Network Instructions Vector Neural Network Instruction (VNNI) on Cascade Lake accelerates Deep Learning and AI inference workloads VNNI : A new set of Intel Advanced Vector Extension (Intel AVX-512) instructions 8-bit (int8) new instruction (VPDPBUSD) Fuses 3 instructions in inner convolution loop using int8 data type 16-bit (int16) new instruction (VPDPWSSD) Fuses 2 instructions in inner convolution loop using int16 data type 9

10 AI/DL Inference Enhancements on INT16 with VNNI Xeon Scalable AVX-512 INPUT INT16 INPUT INT16 vpmaddwd OUTPUT INT32 CONSTANT INT32 vpaddd OUTPUT INT32 Current AVX-512 instructions to perform INT16 convolutions: vpmaddwd, vpaddd New instructions for accelerating AI on Intel Xeon Scalable processors using int16 data INPUT INT16 INPUT INT16 Cascade Lake SP VNNI CONSTANT INT32 vpdpwssd OUTPUT INT32 VNNI instruction to accelerate INT16 convolutions: vpdpwssd 10

11 AI/DL Inference Enhancements on INT8 with VNNI Xeon Scalable AVX-512 INPUT INT8 INPUT INT8 vpmaddubsw OUTPUT INT16 CONSTANT INT16 vpmaddwd OUTPUT INT32 CONSTANT INT32 vpaddd OUTPUT INT32 Current AVX-512 instructions to perform INT8 convolutions: vpmaddubsw, vpmaddwd, vpaddd New instructions for accelerating AI on Intel Xeon Scalable processors using int8 data INPUT INT8 INPUT INT8 Cascade Lake SP VNNI CONSTANT INT32 vpdpbusd OUTPUT INT32 VNNI instruction to accelerate INT8 convolutions: vpdpbusd 11

12 Elements Processed/Cycle VNNI Per Core Throughput Vector Elements Processed per Cycle on Different Data Types FP32 Int16 Int8 FP32 VNNI Int16 VNNI Int8 vfmadd231ps vpmaddwd, vpaddd vpmaddubsw, vpmaddwd, vpaddd vfmadd231ps vpdpwssd vpdpbusd Input 32bit / Output 32bit Input 16bit / Output 32bit Input 8bit / Output 32bit Input 32bit / Output 32bit Input 16bit / Output 32bit Input 8bit / Output 32bit Intel Xeon Scalable Processor Codenamed Skylake FUTURE Intel Xeon Scalable Processor Codenamed CascadeLake Performance measurements were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of May

I n f e r e n c e T h r o u g h p u t ( i m a g e s / s e c ) Inference Throughput with VNNI I n t e l o p t i m i z a t i o n f o r C a f f e R e s N e t - 50 Estimated Throughput on Cascade Lake

13 I n f e r e n c e T h r o u g h p u t ( i m a g e s / s e c ) Inference Throughput with VNNI I n t e l o p t i m i z a t i o n f o r C a f f e R e s N e t - 50 Estimated Throughput on Cascade Lake with VNNI 5.4X i n t 8 o p t i m i z a t i o n s 2.8X F r a m e w o r k o p t i m i z a t i o n s Framework & Library support 1.0 F P 3 2 M K L - DNN Jul 17 Jan 18 Aug 18 Intel Xeon Scalable Processor 13

14 14

15 Growing Gap Between Memory Hierarchy Limitations to traditional architecture impede unified data management MEMORY DRAM HOT TIER Cost prohibitive for data intensive applications STORAGE SSD WARM TIER HDD/TAPE COLD TIER Performance impedes data intensive applications Media capability limits usage to cold tier 15

management MEMORY STORAGE DRAM HOT TIER SSD WARM

intensive applications Latency * : ~1000x Bandwidth

1x Capacity/$ * : ~40x Performance impedes data

16 Growing Gap Between Memory Hierarchy Limitations to traditional architecture impede unified data management MEMORY STORAGE DRAM HOT TIER SSD WARM TIER HDD/TAPE COLD TIER Cost prohibitive for data intensive applications Latency * : ~1000x Bandwidth * : ~0.1x Capacity/$ * : ~40x Performance impedes data intensive applications Media capability limits usage to cold tier * Actual performance and price may vary 16

TIER STORAGE Delivering efficient and scalable

17 Intel Innovations Address These Gaps MEMORY Improving memory capacity Improving SSD performance DRAM HOT TIER STORAGE Delivering efficient and scalable storage SSD WARM TIER Intel 3D Nand SSD HDD/TAPE COLD TIER 17

18 Big and Affordable Memory 128, 256, 512GB High Performance Storage DDR4 Pin Compatible Direct Load/Store Access Hardware Encryption Native Persistence High Reliability Supported on future Intel Xeon Scalable Processors (Cascade Lake)

Intel Optane DC Persistent Memory Hardware Interface Intel Xeon Cascade Lake IMC IMC Intel Optane DC Persistent Memory Module DDR4 electrical and physical interface with proprietary protocol

19 Intel Optane DC Persistent Memory Hardware Interface Intel Xeon Cascade Lake IMC IMC Intel Optane DC Persistent Memory Module DDR4 electrical and physical interface with proprietary protocol extensions Memory channel can be shared between DDR4 and Intel Optane DC persistent memory modules Enables systems to support greater than 3TB of system memory per CPU socket Cache line size accesses Idle latency close to DDR4 DIMMs Performance measurements were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: Source: Intel measured as of May

WPQ CPU CACHES Hardware Interface: Persistence Domain MOV Core L1 L1 L2 L3 CLWB + fence -or- CLFLUSHOPT + fence -or- CLFLUSH -or- NT stores + fence -or- WBINVD (kernel only) Custom

20 WPQ CPU CACHES Hardware Interface: Persistence Domain MOV Core L1 L1 L2 L3 CLWB + fence -or- CLFLUSHOPT + fence -or- CLFLUSH -or- NT stores + fence -or- WBINVD (kernel only) Custom Power fail protected domain indicated by ACPI property: CPU Cache Hierarchy WPQ ADR -or- WPQ Flush (kernel only) Minimum Required Power fail protected domain: Memory subsystem DIMM 20

21 The SNIA NVM Programming Model Mgmt. storage file memory Management UI Management Library Application Standard Raw Device Access Application Standard File API File System Generic NVDIMM Driver Standard File API Application Load/ Store pmem-aware File System DAX MMU Mappings USER SPACE KERNEL SPACE For more information software.intel.com/pmem 21

Performance tuned Open source & product neutral software.intel.

22 The Persistent Memory Development Kit - pmdk PMDK is a collection of libraries Developers pull only what they need Low level programming support Transaction APIs Fully validated Performance tuned Open source & product neutral software.intel.com/pmem Application Standard File API pmem-aware File System Persistent Memory PMDK Libraries MMU Mappings USER SPACE KERNEL SPACE

0 Results have been estimated based on tests conducted on pre-production systems, and provided to you for informational purposes.

23 Usage Example: High Performance Storage DRAM NAND NVMe SSD 9X More read transactions (Ops/sec) 11X More users Per system Vs. comparable server system with dram and nand nvme drives when using Apache* Cassandra-4.0 Results have been estimated based on tests conducted on pre-production systems, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to 23

24 Average Expected Latency Usage Example: Data Replication with Persistent Memory over Fabric initiator Average 4KB Write I/O Round Trip Time Comparison NVMe+NAND SSD vs. PMoF PCIe & NVMe Protocol Network/Fabric NVM Media Software FABRIC Target Replication to NAND Replication using PMoF Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to *Three 9s and five 9s availability assumes bi-weekly maintenance restarts. 24

25 25

26 Cascade Lake Mitigations for Side-Channel Methods Cascade Lake implements hardware mitigations against targeted side-channel methods Variant Side-Channel Method Mitigation on Cascade Lake Variant 1 Bounds Check Bypass OS/VMM Variant 2 Branch Target Injection Hardware + OS/VMM Variant 3 Rogue Data Cache Load Hardware Variant 3a Rogue System Register Read Firmware Variant 4 Speculative Store Bypass Firmware + OS/VMM or runtime L1 Terminal Fault Hardware Cascade Lake SP expected to provide higher performance over software mitigations available for existing products For additional information related to security updates and side channel methods on Intel products, please visit

28 Future Intel Xeon Scalable Processor (Codename: Cascade Lake-SP) Software Libraries and Optimizations AI/DL Enhancement through VNNI Side-Channel Analysis Mitigations Process Tuning, Frequency Boost, Targeted Performance Improvements Intel Xeon Scalable Platform Further Accelerating Data Center Innovations 28

30 Config Details for Skylake Inference throughput (March 2018) Config 1 Framework caffe caffe caffe caffe caffe caffe caffe caffe neon neon branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: (HEAD branch: master version: f6d01efbe93f 70726ea3796a4b89c a6341 version: f6d01efbe9 3f70726ea3796a4b 89c612365a6341 version: f6d01efbe93 version: f6d01efbe93 version: f6d01efbe93 f70726ea3796a4b89 f70726ea3796a4b89 f70726ea3796a4b89 c612365a6341 c612365a6341 c612365a6341 version: f6d01efbe93f7 0726ea3796a4b89c a6341 version: f6d01efbe93f7 0726ea3796a4b89c a6341 version: f6d01efbe9 3f70726ea3796a4b 89c612365a6341 version: f43cfa2e26f 9c84b0f42fcda50b6 b version: f43cfa2e26f 9c84b0f42fcda50b6 b Platform SKX_8180 SKX_8180 SKX_8180 SKX_8180 SKX_8180 SKX_8180 SKX_8180 SKX_8180 SKX_8180 SKX_8180 Sockets Processor Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Intel(R) Xeon(R) Platinum 8180 CPU Platinum 8180 CPU Platinum 8180 CPU Platinum 8180 CPU Platinum 8180 Platinum 8180 CPU Platinum 8180 CPU Platinum 8180 CPU Platinum 8180 Platinum 2.50GHz / 2.50GHz / 2.50GHz / 2.50GHz / GHz / GHz / 28 cores@ 2.50GHz / 28 cores@ 2.50GHz / 28 cores 2.50GHz / 28 cores 2.50GHz / 28 cores cores cores cores cores BIOS SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B SE5C620.86B Enabled Cores Slots Total Memory GB GB GB GB GB GB GB GB GB GB Memory Configuration 2666 MHz 2666 MHz 2666 MHz Memory Comments Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Disks OS sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux CentOS Linux Core Core sda 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc 5.5TB sda 744.1GB,sdb 1.5TB,sdc 5.5TB Ubuntu trusty Ubuntu trusty Hyper Threading ON ON ON ON ON ON ON ON ON ON Turbo ON ON ON ON ON ON ON ON ON ON Topology resnet_50_v1 resnet_50_v1 vgg16 vgg16 vgg16 inception_v3 inception_v3 inception_v3 resnet_50_v2 resnet_50_v2 Batchsize Dataset NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKL2017 MKL2017 Engine version: ae00102be50 6ed0fe2099c6557df2 aa88ad57ec1 version: ae00102be5 06ed0fe2099c6557 df2aa88ad57ec1 version: ae00102be5 version: ae00102be5 version: ae00102be5 06ed0fe2099c6557d 06ed0fe2099c6557d 06ed0fe2099c6557d f2aa88ad57ec1 f2aa88ad57ec1 f2aa88ad57ec1 version: ae00102be506 ed0fe2099c6557df2aa 88ad57ec1 version: ae00102be506 ed0fe2099c6557df2aa 88ad57ec1 version: ae00102be5 06ed0fe2099c6557 df2aa88ad57ec1 version: mklml_lnx_2 version: mklml_lnx_ IP Kernel version generic generic generic generic generic el7.x86_ el7.x86_ el7.x86_ el7.x86_ el7.x86_64 30

31 Config Details for Skylake Inference throughput (April 2018) Config 2 tensorflow tensorflow tensorflow tensorflow tensorflow tensorflow tensorflow tensorflow tensorflow mxnet mxnet mxnet mxnet mxnet mxnet branch: mast branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master branch: master er Framework version: fdb5 version: 024aecf41 version: 024aecf41 version: fdb5664 version: fdb56649 version: fbbc080d version: 9a0d0 version: fdb566 version: 024aecf4149 version: 024aecf4149 version: 024aecf414 version: 024aecf414 version: 024aecf414 version: 024aecf41 version: 024aecf c 4941e11eb643e e11eb643e c8173a c8173a dbc23eef4a cbc c e11eb643e29ceed 41e11eb643e29ceed 941e11eb643e29cee 941e11eb643e29cee 941e11eb643e29cee 4941e11eb643e e11eb643e a50826 ceed3e1c47a115a 9ceed3e1c47a acec8b1111d 26acec8b1111d9c cea8 3ffadf3537b10 a50826acec8b 3e1c47a115ad 3e1c47a115ad d3e1c47a115ad d3e1c47a115ad d3e1c47a115ad ceed3e1c47a115ad ceed3e1c47a115ad acec8b1111d d ad 9cd6f d6f 59b 03bb0006c4 1111d9cd6f 9cd6f Sockets Processor BIOS Intel(R) Xeon(R) Platinum GHz / 28 cores SE5C620.86B Intel(R) Xeon(R) Platinum GHz / 28 cores SE5C620.86B Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Platinum GHz / 28 cores Intel(R) Xeon(R) Intel(R) Xeon(R) Platinum 8180 Platinum GHz / GHz / 28 cores cores Intel(R) Xeon(R) Platinum GHz / 28 cores SE5C620.86B SE5C620.86B.00.0 SE5C620.86B.00.0 SE5C620.86B.00.0 SE5C620.86B.00.0 SE5C620.86B.00. SE5C620.86B.00.0 SE5C620.86B.00.0 SE5C620.86B.0 SE5C620.86B.0 SE5C620.86B SE5C620.86B SE5C620.86B Enabled Cores Slots Total Memory GB GB GB GB GB GB GB GB GB GB GB GB GB GB GB Memory Configuration 12slots / 32 GB 12slots / 32 GB / / Memory Comments Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Micron Disks OS sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc RS3WC080 HDD 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB CentOS Linux Core sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB Ubuntu xenial sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB Ubuntu xenial sda RS3WC080 HDD 744.1GB,sdb 1.5TB,sdc 5.5TB Ubuntu xenial sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB 12slots / 32 GB / 2666 MHz sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB Ubuntu Ubuntu Ubuntuxenial xenial xenial Hyper Threading ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON Turbo ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON Topology resnet_50_v1 resnet_50_v1 resnet_50_v1 vgg16 vgg16 vgg16 inception_v3 inception_v3 inception_v3 resnet_50_v2 vgg16 vgg16 inception_v3 inception_v3 inception_v3 Batchsize ,64, , Instances/ Streams on 2 sockets Dataset NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer NoDataLayer Imagenet Imagenet NoDataLayer Imagenet NoDataLayer Imagenet MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN MKLDNN version: f521 version: e0bfcaa7f version: f5218ff4f version: f5218ff4f version: 283c4a8c version: f5218f version: f5218f version: e0bfcaa7fcb2 version: e0bfcaa7fcb2 version: e0bfcaa7fcb version: e0bfcaa7fcb version: e0bfcaa7fcb version: e0bfcaa7fc version: e0bfcaa7fc version: e0bfcaa7fc 8ff4fd2d16d1 Engine b1e1558f5f c b1e1558f5f c 1db807a729 1db807a729 Kernel version el7.x86_ generic 2b1e1558f5f c1db807a729 2b1e1558f5f c1db807a generic generic 2b1e1558f5f c1db807a generic b2b1e1558f5f c1db807a729 b2b1e1558f5f c1db807a generic generic cb2b1e1558f5f06 d2d16d13aada2e d2d16d13aada2e 24b4e1dea05fbc7 f4fd2d16d13aa f4fd2d16d13aa b2b1e1558f5f067 3aada2e632a 76933c1db807a7 632afd18f2514fe 632afd18f2514fe a7375bab da2e632afd18f da2e632afd18f 6933c1db807a e3 e3 ad1 2514fee3 2514fee generic generic fd18f2514fee el7.x86_ el7.x86_ el7.x86_ el7.x8 6_ el7.x8 6_ el7.x 86_64 31

32 Configuration details of Amazon EC2 C5.18xlarge 1 node systems Config 3 Benchmark Segment Benchmark type Benchmark Metric Framework Topology # of Nodes 1 Platform Sockets Processor BIOS Enabled Cores Platform Slots Total Memory Memory Configuration SSD OS Network Configurations HT Turbo Computer Type AI/ML Inference Sentence/Sec Official mxnet GNMT(sockeye) Amazon EC2 C5.18xlarge instance 2S Intel Xeon Platinum 8124M 3.00GHz (Skylake) N/A 18 cores / socket N/A N/A 144GB N/A EBS Optimized 200GB, Provisioned IOPS SSD Red Hat 7.2 (HVM) Amazon Elastic Network Adapter (ENA) Up to 10 Gbps of aggregate network bandwidth Installed Enhanced Networking with ENA on Centos Placed the all instances in the same placement ON ON Server 32

33 Configuration details of Amazon EC2 C5.18xlarge 1 node systems Config 4 Framework Version Topology Version mxnet mkldnn : f6649e329b23a1efdc40aaa25260d47b4195 GNMT: Batch size GNMT: Dataset, version GNMT: WMT 2017 ( MKLDNN MKL F5218ff4fd2d16d13aada2e632afd18f2514fee3 Version: parallel_studio_xe_2018_update1 update1_cluster_edition_online.tgz Compiler g++: gcc:

34 Configuration Details for Inference Throughput with VNNI Config 5 1x inference throughput improvement in July 2017: Tested by Intel as of July 11 th 2017: Platform: 2S Intel Xeon Platinum GHz (28 cores), HT disabled, turbo disabled, scaling governor set to performance via intel_pstate driver, 384GB DDR ECC RAM. CentOS Linux release (Core), Linux kernel el7.x86_64. SSD: Intel SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: ( revision f96b759f71b f690af267158b82b150b5c. Inference measured with caffe time --forward_only command, training measured with caffe time command. For ConvNet topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from (ResNet-50),and (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver , Intel MKL small libraries version Caffe run with numactl -l. 2.8x inference throughput improvement in January 2018: Tested by Intel as of Jan 19 th 2018 Processor :2 socket Intel(R) Xeon(R) Platinum GHz / 28 cores HT ON, Turbo ON Total Memory GB ( ). CentOS Linux Core, SSD sda 744.1GB,sdb 1.5TB,sdc 5.5TB, Deep Learning Framework Intel Optimization for caffe version:f6d01efbe93f70726ea3796a4b89c612365a6341 Topology::resnet_50_v1 BIOS:SE5C620.86B MKLDNN: version: ae00102be506ed0fe2099c6557df2aa88ad57ec1 NoDataLayer.. Datatype:FP32 Batchsize=64 Measured: imgs/sec vs Tested by Intel as of July 11 th 2017: Platform: 2S Intel Xeon Platinum GHz (28 cores), HT disabled, turbo disabled, scaling governor set to performance via intel_pstate driver, 384GB DDR ECC RAM. CentOS Linux release (Core), Linux kernel el7.x86_64. SSD: Intel SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: ( revision f96b759f71b f690af267158b82b150b5c. Inference measured with caffe time --forward_only command, training measured with caffe time command. For ConvNet topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from (ResNet- 50),and (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver , Intel MKL small libraries version Caffe run with numactl -l. 34

35 Configuration Details for Inference Throughput with VNNI Config 5 5.4x inference throughput improvement in August 2018: Tested by Intel as of measured July 26 th 2018 :2 socket Intel(R) Xeon(R) Platinum GHz / 28 cores HT ON, Turbo ON Total Memory GB ( ). CentOS Linux Core, kernel: el7.x86_64, SSD sda 744.1GB,sdb 1.5TB,sdc 5.5TB, Deep Learning Framework Intel Optimization for caffe version:a3d5b022fe026e9092fc7abc7654b1162ab9940d Topology::resnet_50_v1 BIOS:SE5C620.86B MKLDNN: version:464c268e544bae26f9b85a2acb9122c766a4c396 instances: 2 instances : NoDatasocket:2 (Results on Intel Xeon Scalable Processor were measured running multiple instances of the framework. Methodology described herelayer. Datatype: INT8 Batchsize=64 Measured: imgs/sec vs Tested by Intel as of July 11 th 2017:2S Intel Xeon Platinum GHz (28 cores), HT disabled, turbo disabled, scaling governor set to performance via intel_pstate driver, 384GB DDR ECC RAM. CentOS Linux release (Core), Linux kernel el7.x86_64. SSD: Intel SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: ( revision f96b759f71b f690af267158b82b150b5c. Inference measured with caffe time --forward_only command, training measured with caffe time command. For ConvNet topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from (ResNet-50). Intel C++ compiler ver , Intel MKL small libraries version Caffe run with numactl -l. 11X inference thoughput improvement with CascadeLake: Future Intel Xeon Scalable processor (codename Cascade Lake) results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance vs Tested by Intel as of July 11 th 2017: 2S Intel Xeon Platinum GHz (28 cores), HT disabled, turbo disabled, scaling governor set to performance via intel_pstate driver, 384GB DDR ECC RAM. CentOS Linux release (Core), Linux kernel el7.x86_64. SSD: Intel SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Caffe: ( revision f96b759f71b f690af267158b82b150b5c. Inference measured with caffe time --forward_only command, training measured with caffe time command. For ConvNet topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from (ResNet-50),. Intel C++ compiler ver , Intel MKL small libraries version Caffe run with numactl -l. 35

Data-Centric Innovation Summit NAVEEN RAO CORPORATE VICE PRESIDENT & GENERAL MANAGER ARTIFICIAL INTELLIGENCE PRODUCTS GROUP

Data-Centric Innovation Summit NAVEEN RAO CORPORATE VICE PRESIDENT & GENERAL MANAGER ARTIFICIAL INTELLIGENCE PRODUCTS GROUP Data center logic silicon Tam ~30% cagr Ai is exploding $8-10B Emerging as a