Mike Greenfield, Intel MultiCore 7 Workshop September 27 and 28, 2017 National Center for Atmospheric Research in Boulder, Colorado

Size: px

Start display at page:

Download "Mike Greenfield, Intel MultiCore 7 Workshop September 27 and 28, 2017 National Center for Atmospheric Research in Boulder, Colorado"

Kelly Woods
5 years ago
Views:

1 Mike Greenfield, Intel Multi 7 Workshop September 27 and 28, 2017 National Center for Atmospheric Research in ulder, Colorado *

2 Legal Disclaimers Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer. For more complete information about performance and benchmark results, visit Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. For more information go to All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. No computer system can be absolutely secure. Statements in this document that refer to Intel s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel s results and plans is included in Intel s SEC filings, including the annual report on Form 10-K. Intel, the Intel logo, Xeon, Intel vpro, Intel Xeon Phi, Look Inside., are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries Intel Corporation. 2

3 Optimization Notice Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

4 Key Points for today s session Earth System Models/Centers exploiting Intel Architecture today Intel Roadmap Update: KNL, The Intel Xeon Scalable Processor family, AVX-512, SSF and 3D XPoint Selected notes on E2E Workload Optimization and Performance Portability Advancing the performance of Earth System Models on systems based upon Intel Architecture 4

ESM exploited 8% CORI Source: R Gerber, ICAS 2017 + a huge

5 Earth System Model Innovation on Intel Architecture Hetero IO Optimizations 84 Climate NERSC 3/20 NESAP ESM projects ESM exploited 8% CORI Source: R Gerber, ICAS a huge portion of contemporary Operational Forecast and Climate Research Centers 5

Intel Scalable System Framework A Holistic Solution for All HPC Needs Compute Fabric Memory /

Standards-Based Programmability On-Premise and Cloud-Based Intel Xeon Processors Intel Xeon Phi

3D XPoint Technology Intel SSDs Intel Omni-Path Architecture Intel Silicon Photonics Intel

6 Intel Scalable System Framework A Holistic Solution for All HPC Needs Compute Fabric Memory / Storage Software Small Clusters Through Supercomputers Compute and Data-Centric Computing Standards-Based Programmability On-Premise and Cloud-Based Intel Xeon Processors Intel Xeon Phi Processors Intel FPGAs and Server Solutions Intel Solutions for Lustre* Intel Optane Technology 3D XPoint Technology Intel SSDs Intel Omni-Path Architecture Intel Silicon Photonics Intel Ethernet Intel HPC Orchestrator Intel Software Tools Intel Cluster Ready Program Intel Supported SDVis 31

How It Works Industry-Leading Compute Intel Xeon Processors Intel Xeon Phi Processors Intel FPGAs Intel Omni-Path Architecture Intel Silicon Photonics Fast, Cost-Effective Data Movement Innovative

7 How It Works Industry-Leading Compute Intel Xeon Processors Intel Xeon Phi Processors Intel FPGAs Intel Omni-Path Architecture Intel Silicon Photonics Fast, Cost-Effective Data Movement Innovative Technologies Intel Ethernet Compute Fabric *Other names and brands may be claimed as the property of others. Memory / Storage Software Intel SSDs Intel SW Defined Visualization Fast, Reliable Access to Data 3D XPoint Technology Intel Optane Technology Intel Solutions for Lustre* software Intel HPC Orchestrator Intel Software Tools Intel Cluster Ready Program Ease of Deployment and Management Tight Integration and Co-Design Reference Architecture Benefits Compatibility Bandwidth Density Latency Power Cost 3

Intel Xeon Phi Processor Architecture Self-ot Processor Binary-compatibility with Xeon, 3+ TFLOPS 1 (DP) On-package memory 16GB, up to 490 GB/s STREAM TRIAD Platform Memory Up to 384GB (6ch DDR4-2400

8 Intel Xeon Phi Processor Architecture Self-ot Processor Binary-compatibility with Xeon, 3+ TFLOPS 1 (DP) On-package memory 16GB, up to 490 GB/s STREAM TRIAD Platform Memory Up to 384GB (6ch DDR MHz) Other Key Features TILE: (up to 36) 2VPU 2D Mesh Architecture Out-of-Order s 3X Single-Thread vs. KNC Intel AVX-512 Instructions Scatter/Gather Engine Integrated Fabric - OPA HUB 1MB L2 2VPU Enhanced Intel Atom cores based on Silvermont Microarchitecture DDR4 x4 DMI2 to PCH 36 Lanes PCIe* Gen3 (x16, x16, x4) MCDRAM MCDRAM MCDRAM MCDRAM Processor Package DDR4 Tile EDC (Embedded DRAM Controller) IMC (Integrated Memory Controller) IIO (Integrated I/O Controller) 1 Theoretical peak performance 8

9 Delivering Performance for Deep Learning Workloads ACCELERATE Hardware Capabilities Optimize Deep Learning Software Align Developer Ecosystem Available Now Start training models today using Intel Xeon Phi Available Late 2017 Intel Xeon Phi Processor Knights Mill Up to 4x performance over current processor for Deep Learning workloads* Directly Optimized Frameworks Optimizing these frameworks on Intel Xeon & Xeon Phi processors Libraries/Languages Intel MKL MKL-DNN Nervana Graph Tuned for Intel processors Current & next generation Training Tools Community Benefit from expert-led trainings, hands-on workshops, exclusive remote access, and more! Gain access to the latest libraries, frameworks, tools, and technologies from Intel to accelerate you AI project Collaborate with industry luminaries, developers, students, and Intel engineers Delivering hardware optimized for deep learning Optimized via framework & library enhancements Ensure intel solutions are easy to use and readily available *Intel Xeon Phi processor Knights Mill up to 4x estimated performance improvement over Intel Xeon Phi processor 7290 BASELINE: Intel Xeon Phi Processor 7290 (16GB, 1.50 GHz, 72 core) with 192 GB Total Memory on Red Hat Enterprise Linux* 6.7 kernel using MKL 11.3 Update 4, Relative performance 1.0 Knights Mill: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your 9 contemplated purchases, including the performance of that product when combined with other products. For more information go to

10 2-socket+ Intel Xeon Roadmap Thurley Platform Romley Platform Grantley Platform Purley Platform Intel Microarchitecture Codenamed Nehalem Intel Microarchitecture Codenamed Sandy Bridge Intel Microarchitecture Codenamed Haswell Intel Microarchitecture Codenamed Skylake Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Cascade Lake 45nm 32nm 32nm 22nm 22nm 14nm 14nm New Microarchitecture New Microarchitecture New Microarchitecture New Microarchitecture Brickland Platform is Ivy Bridge-EX, Haswell-EX, and Broadwell-EX Skylake microarchitecture delivers ~10% (geomean) IPC improvement v. Broadwell 10

INTEL XEON SCALABLE processors The Foundation for Agile, Secure, Workload-Optimized Hybrid Cloud UP TO UP TO 2, 4 & DDR4 M 2666 HZ WITH UP TO 28 CORES 8 SOCKET SUPPORT 1.

11 INTEL XEON SCALABLE processors The Foundation for Agile, Secure, Workload-Optimized Hybrid Cloud UP TO UP TO 2, 4 & DDR4 M 2666 HZ WITH UP TO 28 CORES 8 SOCKET SUPPORT 1.5 TB WITH UP TO TOPLINE MEMORY CHANNEL BANDWIDTH HIGHEST ACCELERATOR THROUGHPUT 3 UPI LINKS MAINSTREAM UP TO 22 CORES 2 & 4 UP TO 3 SOCKET SUPPORT UPI LINKS RELIABILITY, AVAILABILITY ADVANCED AND SERVICEABILITY Good SCALABLE PERFORMANCE AT LOW POWER STANDARD RAS MODERATE TASKS INTEL TURBO BOOST TECHNOLOGY AND INTEL HYPER-THREADING TECHNOLOGY FOR MODERATE WORKLOADS Efficient ENTRY SCALABLE PERFORMANCE HARDWARE-ENHANCED SECURITY STANDARD RAS Light TASKS ENTRY PERFORMANCE, Price Sensitive FOR LIGHT WORKLOADS ENTRY 11

Typical 2-socket configuration Intel Xeon E5 v4 (2016) Intel Xeon Purley Scalable (2017) (2017) CPU Intel QPI CPU CPU Intel UPI CPU PCIe* x4 x8 x4 DMI 2 x8 DMI LBG ** 3x16 PCIe* 1x100G Intel OP

12 Typical 2-socket configuration Intel Xeon E5 v4 (2016) Intel Xeon Purley Scalable (2017) (2017) CPU Intel QPI CPU CPU Intel UPI CPU PCIe* x4 x8 x4 DMI 2 x8 DMI LBG ** 3x16 PCIe* 1x100G Intel OP Fabric x4 3x16 PCIe* 1x100G Intel OP Fabric Four DDR4 memory channels up to 24 DIMMs Up to 80 PCIe lanes Two QPI links (up to 9.6 GT/s) DDR4 DIMMs PCIe* uplink connection for Intel QuickAssist Technology and Intel Ethernet ** Six DDR4 memory channels up to 24 DIMMs Up to 96 PCIe lanes Two UPI links (up to 10.4 GT/s); up to 3 UPI links in 4S and 8S configurations Integrated Intel Omni-Path Architecture (Fabric) 12

13 /QPII /QPII /QPII /QPII /QPII /QPII /QPII /QPII /QPII /QPII /QPII /QPII New Mesh Interconnect Architecture Intel Xeon Processor E7 family (24-core die) Intel Xeon Scalable Processor (28-core die) QPI QPI Link Link R3QPI QPI Agent PCI-E PCI-E PCI-E PCI-E X16 X16 X8 X4 (ESI) Ux PCU CB DMA R2PCI IOAPIC IIO 2x UPI x 20 PCIe* * x16 PCIe x16 DMI x 4 CBDMA On Pkg PCIe x16 1x UPI x20 PCIe x16 U D P N SKX SKX SKX SKX SKX SKX /QPII U P D N /QPII D N U P DDR4 MC MC DDR4 /QPII U P D N /QPII D N U P DDR4 DDR4 SKX SKX SKX SKX DDR4 DDR4 /QPII U P D N /QPII D N U P /QPII U P D N /QPII D N U P SKX SKX SKX SKX SKX SKX /QPII U P D N /QPII D N U P /QPII U D P N U D P N /QPII D N U P SKX SKX SKX SKX SKX SKX UP DN SKX SKX SKX SKX SKX SKX DDR Home Agent Mem Ctlr DDR DDR Home Agent Mem Ctlr DDR CHA Caching and Home Agent ; SF Snoop Filter; Last Level ; SKX Skylake Server ; UPI Intel UltraPath Interconnect Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies 13

14 Re-Architected L2 & L3 Hierarchy Previous Architectures Shared L3 /core (inclusive) Intel Xeon Scalable Processor Architecture Shared L MB/core (non-inclusive) L2 (256KB private) L2 (256KB private) L2 (256KB private) L2 (1MB private) L2 (1MB private) L2 (1MB private) On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture): Shared-distributed shared-distributed L3 is primary cache Private-local private L2 becomes primary cache with shared L3 used as overflow cache Shared L3 changed from inclusive to non-inclusive: Inclusive (prior architectures) L3 has copies of all lines in L2 Non-inclusive (Skylake architecture) lines in L2 may not exist in L3 Skylake-SP cache hierarchy architected specifically for Data center use case 14

15 Intel Advanced Vector Extensions-512 (AVX-512) 512-bit wide vectors 32 operand registers 8 64b mask registers Embedded broadcast Embedded rounding Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle Skylake Intel AVX-512 & FMA Haswell / Broadwell Intel AVX2 & FMA Sandybridge Intel AVX (256b) 16 8 Nehalem SSE (128b) 8 4 Intel AVX-512 Instruction Types AVX-512-F AVX-512-VL AVX-512-BW AVX-512-DQ AVX-512-CD AVX-512 Foundation Instructions Vector Length Orthogonality : ability to operate on sub-512 vector sizes 512-bit Byte/Word support Additional D/Q/SP/DP instructions (converts, transcendental support, etc.) Conflict Detect : used in vectorizing loops with potential address conflicts Powerful instruction set for data-parallel computation 15

16 Frequency AVX2 AVX512 Non-AVX AVX2 Non-AVX Frequency Behavior While Running Intel AVX Code s running non-avx, Intel AVX2 light/heavy, and Intel AVX-512 light/heavy code have different turbo frequency limits Frequency of each core is determined independently based on workload demand Mixed Workloads Non-AVX_Turbo AVX2_Turbo AVX512_Turbo Code Type SSE AVX2-Light (without FP & int-mul) All Frequency Limit Non-AVX All Turbo Non-AVX_Base AVX2_Base AVX512_Base AVX2-Heavy (FP & int-mul) AVX512-Light (without FP & int-mul) AVX2 All Turbo s AVX512-Heavy (FP & int-mul) AVX512 All Turbo AVX512 AVX2 Non-AVX s using AVX-512 s using AVX2 s not using AVX 16

17 Normalized to SSE4.2 GFLOPs/GHz GFLOPs, System Power Frequency Normalized to SSE4.2 GFLOPs/Watt Performance and Efficiency with Intel AVX LINPACK Performance SSE4.2 AVX AVX2 AVX512 GFLOPs Power (W) Frequency (GHz) Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products GFLOPs / Watt SSE4.2 AVX AVX2 AVX GFLOPs / GHz SSE4.2 AVX AVX2 AVX512 Intel AVX-512 delivers significant performance and efficiency gains 17

18 Integrated Intel Omni-Path Architecture Platform Benefits - Maximized I/O Density per Node Up to TWO additional PCIe x16 slots are available for maximizing I/O density 1 x16 Significantly more I/O capacity for compute or storage nodes 1 Compute Node GPU GPU GPU GPU GPU GPU SKX-F OPA HFI IFP Cable IFT Card Intel Xeon Processor-F Intel Xeon Processor-F HFI HFI SKX-F or SKX OPA HFI Storage Node or File System Server SKUS WITH INTEGRATED INTEL OMNI-PATH ARCHITECTURE FABRIC Class SKU s Base Non-AVX Speed (GHz) TDP (W) Platinum x F Platinum 8160F Gold 6148F Gold 6142F Gold 6138F Gold 6130F Gold 6126F Intel Xeon Processor-F HFI Intel Xeon Processor-F HFI 1 For illustrative purposes only. Assumes each CPU socket is configured with all 48 PCIe lanes routed to three x16 slots, or 96 total lanes for a 2S Purley platform. PCIe slot count and PCIe device support will vary by OEM platform, so check with your OEM for more details.

3D XPoint 1. 3D XPoint is the next generation non-volatile memory technology by Intel and Micron. 2. Intel SSDs with 3D XPoint media came to market in 2017 (Optane) 3.

19 3D XPoint 1. 3D XPoint is the next generation non-volatile memory technology by Intel and Micron. 2. Intel SSDs with 3D XPoint media came to market in 2017 (Optane) 3. DDR4 socket compatible, Intel DIMMs based on 3D XPoint technology will be supported on next generation data center platform, code-named Purley. 4. On a 2S Xeon server or workstation, Intel DIMMs can offer up to 2X system memory capacity at significantly lower cost per GB than DRAM 5. Intel DIMMs can deliver big memory benefits to existing OS and apps without any modification in the OS or apps 6. Intel DIMMs will co-exist with conventional DDR4 DRAM DIMMs on same platform 7. Intel has sampled DIMMs to select customers 8. Intel DIMMs will be supported on a new version of Skylake in mid

20 Premise End to End Earth System Simulations Architecture Citations (conventional and novel) Kernel Extractions from End to End Earth System Simulations Goal: Simulate more representative systems at greater fidelity, faster, consuming less power and with maximum developer productivity Basic Principles: Balanced System Data Centric Approaches Decrease latency at every level of integration Minimize data movement and reformatting Exploit Industry Standards Respect and enhance customer IP Accommodate load imbalance Exploit and improve s, Vectors, Memory 20

21 Earth System Model Proxy How often is the profile spread across a moderate Number of functions? How often can Kernel speedups and Kernel Energy Reductions translate into Complete E2E workload speedups? Benefit is the management of balance of Kernel Speedup vs data movement + other overheads Durability of the optimized code across the range of simulation use cases 21

22 Geometry Mapping Grid of Physical System Sizes may be dictated by physics or by collected sensor data SW Instantiation of the model Decomposition in ranks, threads and vector loops Possible cache blocking Machine Geometry #s Vector SIMD Width sizes and levels Adaptive Work decomposition Load balance 22

23 Geometry Mapping Grid of Physical System Sizes may be dictated by physics or by collected sensor data SW Instantiation of the model Decomposition in ranks, threads and vector loops Possible cache blocking Machine Geometry #s Vector SIMD Width sizes and levels Adaptive Work decomposition Load balance 23

24 Geometry Mapping Grid of Physical System Sizes may be dictated by physics or by collected sensor data SW Instantiation of the model Decomposition in ranks, threads and vector loops Possible cache blocking Machine Geometry #s Vector SIMD Width sizes and levels Adaptive Work decomposition Load balance 24

25 Geometry Mapping Grid of Physical System Sizes may be dictated by physics or by collected sensor data SW Instantiation of the model Decomposition in ranks, threads and vector loops Possible cache blocking Machine Geometry #s Vector SIMD Width sizes and levels Adaptive Work decomposition Load balance 25

Current Best Known Methods B e n e f i t Vectorization AVX-512 Hybridization Resource Search and eliminate scalability limiters (iterative process) 512 bit vector registers Masking Architecture Full

26 Current Best Known Methods B e n e f i t Vectorization AVX-512 Hybridization Resource Search and eliminate scalability limiters (iterative process) 512 bit vector registers Masking Architecture Full Intel compiler and library support 4 way NUMA platform 18c/socket Huge memory and IO capacity Blocking App C = A * B C = A * B do j = 1, n do i = 1, m do k = 1, l c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo HBM (High Bandwidth Memory) DDRn Numa HBMn Numa 26

27 Position Paper on High Performance Computing Needs in Earth System Prediction (National Earth System Prediction Capability, Silver Spring, MD) April 28, 2017 Selected quotes 1. We advocate for a shift in processor design to increase emphasis on memory bandwidth, so that Earth System models run more efficiently and better serve the public need The present high-water mark of 6% of peak performance achieved for a well designed weather and climate prediction model 14 falls short of what is needed to advance weather and climate prediction in the next decade. References: [1] Carman, Jessie, Thomas Clune, Francis Giraldo, Mark Govett, Brian Gross, Anke Kamrath, Tsengdar Lee, David McCarren, John Michalakes, Scott Sandgathe, Tim Whitcomb Position paper on high performance computing needs in Earth system prediction. National Earth System Prediction Capability. [14] Muller, Andreas, et al. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA, submitted to the International Journal of High-Performance Computing Applications,

Tuning the implementation of the radiation scheme ACRANEB2 Per Berg and Jacob Weismann Poulsen, DMI 2 nd ESCAPE Dissemination and Training Workshop, September 2017 Case Study for the refactoring of

28 Tuning the implementation of the radiation scheme ACRANEB2 Per Berg and Jacob Weismann Poulsen, DMI 2 nd ESCAPE Dissemination and Training Workshop, September 2017 Case Study for the refactoring of Radiation kernels Explore portability vs performance and power impact Key conclusions: Portable Competitive Performance SW Refactoring >> Porting impact for Perf and Energy th CPU and GPU are performant (in both absolute and relative sense) in perf and energy when independently refactored. Source: Tuning the implementation of the radiation scheme ACRANEB2, Per Berg and Jacob Weismann Poulsen, DMI 2 nd ESCAPE Dissemination and Training Workshop, Sept 2017 SKX-8180 measurements provided by Intel, Sept 2017 Optimization Notice:

29 Additional Insights from case study Source: Personal communications with Per Berg and Jacob Weismann Poulsen, DMI Source: Tuning the implementation of the radiation scheme ACRANEB2, Per Berg and Jacob Weismann Poulsen, DMI 2 nd ESCAPE Dissemination and Training Workshop, Sept 2017 Setting Expectations for the Performance Portability between Companion Accelerator and Many Systems John M Levesque, DOE Center of Excellence Performance Portability meeting, 2017, Cray Quote The best performance on the GPU does not perform well on KNL and state-of-the-art Xeon The best performance on KNL performs well on Xeon and okay on the GPU Optimizations were more durable across several generations of Xeon and KNL than across several generations of GPU. Speedups are less interesting and speak mostly to legacy; Only the time to solution should be pursued and compared. GPU optimized version for the main loop currently requires 4x sloc-count (affects development and maintainability). Optimization Notice:

30 Summary The Intel Xeon Scalable Processors: Now available Part of the Intel Scalable System Framework ;Spans Processors, Memory, Storage, CFS, Fabric, Software Balanced System design critical for advancing ESM simulation capability; Basic principles introduced ESM kernel optimizations greatly outpace Generalized End to End workload improvements DMI ACRANEB2 SW refactoring shows value of SW refactoring and limits of performance portability 30

Hubert Nueckel Principal Engineer, Intel. Doug Nelson Technical Lead, Intel. September 2017

Hubert Nueckel Principal Engineer, Intel Doug Nelson Technical Lead, Intel September 2017 Legal Disclaimer Intel technologies features and benefits depend on system configuration and may require enabled