Dr Christopher Dahnken. SSG DRD EMEA Datacenter

Size: px

Start display at page:

Download "Dr Christopher Dahnken. SSG DRD EMEA Datacenter"

Gwen Atkinson
5 years ago
Views:

1 Dr Christopher Dahnken SSG DRD EMEA Datacenter

2 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi,, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

3 Agenda Xeon Phi: Knights Landing Xeon: Sky Lake Short outlook Non-Volatile Memory/Optane

5 KNL Architecture Overview ISA Intel Xeon Processor Binary-Compatible (w/broadwell) On-package memory Up to 16GB, ~500 GB/s STREAM at launch Platform Memory Up to 384GB (6ch MHz) Fixed ttlenecks 2D Mesh Architecture Out-of-Order s 3x single-thread vs. KNC x4 DMI2 to PCH 36 Lanes PCIe* Gen3 (x16, x16, x4) KNL Package TILE: (up to 36) 2VPU HUB 1MB L2 2VPU 4 4 Enhanced Intel Atom cores based on Silvermont Microarchitecture EDC (embedded DRAM controller) IMC (integrated memory controller) IIO (integrated I/O controller)

6 KNL Mesh Interconnect Mesh of Rings OPIO OPIO PCIe OPIO OPIO Every row and column is a (half) ring EDC EDC IIO EDC EDC YX routing: Go in Y Turn Go in X Messages arbitrate at injection and on turn imc imc Coherent Interconnect MESIF protocol (F = Forward) Distributed directory to filter snoops EDC EDC Misc EDC EDC OPIO OPIO OPIO OPIO Three Cluster Modes (1) All-to-All (2) Quadrant (3) Sub-NUMA Clustering 6

7 Cluster Mode: All-to-All OPIO OPIO PCIe OPIO OPIO Address uniformly hashed across all distributed directories EDC EDC IIO EDC EDC 1 4 imc imc 3 No affinity between, Directory and Memory Lower performance mode, compared to other modes. Mainly for fall-back EDC EDC Misc EDC EDC OPIO OPIO OPIO OPIO 1) L2 miss, 2) Directory access, 3) Memory access, 4) Data return 2 Typical Read L2 miss 1. L2 miss encountered 2. Send request to the distributed directory 3. Miss in the directory. Forward to memory 4. Memory sends the data to the requestor 7

8 Cluster Mode: Quadrant OPIO OPIO PCIe OPIO OPIO EDC EDC IIO EDC EDC 3 Chip divided into four virtual Quadrants 1 4 imc imc 2 Address hashed to a Directory in the same quadrant as the Memory Affinity between the Directory and Memory EDC EDC Misc EDC EDC Lower latency and higher BW than all-to-all. SW Transparent. OPIO OPIO OPIO OPIO 1) L2 miss, 2) Directory access, 3) Memory access, 4) Data return 8

9 Cluster Mode: Sub-NUMA Clustering (SNC) OPIO OPIO PCIe OPIO OPIO EDC EDC IIO EDC EDC 3 Each Quadrant (Cluster) exposed as a separate NUMA domain to OS. imc imc Looks analogous to 4-Socket Xeon Affinity between, Directory and Memory Local communication. Lowest latency of all modes. EDC EDC Misc EDC EDC OPIO OPIO OPIO OPIO 1) L2 miss, 2) Directory access, 3) Memory access, 4) Data return SW needs to NUMA optimize to get benefit. 9

10 KNL and VPU Out-of-order core w/ 4 SMT threads VPU tightly integrated with core pipeline 2-wide decode/rename/retire 2x 64B load & 1 64B store port for D$ L1 prefetcher and L2 prefetcher Fast unaligned and cache-line split support Fast gather/scatter support 10

11 Physical Address KNL Memory Modes Model Mode selected at boot - covers all Hybrid Model Flat Models 11

12 : vs Flat Mode Recommended Only as Only Flat + Hybrid Software Effort Performance No software changes required Not peak performance. Change allocations for bandwidth-critical data. Best performance. Limited memory capacity Optimal HW utilization + opportunity for new algorithms 12

SSE* SSE* Xeon 5600 Nehalem Xeon E5-2600 Sandy Bridge

13 KNL Instruction Set ER BW PF DQ CDI CDI AVX-512 F AVX-512 F TSX TSX AVX2 AVX2 AVX2 AVX AVX AVX AVX SSE* SSE* SSE* SSE* SSE* Xeon 5600 Nehalem Xeon E Sandy Bridge Xeon E5-2600v3 Haswell Xeon Phi Knights Landing Xeon Sky Lake 13

15 2-socket+ Intel Xeon Roadmap Thurley Platform Romley Platform Grantley Platform Purley Platform Intel Microarchitecture Codenamed Nehalem Intel Microarchitecture Codenamed Sandy Bridge Intel Microarchitecture Codenamed Haswell Intel Microarchitecture Codenamed Skylake Nehalem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Future 45nm 32nm 32nm 22nm 22nm 14nm 14nm 14nm New Microarchitecture New Microarchitecture New Microarchitecture New Microarchitecture Brickland Platform is Ivy Bridge-EX, Haswell-EX, and Broadwell-EX Skylake microarchitecture delivers ~10% (geomean) IPC improvement v. Broadwell 16

16 New Skylake Uncore Interconnect Architecture Broadwell Server 24-core die dual-ring interconnect Skylake (or Cascade Lake) Server 28-core die mesh interconnect QPI QPI Link Link R3QPI QPI Agent PCI-E PCI-E PCI-E PCI-E X16 X16 X8 X4 (ESI) Ux PCU CB DMA R2PCI IOAPIC IIO 2x UPI x20 PCIe* * x16 PCIe x16 DMI x 4 CBDMA On Pkg PCIe x16 1x UPI x20 PCIe x16 U D P N U D P N D U N P U D P N U D P N D U N P D U N P 4 MC 4 4 MC U D P N D U N P U D P N D U N P U D P N D U N P U D P N UP DN Home Agent Mem Ctlr Home Agent Mem Ctlr CHA Caching & Home Agent SF Snoop Filter Mesh interconnect (Skylake Server) replaces dual-ring interconnect (BDW E5/E7) 17

17 VEC INT Microarchitecture Enhancements Front End 32KB L1 I$ Pre decode Inst Q Load Buffer Store Buffer Port 0 Port 1 ALU Shift JMP 2 FMA ALU Shift DIV ALU LEA MUL FMA ALU Shift Branch Prediction Unit Reorder Buffer Port 5 ALU LEA FMA ALU Shuffle Port 6 ALU Shift JMP 1 Load Data 2 Load Data 3 Scheduler Port 4 Store Data 1MB L2$ Decoders μop Allocate/Rename/Retire Port 2 Load/STA Memory Control Fill Buffers 5 6 Port 3 Load/STA Fill Buffers Port 7 STA 32KB L1 D$ μop Queue In order OOO Memory Broadwell uarch Skylake uarch Out-of-order Window In-flight Loads + Stores Scheduler Entries Registers Integer + FP Allocation Queue 56 64/thread L1D BW (B/Cyc) Load + Store L2 Unified TLB 4K+2M: K+2M: G: 16 Larger and improved branch predictor, higher throughput decoder, larger window to extract ILP Improved scheduler and execution engine, improved throughput and latency of divide/sqrt More load/store bandwidth, deeper load/store buffers, improved prefetcher Data center specific enhancements: Intel AVX-512 with 2 FMAs per core, larger 1MB MLC About 10% performance improvement per core on integer applications at same frequency 18

18 Intel Xeon Scalable Processor Feature Overview 10GbE 3x16 PCIe* Gen3 Skylake-SP CPU OPA DMI Intel QAT ME IE 4x10GbE NIC Lewisburg PCH TPM 2 or 3 Intel UPI High Speed IO GPIO x 100Gb OPA Fabric SPI USB3 PCIe3 SATA3 espi/lpc Firmware 3x16 PCIe Gen3 Skylake-SP CPU OPA BMC 1x 100Gb OPA Fabric CPU VRs OPA VRs Mem VRs Firmware BMC: Baseboard Management Controller PCH: Intel Platform Controller Hub IE: Innovation Engine Intel OPA: Intel Omni-Path Architecture Intel QAT: Intel QuickAssist Technology ME: Manageability Engine NIC: Network Interface Controller VMD: Volume Management Device NTB: Non-Transparent Bridge Feature Socket Scalability CPU TDP Chipset Networking Compression and Crypto Acceleration Storage Security Manageability Details Socket P 2S, 4S, 8S, and >8S (with node controller support) 70W 205W Intel C620 Series (code name Lewisburg) Intel Omni-Path Fabric (integrated or discrete) 4x10GbE (integrated w/ chipset) 100G/40G/25G discrete options Intel QuickAssist Technology to support 100Gb/s comp/decomp/crypto 100K RSA2K public key Integrated QuickData Technology, VMD, and NTB Intel Optane SSD, Intel 3D-NAND NVMe & SATA SSD CPU enhancements (MBE, PPK, MPX) Manageability Engine Intel Platform Trust Technology Intel Key Protection Technology Innovation Engine (IE) Intel Node Manager Intel Datacenter Manager 19

Platform Topologies 2S Configurations 4S Configurations 8S Configuration LBG LBG Intel UPI DMI LBG ** 3x16 PCIe* 1x100G Intel OP Fabric x4 3x16 PCIe* 1x100G Intel OP Fabric LBG

19 Platform Topologies 2S Configurations 4S Configurations 8S Configuration LBG LBG Intel UPI DMI LBG ** 3x16 PCIe* 1x100G Intel OP Fabric x4 3x16 PCIe* 1x100G Intel OP Fabric LBG LBG (2S-2UPI & 2S-3UPI shown) DMI LBG 3x16 PCIe* (4S-2UPI & 4S-3UPI shown) Intel Xeon Scalable Processor supports configurations ranging from 2S-2UPI to 8S DMI LBG 3x16 PCIe* LBG 20

Copyright 2017 Intel Corporation

Copyright 2017 Intel Corporation Agenda Intel Xeon Scalable Platform Overview Architectural Enhancements 2 Platform Overview 3x16 PCIe* Gen3 2 or 3 Intel UPI 3x16 PCIe Gen3 Capabilities Details 10GbE Skylake-SP CPU OPA DMI Intel C620