Intel Architecture for Software Developers

Similar documents
Presenter: Dr. Heinrich Bockhorst Date:

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Growth in Cores - A well rehearsed story

IXPUG 16. Dmitry Durnov, Intel MPI team

Kevin O Leary, Intel Technical Consulting Engineer

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Intel s Architecture for NFV

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Advanced Parallel Programming I

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Vectorization Advisor: getting started

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Modern CPU Architectures

Intel Architecture for HPC

Risk Factors. Rev. 4/19/11

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

45-year CPU Evolution: 1 Law -2 Equations

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

How to write powerful parallel Applications

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Bei Wang, Dmitry Prohorov and Carlos Rosales

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Dr Christopher Dahnken. SSG DRD EMEA Datacenter

H.J. Lu, Sunil K Pandey. Intel. November, 2018

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Philippe Thierry Sr Staff Engineer Intel Corp.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Graphics Performance Analyzer for Android

Getting Started with Intel SDK for OpenCL Applications

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

What's new in VTune Amplifier XE

Introducing Sandy Bridge

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Expressing and Analyzing Dependencies in your C++ Application

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Bitonic Sorting Intel OpenCL SDK Sample Documentation

White Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Next Generation Technology from Intel Intel Pentium 4 Processor

Intel released new technology call P6P

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

2

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Simultaneous Multithreading on Pentium 4

High Performance Computing The Essential Tool for a Knowledge Economy

XT Node Architecture

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Installation Guide and Release Notes

Parallelism, Multicore, and Synchronization

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel Xeon Phi Coprocessor Performance Analysis

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Intel Many Integrated Core (MIC) Architecture

Simplified and Effective Serial and Parallel Performance Optimization

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

LS-DYNA Performance on Intel Scalable Solutions

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

CS425 Computer Systems Architecture

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Intel Workstation Technology

Handout 2 ILP: Part B

COMPUTER ORGANIZATION AND DESI

Trends in systems and how to get efficient performance

What s P. Thierry

What s New August 2015

Advanced Processor Architecture

Keywords and Review Questions

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Inside Intel Core Microarchitecture

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Pentium IV-XEON. Computer architectures M

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Kirill Rogozhin. Intel

Transcription:

Intel Architecture for Software Developers 1

Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software Developers Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Summary 2

Chips and Dies Processor Graphics A chip is the package containing one or more dies (silicon) Major components of the die can be easily identified Example of a typical die: Graphics Core Core Core Core System Agent L3 Cache Memory Controller I/O Die shot of Intel Core 3

Moore s Law The number of transistors on a chip will double approximately every two years. [Gordon Moore] Moore's Law graph, 1965 Source: http://en.wikipedia.org/wiki/moore's_law 4

Parallelism Problem: Economical operation frequency of (CMOS) transistors is limited. No free lunch anymore! Solution: More transistors allow more gates/logic on the same die space and power envelop, improving parallelism: Thread level parallelism (TLP): Multi- and many-core Data level parallelism (DLP): Wider vectors (SIMD) Instruction level parallelism (ILP): Microarchitecture improvements, e.g. threading, superscalarity, 5

History Intel 64 and IA-32 Architecture 1978: 8086 and 8088 processors 1982: Intel 286 processor 1985: Intel386 processor 1989: Intel486 processor 1993: Intel Pentium 1995-1999: P6 family: Intel Pentium Pro processor Intel Pentium II [Xeon] processor Intel Pentium III [Xeon] processor Intel Celeron processor 2000-2006: Intel Pentium 4 processor 2005: Intel Pentium processor Extreme Edition 2001-2007: Intel Xeon processor 2003-2006: Intel Pentium M processor 2006/2007: Intel Core Duo and Intel Core Solo processors 2006/2007: Intel Core 2 processor 2008: Intel Atom processor and Intel Core i7 processor family 2010: Intel Core processor family 2011: Second generation Intel Core processor family 2012: Third generation Intel Core processor family 2013: Fourth generation Intel Core processor family 2013: Intel Atom processor family based on Silvermont microarchitecture 6

Processor Architecture Basics Pipeline Computation of instructions requires several stages: Front End Back End Instructions Load or Fetch Decode Execute Commit Store 1. Fetch: Read instruction (bytes) from memory 2. Decode: Translate instruction (bytes) to microarchitecture 3. Execute: Perform the operation with a functional unit 4. Memory: Load (read) or store (write) data, if required 5. Commit: Retire instruction and update micro-architectural state Registers Memory 7

Fetch Decode Execute Load or Store Commit Processor Architecture Basics Naïve Pipeline: Serial Execution t p 6 5 4 3 2 1 Instructions Registers Memory Characteristics of strict serial execution: Only one instruction for the entire pipeline (all stages) Low complexity Execution time: n instructions * t p Problem: Inefficient because only one stage is active at a time 8

Fetch Decode Execute Load or Store Commit Processor Architecture Basics Pipeline Execution t s 6 5 4 3 2 1 Instructions Registers Memory Characteristics of pipeline execution: Multiple instructions for the entire pipeline (one per stage) Efficient because all stages kept active at every point in time Execution time: n instructions * t s Problem: Reality check: What happens if t s is not constant? 9

Fetch Decode Execute Load or Store Commit Processor Architecture Basics Pipeline Stalls t s t p 6 5 4 3 2 1 Instructions 1: mov $0x1, %rax 2: mov 0x1234, %rbx 3: add %rbx, %rax 4: 5: 6: Registers Memory Pipeline stalls: Caused by pipeline stages to take longer than a cycle Caused by dependencies: order has to be maintained Execution time: n instructions * t avg with t s t avg t p Problem: Stalls slow down pipeline throughput and put stages idle. 10

Processor Architecture Basics Reducing Impact of Pipeline Stalls Impact of pipeline stalls can be reduced by: Branch prediction Superscalarity + multiple issue fetch & decode Out of Order execution Cache Non-temporal stores Prefetching Line fill buffers Load/Store buffers Alignment Simultaneous Multithreading (SMT) Characteristics of the architecture that might require user action! 11

Fetch/Decode Schedule Port 1 Exe LS Com Port 0 Exe LS Com Processor Architecture Basics Superscalarity 6 5 4 3 2 1 Instructions Characteristics of a superscalar architecture: Improves throughput by covering latency (ports are independent) Ports can have different functionalities (floating point, integer, addressing, ) Requires multiple issue fetch & decode (here: 2 issue) Execution time: n instructions * t avg / n ports Problem: More complex and prone in case of dependencies Solution: Out of Order Execution 12

Fetch Decode Dispatch Execute Load or Store Reorder Commit Processor Architecture Basics Out of Order Execution 6 5 4 3 2 1 Instructions 1: mov $0x1, %rax 2: mov $0x2, %rbx 3: add %rbx, %rax 4: add $0x0, $r8 5: add $0x1, $r9 6: add $0x2, $r10 I-Queue ROB Characteristics of out of order (OOO) execution: Instruction queue (I-Queue) moves stalling instructions out of pipeline Reorder buffer (ROB) maintains correct order of committing instructions Reduces pipeline stalls, but not entirely! Speculative execution possible Opposite of OOO execution is in order execution 13

Fetch Decode Execute Load or Store Commit Processor Architecture Basics Cache 6 5 4 3 2 1 Instructions 1: mov $0x1, %rax 2: mov 0x1234, %rbx 3: mov 0x1234, %rcx 4: 5: 6: Cache: Small, some KiB Faster than main memory Cache hierarchy: L1, L2 & L3 (LLC) Reused data can remain in cache (here: 0x1234) Problem: Maintain data coherency Cache coherency protocol (e.g. MESI) Registers Cache Memory 0x1234 Cache-miss Cache-hit 14

Fetch Decode Execute Load or Store Commit Processor Architecture Basics Cache-Hierarchy & Cache-Line 6 5 4 3 2 1 Instructions 1: mov 0x1234, %rbx 2: mov 0x5678, %rcx 3: mov %rcx, 0x1234 4: 5: 6: Cache-hierarchy: For data and instructions Usually inclusive caches Races for resources Can improve access speed Cache-misses & cache-hits 6 5 Cache-miss L1 Caches Cache-miss 4 3 I 2 D 1 Cache-miss L2 Cache Instructions 0x1234 0x5678 0x1234 0x5678 Instructions Memory Cache-line: Cache-miss Cache-hit Always full 64 byte block Instructions Minimal granularity of every load/store Modifications invalidate entire cache-line (dirty bit) L3 Cache 0x1234 0x5678 15

Processor Architecture Basics Prefetching Act ahead of time: Continuous data streams could stutter due to caching Two solutions to avoid this: Hardware prefetch: Processor memory management tries to detect patterns and automatically load cache-lines before access. Same as for branch prediction, subject of the underlying architecture and processor generation. Software prefetch: Programmer can manually add PREFETCH instructions to selectively load cache-lines before access. Software prefetching also needs to be adjusted for each architecture and generation. Prefetching only helps smoothing the flow; it won t increase bandwidth! Consider tuning prefetching only as the very last step of application tuning! 16

Store Load & Store Address Load & Store Addres Store Address Processor Architecture Basics Example: 4 th Generation Intel Core From Intel 64 and IA-32 Architectures Optimization Reference Manual: I Fetch (Pre-Decode) I-Queue BTB Decode & Dispatch ROB, L/S BPU Schedule Port 0 Port 1 Port 5 Port 6 Port 4 Port 2 Port 3 Port 7 Exe [Int] [FP] Exe [Int] [FP] Exe [Int] [FP] Exe [Int] LFB D L2 Cache 17

Graphics Core 1 Core 2 Core n Processor Architecture Basics Core vs. Uncore Core: Processor core s logic: Execution units Core caches (L1/L2) Buffers & registers Uncore: All outside a processor core: DDR3 or DDR4 Processor Memory controller/channels (MC) and Intel QuickPath Interconnect (QPI) L3 cache shared by all cores Type of memory Power management and clocking Optionally: Integrated graphics MC L3 Cache QPI Clock & Power Core Uncore Only uncore is differentiation within same processor family! 18

Processor Architecture Basics UMA and NUMA Memory Socket 1 Processor 1 Processor 2 Memory Socket 2 MC QPI QPI MC UMA (aka. non-numa): Uniform Memory Access (UMA) Addresses interleaved across memory nodes by cache line Accesses may or may not have to cross QPI link Provides good portable performance without tuning NUMA: Non-Uniform Memory Access (NUMA) Addresses not interleaved across memory nodes by cache line Each processor has direct access to contiguous block of memory Provides peek performance but requires special handling System Memory Map System Memory Map 19

Processor Architecture Basics NUMA - Thread Affinity & Enumeration Non-NUMA: Thread affinity might be beneficial (e.g. cache locality) but not required NUMA: Thread affinity is required: Improve accesses to local memory vs. remote memory Ensure 3 rd party components support affinity mapping, e.g.: Intel TBB via set_affinity() Intel OpenMP* via $OMP_PLACES Intel MPI via $I_MPI_PIN_DOMAIN Right way to get enumeration of cores: Intel 64 Architecture Processor Topology Enumeration https://software.intel.com/en-us/articles/intel-64-architectureprocessor-topology-enumeration 20

Processor Architecture Basics NUMA - Memory, Bandwidth & Latency Memory allocation: Differentiate: implicit vs. explicit memory allocation Explicit allocation with NUMA aware libraries, e.g. libnuma (Linux*) Bind memory (SW) thread, and (SW) thread processor More information on optimizing for performance: https://software.intel.com/de-de/articles/optimizing-applicationsfor-numa Memory Socket 1 Processor 1 Processor 2 Memory Socket 2 MC QPI QPI MC Performance: Remote memory access latency ~1.7x greater than local memory Local memory bandwidth can be up to ~2x greater than remote Local Access Remote Access 21

Clock Rate and Power Gating Control power utilization and performance: Clock rate: Idle components can be clocked lower: Save energy Intel SpeedStep If thermal specification allows, components can be over-clocked: Higher performance Intel Turbo Boost Technology Base frequency at P1; P0 might be turbo, depending on processor to decide (how many cores/gpu are active) Intel Turbo Boost Technology 2.0: processor can decide whether turbo mode can exceed the TDP with higher frequencies Power gating: Turn off components to save power P-state: Processor state; low latency; combined with speed step C-state: chip state; longer latency, more aggressive static power reduction 22

Documentation Intel 64 and IA-32 Architectures Software Developer Manuals: Intel 64 and IA-32 Architectures Software Developer s Manuals Volume 1: Basic Architecture Volume 2: Instruction Set Reference Volume 3: System Programming Guide Software Optimization Reference Manual Related Specifications, Application Notes, and White Papers https://www-ssl.intel.com/content/www/us/en/processors/architecturessoftware-developer-manuals.html?iid=tech_vt_tech+64-32_manuals Intel Processor Numbers (who type names are encoded): http://www.intel.com/products/processor_number/eng/ 23

Desktop, Mobile & Server Big Core 24

Desktop, Mobile & Server Tick/Tock Model Intel Core 2 nd Generation Intel Core 3 rd Generation Intel Core 4 th Generation Intel Core Nehalem (2008) New Microarchitecture 45nm Westmere (2010) New Process Technology 32nm Sandy Bridge (2011) New Microarchitecture 32nm Ivy Bridge (2012) New Process Technology 22nm Haswell (2013) New Microarchitecture 22nm Tock Tick Tock Tick Tock Future Broadwell (2014) Skylake Tock: Innovate New Process Technology 14nm New Microarchitecture 14nm New Process Technology 11nm New Microarchitecture 11nm Tick: Shrink Tick Tock Tick Tock 25

Desktop, Mobile & Server Characteristics Processor core: 4 issue Superscalar out-of-order execution Simultaneous multithreading: Intel Hyper-Threading Technology with 2 HW threads per core Multi-core: Intel Core processor family: up to 4 cores (desktop & mobile) Intel Xeon processor family: up to 15 cores (server) Caches: Three level cache hierarchy L1/L2/L3 (Nehalem and later) 64 byte cache line 26

Desktop, Mobile & Server Caches Cache hierarchy: Processor Core 1 Core 2 Core n I D I D I D L2 Cache L2 Cache L2 Cache L3 Cache Level Latency (cycles) Bandwidth (per core per cycle) L1-D 4 2x 16 bytes 32KiB L2 (unified) 12 1x 32 bytes 256KiB L3 (LLC) 26-31 1x 32 bytes varies ( 2MiB per core) L2 and L1 D-Cache in other cores 43 (clean hit), 60 (dirty hit) Size Example for Haswell 27

Desktop, Mobile & Server Performance Following Moore s Law: Microarchitecture Instruction Set SP FLOPs per Cycle per Core DP FLOPs per Cycle per Core L1 Cache Bandwidth (bytes/cycle) L2 Cache Bandwidth (bytes/cycle) Nehalem SSE (128-bits) 8 4 32 (16B read + 16B write) 32 Sandy Bridge AVX (256-bits) 16 8 48 (32B read + 16B write) 32 Haswell AVX2 (256-bits) 32 16 96 (64B read + 32B write) 64 Example of theoretic peak FLOP rates: Intel Core i7-2710qe (Sandy Bridge): 2.1 GHz * 16 SP FLOPs * 4 cores = 134.4 SP GFLOPs Intel Core i7-4765t (Haswell): 2.0 GHz * 32 SP FLOPs * 4 cores = 256 SP GFLOPs 28

Thank you! 29

Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 30

31