Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Similar documents
Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Modern computer architecture. From multicore to petaflops

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Advanced Parallel Programming I

Intel Architecture for Software Developers

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Modern CPU Architectures

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

The Era of Heterogeneous Computing

n N c CIni.o ewsrg.au

45-year CPU Evolution: 1 Law -2 Equations

GPUs and Emerging Architectures

WHY PARALLEL PROCESSING? (CE-401)

Multicore Scaling: The ECM Model

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Computing. November 20, W.Homberg

The ECM (Execution-Cache-Memory) Performance Model

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

INF5063: Programming heterogeneous multi-core processors Introduction

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel Programming on Ranger and Stampede

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE)

Online Course Evaluation. What we will do in the last week?

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management

Parallel Systems I The GPU architecture. Jan Lemeire

Basics of Performance Engineering

Parallel Computer Architecture - Basics -

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Parallel Processing SIMD, Vector and GPU s cont.

Computer Architecture Spring 2016

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CSE502: Computer Architecture CSE 502: Computer Architecture

How to Write Fast Numerical Code

Trends in HPC (hardware complexity and software challenges)

Computer Architecture

CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Basics of performance modeling for numerical applications: Roofline model and beyond

Parallel Computer Architecture - Basics -

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Intel Architecture for HPC

Philippe Thierry Sr Staff Engineer Intel Corp.

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Application Performance on Dual Processor Cluster Nodes

Parallelism. CS6787 Lecture 8 Fall 2017

Intel Enterprise Processors Technology

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

The Mont-Blanc approach towards Exascale

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

COSC 6385 Computer Architecture - Multi Processor Systems

Master Informatics Eng.

Carlo Cavazzoni, HPC department, CINECA

Preparing for Highly Parallel, Heterogeneous Coprocessing

Portland State University ECE 588/688. Cray-1 and Cray T3E

Intel Knights Landing Hardware

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Intel Many Integrated Core (MIC) Architecture

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Parallel Architecture. Hwansoo Han

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Parallel Computing. Hwansoo Han (SKKU)

Processor (IV) - advanced ILP. Hwansoo Han

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Exploring the Effects of Hyperthreading on Scientific Applications

ADVANCED COMPUTER ARCHITECTURES

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

CS 152, Spring 2011 Section 10

Fundamental CUDA Optimization. NVIDIA Corporation

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

XT Node Architecture

Modern Processor Architectures. L25: Modern Compiler Design

HPC Architectures. Types of resource currently in use

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Fundamentals of Computer Design

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Parallel Computing Platforms

Transcription:

Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Motivation: Multi-Cores where and why

Introduction: Moore s law Intel Sandy Bridge EP: 2.3 Billion Nvidia Kepler: 7 Billion 1965: G. Moore claimed #transistors on microchip doubles every 12-24 months (c) RRZE 2014 Basic Architecture 3

Multi-Core: Intel Xeon 2600v3 (2014) Xeon E5-2600v3 Haswell EP : Up to 18 cores running at 2.3 GHz (max 3.6 GHz) Simultaneous Multithreading reports as 36-way chip 5.7 Billion Transistors / 22 nm Die size: 662 mm 2...... 2-socket server (c) RRZE 2014 Basic Architecture 6

A deeper dive into core and chip architecture

General-purpose cache based microprocessor core Modern CPU core Stored-program computer Implements Stored Program Computer concept (Turing 1936) Similar designs on all modern systems (Still) multiple potential bottlenecks Flexible! (c) RRZE 2014 Basic Architecture 8

Basic resources on a stored program computer Instruction execution and data movement 1. Instruction execution This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput. Instructions are the concept of work as seen by processor designers. Not all instructions count as work as seen by application developers! Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Processor work: LOAD r1 = A(i) LOAD r2 = B(i) ADD r1 = r1 + r2 STORE A(i) = r1 INCREMENT i BRANCH top if i<n User work: N Flops (ADDs) (c) RRZE 2014 Basic Architecture 9

Basic resources on a stored program computer Instruction execution and data movement 2. Data transfer Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed). Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Data transfers: 8 byte: LOAD r1 = A(i) 8 byte: LOAD r2 = B(i) 8 byte: STORE A(i) = r2 Sum: 24 byte Crucial question: What is the bottleneck? Data transfer? Code execution? (c) RRZE 2014 Basic Architecture 10

Microprocessors Pipelining

Pipelining of arithmetic/functional units Idea: Split complex instruction into several simple / fast steps (stages) Each step takes the same amount of time, e.g., a single cycle Execute different steps on different instructions at the same time (in parallel) Allows for shorter cycle times (simpler logic circuits), e.g.: floating point multiplication takes 5 cycles, but processor can work on 5 different multiplications simultaneously one result at each cycle after the pipeline is full Drawback: Pipeline must be filled - startup times (#Instructions >> pipeline steps) Efficient use of pipelines requires large number of independent instructions instruction level parallelism Requires complex instruction scheduling by compiler/hardware softwarepipelining / out-of-order Pipelining is widely used in modern computer architectures (c) RRZE 2014 Basic Architecture 12

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,n First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages (c) RRZE 2014 Basic Architecture 13

Pipelining: The Instruction pipeline Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps: Fetch Instruction from L1I Decode instruction Execute Instruction t 1 2 3 4 Fetch Instruction 1 from L1I Fetch Instruction 2 from L1I Fetch Instruction 3 from L1I Fetch Instruction 4 from L1I Decode Instruction 1 Decode Instruction 2 Decode Instruction 3 Execute Instruction 1 Execute Instruction 2 Branches can stall this pipeline! (Speculative Execution, Predication) Each unit is pipelined itself (e.g., Execute = Multiply Pipeline) (c) RRZE 2014 Basic Architecture 14

Microprocessors Superscalarity and Simultaneous Multithreading

Superscalar Processors Instruction Level Parallelism Multiple units enable use of Instrucion Level Parallelism (ILP): Instruction stream is parallelized on the fly t Fetch Instruction 4 Fetch Instruction 3 Fetch from Instruction L1I 2 Fetch from Instruction L1I 1 Fetch Instruction from L1I 2 Decode Fetch Instruction from L1I 2 Decode Fetch from Instruction L1I 2 Instruction Decode 1 Fetch from Instruction L1I 5 Instruction Decode 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch from Instruction L1I 3 Instruction Decode 2 Fetch from Instruction L1I 9 Instruction Decode 2 Fetch Instruction from L1I 4 Decode Instruction 2 Fetch Instruction from L1I 4 Decode Instruction 5 Fetch from Instruction L1I 4 Instruction Decode 3 Fetch from Instruction L1I 13 Instruction Decode 3 from L1I Instruction 3 from L1I Instruction 9 4-way superscalar Execute Execute Instruction Execute 1 Instruction Execute 1 Execute Instruction 1 Execute Instruction 1 Instruction Execute 2 Instruction Execute 2 Instruction 2 Instruction 5 Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar & can perform 2 floating point instructions per cycles (c) RRZE 2014 Basic Architecture 16

2-way SMT Standard core Core details: Simultaneous multi-threading (SMT) SMT principle (2-way example): (c) RRZE 2014 Basic Architecture 17

Microprocessors Single Instruction Multiple Data (SIMD) a.k.a. vectorization

A[0] B[0] C[0] A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] C[0] C[1] C[2] C[3] Core details: SIMD processing Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on wide registers x86 SIMD instruction sets: SSE: register width = 128 Bit 2 double precision floating point operands AVX: register width = 256 Bit 4 double precision floating point operands Adding two registers holding double precision floating point operands R0 R1 R2 R0 R1 R2 SIMD execution: V64ADD [R0,R1] R2 + 256 Bit 64 Bit Scalar execution: R2 ADD [R0,R1] + + + + (c) RRZE 2014 Basic Architecture 19

Microprocessors Memory Hierarchy

Registers and caches: Data transfers in a memory hierarchy Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(2..N cl ) ST A(2..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) CL CL C(:) A(:) Memory 3 CL transfers (c) RRZE 2014 Basic Architecture 21

From UMA to ccnuma Basic architecture of commodity multi-socket nodes Yesterday (2006): Dual-socket Intel Core2 node Uniform Memory Architecture (UMA) Flat memory ; symmetric MPs But: system anisotropy Today: Dual-socket node Cache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the price of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a socket! (c) RRZE 2014 Basic Architecture 22

Cray XC30 SandyBridge-EP 8-core dual socket node 8 cores per socket 2.7 GHz (3.5 @ turbo) DDR3 memory interface with 4 channels per chip Two-way SMT Two 256-bit SIMD FP units SSE4.2, AVX 32 kb L1 data cache per core 256 kb L2 cache per core 20 MB L3 cache per chip (c) RRZE 2014 Basic Architecture 23

There is no single driving force for chip performance! Floating Point (FP) Performance: P = n core * F * S * n n core number of cores: 8 F FP instructions per cycle: 2 (1 MULT and 1 ADD) Intel Xeon Sandy Bridge EP socket 4,6,8 core variants available S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers AVX ) n Clock speed : 2.7 GHz TOP500 rank 1 (mid-90s) P = 173 GF/s (dp) / 346 GF/s (sp) But: P=5.4 GF/s for serial, non-simd code (c) RRZE 2014 Basic Architecture 26

Interlude: A glance at current accelerator technology

NVIDIA Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units 192 (SP) cores each > 1 TFLOP DP peak 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3 3:1 SP:DP performance NVIDIA Corp. Used with permission. (c) RRZE 2014 Basic Architecture 28

Intel Xeon Phi block diagram Architecture 3B Transistors 60+ cores 512 bit SIMD 1 TFLOP DP peak 0.5 MB L2/core GDDR5 2:1 SP:DP performance 64 byte/cy (c) RRZE 2014 Basic Architecture 29

Trading single thread performance for parallelism: GPGPUs vs. CPUs GPU vs. CPU light speed estimate: 1. Compute bound: 2-10x 2. Memory Bandwidth: 1-5x Intel Core i5 2500 ( Sandy Bridge ) Intel Xeon E5-2680 DP node ( Sandy Bridge ) NVIDIA K20x ( Kepler ) Cores@Clock 4 @ 3.3 GHz 2 x 8 @ 2.7 GHz 2880 @ 0.7 GHz Performance + /core 52.8 GFlop/s 43.2 GFlop/s 1.4 GFlop/s Threads@STREAM <4 <16 >8000 Total performance + 210 GFlop/s 691 GFlop/s 4,000 GFlop/s Stream BW 18 GB/s 2 x 40 GB/s 168 GB/s (ECC=1) Transistors / TDP 1 Billion* / 95 W 2 x (2.27 Billion/130W) 7.1 Billion/250W + Single Precision * Includes on-chip GPU and PCI-Express Complete compute device (c) RRZE 2014 Basic Architecture 31

Node topology and programming models

Parallelism in a modern compute node Parallel and shared resources within a shared-memory node 2 GPU #1 1 3 4 5 6 9 10 Other I/O 8 7 PCIe link GPU #2 Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 Cores 2 Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / ccnuma domains 4 PCIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 How does your application react to all of those details? (c) RRZE 2014 Basic Architecture 33

Scalable and saturating behavior Clearly distinguish between saturating and scalable performance on the chip level shared resources may show saturating performance parallel resources show scalable performance (c) RRZE 2014

Parallel programming models on modern compute nodes Shared-memory (intra-node) Good old MPI OpenMP POSIX threads Intel Threading Building Blocks (TBB) Cilk+, OpenCL, StarSs, you name it Distributed-memory (inter-node) MPI PVM (gone) Hybrid Pure MPI MPI+OpenMP MPI + any shared-memory model MPI (+OpenMP) + CUDA/OpenCL/ All models require awareness of topology and affinity issues for getting best performance out of the machine! (c) RRZE 2014 Basic Architecture 35

Parallel programming models: Pure MPI Machine structure is invisible to user: Very simple programming model MPI knows what to do!? Performance issues Intranode vs. internode MPI Node/system topology (c) RRZE 2014 Basic Architecture 36

Parallel programming models: Pure threading on the node Machine structure is invisible to user Very simple programming model Threading SW (OpenMP, pthreads, TBB, ) should know about the details Performance issues Synchronization overhead Memory access Node topology (c) RRZE 2014 Basic Architecture 37

Parallel programming models: Lots of choices Hybrid MPI+OpenMP on a multicore multisocket cluster One MPI process / node One MPI process / socket: OpenMP threads on same socket: blockwise OpenMP threads pinned round robin across cores in node Two MPI processes / socket OpenMP threads on same socket (c) RRZE 2014 Basic Architecture 38

Conclusions about architecture Modern computer architecture has a rich topology Node-level hardware parallelism takes many forms Sockets/devices CPU: 1-8, GPGPU: 1-6 Cores moderate (CPU: 4-16) to massive (GPGPU: 1000 s) SIMD moderate (CPU: 2-8) to massive (GPGPU: 10 s-100 s) Superscalarity (CPU: 2-6) Exploiting performance: parallelism + bottleneck awareness High Performance Computing == computing at a bottleneck Performance of programs is sensitive to architecture Topology/affinity influences overheads of popular programming models Standards do not contain (many) topology-aware features Things are starting to improve slowly (MPI 3.0, OpenMP 4.0) Apart from overheads, performance features are largely independent of the programming model (c) RRZE 2014 Basic Architecture 39