Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Size: px
Start display at page:

Download "Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes"

Transcription

1 Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

2 Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell EP : Up to cores running at + GHz (+ Turbo Mode : 3.5+ GHz) Simultaneous Multithreading reports as 44-way chip 7. Billion Transistors / 14 nm Die size: 456 mm socket server Optional: Cluster on Die (CoD) mode 017: Skylake architecture Mesh instead of ring interconnect CoD Sub-NUMA clustering Up to 8 cores GHz (c) RRZE 018 Basic Architecture

3 A deeper dive into core and chip architecture

4 General-purpose cache-based microprocessor core Modern CPU core Stored-program computer Implements Stored Program Computer concept (Turing 1936) Similar designs on all modern systems (Still) multiple potential bottlenecks Flexible! (c) RRZE 018 Basic Architecture 4

5 Basic resources on a stored program computer execution and data movement 1. execution This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput. s are the concept of work as seen by processor designers. Not all instructions count as work as seen by application developers! Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Processor work: LOAD r1 = A(i) LOAD r = B(i) ADD r1 = r1 + r STORE A(i) = r1 INCREMENT i BRANCH top if i<n User work: N Flops (ADDs) (c) RRZE 018 Basic Architecture 5

6 Basic resources on a stored program computer execution and data movement. Data transfer Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed). Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Data transfers: 8 byte: LOAD r1 = A(i) 8 byte: LOAD r = B(i) 8 byte: STORE A(i) = r Sum: 4 byte Crucial question: What is the bottleneck? Data transfer? Code execution? (c) RRZE 018 Basic Architecture 6

7 From high level code to actual execution for(i=0; i<n; ++i) sum += a[i]; addsd: Add nd argument to 1 st argument and store result in 1 st argument Register increment Compare register content..label: addsd inc cmp jb Compiler xmm1, [rdi+rdx*8] rdx rax, rdx..label &a[0] sum in register xmm1 i in register rdx Jump to label if loop continues N in register rax (c) RRZE 018 Basic Architecture 7

8 Architectural features in the (single) core Pipelining: execution in multiple steps Superscalarity: Multiple instructions per cycle Fetch 4 Fetch from L1I 3 Fetch from L1I Fetch Fetch from L1I 1 Fetch from from L1I L1I Fetch from L1I Fetch Fetch from L1I 5 3 Fetch from from L1I L1I 3 Fetch from L1I 3 Fetch Fetch from L1I 9 4 Fetch from from L1I L1I 4 Fetch from L1I 4 Fetch from L1I 13 from L1I Decode Decode 1 Decode 1 Decode Decode 1 Decode 1 Decode Decode Decode Decode 5 3 Decode 3 Decode 3 9 Execute Execute 1 Execute 1 Execute Execute 1 Execute 1 Execute Execute 5 Single Multiple Data: Multiple operations per instruction Simultaneous Multi-Threading: Multiple instruction sequences in parallel A[0] A[1] A[] A[3] B[0] B[1] B[] B[3] C[0] C[1] C[] C[3] (c) RRZE 018 Basic Architecture 8

9 Microprocessors Pipelining

10 Pipelining of arithmetic/functional units Idea: Split complex instruction into several simple / fast steps (stages) Each step takes the same amount of time, e.g., a single cycle Execute different steps on different instructions at the same time (in parallel) Benefits: Core can work on 5 independent instructions simultaneously One instruction finished each cycle after the pipeline is full Drawback: Pipeline must be filled; large number of independent instructions required Requires complex instruction scheduling by hardware (out-of-order execution) or compiler (software pipelining) Pipelining is widely used in modern computer architectures (c) RRZE 018 Basic Architecture 10

11 5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,n First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages (c) RRZE 018 Basic Architecture 11

12 Pipelining: The pipeline Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps: Fetch from L1I Decode instruction Execute t Fetch 1 from L1I Fetch from L1I Fetch 3 from L1I Fetch 4 from L1I Decode 1 Decode Decode 3 Execute 1 Execute Branches can stall this pipeline! (Speculative Execution, Predication) Each unit is pipelined itself (e.g., Execute = Multiply Pipeline) (c) RRZE 018 Basic Architecture 1

13 Microprocessors Superscalarity and Simultaneous Multithreading

14 Superscalar Processors Level Parallelism Multiple units enable use of Instrucion Level Parallelism (ILP): stream is parallelized on the fly t Fetch 4 Fetch 3 Fetch from L1I Fetch from L1I 1 Fetch from L1I Decode Fetch from L1I Decode Fetch from L1I Decode 1 Fetch from L1I 5 Decode 1 Fetch from L1I 3 Decode 1 Fetch from L1I 3 Decode 1 Fetch from L1I 3 Decode Fetch from L1I 9 Decode Fetch from L1I 4 Decode Fetch from L1I 4 Decode 5 Fetch from L1I 4 Decode 3 Fetch from L1I 13 Decode 3 from L1I 3 from L1I 9 4-way superscalar Execute Execute Execute 1 Execute 1 Execute 1 Execute 1 Execute Execute 5 Example: LOAD STORE MULT ADD Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar & can perform floating point instructions per cycles (c) RRZE 018 Basic Architecture 14

15 Superscalar processors executing multiple instructions concurrently execution Cycle 1 Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Cycle 1 Cycle 13 Cycle 14 Cycle 15 Cycle 16 load a[1] load a[] load a[3] load a[4] load a[5] load a[6] STORE (Latency: cy) add a[1]=c,a[1] add a[]=c,a[] load a[7] add a[3]=c,a[3] load a[8] add a[4]=c,a[4] store a[1] load a[9] add a[5]=c,a[5] store a[] load a[10] add a[6]=c,a[6] store a[3] load a[11] add a[7]=c,a[7] store a[4] load a[1] add a[8]=c,a[8] store a[5] load a[13] add a[9]=c,a[9] store a[6] load a[14] add a[10]=c,a[10] store a[7] load a[15] add a[11]=c,a[11] store a[8] load a[16] add a[1]=c,a[1] store a[10] Correct interleaving / reordering the instruction streams: Out-Of-Order (OOO) execution LOAD (Latency: 4 cy) for(int i=1; i<n; ++i) a[i] = a[i] + s; ADD (Latency: 3cy) Steady state: 3 instructions/cy ( 3-way superscalar execution ) s Per Cycle: IPC=3 Cycles Per : CPI=0.33 (c) RRZE 018 Basic Architecture 15

16 Core details: Simultaneous multi-threading (SMT) SMT principle (-way example): -way SMT Standard core (c) RRZE 018 Basic Architecture 16

17 Microprocessors Single Multiple Data (SIMD) a.k.a. vectorization

18 Core details: SIMD processing Single Multiple Data (SIMD) operations allow the concurrent execution of the same operation on wide registers x86 SIMD instruction sets: SSE: register width = 18 Bit double precision floating point operands AVX: register width = 56 Bit 4 double precision floating point operands AVX-51: you guessed it! Adding two registers holding double precision floating point operands R0 R1 R R0 R1 R 56 Bit 64 Bit A[0] B[0] C[0] SIMD execution: V64ADD [R0,R1] R Scalar execution: R ADD [R0,R1] A[0] A[1] A[] A[3] B[0] B[1] B[] B[3] C[0] C[1] C[] C[3] (c) RRZE 018 Basic Architecture 18

19 There is no single driving force for single core performance! Maximum floating point (FP) performance: PP cccccccc = nnffff ssssssssss nn FFFFFF nn SSSSSSSS ff Typical representatives FFFF nn ssssssssss [inst./cy] nn FFFFFF Superscalarity nn SSSSSSSS [ops/inst.] FMA factor SIMD factor Clock Speed (c) IBM RRZE POWER8 018 Basic Q/014 Architecture S8LC Code ff [Gcy/s] PP cccccccc [GF/s] Nehalem 1 Q1/009 X Westmere 1 Q1/010 X Sandy Bridge 1 4 Q1/01 E Ivy Bridge 1 4 Q3/013 E5-660 v Haswell 4 Q3/014 E5-695 v Broadwell 4 Q1/016 E5-699 v Skylake 8 Q3/017 Gold

20 Microprocessors Memory Hierarchy

21 Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 15 F/B Main memory access speed not sufficient to keep CPU busy Recently: mainly driven by SIMD (and FMA) Introduce fast on-chip caches, holding copies of recently used data items (c) RRZE 018 Basic Architecture 1

22 Registers and caches: Data transfers in a memory hierarchy Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(..N cl ) ST A(..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) CL CL C(:) A(:) Memory 3 CL transfers (c) RRZE 018 Basic Architecture 3

23 New kid on the block: AMD Epyc AMD Epyc Core Processor («Naples») Socket 0 Socket 1 Compute node 4 cores per socket 4 chips w/ 6 cores each ( Zeppelin die) 3 cores share 8MB L3 ( Core Complex, CCX ) DDR4-666 memory interface with channels per chip MemBW per node: 16 ch x 8 byte x.666 GHz = 341 GB/s Two-way SMT Two 56-bit (actually 4 18-bit) SIMD FP units AVX, 8 flops/cycle 3 KiB L1 data cache per core 51 KiB L cache per core x 8 MiB L3 cache per chip 64 MiB L3 cache per socket ccnuma memory architecture Infinity fabric between CCX s and between chips (c) RRZE 018 Basic Architecture 8

24 Interlude: A glance at current accelerator technology NVidia Pascal GP100 vs. Intel Xeon Phi Knights Landing vs. NEC SX-Aurora Tsubasa

25 NVidia Pascal GP100 block diagram Architecture 15.3 B Transistors ~ 1.4 GHz clock speed Up to 60 SM units 64 SP cores each 3 DP cores each :1 SP:DP performance 5.7 TFlop/s DP peak 4 MB L Cache 4096-bit HBM MemBW ~ 73 GB/s (theoretical) MemBW ~ 510 GB/s (measured) (c) RRZE 018 Basic Architecture NVIDIA Corp. 30

26 Intel Xeon Phi Knights Landing block diagram VPU VPU VPU VPU MCDRAM MCDRAM MCDRAM MCDRAM T T T T T T T T P P 3 KiB L1 3 KiB L1 DDR4 DDR4 1 MiB L DDR4 36 tiles (7 cores) max. DDR4 Architecture 8 B Transistors DDR4 DDR4 Up to 1.5 GHz clock speed Up to 36x cores (D mesh) x 51-bit SIMD units (VPU) each 4-way SMT MCDRAM MCDRAM MCDRAM MCDRAM 3.5 TFlop/s DP peak (SP x) 36 MiB L Cache 16 GiB MCDRAM MemBW ~ 470 GB/s (measured) Large DDR4 main memory MemBW ~ 90 GB/s (measured) (c) RRZE 018 Basic Architecture 31

27 Trading single thread performance for parallelism: GPGPUs vs. CPUs GPU vs. CPU light speed estimate (per device) MemBW ~ 5-10x Peak ~ 5-10x x Intel Xeon E5-697v4 Broadwell Intel Xeon Phi 750 Knights Landing NVidia Tesla P100 Pascal Cores@Clock x GHz 1.4 GHz 56 ~1.3 GHz SP Performance/core 73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s Threads@STREAM ~8 ~50 > SP peak.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s Stream BW (meas.) x 6.5 GB/s 450 GB/s (MCDRAM) 510 GB/s Transistors / TDP ~x7 Billion / x145 W 8 Billion / 15W 14 Billion/300W (c) RRZE 018 Basic Architecture 3

28 SX-Aurora TSUBASA Architecture May 30 th, 018 Shintaro Momose NEC Deutschland GmbH Information in this material has been public. Each information in this material can be used for another material with out NDA by showing source from NEC. Copyrights NEC all rights reserved

29 SIMD Vector Scalar Input Pipeline Result SIMD Vector SX Vector is more efficient than SIMD SX is a SIMD-vector

30 SX-Aurora TSUBASA 018 Technology: CPU Frequency: CPU Performance: CPU Memory Bandwidth: 16 nm FinFet 1.4/1.6 GHz 150/457 Gflops 18 GB/sec

31 Vector Engine Processor Processor HBM HBM HBM HBM I/F HBM I/F HBM I/F LLC 8MB core core core core core core core core LLC 8MB HBM I/F HBM I/F HBM I/F HBM HBM HBM Memory Subsystem Memory Bandwidth: LLC Bandwidth: Bandwidth/core: LLC/Core: 1.TB/s 3.0TB/s 400GB/s D-mesh Core Vector Length = 56 words (56 x 64b = 16kb/instruction) 307.GF@1.6GHz 68.8GF@1.4GHz Processor 8 cores.45tf@1.6ghz.15tf@1.4ghz Memory bandwidth: 1.TB/s

32 SPU Scalar Processing Unit 1.TB/s / processor (Ave. 150GB/s / core) 400GB/s / core Single core Vector core and hierarchy of register/llc/memory

33 Vector Execution 3e 56e 64e 8e A vector register 56e x 64 B (18kB) C D FMA x3 3e Vector Length = 56e (3e x 8 cycle) 307.GF = 3 Flops/cycle x (FMA) x 3 x 1.6 Gcy/s

34 Node topology and programming models

35 Parallelism in a modern compute node Parallel and shared resources within a shared-memory node GPU # Other I/O 8 7 PCIe link GPU # Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 Cores Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / ccnuma domains 4 PCIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 How does your application react to all of those details? (c) RRZE 018 Basic Architecture 4

36 Scalable and saturating behavior Clearly distinguish between saturating and scalable performance on the chip level shared resources may show saturating performance parallel resources show scalable performance (c) RRZE 018 Basic Architecture 43

37 Parallel programming models: Pure MPI Machine structure is invisible to user: Very simple programming model MPI knows what to do!? Performance issues Intranode vs. internode MPI Node/system topology (c) RRZE 018 Basic Architecture 45

38 Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming model Threading SW (OpenMP, pthreads, TBB, ) should know about the details, but doesn t Performance issues Synchronization overhead Memory access Node topology (c) RRZE 018 Basic Architecture 46

39 Conclusions about architecture Modern computer architecture has a rich topology Node-level hardware parallelism takes many forms Sockets/devices CPU: 1-4 or more, GPGPU/Phi: 1-6 or more Cores moderate (CPU: 4-4, GPGPU: , Phi: 64-7) SIMD moderate (CPU: -8, Phi: 8-16) to massive (GPGPU: 10 s-100 s) Superscalarity (CPU/Phi: -6) Exploiting performance: parallelism + bottleneck awareness High Performance Computing == computing at a bottleneck Performance of programs is sensitive to architecture Topology/affinity influences overheads of popular programming models Standards do not contain (many) topology-aware features Things are starting to improve slowly (MPI 3.0, OpenMP 4.0) Apart from overheads, performance features are largely independent of the programming model (c) RRZE 018 Basic Architecture 47

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Modern computer architecture. From multicore to petaflops

Modern computer architecture. From multicore to petaflops Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November,

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE)

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE) Introduction to omputer Architecture Jan Eitzinger (RRZE) Georg Hager (RRZE) Milestone Inventions 1938 Elwood Shannon: Solve boolean algebra and binary arithmetic with arrangements of relays 1941 Zuse

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

The ECM (Execution-Cache-Memory) Performance Model

The ECM (Execution-Cache-Memory) Performance Model The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore

More information

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management CAMA: Modern processors Memory hierarchy: Caches basics Data access locality Cache management Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Johannes Hofmann/Dietmar

More information

Node-Level Performance Engineering.

Node-Level Performance Engineering. Registers Node-Level Performance Engineering L1 L2 Georg Hager and Gerhard Wellein Friedrich-Alexander-Universität Erlangen-Nürnberg & Università della Svizzera Italiana Lugano L3 MEM Faculty of Informatics,

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Evaluation of Intel Xeon Phi Knights Corner: Opportunities and Shortcomings ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Putting it all Together: Modern Computer Architecture

Putting it all Together: Modern Computer Architecture Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 29.07.2013 / Aachen, Germany Stand: 22.07.2013 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Overview:

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different

More information

CS 152, Spring 2011 Section 10

CS 152, Spring 2011 Section 10 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

Trends in systems and how to get efficient performance

Trends in systems and how to get efficient performance Trends in systems and how to get efficient performance Martin Hilgeman HPC Consultant martin.hilgeman@dell.com The landscape is changing We are no longer in the general purpose era the argument of tuning

More information

Scaling Throughput Processors for Machine Intelligence

Scaling Throughput Processors for Machine Intelligence Scaling Throughput Processors for Machine Intelligence ScaledML Stanford 24-Mar-18 simon@graphcore.ai 1 MI The impact on humanity of harnessing machine intelligence will be greater than the impact of harnessing

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

INF5063: Programming heterogeneous multi-core processors Introduction

INF5063: Programming heterogeneous multi-core processors Introduction INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using

More information

Basics of performance modeling for numerical applications: Roofline model and beyond

Basics of performance modeling for numerical applications: Roofline model and beyond Basics of performance modeling for numerical applications: Roofline model and beyond Georg Hager, Jan Treibig, Gerhard Wellein SPPEXA PhD Seminar RRZE April 30, 2014 Prelude: Scalability 4 the win! Scalability

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

EPYC VIDEO CUG 2018 MAY 2018

EPYC VIDEO CUG 2018 MAY 2018 AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal

More information

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017

Tools and techniques for optimization and debugging. Andrew Emerson, Fabio Affinito November 2017 Tools and techniques for optimization and debugging Andrew Emerson, Fabio Affinito November 2017 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object,

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

Introducing Sandy Bridge

Introducing Sandy Bridge Introducing Sandy Bridge Bob Valentine Senior Principal Engineer 1 Sandy Bridge - Intel Next Generation Microarchitecture Sandy Bridge: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Parallelism, Multicore, and Synchronization

Parallelism, Multicore, and Synchronization Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Post-K overview Single-thread optimization Parallelization

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information