General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA
|
|
- Ashley Townsend
- 5 years ago
- Views:
Transcription
1 General-purpose SIMT processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr
2 From GPU to heterogeneous multi-core Yesterday ( ) Homogeneous multi-core Discrete components Today ( ) Heterogeneous multi-core Physically unified CPU + GPU on the same chip Logically separated Different programming models, compilers, instruction sets Tomorrow Unified programming models? Single instruction set? Central Processing Unit (CPU) Latencyoptimized cores Throughputoptimized cores Graphics Processing Unit (GPU) Heterogeneous multi-core chip Hardware accelerators 2
3 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 3
4 The 1980': pipelined processor Example: scalar-vector multiplication: X a X for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute L/S Unit Sequential CPU Memory 4
5 The 1990': superscalar processor Goal: improve performance of sequential applications Latency: time to get the result Exploits Instruction-Level Parallelism (ILP) Uses many tricks Branch prediction, out-of-order execution, register renaming, data prefetching, memory disambiguation Basis: speculation Take a bet on future events If right: time gain If wrong, roll back: energy loss 5
6 What makes speculation work: regularity Application behavior likely to follow regular patterns Control regularity for(i ) { if(f(i)) { } Regular case Time Irregular case i=0 i=1 i=2 i=3 i=0 i=1 i=2 i=3 taken taken taken taken not tk taken taken not tk Memory regularity } j = g(i); x = a[j]; j=17 j=18 j=19 j=20 j=21 j=4 j=17 j=2 Speculation exploits patterns to guess accurately Applications Caches Branch prediction Instruction prefetch, data prefetch, write combining 6
7 The 2000': going multi-threaded Memory wall More and more difficult to hide memory latency Power wall Performance is now limited by power consumption ILP wall Law of diminishing returns on Instruction-Level Parallelism Gradual transition from latencyoriented to throughput-oriented Homogeneous multi-core Simultaneous multi-threading Performance Cost Time Gap Serial performance Compute Memory Transistor density Transistor power Total power Time 7
8 Homogeneous multi-core Replication of the complete execution engine Multi-threaded software move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code add i 18 IF store X[17] mul IF ID EX LSU add i 50 IF store X[49] mul IF ID EX LSU Memory Threads: T0 T1 Improves throughput thanks to explicit parallelism 8
9 Simultaneous multi-threading (SMT) Time-multiplexing of processing units Same software view move i slice_begin loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<slice_end? loop Machine code mul mul add i 73 add i 50 load X[89] store X[72] load X[17] store X[49] Fetch Decode Execute L/S Unit Memory Threads: T0 T1 T2 T3 Hides latency thanks to explicit parallelism 9
10 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 10
11 Throughput-oriented architectures Also known as GPUs, but do more than just graphics Target: highly parallel sections of programs Programming model: SPMD one function run by many threads One code: For n threads: X[tid] a * X[tid] Many threads: Goal: maximize computation / energy consumption ratio Many-core approach: many independent, multi-threaded cores Can we be more efficient? Exploit regularity 11
12 Parallel regularity Similarity in behavior between threads Control regularity Time Regular Thread i=17 i=17 i=17 i=17 switch(i) { case 2:... case 17:... case 21:... } Irregular i=21 i=4 i=17 i=2 Memory regularity A load A[8] load A[9] Memory load A[10] load A[11] r=a[i] load A[8] load A[0] load A[11] load A[3] Data regularity a=32 a=32 a=32 a=32 b=52 b=52 b=52 b=52 r=a*b a=17 a=-5 a=11 a=42 b=15 b=0 b=-2 b=52 12
13 Dynamic SPMD vectorization aka SIMT Run SPMD threads in lockstep Mutualize fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers T0 (0-3) load IF T1 T2 T3 (0) mul (0-3) store ID (1) mul (2) mul (3) mul EX Memory (0) (1) (2) (3) LSU SIMT: Single Instruction, Multiple Threads Wave of synchronized threads: warp Improves Area/Power-efficiency thanks to regularity 13
14 Core 127 Core 93 Core 92 Core 91 Core 66 Core 65 Core 64 Core 34 Core 33 Core 32 Core 2 Core 1 Example GPU: NVIDIA GeForce GTX 980 SIMT: warps of 32 threads 16 SMs / chip 4 32 cores / SM, 64 warps / SM Warp 1 Warp 5 Warp 2 Warp 6 Warp 3 Warp 7 Warp 4 Warp 8 Warp 60 Warp 61 Warp 62 Warp 63 Time SM1 SM Gflop/s Up to threads in flight 14
15 SIMT vs. multi-core + explicit SIMD SIMT All parallelism expressed using threads Warp size implementationdefined Dynamic vectorization Threads Multi-core + explicit SIMD Combination of threads, vectors Vector length fixed at compiletime Static vectorization Threads Warp SIMT benefits Easier programming Retain binary compatibility Vector 15
16 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 16
17 Heterogeneity: causes and consequences Amdahl's law S= Time to run sequential sections 1 1 P P N Time to run parallel sections Latency-optimized multi-core (CPU) Low efficiency on parallel sections: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential sections Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Resources saved in parallel sections can be devoted to accelerete sequential sections M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer,
18 Single-ISA heterogeneous architectures Proposed in the academia, now in embedded systems on chip Example: ARM big.little High-performance CPU cores Cortex A57 Low-power CPU cores Cortex A53 big cores Thread migration LITTLE cores Different cores, same instruction set Migrate application threads dynamically according to demand 18
19 Single-ISA heterogeneous CPU-GPU Enables dynamic task migration between CPU and GPU Use the best core for the current job Load-balance on available resources CPU cores GPU cores Thread migration 19
20 Option 1: static vectorization Extend CPU SIMD instruction sets to throughput-oriented cores Scalar + wide vector ISA e.g. x AVX-512 Latency core Throughput cores Issue: conflicting requirements Same suboptimal SIMD vector length for all cores Or binary compatibility loss, ISA fragmentation Intel. Intel Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection. GNU Tools Cauldron
21 Our proposal: Dynamic vectorization Extend SIMT execution model to general-purpose cores Scalar ISA on both sides Latency core Throughput cores Flexibility advantage: SIMD width optimized for each core type Challenge: generalize dynamic vectorization to general-purpose instruction processing 21
22 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 22
23 Capturing instruction regularity How to keep threads synchronized? Challenge: conditional branches Rules of the game One thread per SIMD lane Same instruction on all lanes Lanes can be individually disabled Thread 0 Thread 1 Thread 2 Thread 3 1 instruction Lane 0 Lane 1 Lane 2 Lane 3 x = 0; // Uniform condition if(tid > 17) { x = 1; } // Divergent conditions if(tid < 2) { if(tid == 0) { x = 2; } else { x = 3; } } 23
24 Most common: mask stack skip Code x = 0; // Uniform condition if(tid > 17) { } x = 1; // Divergent conditions if(tid < 2) { } push pop if(tid == 0) { } else { } push pop push pop x = 2; x = 3; Mask Stack 1 activity bit / thread tid= tid= tid=2 tid=3 A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH 84,
25 Traditional SIMT pipeline Instruction Activity bit Exec Instruction Sequencer PC, Activity mask Instruction Fetch Insn, Activity mask Broadcast Instruction, Activity bit Exec Activity bit=0: discard instruction Mask stack Instruction, Activity bit Exec Used in Nvidia GPUs 25
26 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 26
27 Goto considered harmful? MIPS j jal jr syscall NVIDIA Tesla (2007) bar bra brk brkpt cal cont kil pbk pret ret ssy trap.s NVIDIA Fermi (2010) bar bpt bra brk brx cal cont exit jcal jmx kil pbk pret ret ssy.s Intel GMA Gen4 (2006) jmpi if iff else endif do while break cont halt msave mrest push pop Intel GMA SB (2011) jmpi if else endif case while break cont halt call return fork Control instructions in some CPU and GPU instruction sets Why so many? AMD R500 (2005) jump loop endloop rep endrep breakloop breakrep continue AMD R600 (2007) push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after Expose control flow structure to the instruction sequencer AMD Cayman (2011) push push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate_wqm loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after 27
28 SIMD is so last century Maspar MP-1 (1990) 1 instruction for processing elements (PEs) PE : ~1 mm², 1.6 µm process SIMD programming model /1000 Fewer PEs 50 Bigger PEs More divergence NVIDIA Fermi (2010) 1 instruction for 16 PEs PE : ~0,03 mm², 40 nm process Threaded programming model From centralized control to flexible distributed control 28
29 Moving away from the vector model Requirements for a single-isa CPU+GPU Run general-purpose applications Switch freely back and forth between SIMT and MIMD modes Conventional techniques do no meet these requirements Solution: stateless dynamic vectorization Key idea Maintain 1 Program Counter (PC) per thread Each cycle, elect one Master PC to fetch from Activate all threads that have the same PC 29
30 1 PC / thread Code x = 0; if(tid > 17) { x = 1; } if(tid < 2) { if(tid == 0) { x = 2; Master PC } } else { x = 3; } Program Counters (PCs) tid= Match active PC PC 1 PC 2 PC 3 No match inactive 30
31 Our new SIMT pipeline PC 0 Insn, MPC MPC=PC 0? Insn Exec Update PC PC 0 PC 1 Vote MPC Instruction Fetch Insn, MPC Broadcast Insn, MPC MPC=PC 1? Insn Exec Update PC No match: discard instruction PC 1 PC n Insn, MPC MPC=PC n? Insn Exec Update PC PC n 31
32 Benefits of stateless dynamic vectorization Before: stack, counters O(n), O(log n) memory n = nesting depth 1 R/W port to memory Exceptions: stack overflow, underflow Vector semantics Structured control flow only Specific instruction sets After: multiple PCs O(1) memory No shared state Allows thread suspension, restart, migration Multi-thread semantics Traditional languages, compilers Traditional instruction sets Can be mixed with MIMD 32
33 Scheduling policy: min(sp:pc) Which PC to choose as master PC? Conditionals, loops Order of code addresses min(pc) Functions Favor max nesting depth min(sp) With compiler support Unstructured control flow too No code duplication Full backward and forward compatibility Source Assembly Order if( ) { p? br else } 1 else br endif { else: } 2 endif: 3 while( ) { } f(); void f() { } start: p? br start call f f: ret
34 Potential of Min(SP:PC) Comparison of fetch policies on SPMD benchmarks PARSEC and SPLASH benchmarks for CPU, using pthreads, OpenMP, TBB Microarchitecture-independent model: ideal SIMD machine Average number of active threads Min(SP:PC) achieves reconvergence at minimal cost T. Milanez et al. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Computing 40.9:
35 Simty: a synthesizable SIMT CPU Proof of concept for dynamic vectorization Written in synthesizable VHDL Runs the RISC-V instruction set (RV32I) Fully parametrizable warp size, warp count 10-stage pipeline FPGA Prototype On Altera Cyclone IV based DE-2 board 8-warp 4-thread Simty: 7151 LEs, 12 M9Ks (6%), 72 MHz 8-warp 8-thread Simty: LEs, 24 M9Ks (11%), 63 MHz Overhead of per-pc control is small 35
36 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 36
37 DITVA Dynamic Inter-Thread Vectorization Architecture Add dynamic vectorization capability to an in-order SMT CPU Runs existing parallel programs compiled for x86 Scheduling policy: alternate min(sp:pc) and round-robin 4 scalar units 2 SIMD units 4 SIMT units Baseline: 4-thread 4-issue in-order with explicit SIMD DITVA: 4-warp 4-thread 4-issue 37
38 DITVA performance Speedup of 4-warp 2-thread DITVA and 4-warp 4-thread DITVA over baseline 4-thread processor +18% and +30% performance on SPMD workloads 38
39 Outline Performance or efficiency? Latency-oriented architectures Throughput-oriented architectures Heterogeneous architectures Dynamic SPMD vectorization Traditional dynamic vectorization More flexibility with state-free dynamic vectorization New CPU-GPU hybrids DITVA: CPU with dynamic vectorization SBI: GPU with parallel path execution 39
40 Simultaneous Branch Interweaving Co-issue instructions from divergent branches Fill inactive units using parallelism from divergent paths Control-flow graph Same cycle, two instructions SIMT (baseline) SBI N. Brunie, S. Collange, G. Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. ISCA
41 Conclusion: the missing link CPU today GPU today CPU ISA SIMT model Multi-core multi-thread DITVA SBI SIMT New design space New range of architecture options between multi-core and GPUs Enables heterogeneous platforms with unified instruction set 41
Simty: Generalized SIMT execution on RISC-V
Simty: Generalized SIMT execution on RISC-V CARRV 2017 Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core
More informationEscaping the SIMD vs. MIMD mindset A new class of hybrid microarchitectures between GPUs and CPUs
Escaping the SIMD vs. MIMD mindset A new class of hybrid microarchitectures between GPUs and CPUs Sylvain Collange Università degli Studi di Siena xsylvain.collange@gmail.comx Séminaire DALI December 15,
More informationWork factorization for efficient throughput architectures
Work factorization for efficient throughput architectures Sylvain Collange Departamento de Ciência da Computação, ICEx Universidade Federal de Minas Gerais sylvain.collange@dcc.ufmg.br February 01, 2012
More informationGPU architecture part 2: SIMT control flow management
GPU architecture part 2: SIMT control flow management Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Outline Running SPMD software
More informationArchitecture and micro-architecture of GPUs
Architecture and micro-architecture of GPUs Sylvain Collange Arénaire, LIP, ENS de Lyon sylvain.collange@ens-lyon.fr Departamento de Ciência da Computação, UFMG - ICEx May 27, 2011 Where I come from Arénaire,
More informationIntroduction to GPU architecture
Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics
More informationIntroduction to GPU architecture
Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics
More informationOptimization Techniques for Parallel Code 2. Introduction to GPU architecture
Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 What
More informationParallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique
Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline of the course Feb 29: Introduction to GPU architecture Let's pretend
More informationParallel programming: Introduction to GPU architecture
Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 Outline of the course March
More informationSimultaneous Branch and Warp Interweaving for Sustained GPU Performance
Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Nicolas Brunie Sylvain Collange Gregory Diamos by Ravi Godavarthi Outline Introduc)on ISCA'39 (Interna'onal Society for Computers
More informationOptimization Techniques for Parallel Code 3. Introduction to GPU architecture
Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2018 Graphics
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationImproving the efficiency of parallel architectures with regularity
Improving the efficiency of parallel architectures with regularity Sylvain Collange Arénaire, LIP, ENS de Lyon Dipartimento di Ingegneria dell'informazione Università di Siena February 2, 2011 Where I
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationParallelism in Hardware
Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationMulticore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.
CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationIntroduction. L25: Modern Compiler Design
Introduction L25: Modern Compiler Design Course Aims Understand the performance characteristics of modern processors Be familiar with strategies for optimising dynamic dispatch for languages like JavaScript
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationSlides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2
Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationOccupancy-based compilation
Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationDATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)
1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book) OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationParallelism, Multicore, and Synchronization
Parallelism, Multicore, and Synchronization Hakim Weatherspoon CS 3410 Computer Science Cornell University [Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin] xkcd/619 3 Big Picture: Multicore
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More information04 - DSP Architecture and Microarchitecture
September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationPutting it all Together: Modern Computer Architecture
Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationLecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed
More informationThe IA-64 Architecture. Salient Points
The IA-64 Architecture Department of Electrical Engineering at College Park OUTLINE: Architecture overview Background Architecture Specifics UNIVERSITY OF MARYLAND AT COLLEGE PARK Salient Points 128 Registers
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationHardware-Accelerated Dynamic Binary Translation
Hardware-Accelerated Dynamic Binary Translation Rokicki Simon - Irisa / Université de Rennes 1 Steven Derrien - Irisa / Université de Rennes 1 Erven Rohou - Inria Embedded Systems Tight constraints in
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationProf. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU
More informationWritten Exam / Tentamen
Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationComputer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of
More informationLecture 21: Parallelism ILP to Multicores. Parallel Processing 101
18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture
More informationGPU Microarchitecture Note Set 2 Cores
2 co 1 2 co 1 GPU Microarchitecture Note Set 2 Cores Quick Assembly Language Review Pipelined Floating-Point Functional Unit (FP FU) Typical CPU Statically Scheduled Scalar Core Typical CPU Statically
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture
CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed
More information