Occupancy-based compilation

Size: px
Start display at page:

Download "Occupancy-based compilation"

Transcription

1 Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto

2 Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group) scheduling divergent vs uniform thread execution stack-based and PC-based reconvergence Divergence analysis different forms of divergence micro-simd language and machine (an toy example) Gated Single Assignment form (GSA) Compiling into Cooperative Thread Arrays (CTA) having massive number of threads GPU memory issues CTA and warp formation 2/40

3 Threads and occupancy 3/40

4 Having massive number of threads Repeated executions of code programs spend their time in repeatedly executed parts Dependence-based compilation we consider loops as the primary form of repetition covers also many other control structures (e.g., recursion) we form iteration space and analyze dependencies Occupancy-based compilation we consider threads as the primary form of repetition we have a number of threads that execute the same code we try to maximize number of parallel threads light threads help in maximing HW occupancy 4/40

5 GPUs as an example (1/2) There are several vendors offering different concepts. The two main ones NVIDIA SM (Streaming Multiprocessor) warps and warp scheduling AMD VLIW4 (or VLIW5), older basically dependency limited Graphics Core Next (GCN), newer more similar to NVIDIA SM Both toward scalar cores Occupancy limited computing We will (mostly) use NVIDIA terminology in the following 5/40

6 GPUs as an example (2/2) Vector or VLIW cores (in parallel, SIMT style) Especially VLIW is suitable for graphics also NVIDIA used it in the past Both need complex compilers dependency limited loop structures, static scheduling static register conflict resolution Scalar cores (in parallel, pure SIMT) More toward HW scheduling hardware tries to get utilization (occupancy) up software must offer threads for occupancy Simpler compiler More predictable performance over wide range of applications 6/40

7 Utilization and occupancy Occupancy Warp hardware utilization occupancy = Actual active warps / Maximum active warps Why HW sits with nothing to do? Interconnects saturate (memory, registers,...) If per-cta requirements are high, effective number of simultaneous threads is limited CTAs is limited by these High control flow divergence threads in a warp have different control flow cannot be run in parallel i.e., the warping is "sparse" instead of being "dense" Inefficient scheduling mechanisms after a long memory wait, many of the warps tend to get ready roughly at the same time 7/40

8 SIMT execution 8/40

9 SIMT microarchitecture MIMD (Multiple Instruction Multiple Data) Each core has its own instruction stream Co-operation and synchronization through memory Vector-SIMD There is only a single instruction stream Multiple separate vector lanes like cores without the control (etc.) each operates on a word there can be predication SIMT All cores have their own instruction stream However, there is (usually) only one issue unit The cores run the same instruction use predication ignore irrelevant instructions There are several flavors of each! be careful about the terminology (processor, core, lane,...) 9/40

10 Warp scheduling Microthread issue is done by a warp scheduler 1. Warp scheduler chooses a warp Warp must be in an active block Warp must be ready not be waiting for memory or register operands The warp has a PC, which applies to all its unmasked threads 2. Instruction for warp is fetched and decoded Let x denote the number of functional units for this instruction 3. Warp is executed, can take several cycles At each cycle, x threads are issued to functional units Repeat until all threads in the warp are issued Execution is usually pipelined Can take several (e.g., 20) cycles until result is available 10/40

11 Warp scheduling example (1/2) Assume Warp width 32 8 cores, 2 SFUs Code I1: FMUL R17, R19, R29; // Multiplication, uses a core I2: RSQ R7, R7; // Square root, uses a SFU Execution Cyc: W00: [xx I1 xx] [xx I2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx] 11/40

12 Warp scheduling example (2/2) Assume Two warp schedulers, 8 cores per scheduler, 12-stage pipeline Code I1: IADD R1, R0, R5; I2: IMAD.U16 R3, g [0x6].U16, R5L, R2; I3: IADD R2, R1, R5; // Depends on I1 (via R1) Execution (note how low is the utilization) there is plenty of room to run other threads! remember memory latencies Cyc: W00: [I1 ] [I2 ] [I3 ] W01: [I1 ] [I2 ] [I3 ] W02: [I1 ] [I2 ] [I3 ] W03: [I1 ] [I2 ] [I3 ] 12/40

13 Stack-based branch divergence Program flow Active threads A B C D A = B = C = D = Initial stack contents R pc Next pc Mask A stack top Stack after divergence R pc Next pc Mask D D C D B Time After branch completion R pc Next pc Mask D D C After reconvergence R pc Next pc Mask D /40

14 Memory divergence Memory delays can cause warp splitting Typical current HW does not allow such splitting i.e., execution is kept uniform evenif latencies are divergent Below we assume warp splitting support Program flow Active threads A = A B B C B = C = Active mask Data ready Time / cycles Reconvergence when memory gets ready Memory status at initial time 14/40

15 Stack-based vs PC-based reconvergence Warp slitting together with PC-based reconvergence more complex HW needed Program flow A = B = C = D = E = A E B D F G C Conventional sequence Time F = E G = A B D F G C Execution sequence using dynamic warp subdivision 15/40

16 Divergence analysis 16/40

17 Microthread divergence Data divergence occurs If the same variable name is mapped to different values in the environments of distinct processing elements Data divergence produces Control divergence occurs when threads in a warp follow different paths after processing the same branch Memory divergence occurs when a load or store instruction targeting data divergent addresses causes different access delays for different threads affine address: consecutive threads access adjacent or regularly-spaced memory locations uniform address: threads access the same memory location Note: Support for divergence is essential for manycores to target wide set of applications. Basic tile processors (e.g., Epiphany) or multicore CPUs do not have such support. 17/40

18 The micro-simd language We will use a simple (machine) language to demonstrate divergence. Labels (L) ::= l N Variables (V) ::= Tid {v1,v2,...} Instructions ::= - (jump if zero/not zero) bz/bnz v, l - (unconditional jump) jump l - (store into shared memory) store v x = v - (load from shared memory) load v = v x - (atomic increment) atominc v = v x (other binary operations) binop v 1 = v 2 op v 3 - (immediate copy) const v = n - (synchronization barrier) sync - (halt execution) stop Tid is the thread identifier (corresponding to the get_x_id() functions of OpenCL). 18/40

19 The micro-simd machine The state the machine is determined The program that the machine executes The program counter (pc) all Processing Elements, PEs, use the same pc The contents of the shared memory shared by all Processing Elements, PEs The contents of the local memories every PE has a local memory the load and store instructions transfer data between the local memory of a PE and the share memory The set of active threads Θ The contents of the synchronization stack 19/40

20 The synchronization stack The synchronization stack contains frames holding quadruples (l id, Θ done, l next, Θ todo ) l id = the conditional branch that caused the divergence Θdone = the PEs that have reached the synchronization point lnext = indicates the instruction where Θ todo will resume execution Θ todo = the PEs waiting to execute Basically, the machine pushes frames at branches and pops them at synchronization barriers. note that this is different than the R-pc -based stack resembles predicate mask of vector machines, but allows for nested "if" structures and loops that may include synchronization primitives (in micro-simd: atominc, sync) 20/40

21 Micro-SIMD machine operation (1/2) Upon reaching bz v, l we evaluate v in the local memory of each active PE. If the condition v is non-zero for every PE, we move to next instruction, i.e., pc + 1 Similarly, if condition is zero for every PE, we jump to the instruction at l However, if we get distinct values for different PEs, then the branch is divergent we split Θ into Θ0 (condition not true) Θ n (condition true) we execute the "else" branch (with Θ0 ) we push the other branch onto the stack (i.e., Θ n ) basically, we have an execution pending in the stack 21/40

22 Micro-SIMD machine operation (2/2) The handling for branch non-zero (bnz) is similar. Also the non-divergent branch instructions (Θ 0 or Θ n is empty) update the synchronization stack (in order not to get stuck when trying to pop a node because of a sync instruction) If we arrive at a barrier (sync) with a group Θ n of PEs waiting to execute, then we resume their execution at the "then" branch, keeping the previously active PEs into hold If we reach a barrier without any PE waiting to execute, we synchronize the PEs with the current set of active PEs, and resume execution at the next instruction after the barrier. 22/40

23 Example (1/5) Consider the following code in a language resembling OpenCL (note the use of the thread identifier Tid): kernel TriangleSum(float* m, float* v, int c) { if (Tid < c) { int d = 0; float sum = 0.0F; int L = (Tid + 1) * c; for (int i = Tid; i < L; i += c) { if (d % 2) sum += m[i]; d += 1; } v[d-1] = sum; } } 23/40

24 Example (2/5) An simple translation yields to following code in micro-simd: l00: const d = 0 l01: const s = 0 l02: const i = Tid l03: binop x = Tid + 1 l04: binop L = c * x l09: load x = i l10: binop s = s + x l05: binop p = i L l06: bz p, l15 l07: binop p = d % 2 l08: bnz p, l11 l15: sync l16: binop x = d 1 l17: store x = s l18: stop l11: sync l12: binop d = d + 1 l13: binop i = i + c l14: jmp l05 24/40

25 Example (3/5) If we execute the kernel for a 4x4 matrix (i.e., c=4) with 4 threads (t0,t1,t2,t3), we can observe divergence. E.g., when executing l6 for the second time (cycle 17) the predicate p is 0 for thread t0, and 1 for the other PEs because of this, thread t1 is pushed onto the stack Once t3 leaves the loop threads synchronize via the sync instruction at label l15, and resume lock-step execution. In the following (partial!) listing of the execution we mark an active thread with + we mark a passive thread with - 25/40

26 Example (4/5) Cycle Instruction t0 t1 t2 t3 16 l05: binop p = i - L l06: bz p, l l07: binop p = d % l08: bnz p, l l09: load x = i l10: binop s = s + x l11: sync l12: d = d l13: binop i = i + c l14: jmp l l05: binop p = i - L l06: bz p, l l07: p = d % l08: bnz p, l l05: bz p, l l15: sync /40

27 Example (5/5) The instruction l08: bnz p, l11 cannot cause a divergence, even though the predicate p is data dependent on variable d, which is created inside a divergent loop. Variable d is not divergent, although the variable p that controls the loop is Note that a variable can be divergent inside a loop, but uniform outside Precise divergence analysis is needed to observe such behavior Note that on a real GPU we could use simple predication without any branching for the "if" in a PTX compilation bra.uni instead of bra 27/40

28 Gated Single Assignment form (GSA) GSA uses three special functions as instructions: µ, γ, and η instead of the SSA phi-function (or φ). γ function is a join (for branches). γ(p, v 1, v 2 ) is v 1 if the predicate (the gate) is true or else v 2. µ function is a loop join (for loop headers). µ(v 1, v 2 ) is v 1 for the first iteration and v 2 for the rest. η is the loop exit function η(p, v). It binds a loop dependent value v to loop predicate p. GSA allows control dependencies to be transformed into data dependencies. Placement of GSA functions into blocks is similar to SSA However, to save space, we omit details of GSA 28/40

29 Example reconsidered with GSA (1/2) Below sync instructtions are implicit in the η and γ functions Note that the labeling has changed (because of the assigment for µ function) l00: const d 0 = 0 l01: const s 0 = 0 l02: const i 0 = Tid l03: binop x 0 = Tid + 1 l04: binop L 0 = c * x 0 l10: load x 2 = i 1 l11: binop s 2= s 1+ x2 l05: [i 1,s 1,d 1]= µ (i 0,s 0,d 0),(i 2,s 3,d 2) l06: binop p 0 = i 1 L 0 l07: bz p 0, l16 l08: binop p 1= d 1% 2 l09: bnz p 1, l12 l16: [s 4,d 3]= η[(p 0,(s,d 1)] l17: binop x 3 = d 3 1 l18: store x 3 = s 4 l19: stop l12: [s 3]= γ(p 1,s 2, s 1) l13: binop d 2 = d l14: binop i 2 = i 1 + c l15: jmp l05 29/40

30 Example reconsidered with GSA (2/2) Using GSA we can analyze the divergence behavior, e.g. The "if" branch instruction (bnz p1, l12) depends on d 1, for which we have no divergent definition thus, the "if" is not divergent The "for" loop branch instruction(bnz p0, l16) depends on p 0, for which have a divergent definition because Tid is divergent thus, the "for" loop is divergent The threads go "hand-in-hand" inside the loop, but their exit from to loop is divergent note however, that 4 th value of d (i.e., d 3 ) is divergent In general, we get more precision by considering values instead of variables 30/40

31 Compiling CTAs 31/40

32 Cooperative Thread Arrays Applications are typically divided into kernels each kernel is capable of spawning many threads The threads are usually grouped together into thread blocks to a compiler, these are known as cooperative thread arrays (CTAs) When an application starts its execution The CTA scheduler initiates scheduling of CTAs onto the available processors All the threads within a CTA are executed on the same processor in groups of, e.g., of 32 threads This collection of threads is referred to as a warp the threads within a warp share the same instruction stream i.e., single instruction multiple threads, SIMT 32/40

33 CTA formation CTAs correspond to OpenCL workgroups (CUDA blocks) more fundamentally, A CTA is an abstraction which encapsulates all synchronization in a thread group CTAs are partitioned into warps this sub-division is transparent to the application programmer and is an architectural abstraction based on a cost model (memory, regs,...) implied by the kernel code (note the hierarchy) Application Kernel CTA Warp Kernel 1 CTA 1 Warp 1 Thread 1 Kernel 2 CTA 2 Warp 2 Thread 2 Kernel 3 CTA 3 Warp 3 Thread 3 33/40

34 CTA issues The basic idea of having many threads We can hide memory latency A thread waits for memory, we execute other threads Two level scheduling CTA scheduler the compiler produces the CTAs Warp scheduler the compiler must co-operate with the scheduler divergence and synchronization are not trivial Performance bottlenecks 1. On-chip memory and register files are limiting factors on parallelism 2. High control flow divergence 3. Inefficient scheduling mechanisms Round-Robin is typical, memory use easily congested 34/40

35 Memory addressing Specific memory addressing We know statically the memory space that we are using LLD.U32 R5, local [0x10]; GLD.U32 R5, global14 [R3]; G2R.U32 R0, g [A3+0x4].U32; // Load from local space // Load from global space // Load from shared space Generic memory addressing Memory space resolution is dynamic CUDA supports this (OpenCL calls for explicit usage) LD.E!R5, generic [ R3] 35/40

36 Memory space analysis On some architectures, accesses can be generic We know only the space of the pointer Not the points-to space HW resolves dynamically the access type, typically local (OpenCL: private), shared (OpenCL: local), global such HW support makes programming easier Not only about address conversions Alias analysis Memory disambiguation Is basically a points-to-analysis Forward data-flow analysis Using a lattice (as typical for data-flow) Meet is toward "unknown" 36/40

37 Memory organization GPUs usually have memory restrictions Banked memory is the most typical Each bank can serve only one thread (core) at a time Full crossbars would be very expensive Therefore, memory accesses must be correctly structured A 2-way conflict causes serialization and doubles access times Thread0 Thread1 Thread2 Thread3 Thread4 Thread5 Thread6 Thread7 Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Thread0 Thread1 Thread2 Thread3 Thread4 Thread5 Thread6 Thread7 Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Thread0 Thread1 Thread2 Thread3 Thread4 Thread5 Thread6 Thread7 Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 Linear addressing (stride=1) No bank conflict Nonlinear addressing No bank conflict Linear addressing (stride=2) 2 way bank conflict 37/40

38 Assigning microthreads Cooperative thread array (CTA) Structure to hold the generated threads Registers are partitioned among threads The number of threads depends on that Smaller register usage = getting more threads not like typical RISC compiling We need huge number of threads, consider, e.g.: Memory latencies of x times instruction cycle time Assume every y th instruction to access memory we need x/y ready warps Typical GPU HW has, e.g., 48 warps scheduled per scheduler (if some instruction stalls, have others to choose) Note: having 32 threads in a warp, 48 warps per scheduler, 4 warp schedulers per processor, and 32 processors means simultaneously scheduled threads. As threads are very short, the software should be able to create millions of threads to keep the HW busy! 38/40

39 Code partitioning to remove synchronization Input: List of Statements F in AST representation Output: List X of code partitions free of barriers Begin P = new partition; while F has next statement S do switch type of statement S case barrier Add P to X; P = new partition; case simple statement Add S to P; case statement sequence Prepend statements comprising S to F; otherwise if S contains a barrier statement then Add P to X and invoke recursively for the body of S Append L to X; P = new partition; else Add S to P; 39/40

40 Optimizations Optimizing applications can be hard Some parallel algorithms are intrinsically divergent programmers need know the overall platform properties Precise identification divergences is hard compiler support is a big help Divergence-aware optimizations branch distribution merges code inside potentially divergent program paths branch fusion a generalization of branch distribution, joins chains of common instructions present in two divergent paths branch splitting divides a parallelizable loop enclosing a multi-path branch into multiple loops, each containing only one branch thread reallocation regrouping divergent threads among warps, so that only one or just a few warps will contain divergent threads 40/40

On Static Timing Analysis of GPU Kernels

On Static Timing Analysis of GPU Kernels On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Aalto University Espoo, Finland vesa.hirvisalo@aalto.fi Abstract We study static timing analysis of programs running on GPU accelerators. Such programs

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Using CUDA in Practice Klaus Mueller Computer Science Department Stony Brook University Code examples from Shane Cook CUDA Programming Related to: score boarding load and store

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Optimizations. Umar Arshad ArrayFire Advanced CUDA Optimizations Umar Arshad (@arshad_umar) ArrayFire (@arrayfire) ArrayFire World s leading GPU experts In the industry since 2007 NVIDIA Partner Deep experience working with thousands of customers

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Programmer's View of Execution Teminology Summary

Programmer's View of Execution Teminology Summary CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Parallelism and runtimes

Parallelism and runtimes Parallelism and runtimes Advanced Course on Compilers Spring 2015 (III-V): Lecture 7 Vesa Hirvisalo ESG/CSE/Aalto Today Parallel platforms Concurrency Consistency Examples of parallelism Regularity of

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

GPU Architecture. Samuli Laine NVIDIA Research

GPU Architecture. Samuli Laine NVIDIA Research GPU Architecture Samuli Laine NVIDIA Research Today The graphics pipeline: Evolution of the GPU Throughput-optimized parallel processor design I.e., the GPU Contrast with latency-optimized (CPU-like) design

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) 1 DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2) Chapter 4 Appendix A (Computer Organization and Design Book) OUTLINE SIMD Instruction Set Extensions for Multimedia (4.3) Graphical

More information

Exotic Methods in Parallel Computing [GPU Computing]

Exotic Methods in Parallel Computing [GPU Computing] Exotic Methods in Parallel Computing [GPU Computing] Frank Feinbube Exotic Methods in Parallel Computing Dr. Peter Tröger Exotic Methods in Parallel Computing FF 2012 Architectural Shift 2 Exotic Methods

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

GPUs have enormous power that is enormously difficult to use

GPUs have enormous power that is enormously difficult to use 524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006 P3 / 2006 Register Allocation What is register allocation Spilling More Variations and Optimizations Kostis Sagonas 2 Spring 2006 Storing values between defs and uses Program computes with values value

More information

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Arquitetura e Organização de Computadores 2

Arquitetura e Organização de Computadores 2 Arquitetura e Organização de Computadores 2 Paralelismo em Nível de Dados Graphical Processing Units - GPUs Graphical Processing Units Given the hardware invested to do graphics well, how can be supplement

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

ESE534: Computer Organization. Previously. Today. Computing Requirements (review) Instruction Control. Instruction Taxonomy

ESE534: Computer Organization. Previously. Today. Computing Requirements (review) Instruction Control. Instruction Taxonomy ESE534: Computer Organization Previously Temporally Programmable Architectures Day 10: February 15, 2012 Instruction Space Spatially Programmable Architectures Instructions 1 2 Today Instructions Requirements

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture. - Data Level Parallelism (II) COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

On the Correctness of the SIMT Execution Model of GPUs. Extended version of the author s ESOP 12 paper. Axel Habermaier and Alexander Knapp

On the Correctness of the SIMT Execution Model of GPUs. Extended version of the author s ESOP 12 paper. Axel Habermaier and Alexander Knapp UNIVERSITÄT AUGSBURG On the Correctness of the SIMT Execution Model of GPUs Extended version of the author s ESOP 12 paper Axel Habermaier and Alexander Knapp Report 2012-01 January 2012 INSTITUT FÜR INFORMATIK

More information

Using Virtual Texturing to Handle Massive Texture Data

Using Virtual Texturing to Handle Massive Texture Data Using Virtual Texturing to Handle Massive Texture Data San Jose Convention Center - Room A1 Tuesday, September, 21st, 14:00-14:50 J.M.P. Van Waveren id Software Evan Hart NVIDIA How we describe our environment?

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors Definitions (1 of 2) What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Instruction

More information

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,

More information

Code generation for modern processors

Code generation for modern processors Code generation for modern processors What are the dominant performance issues for a superscalar RISC processor? Refs: AS&U, Chapter 9 + Notes. Optional: Muchnick, 16.3 & 17.1 Strategy il il il il asm

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Chapter 2: Instructions How we talk to the computer

Chapter 2: Instructions How we talk to the computer Chapter 2: Instructions How we talk to the computer 1 The Instruction Set Architecture that part of the architecture that is visible to the programmer - instruction formats - opcodes (available instructions)

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Intermediate Representations Part II

Intermediate Representations Part II Intermediate Representations Part II Types of Intermediate Representations Three major categories Structural Linear Hybrid Directed Acyclic Graph A directed acyclic graph (DAG) is an AST with a unique

More information

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.

More information

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100 CS/EE 217 Midterm ANSWER ALL QUESTIONS TIME ALLOWED 60 MINUTES Question Possible Points Points Scored 1 24 2 32 3 20 4 24 Total 100 Question 1] [24 Points] Given a GPGPU with 14 streaming multiprocessor

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information