Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *
|
|
- Martina Elliott
- 5 years ago
- Views:
Transcription
1 Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *
2 modern GPU architectures deeply pipelined for efficient resource sharing several buffering and collision avoidance stages hiding long read-after-write (RAW) latency w/ extensive multi-threading no power-hungry data forwarding network (DFN) modern GPGPU applications insufficient thread-level parallelism (TLP) to hide long RAW latency (18-24 cycles) 40% of total stalls
3 most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint
4 23% (SP/INT) and 33% (DP) higher performance most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint
5 motivation insufficient TLP to hide long RAW latency proposed GPU architectural techniques MoRF HFMA dual-v DD pipeline evaluation summary
6 SM (Fermi-like arch.) 32-wide SIMT w/ 32K registers 1536 (8 CTAs ) 24-cycle RAW latency longer for SFU ops hiding RAW latency requiring 24 active warps 32 /warp 24 warps = 768 active SM: stream multiprocessor CTA: cooperative thread arrays *SMIT: single-instr. multiple 24 cycles front-end register address unit bank arbitration unit RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback
7 application char. reg. file size 10% of GPU power how is an application divided into kernels? how much TLP exists in different kernels? avoiding non-coalesced memory accesses & resource contention synchronization and data sharing b/w shared mem. size 8% of GPU power
8 application char. reg. file size 10% of GPU power shared mem. size 8% of GPU power GPUs allocate registers statically during work distribution are grouped in unit of CTAs # of registers per thread determined by compiler/driver a CTA can be issued only if there are enough registers for all its
9 application char. reg. file size 10% of GPU power shared/const. memory requirement per CTA determined by compiler/driver SM must have enough memory to accommodate a complete CTA shared mem. size 8% of GPU power
10 DCT % occupancy 8 CTAs w/ 64 /CTA NDL % occupancy 16 KB/CTA w/ 44 /CTA CFD registers/thread w/ 192 /CTA 37% occupancy
11 DCT application partitioning, 512 thread synchronization, data dependencies CTAs w/ 64 /CTA % occupancy NDL KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy
12 performance of low-occupancy applications constrained by long RAW latencies DCT application partitioning, 512 thread synchronization, data dependencies CTAs w/ 64 /CTA % occupancy NDL KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy
13 8% occupancy 50% occupancy RAW stalls due to low occupancy 40% of total stalled cycles ideal DFN 18% IPC improvement
14 forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback
15 forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages high overhead of ideal DFN (16 % of GPU power) RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback
16 mul.wide.u16 $r3, %ctaid.y, 0x8; cvt.u32.u16 $r2, $r0.lo; mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three 80% of dynamic instr. < 3 dep. dist. store 3 most recent exe. results per thread in forwarding buffer (FB)
17 Most-recent Result Forwarding forwarding buffer (FB) dispatch port FB[0] FB[1] FB[2] operand collector warp[0] warp[0] warp[0] warp[1] warp[1] warp[1] warp[2] warp[2] warp[2] FP units FB INT units warp[47] warp[47] warp[47] result queue rd/wr ports one 32-bit entry per warp in each FB bank each entry indexed by warp id
18 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] warp[1] warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic
19 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic
20 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic
21 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, FB[0] $r2; FB[1] $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic
22 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; FB[0] FB[1] $r6 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic
23 early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A LZC & normalizer Accumulate RF exponent update 8 result 24` exponent 26 rounder result significand
24 early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A LZC & normalizer breaking 7-cycle dep. chain exponent allows update rounder 1-cycle Accumulate RF exe. latency 8 result 24` exponent 26 result significand
25 power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) Normalize (75-bit) Round (24-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit) SPwDP dynamic leakage dynamic leakage Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit)
26 power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) SPwDP dynamic leakage dynamic leakage Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (75-bit) Normalize (163-bit) Normalize (163-bit) replacing 8 out of 16 DP FPUs w/ SPwDP Round (24-bit) Round (53-bit) Round (53-bit)
27 SP/INT benchmarks memory intensive INT only MoRF speedup: 14% MoRF+HFMA speedup: 18%
28 DP benchmarks longer FP latency replaced by shorter HFMA latency memory intensive MoRF speedup: 15% MoRF+HFMA speedup: 30%
29 RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FPU power consumption 35% of total GPU power in TDP condition!
30 RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower V DD for exe. units double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles lower V DD
31 RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback lower V DD for exe. units FP/INT exe. units result queue writeback MoRF +HFMA counter negative impact of double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles increased exe. latency lower V DD
32 RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower GPU power increase # of SMs (i.e., higher perf.) for the same power budget 16 cycles lower V DD
33 1 To support DP FMA operations. 2 For half as many DP FMA units. 3 Including additional pipeline overhead. DFN GPU peak power overhead 16% MoRF GPU peak power overhead 0.9% MoRF+HFMA MoRF+HFMA+Dual V dd SP FMA power increase 1 20% DP FMA power decrease 2 50% GPU peak power reduction 6.4% exe. units power reduction 3 35% GPU peak power reduction 14% 1. to support DP FMA ops. 2. for half as many DP FMA units. 3. Including additional pipeline overhead. MoRF+HFMA slightly higher power due to additional logic and SRAM dual-v DD design reduced SP and DP power consumption
34 SP/INT benchmarks 23% speedup w/ MoRF+HFMA+dual-V DD +more SMs 5% additional improvement over MoRF+HFMA
35 DP benchmarks 33% speedup w/ MoRF+HFMA+dual-V DD +more SMs 3% additional improvement over MoRF+HFMA
36 insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint
37 23% (SP/INT) and 33% (DP) higher performance insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint
38 Question? 38
Power-efficient Computing for Compute-intensive GPGPU Applications
Power-efficient Computing for Compute-intensive GPGPU Applications Syed Zohaib Gilani, Nam Sung Kim, Michael J. Schulte The University of Wisconsin-Madison, WI, U.S.A. Advanced Micro Devices, TX, U.S.A.
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationAdvanced CUDA Programming. Dr. Timo Stich
Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationThe instruction scheduling problem with a focus on GPU. Ecole Thématique ARCHI 2015 David Defour
The instruction scheduling problem with a focus on GPU Ecole Thématique ARCHI 2015 David Defour Scheduling in GPU s Stream are scheduled among GPUs Kernel of a Stream are scheduler on a given GPUs using
More informationCUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)
CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration
More informationVLSI Signal Processing
VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationMultiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.
Architectural and Implementation Tradeoffs for Multiple-Context Processors Coping with Latency Two-step approach to managing latency First, reduce latency coherent caches locality optimizations pipeline
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationECE 154A Introduction to. Fall 2012
ECE 154A Introduction to Computer Architecture Fall 2012 Dmitri Strukov Lecture 10 Floating point review Pipelined design IEEE Floating Point Format single: 8 bits double: 11 bits single: 23 bits double:
More informationA Framework for Modeling GPUs Power Consumption
A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationSparse Linear Algebra in CUDA
Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationD5.5.3 Design and implementation of the SIMD-MIMD GPU architecture
D5.5.3(v.1.0) D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture Document Information Contract Number 288653 Project Website lpgpu.org Contractual Deadline 31-08-2013 Nature Report Author
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationComputer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of
More informationLecture 9. Thread Divergence Scheduling Instruction Level Parallelism
Lecture 9 Thread Divergence Scheduling Instruction Level Parallelism Announcements Tuesday s lecture on 2/11 will be moved to room 4140 from 6.30 PM to 7.50 PM Office hours cancelled on Thursday; make
More informationAddressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline
DOI 10.1007/s10766-012-0201-1 Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline Chunyang Gou Georgi N. Gaydadjiev Received: 17 August 2011 / Accepted: 14 June 2012 The Author(s)
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Memory Issues in CUDA Execution Scheduling in CUDA February 23, 2012 Dan Negrut, 2012 ME964 UW-Madison Computers are useless. They can only
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationMattan Erez. The University of Texas at Austin
EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and
More informationAdvanced CUDA Optimizing to Get 20x Performance. Brent Oster
Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationAdvanced CUDA Optimizing to Get 20x Performance
Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationScalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency. Jan Lucas TU Berlin - AES
Scalarization and Temporal SIMT in GPUs: Reducing Redundant Operations for Better Performance and Higher Energy Efficiency Jan Lucas TU Berlin - AES Overview What is a Scalarization? Why are Scalar Operations
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationOrchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading
More informationECE331: Hardware Organization and Design
ECE331: Hardware Organization and Design Lecture 35: Final Exam Review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Material from Earlier in the Semester Throughput and latency
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationigpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012
igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012 Outline Motivation and Challenges Background Mechanism igpu Architecture
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationHW1 Solutions. Type Old Mix New Mix Cost CPI
HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationCORF: Coalescing Operand Register File for GPUs
CORF: Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden University of California, Riverside Riverside, CA hodjat.asghari@email.ucr.edu Farzad Khorasani Tesla, Inc. Palo Alto, CA fkhorasani@tesla.com
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationPipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1
55:3/C:60 Spring 00 Pipelined Design Motivation: Increase processor throughput with modest increase in hardware. Bandwidth or Throughput = Performance Pipelined Processors Chapter Bandwidth (BW) = no.
More information1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM
Administrative L4: Hardware Execution Model and Overview January 26, 2009 First assignment out, due Friday at 5PM Any questions? New mailing list: cs6963-discussion@list.eng.utah.edu Please use for all
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION
April 4-7, 2016 Silicon Valley CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JAKOB PROGSCH, NVIDIA 1 WHAT YOU WILL LEARN An iterative method to optimize your GPU
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationGPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationVertex Shader Design II
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19
CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be
More informationAppendix C: Pipelining: Basic and Intermediate Concepts
Appendix C: Pipelining: Basic and Intermediate Concepts Key ideas and simple pipeline (Section C.1) Hazards (Sections C.2 and C.3) Structural hazards Data hazards Control hazards Exceptions (Section C.4)
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More information