Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
|
|
- Eleanor Ray
- 6 years ago
- Views:
Transcription
1 Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
2 Parallelize your code! Launch more threads! Multi- threading Caching Main Memory Prefetching Improve Replacement Policies Is the arp Scheduler aware of these techniques? Improve Memory Scheduling Policies Improve Prefetcher (look deep in the future, if you can!)
3 Two-level Scheduling MICRO Multi- threading Caching Main Memory Aware arp Scheduler Cache-Conscious Scheduling, MICRO Prefetching Thread-Block-Aware Scheduling (OL) ASPLOS?
4 Our Proposal n Prefetch Aware arp Scheduler n Goals: q Make a Simple prefetcher more Capable q Improve system performance by orchestrating scheduling and prefetching mechanisms n % average IPC improvement over q Prefetching + Conventional arp Scheduling Policy n % average IPC improvement over q Prefetching + Best Previous arp Scheduling Policy
5 Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions
6 High- Level View of a GPU Streaming Multiprocessors (SMs) Threads CTA CTA CTA CTA arps Scheduler L Caches Prefetcher ALUs Interconnect Cooperative Thread Arrays (CTAs) Or Thread Blocks L cache DRAM
7 arp Scheduling Policy n Equal scheduling priority q Round-Robin (RR) execution n Problem: arps stall roughly at the same time Phase () SIMT Core Stalls Phase () DRAM Requests D D D D D D D D Time
8 Phase () DRAM Requests D D D D D D D D SIMT Core Stalls Phase () Time Group Phase () DRAM Requests D D D D TO LEVEL (TL) SCHEDULING Group Phase () D D D D Group Comp. Phase () Group Comp. Phase () Saved Cycles
9 Accessing DRAM High Bank- Level Parallelism High Row Buffer Locality X X + X + X + Y Y + Bank Bank Y + Y + Memory Addresses Legend Group Group Idle for a period Bank Bank Low Bank-Level Parallelism High Row Buffer Locality
10 arp Scheduler Perspec?ve (Summary) arp Scheduler Round- Robin (RR) Forms Multiple arp Groups? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Two-Level (TL) 0
11 Evalua?ng RR and TL schedulers 0 SSC Round-robin (RR) PVC KMN SPMV BFSR FFT SCP Two-level (TL) IPC Can Improvement we further factor with Perfect L Cache reduce this gap? Via Prefetching? BLK FT.0X.X JPEG GMEA N
12 () Prefetching: Saves more cycles Phase () DRAM Requests D D D D Phase () D D D D (A) Comp. Phase () (B) Comp. Phase () TL Saved Cycles Time RR Phase- (Group-) Can Start Phase () DRAM Requests D D D Phase () D P P P P Comp. Phase () Prefetch Requests Comp. Phase () Saved Cycles
13 () Prefetching: Improve DRAM Bandwidth U?liza?on X X + X + X + Y Y + Y + Y + Memory Addresses High Bank- Level Parallelism Idle for a period Bank Bank No Idle period! High Row Buffer Locality Bank Bank Prefetch Requests
14 Challenge: Designing a Prefetcher X X + X + X + Y Y + Y + Y + Memory Addresses X Y Prefetch Requests Bank Bank X Sophisticated Prefetcher Y
15 Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy hy? A simple prefetching does not improve performance with existing scheduling policies.
16 Simple Prefetching + RR scheduling RR Phase () DRAM Requests D D D D D D D D Phase () No Saved Cycles Time Phase () DRAM Requests D P D P D Overlap with D (Late Prefetch) P D P Overlap with D (Late Prefetch) Phase () Time
17 Simple Prefetching + TL scheduling Phase () DRAM Requests Group Phase () Overlap with D (Late Group Group Group Group Comp. Comp. Phase Phase Phase () () () D D D D D P D P Group Phase () Prefetch) Overlap with D (Late Prefetch) D D D D D P D P No Saved Cycles (over TL) Group Comp. Phase () Group Comp. Phase () TL Saved Cycles Time RR
18 Let s Try X Simple Prefetcher X +
19 Simple Prefetching with TL scheduling X X + X + X + X + May not be equal to Y Y + Y Idle for a period Bank Bank Y + Y + Memory Addresses Useless Prefetch (X + ) Bank Bank UP UP UP UP Useless Prefetches
20 Simple Prefetching with TL scheduling Phase () DRAM Requests Phase () DRAM Requests D D D D D D D Phase () Phase () D U U U U D D D D D D D D Comp. Phase () Comp. Phase () Comp. Phase () No Saved Cycles Comp. Phase () (over TL) TL Useless Prefetches Saved Cycles Time RR
21 arp Scheduler Perspec?ve (Summary) arp Scheduler Forms Multiple arp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Round- Robin (RR) Two-Level (TL)
22 Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy A simple prefetching does not improve performance with existing scheduling policies.
23 Sophisticated Prefetcher Prefetch Aware (PA) arp Scheduler Simple Prefetcher
24 Prefetch- aware (PA) warp scheduling X X + X + X + Y Y + Y + Y + Round Robin Scheduling Group Group X X + X X + X + X + X + X + Y Y Y + Y + Y + Y + Prefetch-aware Scheduling Non-consecutive warps are associated with one group Y + Y + Two-level Scheduling See paper for generalized algorithm of PA scheduler
25 Simple Prefetching with PA scheduling X X + X + X + Y Y + Y + Y + Bank Bank Reasoning of non-consecutive warp grouping is that groups can (simple) prefetch for each other (green warps can prefetch for red warps using simple prefetcher) X Simple Prefetcher X +
26 Simple Prefetching with PA scheduling X X + X + X + Y Y + Y + Y + Cache Hits! Bank Bank X Simple Prefetcher X +
27 Simple Prefetching with PA scheduling (A) (B) Phase () DRAM Requests D D D D Phase () D D D D Comp. Phase () Comp. Phase () TL Saved Cycles Time RR Phase- (Group-) Can Start Phase () DRAM Requests D D D Phase () D P P P P Comp. Phase () Prefetch Requests Comp. Phase () Saved Saved Cycles!!! (over TL) Cycles
28 DRAM Bandwidth U:liza:on X X + X + X + Y + Y + Y + % increase in bank-level parallelism Y % decrease in row buffer locality Bank Bank High Bank-Level Parallelism High Row Buffer Locality X Simple Prefetcher X +
29 arp Scheduler Perspec?ve (Summary) arp Scheduler Round- Robin (RR) Forms Multiple arp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Two-Level (TL) Prefetch- Aware (PA) (with prefetching) 9
30 Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions 0
31 Evalua?on Methodology n Evaluated on GPGPU-Sim, a cycle accurate GPU simulator n Baseline Architecture q 0 SMs, memory controllers, crossbar connected q q q 00MHz, SIMT idth =, Max. 0 threads/core KB L data cache, KB Texture and Constant Caches L Data Cache Prefetcher, GDDR@00MHz n Applications Chosen from: q Mapreduce Applications q q q Rodinia Heterogeneous Applications Parboil Throughput Computing Focused Applications NVIDIA CUDA SDK GPGPU Applications
32 Spa?al Locality Detector based Prefetching X MACRO BLOCK D Prefetch:- Not accessed (demanded) Cache Lines X + X + X + P D Prefetch-aware Scheduler Improves effectiveness of this simple prefetcher See paper for more details P D = Demand, P = Prefetch
33 Improving Prefetching Effec?veness RR+Prefetching TL+Prefetching PA+Prefetching 00% 0% 0% 0% 0% Fraction of Late Prefetches 9% % 9% 00% 0% 0% 0% 0% Prefetch Accuracy % 9% 90% 0% 0% % 0% Reduction in LD Miss Rates % 0% % 0% % %
34 Performance Evalua?on.. 0. SSC PVC KMN Results are Normalized to RR scheduling RR+Prefetching TL TL+Prefetching Prefetch-aware (PA) PA+Prefetching SPMV BFSR FFT % IPC improvement over Prefetching + RR arp Results Scheduling Policy (Commonly Used) % IPC improvement over Prefetching + TL arp Scheduling Policy (Best Previous) SCP BLK FT JPEG See paper for Additional GMEAN
35 Conclusions n n n Existing warp schedulers in GPGPUs cannot take advantage of simple prefetchers q Consecutive warps have good spatial locality, and can prefetch well for each other q But, existing schedulers schedule consecutive warps closeby in time à prefetches are too late e proposed prefetch-aware (PA) warp scheduling q Key idea: group consecutive warps into different groups q Enables a simple prefetcher to be timely since warps in different groups are scheduled at separate times Evaluations show that PA warp scheduling improves performance over combinations of conventional (RR) and the best previous (TL) warp scheduling and prefetching policies q Better orchestrates warp scheduling and prefetching decisions
36 THANKS! QUESTIONS?
37 BACKUP
38 Effect of Prefetch- aware Scheduling Percentage of DRAM requests (averaged over group) with: miss misses - misses to a macro-block 0% 0% High Spatial Locality Requests Recovered by Prefetching 0% High Spatial Locality Requests 0% Two-level Prefetch-aware
39 orking (ith Two- Level Scheduling) MACRO BLOCK MACRO BLOCK X D Y D X + X + X + D D D High Spatial Locality Requests Y + Y + Y + D D D 9
40 orking (ith Prefetch- Aware Scheduling) MACRO BLOCK MACRO BLOCK X D Y D X + X + X + P D P High Spatial Locality Requests Y + Y + Y + P D P
41 orking (ith Prefetch- Aware Scheduling) MACRO BLOCK MACRO BLOCK X Y X + X + D Cache Hits Y + Y + D X + D Y + D
42 Effect on Row Buffer locality Row Buffer Locality 0 0 TL TL+Prefetching PA PA+Prefetching SSC PVC KMN SPMV BFSR FFT SCP BLK FT JPEG AVG % decrease in row buffer locality over TL
43 Effect on Bank- Level Parallelism Bank Level Parallelism SSC PVC RR TL PA KMN SPMV BFSR FFT SCP BLK FT JPEG AVG % increase in bank-level parallelism over TL
44 Simple Prefetching + RR scheduling X X + X + X + Y Y + Y + Y + Memory Addresses Bank Bank Bank Bank
45 Simple Prefetching with TL scheduling X X + X + X + Y Y + Y + Y + Memory Addresses Idle for a period Bank Bank Legend Group Group Idle for a period Bank Bank
46 CTA-Assignment Policy (Example) Multi-threaded CUDA Kernel CTA- CTA- CTA- CTA- SIMT Core- SIMT Core- CTA- CTA- CTA- CTA- arp Scheduler arp Scheduler L Caches ALUs L Caches ALUs
Orchestrated Scheduling and Prefetching for GPGPUs
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog Onur Kayiran Asit K. Mishra Mahmut T. Kandemir Onur Mutlu Ravishankar Iyer Chita R. Das The Pennsylvania State University Carnegie Mellon University
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationExploiting Core Criticality for Enhanced GPU Performance
Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationNeither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Technical Report CSE--6 September, Onur Kayiran onur@cse.psu.edu Adwait Jog adwait@cse.psu.edu Mahmut T. Kandemir kandemir@cse.psu.edu
More informationExploiting Core Criticality for Enhanced GPU Performance
Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog Onur Kayiran 2 Ashutosh Pattnaik Mahmut T. Kandemir Onur Mutlu 4,5 Ravishankar Iyer 6 Chita R. Das College of William and Mary 2 Advanced
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationApplication-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog Evgeny Bolotin 2 Zvika Guz 2 Mike Parker 3 Stephen W. Keckler 2,4 Mahmut T. Kandemir Chita R.
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationCtrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationEnergy-Efficient Scheduling for Memory-Intensive GPGPU Workloads
Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics
More informationNeither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and hita R. Das Department of omputer Science and Engineering The Pennsylvania State University,
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationA Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps
A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Saugata Ghose Abhishek Bhowmick Rachata Ausavarungnirun Chita Das Mahmut Kandemir
More informationAPRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationCAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationA Framework for Modeling GPUs Power Consumption
A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January
More informationarxiv: v2 [cs.ar] 5 Jun 2015
Improving GPU Performance Through Resource Sharing Vishwesh Jatala Department of CSE, IIT Kanpur vjatala@cseiitkacin Jayvant Anantpur SERC, IISc, Bangalore jayvant@hpcserciiscernetin Amey Karkare Department
More informationMRPB: Memory Request Priori1za1on for Massively Parallel Processors
MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches
More informationAnatomy of GPU Memory System for Multi-Application Execution
Anatomy of GPU Memory System for Multi-Application Execution Adwait Jog Onur Kayiran 4 Tuba Kesten 2 Ashutosh Pattnaik 2 Evgeny Bolotin 3 Niladrish Chatterjee 3 Stephen W. Keckler 3,5 Mahmut T. Kandemir
More informationEfficient and Fair Multi-programming in GPUs via Effective Bandwidth Management
Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management Haonan Wang, Fan Luo, Mohamed Ibrahim, Onur Kayiran, and Adwait Jog College of William and Mary Advanced Micro Devices, Inc.
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationThe Pennsylvania State University. The Graduate School. College of Engineering KERNEL-BASED ENERGY OPTIMIZATION IN GPUS.
The Pennsylvania State University The Graduate School College of Engineering KERNEL-BASED ENERGY OPTIMIZATION IN GPUS A Thesis in Computer Science and Engineering by Amin Jadidi 2015 Amin Jadidi Submitted
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationAnalyzing CUDA Workloads Using a Detailed GPU Simulator
CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:
More informationPerformance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems
Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Yingchao Huang University of California, Merced yhuang46@ucmerced.edu Abstract A heterogeneous memory system (HMS)
More informationEvaluating GPGPU Memory Performance Through the C-AMAT Model
Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000
More informationCache efficiency based dynamic bypassing technique for improving GPU performance
94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and
More informationThread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationRegMutex: Inter-Warp GPU Register Time-Sharing
RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,
More informationGPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA
GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationGPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Abhishek Bhowmick Rachata Ausavarungnirun
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationTHROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION
Southern Illinois University Carbondale OpenSIUC Theses Theses and Dissertations 12-1-2017 THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION SRINIVASA REDDY PUNYALA
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationGeneric Polyphase Filterbanks with CUDA
Generic Polyphase Filterbanks with CUDA Jan Krämer German Aerospace Center Communication and Navigation Satellite Networks Weßling 04.02.2017 Knowledge for Tomorrow www.dlr.de Slide 1 of 27 > Generic Polyphase
More informationCooperative Caching for GPUs
Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh at HiPEAC Stockholm, Sweden 24 th January, 2017 Multithreading on GPUs Kernel Hardware Scheduler Core Core Core Core Host CPU to
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationA REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Dongwei Wang Follow this and additional
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationAn Efficient STT-RAM Last Level Cache Architecture for GPUs
An Efficient STT-RAM Last Level Cache Architecture for GPUs Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad, HPCAN Lab, Computer Engineering Department, Sharif
More informationDesigning High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC
Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationCache Memory Access Patterns in the GPU Architecture
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationUnderstanding The Effects of Wrong-path Memory References on Processor Performance
Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend
More informationCMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points):
Name: CMU 18-447 Introduction to Computer Architecture, Spring 2013 Midterm Exam 2 Instructions: Date: Wed., 4/17 Legibility & Name (5 Points): Problem 1 (90 Points): Problem 2 (35 Points): Problem 3 (35
More informationAn Optimization Compiler Framework Based on Polyhedron Model for GPGPUs
Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2017 An Optimization Compiler Framework Based on Polyhedron Model for GPGPUs Lifeng Liu Wright State University
More informationCORF: Coalescing Operand Register File for GPUs
CORF: Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden University of California, Riverside Riverside, CA hodjat.asghari@email.ucr.edu Farzad Khorasani Tesla, Inc. Palo Alto, CA fkhorasani@tesla.com
More informationECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen
ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher
More informationAnalysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth
Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationNam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *
Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationDynamic Resource Management for Efficient Utilization of Multitasking GPUs
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs Jason Jong Kyu Park University of Michigan Ann Arbor, MI jasonjk@umich.edu Yongjun Park Hongik University Seoul, Korea yongjun.park@hongik.ac.kr
More informationLTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching Mohammad Sadrosadati 1,2 Amirhossein Mirhosseini 3 Seyed Borna Ehsani 1 Hamid Sarbazi-Azad 1,4
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationMattan Erez. The University of Texas at Austin
EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and
More informationAnalysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs
AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de
More informationPower-efficient Computing for Compute-intensive GPGPU Applications
Power-efficient Computing for Compute-intensive GPGPU Applications Syed Zohaib Gilani, Nam Sung Kim, Michael J. Schulte The University of Wisconsin-Madison, WI, U.S.A. Advanced Micro Devices, TX, U.S.A.
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationPrefetching Techniques for Near-memory Throughput Processors
Prefetching Techniques for Near-memory Throughput Processors Reena Panda University of Texas at Austin reena.panda@utexas.edu Onur Kayiran AMD Research onur.kayiran@amd.com Yasuko Eckert AMD Research yasuko.eckert@amd.com
More informationAutomatic Data Layout Transformation for Heterogeneous Many-Core Systems
Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung
More informationA Detailed GPU Cache Model Based on Reuse Distance Theory
A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationToward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs
Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs ABSTRACT Gwangsun Kim Arm gwangsun.kim@arm.com Mike O Connor NVIDIA and UT-Austin moconnor@nvidia.com 3D-stacked memory
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationCS377P Programming for Performance GPU Programming - II
CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More information{jtan, zhili,
Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs Jingweijia Tan, Zhi Li, Xin Fu Department of Electrical Engineering and Computer Science University of Kansas Lawrence, KS 66045,
More informationRethinking Prefetching in GPGPUs: Exploiting Unique Opportunities
Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationDesigning Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs
Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Chien-Ting Chen 1, Yoshi Shih-Chieh Huang 1, Yuan-Ying Chang 1, Chiao-Yun Tu 1, Chung-Ta King 1, Tai-Yuan Wang 1, Janche Sang
More information