Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Size: px
Start display at page:

Download "Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das"

Transcription

1 Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

2 Parallelize your code! Launch more threads! Multi- threading Caching Main Memory Prefetching Improve Replacement Policies Is the arp Scheduler aware of these techniques? Improve Memory Scheduling Policies Improve Prefetcher (look deep in the future, if you can!)

3 Two-level Scheduling MICRO Multi- threading Caching Main Memory Aware arp Scheduler Cache-Conscious Scheduling, MICRO Prefetching Thread-Block-Aware Scheduling (OL) ASPLOS?

4 Our Proposal n Prefetch Aware arp Scheduler n Goals: q Make a Simple prefetcher more Capable q Improve system performance by orchestrating scheduling and prefetching mechanisms n % average IPC improvement over q Prefetching + Conventional arp Scheduling Policy n % average IPC improvement over q Prefetching + Best Previous arp Scheduling Policy

5 Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions

6 High- Level View of a GPU Streaming Multiprocessors (SMs) Threads CTA CTA CTA CTA arps Scheduler L Caches Prefetcher ALUs Interconnect Cooperative Thread Arrays (CTAs) Or Thread Blocks L cache DRAM

7 arp Scheduling Policy n Equal scheduling priority q Round-Robin (RR) execution n Problem: arps stall roughly at the same time Phase () SIMT Core Stalls Phase () DRAM Requests D D D D D D D D Time

8 Phase () DRAM Requests D D D D D D D D SIMT Core Stalls Phase () Time Group Phase () DRAM Requests D D D D TO LEVEL (TL) SCHEDULING Group Phase () D D D D Group Comp. Phase () Group Comp. Phase () Saved Cycles

9 Accessing DRAM High Bank- Level Parallelism High Row Buffer Locality X X + X + X + Y Y + Bank Bank Y + Y + Memory Addresses Legend Group Group Idle for a period Bank Bank Low Bank-Level Parallelism High Row Buffer Locality

10 arp Scheduler Perspec?ve (Summary) arp Scheduler Round- Robin (RR) Forms Multiple arp Groups? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Two-Level (TL) 0

11 Evalua?ng RR and TL schedulers 0 SSC Round-robin (RR) PVC KMN SPMV BFSR FFT SCP Two-level (TL) IPC Can Improvement we further factor with Perfect L Cache reduce this gap? Via Prefetching? BLK FT.0X.X JPEG GMEA N

12 () Prefetching: Saves more cycles Phase () DRAM Requests D D D D Phase () D D D D (A) Comp. Phase () (B) Comp. Phase () TL Saved Cycles Time RR Phase- (Group-) Can Start Phase () DRAM Requests D D D Phase () D P P P P Comp. Phase () Prefetch Requests Comp. Phase () Saved Cycles

13 () Prefetching: Improve DRAM Bandwidth U?liza?on X X + X + X + Y Y + Y + Y + Memory Addresses High Bank- Level Parallelism Idle for a period Bank Bank No Idle period! High Row Buffer Locality Bank Bank Prefetch Requests

14 Challenge: Designing a Prefetcher X X + X + X + Y Y + Y + Y + Memory Addresses X Y Prefetch Requests Bank Bank X Sophisticated Prefetcher Y

15 Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy hy? A simple prefetching does not improve performance with existing scheduling policies.

16 Simple Prefetching + RR scheduling RR Phase () DRAM Requests D D D D D D D D Phase () No Saved Cycles Time Phase () DRAM Requests D P D P D Overlap with D (Late Prefetch) P D P Overlap with D (Late Prefetch) Phase () Time

17 Simple Prefetching + TL scheduling Phase () DRAM Requests Group Phase () Overlap with D (Late Group Group Group Group Comp. Comp. Phase Phase Phase () () () D D D D D P D P Group Phase () Prefetch) Overlap with D (Late Prefetch) D D D D D P D P No Saved Cycles (over TL) Group Comp. Phase () Group Comp. Phase () TL Saved Cycles Time RR

18 Let s Try X Simple Prefetcher X +

19 Simple Prefetching with TL scheduling X X + X + X + X + May not be equal to Y Y + Y Idle for a period Bank Bank Y + Y + Memory Addresses Useless Prefetch (X + ) Bank Bank UP UP UP UP Useless Prefetches

20 Simple Prefetching with TL scheduling Phase () DRAM Requests Phase () DRAM Requests D D D D D D D Phase () Phase () D U U U U D D D D D D D D Comp. Phase () Comp. Phase () Comp. Phase () No Saved Cycles Comp. Phase () (over TL) TL Useless Prefetches Saved Cycles Time RR

21 arp Scheduler Perspec?ve (Summary) arp Scheduler Forms Multiple arp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Round- Robin (RR) Two-Level (TL)

22 Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy A simple prefetching does not improve performance with existing scheduling policies.

23 Sophisticated Prefetcher Prefetch Aware (PA) arp Scheduler Simple Prefetcher

24 Prefetch- aware (PA) warp scheduling X X + X + X + Y Y + Y + Y + Round Robin Scheduling Group Group X X + X X + X + X + X + X + Y Y Y + Y + Y + Y + Prefetch-aware Scheduling Non-consecutive warps are associated with one group Y + Y + Two-level Scheduling See paper for generalized algorithm of PA scheduler

25 Simple Prefetching with PA scheduling X X + X + X + Y Y + Y + Y + Bank Bank Reasoning of non-consecutive warp grouping is that groups can (simple) prefetch for each other (green warps can prefetch for red warps using simple prefetcher) X Simple Prefetcher X +

26 Simple Prefetching with PA scheduling X X + X + X + Y Y + Y + Y + Cache Hits! Bank Bank X Simple Prefetcher X +

27 Simple Prefetching with PA scheduling (A) (B) Phase () DRAM Requests D D D D Phase () D D D D Comp. Phase () Comp. Phase () TL Saved Cycles Time RR Phase- (Group-) Can Start Phase () DRAM Requests D D D Phase () D P P P P Comp. Phase () Prefetch Requests Comp. Phase () Saved Saved Cycles!!! (over TL) Cycles

28 DRAM Bandwidth U:liza:on X X + X + X + Y + Y + Y + % increase in bank-level parallelism Y % decrease in row buffer locality Bank Bank High Bank-Level Parallelism High Row Buffer Locality X Simple Prefetcher X +

29 arp Scheduler Perspec?ve (Summary) arp Scheduler Round- Robin (RR) Forms Multiple arp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Two-Level (TL) Prefetch- Aware (PA) (with prefetching) 9

30 Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions 0

31 Evalua?on Methodology n Evaluated on GPGPU-Sim, a cycle accurate GPU simulator n Baseline Architecture q 0 SMs, memory controllers, crossbar connected q q q 00MHz, SIMT idth =, Max. 0 threads/core KB L data cache, KB Texture and Constant Caches L Data Cache Prefetcher, GDDR@00MHz n Applications Chosen from: q Mapreduce Applications q q q Rodinia Heterogeneous Applications Parboil Throughput Computing Focused Applications NVIDIA CUDA SDK GPGPU Applications

32 Spa?al Locality Detector based Prefetching X MACRO BLOCK D Prefetch:- Not accessed (demanded) Cache Lines X + X + X + P D Prefetch-aware Scheduler Improves effectiveness of this simple prefetcher See paper for more details P D = Demand, P = Prefetch

33 Improving Prefetching Effec?veness RR+Prefetching TL+Prefetching PA+Prefetching 00% 0% 0% 0% 0% Fraction of Late Prefetches 9% % 9% 00% 0% 0% 0% 0% Prefetch Accuracy % 9% 90% 0% 0% % 0% Reduction in LD Miss Rates % 0% % 0% % %

34 Performance Evalua?on.. 0. SSC PVC KMN Results are Normalized to RR scheduling RR+Prefetching TL TL+Prefetching Prefetch-aware (PA) PA+Prefetching SPMV BFSR FFT % IPC improvement over Prefetching + RR arp Results Scheduling Policy (Commonly Used) % IPC improvement over Prefetching + TL arp Scheduling Policy (Best Previous) SCP BLK FT JPEG See paper for Additional GMEAN

35 Conclusions n n n Existing warp schedulers in GPGPUs cannot take advantage of simple prefetchers q Consecutive warps have good spatial locality, and can prefetch well for each other q But, existing schedulers schedule consecutive warps closeby in time à prefetches are too late e proposed prefetch-aware (PA) warp scheduling q Key idea: group consecutive warps into different groups q Enables a simple prefetcher to be timely since warps in different groups are scheduled at separate times Evaluations show that PA warp scheduling improves performance over combinations of conventional (RR) and the best previous (TL) warp scheduling and prefetching policies q Better orchestrates warp scheduling and prefetching decisions

36 THANKS! QUESTIONS?

37 BACKUP

38 Effect of Prefetch- aware Scheduling Percentage of DRAM requests (averaged over group) with: miss misses - misses to a macro-block 0% 0% High Spatial Locality Requests Recovered by Prefetching 0% High Spatial Locality Requests 0% Two-level Prefetch-aware

39 orking (ith Two- Level Scheduling) MACRO BLOCK MACRO BLOCK X D Y D X + X + X + D D D High Spatial Locality Requests Y + Y + Y + D D D 9

40 orking (ith Prefetch- Aware Scheduling) MACRO BLOCK MACRO BLOCK X D Y D X + X + X + P D P High Spatial Locality Requests Y + Y + Y + P D P

41 orking (ith Prefetch- Aware Scheduling) MACRO BLOCK MACRO BLOCK X Y X + X + D Cache Hits Y + Y + D X + D Y + D

42 Effect on Row Buffer locality Row Buffer Locality 0 0 TL TL+Prefetching PA PA+Prefetching SSC PVC KMN SPMV BFSR FFT SCP BLK FT JPEG AVG % decrease in row buffer locality over TL

43 Effect on Bank- Level Parallelism Bank Level Parallelism SSC PVC RR TL PA KMN SPMV BFSR FFT SCP BLK FT JPEG AVG % increase in bank-level parallelism over TL

44 Simple Prefetching + RR scheduling X X + X + X + Y Y + Y + Y + Memory Addresses Bank Bank Bank Bank

45 Simple Prefetching with TL scheduling X X + X + X + Y Y + Y + Y + Memory Addresses Idle for a period Bank Bank Legend Group Group Idle for a period Bank Bank

46 CTA-Assignment Policy (Example) Multi-threaded CUDA Kernel CTA- CTA- CTA- CTA- SIMT Core- SIMT Core- CTA- CTA- CTA- CTA- arp Scheduler arp Scheduler L Caches ALUs L Caches ALUs

Orchestrated Scheduling and Prefetching for GPGPUs

Orchestrated Scheduling and Prefetching for GPGPUs Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog Onur Kayiran Asit K. Mishra Mahmut T. Kandemir Onur Mutlu Ravishankar Iyer Chita R. Das The Pennsylvania State University Carnegie Mellon University

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Technical Report CSE--6 September, Onur Kayiran onur@cse.psu.edu Adwait Jog adwait@cse.psu.edu Mahmut T. Kandemir kandemir@cse.psu.edu

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog Onur Kayiran 2 Ashutosh Pattnaik Mahmut T. Kandemir Onur Mutlu 4,5 Ravishankar Iyer 6 Chita R. Das College of William and Mary 2 Advanced

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog Evgeny Bolotin 2 Zvika Guz 2 Mike Parker 3 Stephen W. Keckler 2,4 Mahmut T. Kandemir Chita R.

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads

Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics

More information

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and hita R. Das Department of omputer Science and Engineering The Pennsylvania State University,

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Saugata Ghose Abhishek Bhowmick Rachata Ausavarungnirun Chita Das Mahmut Kandemir

More information

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information

arxiv: v2 [cs.ar] 5 Jun 2015

arxiv: v2 [cs.ar] 5 Jun 2015 Improving GPU Performance Through Resource Sharing Vishwesh Jatala Department of CSE, IIT Kanpur vjatala@cseiitkacin Jayvant Anantpur SERC, IISc, Bangalore jayvant@hpcserciiscernetin Amey Karkare Department

More information

MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches

More information

Anatomy of GPU Memory System for Multi-Application Execution

Anatomy of GPU Memory System for Multi-Application Execution Anatomy of GPU Memory System for Multi-Application Execution Adwait Jog Onur Kayiran 4 Tuba Kesten 2 Ashutosh Pattnaik 2 Evgeny Bolotin 3 Niladrish Chatterjee 3 Stephen W. Keckler 3,5 Mahmut T. Kandemir

More information

Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management

Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management Haonan Wang, Fan Luo, Mohamed Ibrahim, Onur Kayiran, and Adwait Jog College of William and Mary Advanced Micro Devices, Inc.

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

The Pennsylvania State University. The Graduate School. College of Engineering KERNEL-BASED ENERGY OPTIMIZATION IN GPUS.

The Pennsylvania State University. The Graduate School. College of Engineering KERNEL-BASED ENERGY OPTIMIZATION IN GPUS. The Pennsylvania State University The Graduate School College of Engineering KERNEL-BASED ENERGY OPTIMIZATION IN GPUS A Thesis in Computer Science and Engineering by Amin Jadidi 2015 Amin Jadidi Submitted

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Yingchao Huang University of California, Merced yhuang46@ucmerced.edu Abstract A heterogeneous memory system (HMS)

More information

Evaluating GPGPU Memory Performance Through the C-AMAT Model

Evaluating GPGPU Memory Performance Through the C-AMAT Model Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000

More information

Cache efficiency based dynamic bypassing technique for improving GPU performance

Cache efficiency based dynamic bypassing technique for improving GPU performance 94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

RegMutex: Inter-Warp GPU Register Time-Sharing

RegMutex: Inter-Warp GPU Register Time-Sharing RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,

More information

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA GPGPU LAB Case study: Finite-Difference Time- Domain Method on CUDA Ana Balevic IPVS 1 Finite-Difference Time-Domain Method Numerical computation of solutions to partial differential equations Explicit

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Abhishek Bhowmick Rachata Ausavarungnirun

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION

THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION Southern Illinois University Carbondale OpenSIUC Theses Theses and Dissertations 12-1-2017 THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION SRINIVASA REDDY PUNYALA

More information

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research

More information

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)

More information

Generic Polyphase Filterbanks with CUDA

Generic Polyphase Filterbanks with CUDA Generic Polyphase Filterbanks with CUDA Jan Krämer German Aerospace Center Communication and Navigation Satellite Networks Weßling 04.02.2017 Knowledge for Tomorrow www.dlr.de Slide 1 of 27 > Generic Polyphase

More information

Cooperative Caching for GPUs

Cooperative Caching for GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh at HiPEAC Stockholm, Sweden 24 th January, 2017 Multithreading on GPUs Kernel Hardware Scheduler Core Core Core Core Host CPU to

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Dongwei Wang Follow this and additional

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

An Efficient STT-RAM Last Level Cache Architecture for GPUs

An Efficient STT-RAM Last Level Cache Architecture for GPUs An Efficient STT-RAM Last Level Cache Architecture for GPUs Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad, HPCAN Lab, Computer Engineering Department, Sharif

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

CMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points):

CMU Introduction to Computer Architecture, Spring Midterm Exam 2. Date: Wed., 4/17. Legibility & Name (5 Points): Problem 1 (90 Points): Name: CMU 18-447 Introduction to Computer Architecture, Spring 2013 Midterm Exam 2 Instructions: Date: Wed., 4/17 Legibility & Name (5 Points): Problem 1 (90 Points): Problem 2 (35 Points): Problem 3 (35

More information

An Optimization Compiler Framework Based on Polyhedron Model for GPGPUs

An Optimization Compiler Framework Based on Polyhedron Model for GPGPUs Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2017 An Optimization Compiler Framework Based on Polyhedron Model for GPGPUs Lifeng Liu Wright State University

More information

CORF: Coalescing Operand Register File for GPUs

CORF: Coalescing Operand Register File for GPUs CORF: Coalescing Operand Register File for GPUs Hodjat Asghari Esfeden University of California, Riverside Riverside, CA hodjat.asghari@email.ucr.edu Farzad Khorasani Tesla, Inc. Palo Alto, CA fkhorasani@tesla.com

More information

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen

ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher

More information

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth Analysis Report v3 Duration 932.612 µs Grid Size [ 1024,1,1 ] Block Size [ 1024,1,1 ] Registers/Thread 32 Shared Memory/Block 28 KiB Shared Memory Requested 64 KiB Shared Memory Executed 64 KiB Shared

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs Dynamic Resource Management for Efficient Utilization of Multitasking GPUs Jason Jong Kyu Park University of Michigan Ann Arbor, MI jasonjk@umich.edu Yongjun Park Hongik University Seoul, Korea yongjun.park@hongik.ac.kr

More information

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching Mohammad Sadrosadati 1,2 Amirhossein Mirhosseini 3 Seyed Borna Ehsani 1 Hamid Sarbazi-Azad 1,4

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de

More information

Power-efficient Computing for Compute-intensive GPGPU Applications

Power-efficient Computing for Compute-intensive GPGPU Applications Power-efficient Computing for Compute-intensive GPGPU Applications Syed Zohaib Gilani, Nam Sung Kim, Michael J. Schulte The University of Wisconsin-Madison, WI, U.S.A. Advanced Micro Devices, TX, U.S.A.

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

Prefetching Techniques for Near-memory Throughput Processors

Prefetching Techniques for Near-memory Throughput Processors Prefetching Techniques for Near-memory Throughput Processors Reena Panda University of Texas at Austin reena.panda@utexas.edu Onur Kayiran AMD Research onur.kayiran@amd.com Yasuko Eckert AMD Research yasuko.eckert@amd.com

More information

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs

Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs ABSTRACT Gwangsun Kim Arm gwangsun.kim@arm.com Mike O Connor NVIDIA and UT-Austin moconnor@nvidia.com 3D-stacked memory

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015 Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work Outline

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

{jtan, zhili,

{jtan, zhili, Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs Jingweijia Tan, Zhi Li, Xin Fu Department of Electrical Engineering and Computer Science University of Kansas Lawrence, KS 66045,

More information

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Chien-Ting Chen 1, Yoshi Shih-Chieh Huang 1, Yuan-Ying Chang 1, Chiao-Yun Tu 1, Chung-Ta King 1, Tai-Yuan Wang 1, Janche Sang

More information