CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

Size: px
Start display at page:

Download "CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads"

Transcription

1 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole- Jean Wu Arizona State University 6/17/2015

2 GPU Execution Model Nowadays GPUs are used to process parallel workloads Hundreds or thousands of threads are running concurrently thread- block warp 0 warp 1 warp/wavefront warp 2 warp 3 2/28

3 Modern GPU Architecture Modern GPUs consist of muldple stream- muldprocessors (SMs), which are similar to SIMD processors. 3/28

4 The Warp Criticality Problem Significant warp execudon Dme disparity for warps in the same thread block Execu6on Time (Compared to the Fastest Warp) % Breadth- first- search w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) Execu6on Time Disparity ComputaDon 100% 75% 50% 25% 0% bfs b+tree heartwall Idle Time kmeans needle srad_1 strcltr_small Avg *Lee and Wu, CAWS: Cri)cality- Aware Warp Scheduling for GPGPU Workloads, PACT- 2014

5 Research Questions What is the source of warp cridcality? How can we effecdvely accelerate cridcal warp execudon? 5/28

6 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Conclusion 6/28

7 Source of Warp Criticality Workload Imbalance Diverging Branch Behavior Memory ContenDon and Memory Access Latency ExecuDon Order of Warp Scheduling 7/28

8 Workload Imbalance & Diverging Branch Workload imbalance or diverging branch behavior makes warps have different number of dynamic instrucdon counts. exe Dme inst count Execu6on Time Normalized to the Fastest Warp w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w Inst Count Normalized to the Fastest Warp Warps (Sorted by Cri6cality) 8/28

9 Memory Contention While warps experience different latency to access memory, memory contendon can induce warp cridcality. Execu6on Time Normalized to the Fastest Warp Total Memory System Latency Other Latency w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) 9/28

10 Warp Scheduling Order The warp scheduler may introduce addidonal stall cycles for a ready warp, resuldng in warp cridcality Execu6on Time Normalized to the Fastest Warp Scheduler Latency Other Latency w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) 10/28

11 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Discussion and Conclusion 11/28

12 CAWA: Criticality-Aware Warp Acceleration Coordinated warp scheduling and cache prioritization design CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse

13 CAWA: Criticality-Aware Warp Acceleration Cri6cality Predic6on Logic (CPL) Predic6ng and iden6fying the cri6cal warp at run6me greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse

14 CAWA CPL : Criticality Prediction Logic EvaluaDng number of addidonal cycles a warp may experience ninst is decremented whenever an instrucdon is executed Cri)cality = ninst * w.cpiavg + nstall instrucdon count disparity diverging branch memory latency scheduling latency 14/28

15 CAWA: Criticality-Aware Warp Acceleration CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy Cri6cality Aware Warp Scheduler (gcaws) Priori6zing and accelera6ng the cri6cal warp execu6on CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse Reduce delay from the scheduler

16 CAWA gcaws : greedy Criticality-Aware Warp Scheduler PrioriDzing warps based on their cridcality given by CPL ExecuDng warps in a greedy* manner Select the most cridcal ready- warp Keep on execudng the select warp undl it stalls Warp Pool Warp 0 Warp 1 Warp 2 Warp 3 CriDcality Warp Scheduler Selec6on Sequence TradiDonal Approach (e.g. RR, 2L, GTO): W0à W1à W2à W3 gcaws: W1à W3à W0à W2 *Rogers et al., Cache- Conscious Wavefront Scheduler, MICRO 12 16/28

17 CAWA: Criticality-Aware Warp Acceleration CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon Cri6cality- Aware Cache Priori6za6on (CACP) Priori6zing and alloca6ng cache lines for cri6cal warp reuse Reduce delay from the memory

18 Critical Cache Block Reuse Behavior 100% Reuse CriDcal Line Reuse Non CriDcal Line No Reuse 80% 60% 40% 20% 0% Design a PC/MEM- based cache reuse predictor! Capture cache lines that will be used by cridcal warps! Capture cache lines that are likely to be reused 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 18/28

19 CAWA CACP : Criticality-Aware Cache Prioritization *Wu et al., SHiP: Signatured- based Hit Predictor for High Performance Caching, MICRO 11 19/28

20 CAWA CACP : Criticality-Aware Cache Prioritization Cache Request 20/28

21 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Discussion and Conclusion 21/28

22 Experimental Environment GPU simuladon infrastructure GPGPU- sim v3.2.0* nvidia nvcc toolchain v3.2 + gcc v4.4.7 nvidia GTX480 architecture 15 SMs (maximum 48- warps/8- blocks per SM) 16kB (8- sets/16- ways) L1D$ per SM 768kB (64- sets/16- ways) shared L2$ Benchmarks 12 applicadons from Rodinia benchmark suites** including 7 applicadons with high degree of cridcality *Bakhoda et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 09 **Che et al., Rodinia: A Benchmark Suite for Heterogeneous CompuDng, ISPASS 09 22/28

23 Overall Performance Improvement Normalized IPC Normalized MPKI L GTO CAWA 3.13x 2.54x CAWA aims to retain data for cri?cal warps Improving GPU performance via large warps and two- level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O Connor, and Aamodt. MICRO `12. 23/28

24 gcaws Performance Improvement gcaws CAWA IPC Normalized to Baseline RR x 2.63x Large memory footprint Sta?c cache par??oning needs improvement 24/28

25 Performance Improvement with CAWA CACP RR+CACP 2L + CACP GTO+CACP CAWA MPKI Normalized to Baseline RR CACP can effec?vely reduce cache interference with any schedulers Schedulers need modifica?on for robust and higher performance improvement with CACP 25/28

26 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Conclusion 26/28

27 Research Questions What is the source of warp cridcality? workload imbalance and/or diverging branch different memory access latencies addidonal scheduling latency How can we effecdvely accelerate cridcal warp execudon? CAWA gcaws allocates more Dme resources for cridcal warps CAWA CACP retains (1) cache lines that will be reused by cridcal warps in the cridcal cache parddon and (2) cache lines that will be reused with higher priority 27/28

28 Conclusion This is the first work that proposes an effecdve soludon for midgadng the degree of warp cridcality by acceleradng the execudon of cridcal warps of GPGPUs CAWA consists of An accurate dynamic warp predicdon scheme to guide cridcal warp acceleradon A coordinated management scheme to improve warp scheduling and cache prioridzadon for cridcal warp acceleradon Our soludon improves GPGPU performance by an average of 23% (9% across all applicadons). 28/28

29 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) Thank you! CAWA: Coordinated Warp Scheduling and Cache PrioriDzaDon for CriDcal Warp AcceleraDon of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole- Jean Wu Arizona State University 6/17/2015

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Intelligent Scheduling and Memory Management Techniques. for Modern GPU Architectures. Shin-Ying Lee

Intelligent Scheduling and Memory Management Techniques. for Modern GPU Architectures. Shin-Ying Lee Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures by Shin-Ying Lee A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

Computer Architecture: Cache Op1miza1ons. Berk Sunar and Thomas Eisenbarth ECE 505

Computer Architecture: Cache Op1miza1ons. Berk Sunar and Thomas Eisenbarth ECE 505 Computer Architecture: Cache Op1miza1ons Berk Sunar and Thomas Eisenbarth ECE 505 Memory Hierarchy Design Basic Principles of Data Cache Chap. 2 Six basic opdmizadons for caches Appendix B Performance

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads

Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics

More information

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

Exploi'ng Compressed Block Size as an Indicator of Future Reuse Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed

More information

Evaluating GPGPU Memory Performance Through the C-AMAT Model

Evaluating GPGPU Memory Performance Through the C-AMAT Model Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000

More information

MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches

More information

ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs

ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs Akhil Arunkumar, Shin-Ying Lee and Carole-Jean Wu School of Computing, Informatics, and Decision Systems Engineering Arizona

More information

arxiv: v2 [cs.ar] 5 Jun 2015

arxiv: v2 [cs.ar] 5 Jun 2015 Improving GPU Performance Through Resource Sharing Vishwesh Jatala Department of CSE, IIT Kanpur vjatala@cseiitkacin Jayvant Anantpur SERC, IISc, Bangalore jayvant@hpcserciiscernetin Amey Karkare Department

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA

More information

Characterizing the Performance Bottlenecks of Irregular GPU Kernels

Characterizing the Performance Bottlenecks of Irregular GPU Kernels Characterizing the Performance Bottlenecks of Irregular GPU Kernels Molly A. O Neil M.S. Candidate, Computer Science Thesis Defense April 2, 2015 Committee Members: Dr. Martin Burtscher Dr. Apan Qasem

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Cache efficiency based dynamic bypassing technique for improving GPU performance

Cache efficiency based dynamic bypassing technique for improving GPU performance 94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications

Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications Mihir Awatramani, Xian Zhu, Joseph Zambreno and Diane Rover Department of Electrical and Computer Engineering Iowa

More information

PATS: Pattern Aware Scheduling and Power Gating for GPGPUs

PATS: Pattern Aware Scheduling and Power Gating for GPGPUs PATS: Pattern Aware Scheduling and Power Gating for GPGPUs Qiumin Xu and Murali Annavaram Ming Hsieh Department of Electrical Engineering, University of Southern California Los Angeles, CA {qiumin, annavara}

More information

The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale

The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale Michela Taufer University of Delaware Michela Becchi University of Missouri From MulD- core to Many Cores

More information

Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and Distances Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical

More information

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Peking University, China Advanced Digital Science Center, Singapore University of Illinois at Urbana

More information

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha

More information

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog Evgeny Bolotin 2 Zvika Guz 2 Mike Parker 3 Stephen W. Keckler 2,4 Mahmut T. Kandemir Chita R.

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

Graphite IntroducDon and Overview. Goals, Architecture, and Performance

Graphite IntroducDon and Overview. Goals, Architecture, and Performance Graphite IntroducDon and Overview Goals, Architecture, and Performance 4 The Future of MulDcore #Cores 128 1000 cores? CompuDng has moved aggressively to muldcore 64 32 MIT Raw Intel SSC Up to 72 cores

More information

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs A Study of the Potential of Locality-Aware Thread Scheduling for GPUs Cedric Nugteren Gert-Jan van den Braak Henk Corporaal Eindhoven University of Technology, Eindhoven, The Netherlands c.nugteren@tue.nl,

More information

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU

More information

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun

More information

RegMutex: Inter-Warp GPU Register Time-Sharing

RegMutex: Inter-Warp GPU Register Time-Sharing RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Divergence-Aware Warp Scheduling

Divergence-Aware Warp Scheduling Divergence-Aware Warp Scheduling Timothy G. Rogers Department of Computer and Electrical Engineering University of British Columbia tgrogers@ece.ubc.ca Mike O Connor NVIDIA Research moconnor@nvidia.com

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,

More information

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Dept. of Electrical and Computer Engineering North Carolina State University Raleigh, NC, USA {pxiang, hzhou}@ncsu.edu Abstract High

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

An Efficient STT-RAM Last Level Cache Architecture for GPUs

An Efficient STT-RAM Last Level Cache Architecture for GPUs An Efficient STT-RAM Last Level Cache Architecture for GPUs Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad, HPCAN Lab, Computer Engineering Department, Sharif

More information

MRPB: Memory Request Prioritization for Massively Parallel Processors

MRPB: Memory Request Prioritization for Massively Parallel Processors MRPB: Memory Request Prioritization for Massively Parallel Processors Wenhao Jia Princeton University wjia@princeton.edu Kelly A. Shaw University of Richmond kshaw@richmond.edu Margaret Martonosi Princeton

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Chien-Ting Chen 1, Yoshi Shih-Chieh Huang 1, Yuan-Ying Chang 1, Chiao-Yun Tu 1, Chung-Ta King 1, Tai-Yuan Wang 1, Janche Sang

More information

THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION

THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION Southern Illinois University Carbondale OpenSIUC Theses Theses and Dissertations 12-1-2017 THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION SRINIVASA REDDY PUNYALA

More information

CS 179: GPU Computing

CS 179: GPU Computing CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh

More information

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern

More information

A STT-RAM-based Low-Power Hybrid Register File for GPGPUs

A STT-RAM-based Low-Power Hybrid Register File for GPGPUs A STT-RAM-based Low-Power Hybrid Register File for GPGPUs Gushu Li 1, Xiaoming Chen 2, Guangyu Sun 3, Henry Hoffmann 4, Yongpan Liu 1, Yu Wang 1, Huazhong Yang 1 1 Deparment of E.E., Tsinghua National

More information

Accelera'on A+acks on PBKDF2

Accelera'on A+acks on PBKDF2 Accelera'on A+acks on PBKDF2 Or, what is inside the black- box of oclhashcat? Andrew Ruddick, UK Dr. Jeff Yan, Lancaster University, UK andrew.ruddick@hotmail.co.uk, jeff.yan@lancaster.ac.uk What is PBKDF2?

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

{jtan, zhili,

{jtan, zhili, Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs Jingweijia Tan, Zhi Li, Xin Fu Department of Electrical Engineering and Computer Science University of Kansas Lawrence, KS 66045,

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs Mohammad Abdel-Majeed abdelmaj@usc.edu Daniel Wong wongdani@usc.edu Ming Hsieh Department of Electrical Engineering University of Southern

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Ping Xiang, Yi Yang*, Mike Mantor #, Norm Rubin #, Lisa R. Hsu #, Huiyang Zhou

More information

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Dongwei Wang Follow this and additional

More information

CSense: A Stream- Processing Toolkit for Robust and High- Rate of Mobile Sensing ApplicaDons

CSense: A Stream- Processing Toolkit for Robust and High- Rate of Mobile Sensing ApplicaDons CSense: A Stream- Processing Toolkit for Robust and High- Rate of Mobile Sensing ApplicaDons IPSN 2014 Farley Lai, Syed Shabih Hasan, AusDn Laugesen, Octav Chipara Department of Computer Science University

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Application-Specific Autonomic Cache Tuning for General Purpose GPUs

Application-Specific Autonomic Cache Tuning for General Purpose GPUs Application-Specific Autonomic Cache Tuning for General Purpose GPUs Sam Gianelli, Edward Richter, Diego Jimenez, Hugo Valdez, Tosiron Adegbija, Ali Akoglu Department of Electrical and Computer Engineering

More information

D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture

D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture D5.5.3(v.1.0) D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture Document Information Contract Number 288653 Project Website lpgpu.org Contractual Deadline 31-08-2013 Nature Report Author

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Intra-Cluster Coalescing to Reduce GPU NoC Pressure

Intra-Cluster Coalescing to Reduce GPU NoC Pressure Intra-Cluster Coalescing to Reduce GPU NoC Pressure Lu Wang 1, Xia Zhao 1, David Kaeli 2, Zhiying Wang 3, and Lieven Eeckhout 1 1 Ghent University,{luluwang.wang, xia.zhao, lieven.eeckhout}@ugent.be 2

More information

Memory Request Priority Based Warp Scheduling for GPUs

Memory Request Priority Based Warp Scheduling for GPUs Chinese Journal of Electronics Vol.27, No.5, Sept. 2018 Memory Request Priority Based Warp Scheduling for GPUs ZHANG Jun 1,2, HE Yanxiang 2,SHENFanfan 2, LI Qing an 2 and TAN Hai 1 (1. School of Software,

More information

Exploring Non-Uniform Processing In-Memory Architectures

Exploring Non-Uniform Processing In-Memory Architectures Exploring Non-Uniform Processing In-Memory Architectures ABSTRACT Kishore Punniyamurthy kishore.punniyamurthy@utexas.edu The University of Texas at Austin Advancements in packaging technology have made

More information

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

arxiv: v5 [cs.ar] 12 Feb 2017

arxiv: v5 [cs.ar] 12 Feb 2017 A Scratchpad Sharing in GPUs Vishwesh Jatala, Indian Institute of Technology, Kanpur Jayvant Anantpur, Indian Institute of Science, Bangalore Amey Karkare, Indian Institute of Technology, Kanpur arxiv:1607.03238v5

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture

Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture Ping Xiang Yi Yang Mike Mantor Norm Rubin Huiyang Zhou Dept. of ECE Dept. of Integrated Systems Graphics Group Nvidia Research Dept. of

More information

Power and Performance Analysis of GPU-Accelerated Systems

Power and Performance Analysis of GPU-Accelerated Systems Power and Performance Analysis of GPU-Accelerated Systems Yuki Abe, Hiroshi Sasaki, Martin Peres, Koji Inoue, Kazuaki Murakami, Shinpei Kato To cite this version: Yuki Abe, Hiroshi Sasaki, Martin Peres,

More information

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

Inter-warp Divergence Aware Execution on GPUs

Inter-warp Divergence Aware Execution on GPUs Inter-warp Divergence Aware Execution on GPUs A Thesis Presented by Chulian Zhang to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master

More information

Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU

Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU Chinese Journal of Electronics Vol.24, No.4, Oct. 2015 Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU LI Bingchao, WEI Jizeng, GUO Wei and SUN Jizhou (School of Computer Science

More information

ADDRESS TRANSLATION FOR THROUGHPUT-ORIENTED ACCELERATORS

ADDRESS TRANSLATION FOR THROUGHPUT-ORIENTED ACCELERATORS ... ADDRESS TRANSLATION FOR THROUGHPUT-ORIENTED ACCELERATORS... WITH PROCESSOR VENDORS EMBRACING HARDWARE HETEROGENEITY, PROVIDING LOW- OVERHEAD HARDWARE AND SOFTWARE ABSTRACTIONS TO SUPPORT EASY-TO-USE

More information

Orchestrated Scheduling and Prefetching for GPGPUs

Orchestrated Scheduling and Prefetching for GPGPUs Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog Onur Kayiran Asit K. Mishra Mahmut T. Kandemir Onur Mutlu Ravishankar Iyer Chita R. Das The Pennsylvania State University Carnegie Mellon University

More information

DaCache: Memory Divergence-Aware GPU Cache Management

DaCache: Memory Divergence-Aware GPU Cache Management DaCache: Memory Divergence-Aware GPU Cache Management Bin Wang, Weikuan Yu, Xian-He Sun, Xinning Wang Department of Computer Science Department of Computer Science Auburn University Illinois Institute

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Research Article CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

Research Article CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU e Scientific World Journal Volume 215, Article ID 848416, 1 pages http://dx.doi.org/1.1155/215/848416 Research Article CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU Jianliang

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays

Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays Technical Report 13rp003-GRIS A. Schulz 1, D. Wodniok 2, S. Widmer 2 and

More information