CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
|
|
- Cory Quinn
- 5 years ago
- Views:
Transcription
1 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole- Jean Wu Arizona State University 6/17/2015
2 GPU Execution Model Nowadays GPUs are used to process parallel workloads Hundreds or thousands of threads are running concurrently thread- block warp 0 warp 1 warp/wavefront warp 2 warp 3 2/28
3 Modern GPU Architecture Modern GPUs consist of muldple stream- muldprocessors (SMs), which are similar to SIMD processors. 3/28
4 The Warp Criticality Problem Significant warp execudon Dme disparity for warps in the same thread block Execu6on Time (Compared to the Fastest Warp) % Breadth- first- search w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) Execu6on Time Disparity ComputaDon 100% 75% 50% 25% 0% bfs b+tree heartwall Idle Time kmeans needle srad_1 strcltr_small Avg *Lee and Wu, CAWS: Cri)cality- Aware Warp Scheduling for GPGPU Workloads, PACT- 2014
5 Research Questions What is the source of warp cridcality? How can we effecdvely accelerate cridcal warp execudon? 5/28
6 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Conclusion 6/28
7 Source of Warp Criticality Workload Imbalance Diverging Branch Behavior Memory ContenDon and Memory Access Latency ExecuDon Order of Warp Scheduling 7/28
8 Workload Imbalance & Diverging Branch Workload imbalance or diverging branch behavior makes warps have different number of dynamic instrucdon counts. exe Dme inst count Execu6on Time Normalized to the Fastest Warp w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w Inst Count Normalized to the Fastest Warp Warps (Sorted by Cri6cality) 8/28
9 Memory Contention While warps experience different latency to access memory, memory contendon can induce warp cridcality. Execu6on Time Normalized to the Fastest Warp Total Memory System Latency Other Latency w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) 9/28
10 Warp Scheduling Order The warp scheduler may introduce addidonal stall cycles for a ready warp, resuldng in warp cridcality Execu6on Time Normalized to the Fastest Warp Scheduler Latency Other Latency w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Warps (Sorted by Cri6cality) 10/28
11 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Discussion and Conclusion 11/28
12 CAWA: Criticality-Aware Warp Acceleration Coordinated warp scheduling and cache prioritization design CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse
13 CAWA: Criticality-Aware Warp Acceleration Cri6cality Predic6on Logic (CPL) Predic6ng and iden6fying the cri6cal warp at run6me greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse
14 CAWA CPL : Criticality Prediction Logic EvaluaDng number of addidonal cycles a warp may experience ninst is decremented whenever an instrucdon is executed Cri)cality = ninst * w.cpiavg + nstall instrucdon count disparity diverging branch memory latency scheduling latency 14/28
15 CAWA: Criticality-Aware Warp Acceleration CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy Cri6cality Aware Warp Scheduler (gcaws) Priori6zing and accelera6ng the cri6cal warp execu6on CriDcality- Aware Cache PrioriDzaDon (CACP) PrioriDzing and allocadng cache lines for cridcal warp reuse Reduce delay from the scheduler
16 CAWA gcaws : greedy Criticality-Aware Warp Scheduler PrioriDzing warps based on their cridcality given by CPL ExecuDng warps in a greedy* manner Select the most cridcal ready- warp Keep on execudng the select warp undl it stalls Warp Pool Warp 0 Warp 1 Warp 2 Warp 3 CriDcality Warp Scheduler Selec6on Sequence TradiDonal Approach (e.g. RR, 2L, GTO): W0à W1à W2à W3 gcaws: W1à W3à W0à W2 *Rogers et al., Cache- Conscious Wavefront Scheduler, MICRO 12 16/28
17 CAWA: Criticality-Aware Warp Acceleration CriDcality PredicDon Logic (CPL) PredicDng and idendfying the cridcal warp at rundme greedy CriDcality Aware Warp Scheduler (gcaws) PrioriDzing and acceleradng the cridcal warp execudon Cri6cality- Aware Cache Priori6za6on (CACP) Priori6zing and alloca6ng cache lines for cri6cal warp reuse Reduce delay from the memory
18 Critical Cache Block Reuse Behavior 100% Reuse CriDcal Line Reuse Non CriDcal Line No Reuse 80% 60% 40% 20% 0% Design a PC/MEM- based cache reuse predictor! Capture cache lines that will be used by cridcal warps! Capture cache lines that are likely to be reused 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 256KB L1D $ 16KB L1D $ 18/28
19 CAWA CACP : Criticality-Aware Cache Prioritization *Wu et al., SHiP: Signatured- based Hit Predictor for High Performance Caching, MICRO 11 19/28
20 CAWA CACP : Criticality-Aware Cache Prioritization Cache Request 20/28
21 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Discussion and Conclusion 21/28
22 Experimental Environment GPU simuladon infrastructure GPGPU- sim v3.2.0* nvidia nvcc toolchain v3.2 + gcc v4.4.7 nvidia GTX480 architecture 15 SMs (maximum 48- warps/8- blocks per SM) 16kB (8- sets/16- ways) L1D$ per SM 768kB (64- sets/16- ways) shared L2$ Benchmarks 12 applicadons from Rodinia benchmark suites** including 7 applicadons with high degree of cridcality *Bakhoda et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 09 **Che et al., Rodinia: A Benchmark Suite for Heterogeneous CompuDng, ISPASS 09 22/28
23 Overall Performance Improvement Normalized IPC Normalized MPKI L GTO CAWA 3.13x 2.54x CAWA aims to retain data for cri?cal warps Improving GPU performance via large warps and two- level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O Connor, and Aamodt. MICRO `12. 23/28
24 gcaws Performance Improvement gcaws CAWA IPC Normalized to Baseline RR x 2.63x Large memory footprint Sta?c cache par??oning needs improvement 24/28
25 Performance Improvement with CAWA CACP RR+CACP 2L + CACP GTO+CACP CAWA MPKI Normalized to Baseline RR CACP can effec?vely reduce cache interference with any schedulers Schedulers need modifica?on for robust and higher performance improvement with CACP 25/28
26 Outline IntroducDon and MoDvaDon Source of Warp CriDcality CAWA: Coordinated CriDcality- Aware Warp AcceleraDon CPL: CriDcality PredicDon Logic gcaws: greedy CriDcality- Aware Warp Scheduler CACP: CriDcality- Aware Cache PrioriDzaDon Methodology and EvaluaDon Results Conclusion 26/28
27 Research Questions What is the source of warp cridcality? workload imbalance and/or diverging branch different memory access latencies addidonal scheduling latency How can we effecdvely accelerate cridcal warp execudon? CAWA gcaws allocates more Dme resources for cridcal warps CAWA CACP retains (1) cache lines that will be reused by cridcal warps in the cridcal cache parddon and (2) cache lines that will be reused with higher priority 27/28
28 Conclusion This is the first work that proposes an effecdve soludon for midgadng the degree of warp cridcality by acceleradng the execudon of cridcal warps of GPGPUs CAWA consists of An accurate dynamic warp predicdon scheme to guide cridcal warp acceleradon A coordinated management scheme to improve warp scheduling and cache prioridzadon for cridcal warp acceleradon Our soludon improves GPGPU performance by an average of 23% (9% across all applicadons). 28/28
29 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) Thank you! CAWA: Coordinated Warp Scheduling and Cache PrioriDzaDon for CriDcal Warp AcceleraDon of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole- Jean Wu Arizona State University 6/17/2015
Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationOrchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationIntelligent Scheduling and Memory Management Techniques. for Modern GPU Architectures. Shin-Ying Lee
Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures by Shin-Ying Lee A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy
More informationComputer Architecture: Cache Op1miza1ons. Berk Sunar and Thomas Eisenbarth ECE 505
Computer Architecture: Cache Op1miza1ons Berk Sunar and Thomas Eisenbarth ECE 505 Memory Hierarchy Design Basic Principles of Data Cache Chap. 2 Six basic opdmizadons for caches Appendix B Performance
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationEnergy-Efficient Scheduling for Memory-Intensive GPGPU Workloads
Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationEvaluating GPGPU Memory Performance Through the C-AMAT Model
Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000
More informationMRPB: Memory Request Priori1za1on for Massively Parallel Processors
MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches
More informationID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs
ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs Akhil Arunkumar, Shin-Ying Lee and Carole-Jean Wu School of Computing, Informatics, and Decision Systems Engineering Arizona
More informationarxiv: v2 [cs.ar] 5 Jun 2015
Improving GPU Performance Through Resource Sharing Vishwesh Jatala Department of CSE, IIT Kanpur vjatala@cseiitkacin Jayvant Anantpur SERC, IISc, Bangalore jayvant@hpcserciiscernetin Amey Karkare Department
More informationAnalyzing CUDA Workloads Using a Detailed GPU Simulator
CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA
More informationCharacterizing the Performance Bottlenecks of Irregular GPU Kernels
Characterizing the Performance Bottlenecks of Irregular GPU Kernels Molly A. O Neil M.S. Candidate, Computer Science Thesis Defense April 2, 2015 Committee Members: Dr. Martin Burtscher Dr. Apan Qasem
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationCache efficiency based dynamic bypassing technique for improving GPU performance
94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationPhase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications
Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications Mihir Awatramani, Xian Zhu, Joseph Zambreno and Diane Rover Department of Electrical and Computer Engineering Iowa
More informationPATS: Pattern Aware Scheduling and Power Gating for GPGPUs
PATS: Pattern Aware Scheduling and Power Gating for GPGPUs Qiumin Xu and Murali Annavaram Ming Hsieh Department of Electrical Engineering, University of Southern California Los Angeles, CA {qiumin, annavara}
More informationThe Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale
The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale Michela Taufer University of Delaware Michela Becchi University of Missouri From MulD- core to Many Cores
More informationPerformance in GPU Architectures: Potentials and Distances
Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical
More informationRegister and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen
Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Peking University, China Advanced Digital Science Center, Singapore University of Illinois at Urbana
More informationUNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS
UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha
More informationApplication-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog Evgeny Bolotin 2 Zvika Guz 2 Mike Parker 3 Stephen W. Keckler 2,4 Mahmut T. Kandemir Chita R.
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More informationRethinking Prefetching in GPGPUs: Exploiting Unique Opportunities
Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationGraphite IntroducDon and Overview. Goals, Architecture, and Performance
Graphite IntroducDon and Overview Goals, Architecture, and Performance 4 The Future of MulDcore #Cores 128 1000 cores? CompuDng has moved aggressively to muldcore 64 32 MIT Raw Intel SSC Up to 72 cores
More informationA Study of the Potential of Locality-Aware Thread Scheduling for GPUs
A Study of the Potential of Locality-Aware Thread Scheduling for GPUs Cedric Nugteren Gert-Jan van den Braak Henk Corporaal Eindhoven University of Technology, Eindhoven, The Netherlands c.nugteren@tue.nl,
More informationPAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS
PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU
More informationQuantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms
Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationAPRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun
More informationRegMutex: Inter-Warp GPU Register Time-Sharing
RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationDivergence-Aware Warp Scheduling
Divergence-Aware Warp Scheduling Timothy G. Rogers Department of Computer and Electrical Engineering University of British Columbia tgrogers@ece.ubc.ca Mike O Connor NVIDIA Research moconnor@nvidia.com
More informationLaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,
More informationWarp-Level Divergence in GPUs: Characterization, Impact, and Mitigation
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Dept. of Electrical and Computer Engineering North Carolina State University Raleigh, NC, USA {pxiang, hzhou}@ncsu.edu Abstract High
More informationA Detailed GPU Cache Model Based on Reuse Distance Theory
A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam
More informationAn Efficient STT-RAM Last Level Cache Architecture for GPUs
An Efficient STT-RAM Last Level Cache Architecture for GPUs Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad, HPCAN Lab, Computer Engineering Department, Sharif
More informationMRPB: Memory Request Prioritization for Massively Parallel Processors
MRPB: Memory Request Prioritization for Massively Parallel Processors Wenhao Jia Princeton University wjia@princeton.edu Kelly A. Shaw University of Richmond kshaw@richmond.edu Margaret Martonosi Princeton
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationDesigning Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs
Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Chien-Ting Chen 1, Yoshi Shih-Chieh Huang 1, Yuan-Ying Chang 1, Chiao-Yun Tu 1, Chung-Ta King 1, Tai-Yuan Wang 1, Janche Sang
More informationTHROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION
Southern Illinois University Carbondale OpenSIUC Theses Theses and Dissertations 12-1-2017 THROUGHPUT OPTIMIZATION AND RESOURCE ALLOCATION ON GPUS UNDER MULTI-APPLICATION EXECUTION SRINIVASA REDDY PUNYALA
More informationCS 179: GPU Computing
CS 179: GPU Computing LECTURE 2: INTRO TO THE SIMD LIFESTYLE AND GPU INTERNALS Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ Separate CUDA code into.cu and.cuh
More informationWarped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding
Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern
More informationA STT-RAM-based Low-Power Hybrid Register File for GPGPUs
A STT-RAM-based Low-Power Hybrid Register File for GPGPUs Gushu Li 1, Xiaoming Chen 2, Guangyu Sun 3, Henry Hoffmann 4, Yongpan Liu 1, Yu Wang 1, Huazhong Yang 1 1 Deparment of E.E., Tsinghua National
More informationAccelera'on A+acks on PBKDF2
Accelera'on A+acks on PBKDF2 Or, what is inside the black- box of oclhashcat? Andrew Ruddick, UK Dr. Jeff Yan, Lancaster University, UK andrew.ruddick@hotmail.co.uk, jeff.yan@lancaster.ac.uk What is PBKDF2?
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More information{jtan, zhili,
Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs Jingweijia Tan, Zhi Li, Xin Fu Department of Electrical Engineering and Computer Science University of Kansas Lawrence, KS 66045,
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationSpring Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp
More informationWarped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs Mohammad Abdel-Majeed abdelmaj@usc.edu Daniel Wong wongdani@usc.edu Ming Hsieh Department of Electrical Engineering University of Southern
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationExploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement
Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Ping Xiang, Yi Yang*, Mike Mantor #, Norm Rubin #, Lisa R. Hsu #, Huiyang Zhou
More informationA REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Dongwei Wang Follow this and additional
More informationCSense: A Stream- Processing Toolkit for Robust and High- Rate of Mobile Sensing ApplicaDons
CSense: A Stream- Processing Toolkit for Robust and High- Rate of Mobile Sensing ApplicaDons IPSN 2014 Farley Lai, Syed Shabih Hasan, AusDn Laugesen, Octav Chipara Department of Computer Science University
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationApplication-Specific Autonomic Cache Tuning for General Purpose GPUs
Application-Specific Autonomic Cache Tuning for General Purpose GPUs Sam Gianelli, Edward Richter, Diego Jimenez, Hugo Valdez, Tosiron Adegbija, Ali Akoglu Department of Electrical and Computer Engineering
More informationD5.5.3 Design and implementation of the SIMD-MIMD GPU architecture
D5.5.3(v.1.0) D5.5.3 Design and implementation of the SIMD-MIMD GPU architecture Document Information Contract Number 288653 Project Website lpgpu.org Contractual Deadline 31-08-2013 Nature Report Author
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntra-Cluster Coalescing to Reduce GPU NoC Pressure
Intra-Cluster Coalescing to Reduce GPU NoC Pressure Lu Wang 1, Xia Zhao 1, David Kaeli 2, Zhiying Wang 3, and Lieven Eeckhout 1 1 Ghent University,{luluwang.wang, xia.zhao, lieven.eeckhout}@ugent.be 2
More informationMemory Request Priority Based Warp Scheduling for GPUs
Chinese Journal of Electronics Vol.27, No.5, Sept. 2018 Memory Request Priority Based Warp Scheduling for GPUs ZHANG Jun 1,2, HE Yanxiang 2,SHENFanfan 2, LI Qing an 2 and TAN Hai 1 (1. School of Software,
More informationExploring Non-Uniform Processing In-Memory Architectures
Exploring Non-Uniform Processing In-Memory Architectures ABSTRACT Kishore Punniyamurthy kishore.punniyamurthy@utexas.edu The University of Texas at Austin Advancements in packaging technology have made
More informationAutomatic Data Layout Transformation for Heterogeneous Many-Core Systems
Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationarxiv: v5 [cs.ar] 12 Feb 2017
A Scratchpad Sharing in GPUs Vishwesh Jatala, Indian Institute of Technology, Kanpur Jayvant Anantpur, Indian Institute of Science, Bangalore Amey Karkare, Indian Institute of Technology, Kanpur arxiv:1607.03238v5
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationRevisiting ILP Designs for Throughput-Oriented GPGPU Architecture
Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture Ping Xiang Yi Yang Mike Mantor Norm Rubin Huiyang Zhou Dept. of ECE Dept. of Integrated Systems Graphics Group Nvidia Research Dept. of
More informationPower and Performance Analysis of GPU-Accelerated Systems
Power and Performance Analysis of GPU-Accelerated Systems Yuki Abe, Hiroshi Sasaki, Martin Peres, Koji Inoue, Kazuaki Murakami, Shinpei Kato To cite this version: Yuki Abe, Hiroshi Sasaki, Martin Peres,
More informationCache Memory Access Patterns in the GPU Architecture
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationInter-warp Divergence Aware Execution on GPUs
Inter-warp Divergence Aware Execution on GPUs A Thesis Presented by Chulian Zhang to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master
More informationImproving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU
Chinese Journal of Electronics Vol.24, No.4, Oct. 2015 Improving SIMD Utilization with Thread-Lane Shuffled Compaction in GPGPU LI Bingchao, WEI Jizeng, GUO Wei and SUN Jizhou (School of Computer Science
More informationADDRESS TRANSLATION FOR THROUGHPUT-ORIENTED ACCELERATORS
... ADDRESS TRANSLATION FOR THROUGHPUT-ORIENTED ACCELERATORS... WITH PROCESSOR VENDORS EMBRACING HARDWARE HETEROGENEITY, PROVIDING LOW- OVERHEAD HARDWARE AND SOFTWARE ABSTRACTIONS TO SUPPORT EASY-TO-USE
More informationOrchestrated Scheduling and Prefetching for GPGPUs
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog Onur Kayiran Asit K. Mishra Mahmut T. Kandemir Onur Mutlu Ravishankar Iyer Chita R. Das The Pennsylvania State University Carnegie Mellon University
More informationDaCache: Memory Divergence-Aware GPU Cache Management
DaCache: Memory Divergence-Aware GPU Cache Management Bin Wang, Weikuan Yu, Xian-He Sun, Xinning Wang Department of Computer Science Department of Computer Science Auburn University Illinois Institute
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationA Framework for Modeling GPUs Power Consumption
A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationResearch Article CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU
e Scientific World Journal Volume 215, Article ID 848416, 1 pages http://dx.doi.org/1.1155/215/848416 Research Article CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU Jianliang
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationThread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationDynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs
Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationExtended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays
Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays Technical Report 13rp003-GRIS A. Schulz 1, D. Wodniok 2, S. Widmer 2 and
More information