Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
|
|
- Vernon Simon
- 5 years ago
- Views:
Transcription
1 The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October 3, 2016
2 Graphics Processing Unit (GPU) Run massive number of parallel threads to maximize the throughput Core 0 Accelerate parallel computa9on Image processing Weather forecas9ng Scien9fic and engineering computa9on Data mining Machine learning Core 2 Core 1 Core 3
3 Cache Contention and Thrashing Massive number of concurrent threads (1000+) contend the limited cache storage (16 64kB) A large amount of memory requests result in cache misses A large amount of cache lines are evicted too early
4 GPU Cache Sensitivity Caches do not effectively improve the performance of GPUs GPGPU applications can obtain high speedup with larger L1 data caches Speedup over Baseline L1D$ Off 16kB L1D$ (Baseline) 64kB L1D$ 3.94x 6.82x BO PTH HOT BP FWT HTW SR1 NW SR2 SC BT DCT WC MIS CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Avg
5 Research Question How can we effec9vely mi9gate the cache conten9on problem and improve the GPU performance without increasing the cache capacity?
6 Outline Introduction and Motivation GPU Cache Access Behavior Characterization Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing Methodology and Evaluation Result Conclusion
7 per-inst Cache Line Reuse Distance Bypassing memory requests generated by the instructions which exhibit long reuse distance to increase the effective cache capacity Reuse Distance Distribu.on 100% 80% 60% 40% 20% 0% PC_0 < > 16 PC_1 PC_2 PC_3 PC_4 PC_5 PC_6 PC_7 Memory Instruc.ons of Rodinia* BFS Applica.on PC_8 * Che et al., Rodinia: A Benchmark Suite for Heterogeneous Computing, IISWC 09
8 GPU Cache Bypassing Reuse Distance Distribu9on Speedup with Bypassing Reuse Distance Distribu.on < > % 80% 60% 40% 20% 0% PC_3 PC_4 PC_6 Speedup over No Bypassing Bypassing Complete PC_3 PC_4 BFS PC_6
9 Bypassing Aggressiveness Speedup over No Bypassing AGG=1 AGG=3 AGG=5 AGG=7 Bypassing Complete 1.5 AGG 1.4 opt = AGG opt =3 1.1 AGG opt = PC_3 PC_4 PC_6 PC_1 PC_2 PC_ AGG opt =5 PC_0 KMN BFS The optimal bypassing aggressiveness varies across applications and memory instructions ELL AGG Bypassing Probability 100% 7 99% 5 96% 3 75% 1 50%
10 Outline Introduction and Motivation GPU Cache Access Behavior Characterization Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing Methodology and Evaluation Result Conclusion
11 [1] Wu et al., SHiP: Signature-based Hit Predictor for High Performance Caching, MICRO 12 [2] Tian et al., Adaptive GPU Cache Bypassing, GPGPU 15 [3] Lee et al., CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads, ISCA 15 [4] Qureshi et al., Adaptive Insertion Policies for High Performance Caching, ISCA 07 Ctrl-C: Instruction-Aware Control Loop Based Cache Bypassing Adjus9ng the bypassing aggressiveness per instruc9on to achieve the op9mal cache hit rate by using feedback control loops Capturing the per-instruc9on cache line reuse behavior by the unique PC signature of memory instruc9ons [1][2][3] Bypassing memory requests stochas9cally to alleviate degree of cache thrashing [4]
12 Ctrl-C Design Overview Cache Tag Valid Data Reuse Inser2onPC Instruction-Reuse (ireuse) Table: An array indexed by the lower 7-bit of the instruction PC ireuse AGG: bypassing AGGressiveness BYP: number of requests BYPassed INSERT: number of cache line INSERTed ZERO: number of ZERO-reuse lines (TH L, TH H ): target thresholds of zero reuse lines
13 Ctrl-C Feedback Control Loop at at Eviction Miss Feedback Control Loop Cache if (k > TH H ) AGG if (k < TH L ) AGG PC ireuse reuse? mem request bypass? if (BYP == 2 AGG ) insert BYP = 0 else bypass BYP++
14 Outline Introduction and Motivation GPU Cache Access Behavior Characterization Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing Methodology and Evaluation Result Conclusion
15 Methodology GPGPU-sim [1] to simulate a NVIDIA Fermi-based GTX-480 GPU 15 streaming mul9processors 16kB L1 data cache (4-way/32-set) per SM with Fermihashing [2] index 768kB L2 cache (8-way/64-set/12-par99on) (TH L, TH H ) = (0.1, 0.4) 27 benchmarks (including 13 high cache sensi9ve workloads) to represent a wide range of GPU behavior [1] Bakhoda et al., Analyzing CUDA Workloads Using a Detailed GPU Simulator, ISPASS 09 [2] Nugteren et al., A Detailed GPU Cache Model Based on Reuse Distance Theory, HPCA 14
16 Ctrl-C Performance Improvement Speedup over Baseline kB L1D$ (Baseline) Ctrl-C 32kB L1D$ 2.39x 2.38x
17 Ctrl-C Performance Improvement Speedup over Baseline kB L1D$ (Baseline) Adap9ve Bypass* Ctrl-C 0.42x 2.39x * Tian et al., Adaptive GPU Cache Bypassing, GPGPU 15
18 Research Question How can we effec9vely mi9gate the cache conten9on problem and improve the GPU performance without increasing the cache capacity? Proposed an instruc9on-aware algorithm to predict the cache access behavior Employed feedback control loops to adap9vely bypass memory requests based on an instruc9on s reuse patern
19 Conclusion This is the first work that designs a feedback control loop to determine the op9mal bypassing seung at the instruc9on granularity This paper offers detailed characteriza9on results that show the op9mal cache bypassing aggressiveness varies across applica9ons and memory instruc9ons We propose an instruc9on-aware Ctrl-C cache bypassing scheme to Dynamically predict the best bypassing aggressiveness Improve the performance by an average of 1.42x speedup for cache sensi9ve applica9ons
20 The 34 th IEEE International Conference on Computer Design Thank you! Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October 3, 2016
21 Backup
22 Cache Contention and Thrashing GPU cache capacity is too small to fit in the ac9ve dataset for all concurrent threads. Way 0 Offset 0 Offset 1 Working Dataset Thread_0: D[0] D[N-1] Thread_1: D[N] D[2N-1] Thread_2: D[2N] D[3N-1] Offset N-1 Way 1 Offset 0 Offset 1 Offset N-1 D[2N] D[N] D[0] X D[2N+1] D[N+1] D[1] X D[3N-1] D[2N-1] D[N-1] X D[2N] D[N] D[0] X D[2N+1] D[N+1] D[1] X D[2N-1] D[3N-1] D[N-1] X Access Sequence 1. Thread_0: D[0] 2. Thread_1: D[N] 3. Thread_2: D[2N] 4. Thread_0: D[1] 5. Thread_1: D[N+1] 6. Thread_2: D[2N+1] 7. Thread_0: D[3]
23 GPU Cache Line Reuse Distance > % 80% 60% 40% 20% 0% CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Reuse Distance Distribu.on
24 Storage Overhead AGG counter: 3-bit per ireuse entry BYP counter: 7-bit per ireuse entry REF counter: 10-bit per ireuse entry ZERO counter: 10-bit per ireuse entry Cache meta data: 8-bit per cache line Only need 3.5% overhead for a 16kB cache (32-set/4-way) with 128-entry ireuse to gain 1.41x speedup
25 Storage Overhead Comparison 16kB L1 data cache 32sets 4 ways 76kB unified L2 cache 64 sets 8 ways 12 par99ons Ctrl-C Adap.ve Bypassing* Storage Overhead wrt L1D$ 3.5% 5.6% Speedup 1.42x 1.23x * Tian et al., Adaptive GPU Cache Bypassing, GPGPU 15
26 Ctrl-C MPKI Reduction 16kB L1D$ (Baseline) Ctrl-C MPKI Normalized to Baseline CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Avg (CS) Avg (NS) Avg (All)
27 Ctrl-C Bus Traffic Reduction 16kB L1D$ (Baseline) Adap9ve Bypass Ctrl-C Bus Traffic Normalized to Baseline CLR PF PVR BC CSR FLD SS BFS STR ELL PRK MM KMN Avg (CS) Avg (NS) Avg (All)
CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs
ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs Akhil Arunkumar, Shin-Ying Lee and Carole-Jean Wu School of Computing, Informatics, and Decision Systems Engineering Arizona
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationMRPB: Memory Request Priori1za1on for Massively Parallel Processors
MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationOrchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationA Detailed GPU Cache Model Based on Reuse Distance Theory
A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam
More informationAPRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun
More informationIntelligent Scheduling and Memory Management Techniques. for Modern GPU Architectures. Shin-Ying Lee
Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures by Shin-Ying Lee A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy
More informationPAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS
PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU
More informationRethinking Prefetching in GPGPUs: Exploiting Unique Opportunities
Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi
More informationCache efficiency based dynamic bypassing technique for improving GPU performance
94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationEnergy-Efficient Scheduling for Memory-Intensive GPGPU Workloads
Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics
More informationUNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS
UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationLecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies
Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset
More informationAn Efficient STT-RAM Last Level Cache Architecture for GPUs
An Efficient STT-RAM Last Level Cache Architecture for GPUs Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad, HPCAN Lab, Computer Engineering Department, Sharif
More information15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011
15-740/18-740 Computer Architecture Lecture 5: Project Example Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisang
More informationRow Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different
More informationQuantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms
Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationPerformance in GPU Architectures: Potentials and Distances
Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical
More informationCooperative Caching for GPUs
Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh at HiPEAC Stockholm, Sweden 24 th January, 2017 Multithreading on GPUs Kernel Hardware Scheduler Core Core Core Core Host CPU to
More informationExploiting Core Criticality for Enhanced GPU Performance
Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationPriority-Based Cache Allocation in Throughput Processors
Priority-Based Cache Allocation in Throughput Processors Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O Connor, Mattan Erez, Doug Burger, Donald S. Fussell and Stephen W. Keckler The University of Texas
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationA Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than
More informationExploring Cache Bypassing and Partitioning for Multi-Tasking on GPUs
Exploring Cache Bypassing and Partitioning for Multi-Tasking on GPUs Yun Liang School of EECS, Peking University ericlyun@pku.edu.cn Xiuhong Li School of EECS, Peking University lixiuhong@pku.edu.cn Xiaolong
More informationUnderstanding Outstanding Memory Request Handling Resources in GPGPUs
Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca
More informationWarped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding
Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern
More informationRegister and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen
Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Peking University, China Advanced Digital Science Center, Singapore University of Illinois at Urbana
More informationPrefetching Techniques for Near-memory Throughput Processors
Prefetching Techniques for Near-memory Throughput Processors Reena Panda University of Texas at Austin reena.panda@utexas.edu Onur Kayiran AMD Research onur.kayiran@amd.com Yasuko Eckert AMD Research yasuko.eckert@amd.com
More informationApplication-Specific Autonomic Cache Tuning for General Purpose GPUs
Application-Specific Autonomic Cache Tuning for General Purpose GPUs Sam Gianelli, Edward Richter, Diego Jimenez, Hugo Valdez, Tosiron Adegbija, Ali Akoglu Department of Electrical and Computer Engineering
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu
More informationCUDA Architecture & Programming Model
CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
This is a pre-print, author's version of the paper to appear in the IEEE International Symposium on Workload Characterization (IISWC), 2017. Performance Characterization, Prediction, and Optimization for
More informationA Framework for Modeling GPUs Power Consumption
A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January
More informationECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation. Qisi Wang Hui-Shun Hung Chien-Fu Chen
ECE/CS 752 Final Project: The Best-Offset & Signature Path Prefetcher Implementation Qisi Wang Hui-Shun Hung Chien-Fu Chen Outline Data Prefetching Exist Data Prefetcher Stride Data Prefetcher Offset Prefetcher
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationCopyright by Dong Li 2014
Copyright by Dong Li 2014 The Dissertation Committee for Dong Li certifies that this is the approved version of the following dissertation: Orchestrating Thread Scheduling and Cache Management to Improve
More informationTag-Split Cache for Efficient GPGPU Cache Utilization
Tag-Split Cache for Efficient GPGPU Cache Utilization Lingda Li Ari B. Hayes Shuaiwen Leon Song Eddy Z. Zhang Department of Computer Science, Rutgers University Pacific Northwest National Lab lingda.li@cs.rutgers.edu
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationLayer-wise Performance Bottleneck Analysis of Deep Neural Networks
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William
More informationDaCache: Memory Divergence-Aware GPU Cache Management
DaCache: Memory Divergence-Aware GPU Cache Management Bin Wang, Weikuan Yu, Xian-He Sun, Xinning Wang Department of Computer Science Department of Computer Science Auburn University Illinois Institute
More informationAnalyzing CUDA Workloads Using a Detailed GPU Simulator
CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:
More informationOn-the-Fly Elimination of Dynamic Irregularities for GPU Computing
On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen Graphic Processing Units (GPU) 2 Graphic Processing Units (GPU) 2 Graphic
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationAutomatic Data Layout Transformation for Heterogeneous Many-Core Systems
Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung
More informationMRPB: Memory Request Prioritization for Massively Parallel Processors
MRPB: Memory Request Prioritization for Massively Parallel Processors Wenhao Jia Princeton University wjia@princeton.edu Kelly A. Shaw University of Richmond kshaw@richmond.edu Margaret Martonosi Princeton
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationOverview of Performance Prediction Tools for Better Development and Tuning Support
Overview of Performance Prediction Tools for Better Development and Tuning Support Universidade Federal Fluminense Rommel Anatoli Quintanilla Cruz / Master's Student Esteban Clua / Associate Professor
More informationDivergence-Aware Warp Scheduling
Divergence-Aware Warp Scheduling Timothy G. Rogers Department of Computer and Electrical Engineering University of British Columbia tgrogers@ece.ubc.ca Mike O Connor NVIDIA Research moconnor@nvidia.com
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationPerformance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems
Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Yingchao Huang University of California, Merced yhuang46@ucmerced.edu Abstract A heterogeneous memory system (HMS)
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationGPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis
More informationA REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Dongwei Wang Follow this and additional
More informationScheduling Page Table Walks for Irregular GPU Applications
208 ACM/IEEE 45th Annual International Symposium on Computer Architecture Scheduling Page Table Walks for Irregular GPU Applications Seunghee Shin Guilherme Cox Mark Oskin Gabriel H. Loh Yan Solihin Abhishek
More informationCANDY: Enabling Coherent DRAM Caches for Multi-Node Systems
CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu
More informationRodinia Benchmark Suite
Rodinia Benchmark Suite CIS 601 Paper Presentation 3/16/2017 Presented by Grayson Honan, Shreyas Shivakumar, Akshay Sriraman Rodinia: A Benchmark Suite for Heterogeneous Computing Shuai Che, Michael Boyer,
More informationEvaluating GPGPU Memory Performance Through the C-AMAT Model
Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000
More informationPredictive Runtime Code Scheduling for Heterogeneous Architectures
Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation
More informationMemory Request Priority Based Warp Scheduling for GPUs
Chinese Journal of Electronics Vol.27, No.5, Sept. 2018 Memory Request Priority Based Warp Scheduling for GPUs ZHANG Jun 1,2, HE Yanxiang 2,SHENFanfan 2, LI Qing an 2 and TAN Hai 1 (1. School of Software,
More informationBottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt
Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications
More informationRegister File Organization
Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationLecture-16 (Cache Replacement Policies) CS422-Spring
Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity
More informationThread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationAn Adaptive Filtering Mechanism for Energy Efficient Data Prefetching
18th Asia and South Pacific Design Automation Conference January 22-25, 2013 - Yokohama, Japan An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching Xianglei Dang, Xiaoyin Wang, Dong Tong,
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair *, Johnathan Alsop^, Sarita V. Adve + * University of Wisconsin-Madison ^ AMD Research + University
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationDesigning High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC
Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCharacterizing the Performance Bottlenecks of Irregular GPU Kernels
Characterizing the Performance Bottlenecks of Irregular GPU Kernels Molly A. O Neil M.S. Candidate, Computer Science Thesis Defense April 2, 2015 Committee Members: Dr. Martin Burtscher Dr. Apan Qasem
More informationA Study of the Potential of Locality-Aware Thread Scheduling for GPUs
A Study of the Potential of Locality-Aware Thread Scheduling for GPUs Cedric Nugteren Gert-Jan van den Braak Henk Corporaal Eindhoven University of Technology, Eindhoven, The Netherlands c.nugteren@tue.nl,
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationLecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers
Lecture 12: Large ache Design Topics: Shared vs. private, centralized vs. decentralized, UA vs. NUA, recent papers 1 Shared Vs. rivate SHR: No replication of blocks SHR: Dynamic allocation of space among
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationParallelization Techniques for Implementing Trellis Algorithms on Graphics Processors
1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor
More informationTo Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs
To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationWALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems
: A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University
More informationLecture 10: Large Cache Design III
Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a 9% higher miss rate than true LRU 1 Overview 2 Set
More informationWarp-Level Divergence in GPUs: Characterization, Impact, and Mitigation
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Dept. of Electrical and Computer Engineering North Carolina State University Raleigh, NC, USA {pxiang, hzhou}@ncsu.edu Abstract High
More informationMemory Access Pattern-Aware DRAM Performance Model for Multi-core Systems
Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi *, Jongbok Lee +, and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More information