UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS
|
|
- Jessie Nicholson
- 6 years ago
- Views:
Transcription
1 UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha Agarwal was an intern at NVIDIA)
2 EVOLVING GPU MEMORY SYSTEM NVLink (Cache PCI-E Coherent) GDDR5 GPU CPU 2 GB/s GB/s 8 GB/s DDR4 Roadmap CUDA 1.-5.x CUDA 6.+: Current Future cudamemcpy Unified virtual memory CPU-GPU cache-coherent high BW interconnect Programmer controlled copying to GPU memory Run-time controlled copying à Better productivity How to best exploit full BW while maintaining programmability? 2
3 DESIGN GOALS & OPPORTUNITIES! Simple programming model:! No need for explicit data copying! Exploit full DDR + GDDR BW! Additional 3% BW via NVLink! Crucial to BW sensitive GPU apps Total Bandwidth (GB/s) Legacy CUDA UVM PCI-E NVLink CC-NUMA BW Target Design intelligent dynamic page migration policies to achieve both these goals 3
4 BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW BW from CC % of Migrated Pages to GPU Memory 1! Coherence-based accesses, no page migration! Wastes GPU memory BW 4
5 BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW Additional BW from CC % of Migrated Pages to GPU Memory 1! Static Oracle: Place data in the ratio of memory bandwidths [ASPLOS 15] Dynamic migration can exploit the full system memory BW 5
6 BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW Additional BW from CC % of Migrated Pages to GPU Memory 1! Excessive migration leads to under-utilization of DDR BW Migrate pages for optimal BW utilization 6
7 CONTRIBUTIONS Intelligent Dynamic Page Migration! Aggressively migrate pages upon First-Touch to GDDR memory! Pre-fetch neighbors of touched pages to reduce TLB shootdowns! Throttle page migrations when nearing peak BW Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA 7
8 OUTLINE! Page Migration Techniques! First-Touch page migration! Range-Expansion to save TLB shootdowns! BW balancing to stop excessive migrations! Results & Conclusions 8
9 FIRST-TOUCH PAGE MIGRATION! Naive: Migrate pages that are touched Throughput Rela.ve to No Migra.on CPU memory GPU memory! First-Touch migration approaches Legacy CUDA First-Touch migration is cheap, no hardware counters required 9
10 PROBLEMS WITH FIRST-TOUCH MIGRATION VA PA Send Ack VA PA V1 P2 P1 Invalidate V1 P2 P1 CPU TLB GPU TLB Migrate V1 CPU GPU Throughput Rela.ve To No Migra.on No overhead TLB shootdown backprop needle pathfinder! TLB shootdowns may negate benefits of page migration How to migrate pages without incurring shootdown cost? 1
11 PRE-FETCH TO AVOID TLB SHOOTDOWNS Hot pages, consume 8% BW! Intuition: Hot virtual addresses are clustered! Pre-fetch pages before access by the GPU Max Min Min No shootdown cost for pre-fetched pages 11
12 MIGRATION USING RANGE-EXPANSION CPU memory GPU memory X X X X X Accessed, shootdown required Pre-fetched, shootdown not required Throughput Relative to No Migration No overhead No range expansion TLB shootdown No overhead TLB shootdown No overhead backprop needle pathfinder TLB shootdown! Pre-fetch pages in spatially contiguous range 12
13 MIGRATION USING RANGE-EXPANSION CPU memory GPU memory X X X Accessed, shootdown required Pre-fetched, shootdown not required Throughput Relative to No Migration No range expansion Range:16 Range:64 Range: No overhead TLB shootdown No overhead TLB shootdown No overhead backprop needle pathfinder TLB shootdown! Pre-fetch pages in spatially contiguous range Range-Expansion hides TLB shootdown overhead 13
14 REVISITING BANDWIDTH UTILIZATION Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW First-Touch migration BW from CC Excessive migration % of Migrated Pages to GPU Memory 1! First-Touch + Range-Expansion aggressively unlocks GDDR BW How to avoid excessive page migrations? 14
15 BANDWIDTH BALANCING First- Touch + Range Exp % of Migrated Pages to GPU Memory 1 7 Excessive migrations First-Touch migrations Total Bandwidth (GB/s) Target Start DDR under- subscribed GDDR under- subscribed Time End % of Accesses from GPU Memory Throttle migrations when nearing peak BW 15
16 BANDWIDTH BALANCING First- Touch + Range Exp First- Touch + Range Exp + BW Balancing % of Migrated Pages to GPU Memory 1 7 Excessive migrations First-Touch migrations Total Bandwidth (GB/s) Target Start DDR under- subscribed GDDR under- subscribed Time End % of Accesses from GPU Memory Throttle migrations when nearing peak BW 16
17 SIMULATION ENVIRONMENT! Simulator: GPGPU-Sim 3.x! Heterogeneous 2-level memory! GDDR5 (2GB/s, 8-channels)! DDR4 (8GB/s, 4-channels) GPU 1 clks additional latency 8 GB/s CPU! GPU-CPU interconnect! Latency: 1 GPU core cycles! Workloads:! Rodinia applications [Che IISWC29]! DoE mini apps [Villa SC214] GDDR5 2 GB/s 8 GB/s DDR4 17
18 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle 18
19 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle First-Touch + Range-Expansion + BW Balancing outperforms Legacy CUDA 19
20 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle Streaming accesses, no-reuse after migration 2
21 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA 21
22 CONCLUSIONS! Developed migration policies without any programmer involvement! First-Touch migration is cheap but has high TLB shootdowns! First-Touch + Range-Expansion technique unlocks GDDR memory BW! BW balancing maximizes BW utilization, throttles excessive migrations These 3 complementary techniques effectively unlock full system BW 22
23 THANK YOU 23
24 EFFECTIVENESS OF RANGE-EXPANSION TLB Shootdowns Execution % Migrations Shootdown Execution Saved Benchmark Execution % Migrations Exececution backpropbenchmark 29.1% Overhead 26% 7.6% TLB Shootdowns of Shootdown Without Saved Runtime backprop 29.1% 26% 7.6% bfs TLB 6.7% Shootdown Shootdown 12% Saved.8% cns 2.4% 2%.5% comd 2.2% 89% 1.8% cns Backprop 2.4% 29.1% 26% 2% 7.6%.5% minife 3.6% 36% 1.3% mummer 21.15% 13% 2.75% comd Pathfinder 2.2% 25.9% 1% 89% 2.6% 1.8% pathfinder 25.9% 1% 2.6% srad v1.5% 27%.14% kmeans Needle 4.1% 24.9% 55% 79% 2.75% 3.17% Average 11.13% 33.5% 3.72% minife Mummer TABLE 3.6% 21.15% II: Effectiveness of range prefetching 13% at36% avoiding 2.75% 1.3% mummer Bfs 21.15% 6.7% 12% 13%.8% 2.75% needle 24.9% 55% 13.7% pathfinder Range-Expansion 25.9% can save up to 45% 1% TLB shootdowns 2.6% srad v1.5% 27%.14% 24
25 DATA TRANSFER RATIO Performance is low when GDDR/DDR ratio is away from optimal 25
26 GPU WORKLOADS: BW SENSITIVITY backprop" bfs" btree" gaussian" heartwell" hotspot" kmeans" lava_md" lud_cuda" mummergpu" needle" pathfinder" sc_gpu" srad_v1" cns" comd" minife" xsbench" 3" Rela%ve'Throughput'' 2.5" 2" 1.5" 1".5" Rela%ve' Throughput' " 1.2$.8$.12x".14x".17x".2x".25x".33x".5x" 1x" 2x" 3x" 1$ Aggregate'DRAM'Bandwidth,'1x=2GB/sec' '99$ '8$ '6$ '4$ '2$ $ 2$ 4$ 6$ 8$ 1$ Additonal'DRAM'delay'in'GPU'core'cycles' GPU workloads are highly BW sensitive 26
27 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch First- Touch + Range Exp First- Touch + Range Exp + BW Balancing ORACLE Oracle 6 Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA 27
PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS
PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationSelective GPU Caches to Eliminate CPU GPU HW Cache Coherence
Selective GPU Caches to Eliminate CPU GPU HW Cache Coherence Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, Stephen W. Keckler NVIDIA and University of Michigan {dnellans,eebrahimi,jdanskin,skeckler}@nvidia.com,
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationGaaS Workload Characterization under NUMA Architecture for Virtualized GPU
GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationTowards High Performance Paged Memory for GPUs
Towards High Performance Paged Memory for GPUs Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, Stephen W. Keckler NVIDIA and The University of Texas at Austin {dnellans,azulfiqar,mstephenson,skeckler}@nvidia.com,
More informationAn Investigation of Unified Memory Access Performance in CUDA
An Investigation of Unified Memory Access Performance in CUDA Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun and Martin Herbordt Electrical and Computer Engineering Department, Boston University,
More informationCtrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More informationNUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationReducing GPU Address Translation Overhead with Virtual Caching
Computer Sciences Technical Report #1842 Reducing GPU Address Translation Overhead with Virtual Caching Hongil Yoon Jason Lowe-Power Gurindar S. Sohi University of Wisconsin-Madison Computer Sciences Department
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationTHE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017
THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data
More informationThe Case for Heterogeneous HTAP
The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid
More informationCooperative Caching for GPUs
Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh at HiPEAC Stockholm, Sweden 24 th January, 2017 Multithreading on GPUs Kernel Hardware Scheduler Core Core Core Core Host CPU to
More informationDVFS Space Exploration in Power-Constrained Processing-in-Memory Systems
DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationTowards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu
Towards Automatic Heterogeneous Computing Performance Analysis Carl Pearson pearson@illinois.edu Adviser: Wen-Mei Hwu 2018 03 30 1 Outline High Performance Computing Challenges Vision CUDA Allocation and
More informationFusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems
FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems by Vitaly Zakharenko A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationHETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse
HETEROGENEOUS MEMORY MANAGEMENT Linux Plumbers Conference 2018 Jérôme Glisse EVERYTHING IS A POINTER All data structures rely on pointers, explicitly or implicitly: Explicit in languages like C, C++,...
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationToward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs
Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs ABSTRACT Gwangsun Kim Arm gwangsun.kim@arm.com Mike O Connor NVIDIA and UT-Austin moconnor@nvidia.com 3D-stacked memory
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationLoGA: Low-overhead GPU accounting using events
: Low-overhead GPU accounting using events Jens Kehne Stanislav Spassov Marius Hillenbrand Marc Rittinghaus Frank Bellosa Karlsruhe Institute of Technology (KIT) Operating Systems Group os@itec.kit.edu
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationHigh performance computing. Memory
High performance computing Memory Performance of the computations For many programs, performance of the calculations can be considered as the retrievability from memory and processing by processor In fact
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationAPRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationReview on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala
Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationCarlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)
Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationTHE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD
THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationCharacterizing the Intra-warp Address Distribution and Bandwidth Demands of GPGPUs
Purdue University Purdue e-pubs Open Access Theses Theses and Dissertations Spring 2014 Characterizing the Intra-warp Address Distribution and Bandwidth Demands of GPGPUs Calvin Holic Purdue University
More information1 Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
1 Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead Konstantinos Koukos, Uppsala University Alberto Ros, University of Murcia Erik Hagersten, Uppsala University Stefanos Kaxiras,
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationUnified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015
Unified memory GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Manuel Ujaldón Associate Professor @ Univ. of Malaga (Spain) Conjoint Senior
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationOrchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationRethinking Prefetching in GPGPUs: Exploiting Unique Opportunities
Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi
More informationA Comparison of Performance and Accuracy of Measurement Algorithms in Software
A Comparison of Performance and Accuracy of Measurement Algorithms in Software Omid Alipourfard, Masoud Moshref 1, Yang Zhou 2, Tong Yang 2, Minlan Yu 3 Yale University, Barefoot Networks 1, Peking University
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationBe Fast, Cheap and in Control with SwitchKV Xiaozhou Li
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li Raghav Sethi Michael Kaminsky David G. Andersen Michael J. Freedman Goal: fast and cost-effective key-value store Target: cluster-level storage for
More informationNVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas
NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -
More informationBigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu
Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationSharing High-Performance Devices Across Multiple Virtual Machines
Sharing High-Performance Devices Across Multiple Virtual Machines Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX,
More informationCAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationAdapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES
Adapted from: TRENDS AND ATTRIBUTES OF HORIZONTAL AND VERTICAL COMPUTING ARCHITECTURES Tom Atwood Business Development Manager Sun Microsystems, Inc. Takeaways Understand the technical differences between
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationOpenPOWER Performance
OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationPrefetching Techniques for Near-memory Throughput Processors
Prefetching Techniques for Near-memory Throughput Processors Reena Panda University of Texas at Austin reena.panda@utexas.edu Onur Kayiran AMD Research onur.kayiran@amd.com Yasuko Eckert AMD Research yasuko.eckert@amd.com
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationEvaluating On-Node GPU Interconnects for Deep Learning Workloads
Evaluating On-Node GPU Interconnects for Deep Learning Workloads NATHAN TALLENT, NITIN GAWANDE, CHARLES SIEGEL ABHINAV VISHNU, ADOLFY HOISIE Pacific Northwest National Lab PMBS 217 (@ SC) November 13,
More informationExploring GPU Architecture for N2P Image Processing Algorithms
Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme
Fast and Efficient Automatic Memory Management for GPUs using Compiler- Assisted Runtime Coherence Scheme Sreepathi Pai R. Govindarajan Matthew J. Thazhuthaveetil Supercomputer Education and Research Centre,
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationPerformance in GPU Architectures: Potentials and Distances
Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationCache efficiency based dynamic bypassing technique for improving GPU performance
94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationEXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière
EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationRevisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach
Revisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach Yuki Soma, Balazs Gerofi, Yutaka Ishikawa 1 Agenda Background on virtual memory
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationMediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency
MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More informationOverview of Project's Achievements
PalDMC Parallelised Data Mining Components Final Presentation ESRIN, 12/01/2012 Overview of Project's Achievements page 1 Project Outline Project's objectives design and implement performance optimised,
More informationEliminating Dark Bandwidth
Eliminating Dark Bandwidth Jonathan Beard (twitter: @jonathan_beard) Staff Research Engineer Memory Systems, HPC HCPM 2017 22 June 2017 Dark Bandwidth: Data moved throughout the memory hierarchy that is
More information