PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS
|
|
- Claud Hicks
- 6 years ago
- Views:
Transcription
1 PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015
2 EVOLVING GPU MEMORY SYSTEM PCI-E GDDR5 GPU CPU 15.8 GB/s 200 GB/s 80 GB/s DDR4 Roadmap Current: CUDA 1-7 cudamemcpy/unified Virtual Memory low BW interconnect 2
3 EVOLVING GPU MEMORY SYSTEM NVLink (Cache Coherent) GDDR5 GPU CPU 80 GB/s 200 GB/s 80 GB/s DDR4 Roadmap Current: CUDA 1-7 cudamemcpy/unified Virtual Memory low BW interconnect Future CPU-GPU cache-coherent high BW interconnect How to best exploit BW from both GPU memory & cache-coherent link? 3
4 GPU WORKLOAD CHARACTERISTICS Highly Sensitive to Memory Bandwidth Rela%ve Throughput Rodinia, Parboil and DoE mini apps 0.125x 0.25x 0.5x 1x 2x 3x Aggregate DRAM Bandwidth, 1x=200GB/sec Applications perform up to 2.8x better with 3x more BW 4
5 GPU WORKLOAD CHARACTERISTICS Memory Latency Tolerant Rela%ve Throughput Rodinia, Parboil and DoE mini apps DRAM Latency (cycles), Baseline 100 cycles Applications are more sensitive to memory bandwidth than latency 5
6 CPU-GPU MEMORY ORGANIZATION Variation in Bandwidth and Capacity! BW-optimized memory è GPU! Higher BW, limited capacity! HBM, GDDR5, WIO2! Capacity-optimized memory è CPU! Higher capacity, lower BW! DDR4, LPDDR4 HPC System GPU$+$HBM$ CPU$ DDR4$ 1000 GB/sec, 16 GB 120 GB/sec, 256 GB Desktop System GDDR5$ GPU$ CPU$ DDR4$ 200 GB/sec, 4 GB 80GB/sec, 32 GB WIO2$ Mobile System GPU$ CPU$ LPDDR4$ 51 GB/sec, 1 GB 24 GB/sec, 4 GB BW Ratio 8.3x 2.5x 2.1x How to place data to exploit BW from all levels of the memory? 6
7 CONTRIBUTIONS BW Maximizing Page Placement Policies! BW-AWARE policy places pages in the ratio of memory BW! Application-Aware policy selectively places hot pages in GPU memory! Data structure based annotation can identify hot pages BW-AWARE page placement is 35% better than Linux INTERLEAVE & 18% than Linux LOCAL Application-Aware policy achieves 90% of the static oracle performance 7
8 OUTLINE! Page Placement Policies! Linux NUMA page placement! BW-AWARE placement! Application memory access patterns! Application-Aware placement! Results & Conclusions 8
9 NUMA PAGE PLACEMENT Linux s Local Placement for CPU NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Lower (ê ) is better Min Time to Serve 100GB CPU (80GB/s)! All data in capacity-optimized CPU memory! Wastes high-bw GPU memory 9
10 NUMA PAGE PLACEMENT Linux s Local Placement for GPU NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s)! All data in bandwidth-optimized GDDR memory! Wastes cache-coherent BW 10
11 NUMA PAGE PLACEMENT Linux s Interleave Placement NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s)! 50% data in GDDR memory, 50% in DDR memory! Exploit BW of both GDDR & DDR memories 11
12 NUMA PAGE PLACEMENT Bandwidth Optimal Placement? NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s)! Goal: Minimize time to access the data! What is the optimal data placement ratio? 12
13 GPU GDDR5 200 GB/s NUMA PAGE PLACEMENT Bandwidth Optimal -> BW-AWARE Placement NVLink 80 GB/s 80 GB/s CPU DDR4 70% accesses 30% accesses! Place data in the ratio of memory bandwidths! Achieves the goal to minimize time to access the data Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s) 13
14 BW-AWARE IMPLEMENTATION Physical page allocation Generate a random number N ε [0,99] If N 30: page è GDDR Else: page è DDR DDR4 GDDR5 Random placement distributes data accesses in the optimal ratio 14
15 BW-AWARE PERFORMANCE! BW-AWARE performs! 35% better than INTERLEAVE! 18% better than LOCAL! BW-AWARE policy places! 70% accesses è GDDR (70G)! 30% accesses è DDR (30D) Throughput Rela%ve to Linux Interleave backprop bfs cfd gaussian kmeans mummergpu needle nn pathfinder srad_v1 cns comd minife xsbench histo lbm sad sgemm stencil LOCAL BW- AWARE INTERLEAVE Ra%o of Pages Placed in GDDR:DDR Memories What if GDDR memory doesn t have enough capacity? 15
16 LIMITED MEMORY CAPACITY! Performance degrades at low GDDR memory capacity! At low GDDR capacities (10%)! Up to 4x slowdown backprop bfs cfd gaussian kmeans mummergpu needle nn pathfinder srad_v1 cns comd minife xsbench histo lbm sad sgemm stencil Throughput Rela%ve to Op%mal Available GDDR Memory Capacity (%) Data access ratio!= optimal 16
17 APPLICATION CHARACTERISTICS! Bandwidth CDF! At OS page granularity (4KB)! Hotness: number of accesses to a page served from DRAM! Applications pages accessed non-uniformly Frac%on(of(Memory(Bandwidth((%)( backprop" bfs" cfd" gaussian" kmeans" mummergpu" needle" nn" pathfinder" srad_v1" cns" comd" minife" xsbench" histo" lbm" sad" sgemm" stencil" 100" 90" 80" 70" 60" 50" 40" 30" 20" 10" 0" Hot Pages Cold Pages 0" 10" 20" 30" 40" 50" 60" 70" 80" 90" 100" Frac%on(of(Applica%on(Memory(Footprint((%)( 17
18 VISUALIZING PAGE ACCESS PATTERN! Application CDF of data footprint vs. virtual address layout! Intuition: Hot pages are clustered together! Place hot data structures in limited capacity GDDR memory Applica1on(( Data(Structures( Virtual(Page(Address( 20000" 18000" 16000" 14000" 12000" 10000" 8000" 6000" 4000" 2000" 0" Hot Pages d_graph_nodes" d_graph_mask" d_graph_visited" d_over" Applica1on(Pages(( BFS d_graph_edges" d_upda6ng_graph_mask" d_cost" Cold Pages 18
19 APPLICATION-AWARE PAGE PLACEMENT Selectively place hot pages in GDDR memory! Goal: Achieve the data access ratio close to the optimal! Compiler-based profiling to identify hot data structures [Stephenson ISCA2015]! Augmented nvcc and ptxas to support data structure access profiling! Compiler inserts memory instrumentation code for all loads & stores! Program annotation to hint hot data structures for placement! Runtime uses these hints to place hot pages in GDDR memory 19
20 ! Annotate the cudamalloc with memory placement hints! CUDA runtime uses these hints to make data placement decisions cudamalloc(devptr0, size[0]); cudamalloc(devptr1, size[1]); Original Code PROGRAM ANNOTATION Virtual Address Space hotness[0] = 2; hotness[1] = 1; hint[] = GetAllocation(size[], hotness[]); cudamalloc(devptr0, size[0], hint[0]); cudamalloc(devptr1, size[1], hint[1]); Annotated Code Virtual Address Space devptr0 devptr1 Profiling or Expert Programmer devptr0 devptr1 Hotness 20
21 SIMULATION ENVIRONMENT! Simulator: GPGPU-Sim 3.x! Heterogeneous 2-level memory! GDDR5 (200GB/s, 8-channels)! DDR4 (80GB/s, 4-channels) GPU 100 clock additional latency 80 GB/s CPU! GPU-CPU interconnect! Latency: 100 GPU core cycles! Workloads:! Rodinia [Che IISWC2009], Parboil [Stratton TR2012]! DoE mini apps [Villa SC2014] GDDR5 200 GB/s 80 GB/s DDR4 21
22 PROGRAM ANNOTATION PERFORMANCE Throughput Rela%ve to Linux Interleave BW- AWARE Program AnnotaRon Oracle! At 10% footprint in GDDR! 19% better than INTERLEAVE, 90% of oracle performance Program annotation correctly identifies hot data structures 22
23 SUMMARY! BW-AWARE page placement is application-agnostic policy! Directly implementable as a default OS policy! Near-optimal performance when applications fit in GDDR memory! Application-aware page placement! Governed by page-access frequency! Program annotation achieves 90% of Oracle Performance! Dynamic migration may be required in application with phases These 2 policies effectively exploit full system BW (code available at 23
24 THANK YOU 24
25 BW-AWARE ADAPTABILITY TO BW-RATIOS! Bandwidth ratios vary! BW-AWARE adapts to! bandwidth variation Throughput Rela%ve to 200GB/sec INTERLEAVE BW- AWARE LOCAL policy BO + CO Memory Bandwidth (GB/sec) 25
26 SENSITIVITY TO DATA SETS Throughput Rela%ve to Linux Interleave training- set BW- AWARE Program AnnotaRon Oracle data- set- 1 data- set- 2 data- set- 3 training- set data- set- 1 data- set- 2 data- set- 3 training- set data- set- 1 data- set- 2 data- set- 3 training- set data- set- 1 data- set- 2 data- set- 3 bfs minife mummergpu xsbench Feedback-driven optimization not sensitive to datasets/parameter 26
27 ORACLE: CONSTRAINED CAPACITY Throuhgput Rela%ve to Unconstarined Interleave Oracle- unconstrained capacity Oracle- 10% GDDR memory capacity BW- AWARE- unconstrained capacity BW- AWARE- 10% GDDR memory capacity At 10% GDDR capacity, Oracle performance drops than 100% capacity 27
28 BW-AWARE PAGE PLACEMENT Place pages in the BW ratio of the NUMA zones! BO & CO memory BW: b B and b C! N uniformly accessed pages! Page fraction in BO memory: f B! T = max (N*f B /b B, N(1-f B )/b C )! f Bopt = b B /(b B +b C ) N/b c Time (f bopt, T min )! Application agnostic policy 0 Pages in BO Memory (f b ) 1 28
UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS
UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationSelective GPU Caches to Eliminate CPU GPU HW Cache Coherence
Selective GPU Caches to Eliminate CPU GPU HW Cache Coherence Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, Stephen W. Keckler NVIDIA and University of Michigan {dnellans,eebrahimi,jdanskin,skeckler}@nvidia.com,
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationGaaS Workload Characterization under NUMA Architecture for Virtualized GPU
GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California
More informationTHE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017
THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data
More informationCtrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs
The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October
More informationYunsup Lee UC Berkeley 1
Yunsup Lee UC Berkeley 1 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationDVFS Space Exploration in Power-Constrained Processing-in-Memory Systems
DVFS Space Exploration in Power-Constrained Processing-in-Memory Systems Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationTowards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu
Towards Automatic Heterogeneous Computing Performance Analysis Carl Pearson pearson@illinois.edu Adviser: Wen-Mei Hwu 2018 03 30 1 Outline High Performance Computing Challenges Vision CUDA Allocation and
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationNUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems
NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationBREAKING THE MEMORY WALL
BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power
More informationTowards High Performance Paged Memory for GPUs
Towards High Performance Paged Memory for GPUs Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, Stephen W. Keckler NVIDIA and The University of Texas at Austin {dnellans,azulfiqar,mstephenson,skeckler}@nvidia.com,
More informationHETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse
HETEROGENEOUS MEMORY MANAGEMENT Linux Plumbers Conference 2018 Jérôme Glisse EVERYTHING IS A POINTER All data structures rely on pointers, explicitly or implicitly: Explicit in languages like C, C++,...
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationGPU Computing with NVIDIA s new Kepler Architecture
GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro,
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationGPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationExploiting Core Criticality for Enhanced GPU Performance
Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEnergy-Efficient Scheduling for Memory-Intensive GPGPU Workloads
Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics
More informationParalization on GPU using CUDA An Introduction
Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing
More informationToward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs
Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs ABSTRACT Gwangsun Kim Arm gwangsun.kim@arm.com Mike O Connor NVIDIA and UT-Austin moconnor@nvidia.com 3D-stacked memory
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationApril 4-7, 2016 Silicon Valley INSIDE PASCAL. Mark Harris, October 27,
April 4-7, 2016 Silicon Valley INSIDE PASCAL Mark Harris, October 27, 2016 @harrism INTRODUCING TESLA P100 New GPU Architecture CPU to CPUEnable the World s Fastest Compute Node PCIe Switch PCIe Switch
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationDESIGNING EFFICIENT HETEROGENEOUS MEMORY ARCHITECTURES
... DESIGNING EFFICIENT HETEROGENEOUS MEMORY ARCHITECTURES... THE AUTHORS MODEL OF ENERGY, BANDWIDTH, AND LATENCY FOR DRAM TECHNOLOGIES ENABLES EXPLORATION OF MEMORY HIERARCHIES THAT COMBINE HETEROGENEOUS
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationCross-Layer Memory Management to Reduce DRAM Power Consumption
Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationTHE DYNAMIC GRANULARITY MEMORY SYSTEM
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin MEMORY ACCESS GRANULARITY The size of block for accessing main memory Often, equal
More informationCUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker
CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationTransparent Checkpoint and Restart Technology for CUDA applications. Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology
Transparent Checkpoint and Restart Technology for CUDA applications Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology Taichiro, SUZUKI 2010.4 ~ 2014.3 Bachelor course at Tokyo
More informationBigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu
Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters
More informationCarlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)
Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB
More informationTEMP: Thread Batch Enabled Memory Partitioning for GPU
TEMP: Thread Batch Enabled Memory Partitioning for GPU Mengjie Mao, Wujie Wen, Xiaoxiao Liu, Jingtong Hu, Danghui Wang, Yiran Chen, Hai Li Department of Electrical and Computer Engineering, University
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More informationUnified memory. GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015
Unified memory GPGPU 2015: High Performance Computing with CUDA University of Cape Town (South Africa), April, 20th-24th, 2015 Manuel Ujaldón Associate Professor @ Univ. of Malaga (Spain) Conjoint Senior
More informationPosition Paper: OpenMP scheduling on ARM big.little architecture
Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationSCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH
Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY
More informationSolros: A Data-Centric Operating System Architecture for Heterogeneous Computing
Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More informationOrchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationSharing High-Performance Devices Across Multiple Virtual Machines
Sharing High-Performance Devices Across Multiple Virtual Machines Preamble What does sharing devices across multiple virtual machines in our title mean? How is it different from virtual networking / NSX,
More informationCS377P Programming for Performance GPU Programming - I
CS377P Programming for Performance GPU Programming - I Sreepathi Pai UTCS November 9, 2015 Outline 1 Introduction to CUDA 2 Basic Performance 3 Memory Performance Outline 1 Introduction to CUDA 2 Basic
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationAnalyzing CUDA Workloads Using a Detailed GPU Simulator
CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:
More informationCache efficiency based dynamic bypassing technique for improving GPU performance
94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and
More informationWhat is gem5 and where do I get it?
What is gem5 and where do I get it? Andreas Sandberg & Nikos Nikoleris ARM Research Why gem5? Runs real workloads Runs complex workloads like Android & ChromeOS System-level insights Device interactions
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationAutomatic Data Layout Transformation for Heterogeneous Many-Core Systems
Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung
More informationImproving overall performance and energy consumption of your cluster with remote GPU virtualization
Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More information10/1/ Introduction 2. Existing Methods 3. Future Research Issues 4. Existing works 5. My Research plan. What is Data Center
Weilin Peng Sept. 28 th 2009 What is Data Center Concentrated clusters of compute and data storage resources that are connected via high speed networks and routers. H V A C Server Network Server H V A
More informationCAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads
2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-
More informationUnified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association
Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationShadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies
Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu
More informationWHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016
WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationOpenPOWER Performance
OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack
More informationCS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15
CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationPage 1. Multilevel Memories (Improving performance using a little cash )
Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationVirtualized and Flexible ECC for Main Memory
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010 1 Memory Error Protection Applying ECC
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationComputer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 6 Storage and Other I/O Topics Introduction I/O devices can be characterized by Behaviour: input, output, storage Partner: human or machine
More informationEfficient and Fair Multi-programming in GPUs via Effective Bandwidth Management
Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management Haonan Wang, Fan Luo, Mohamed Ibrahim, Onur Kayiran, and Adwait Jog College of William and Mary Advanced Micro Devices, Inc.
More information