Benchmarking the Memory Hierarchy of Modern GPUs

Similar documents
Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental CUDA Optimization. NVIDIA Corporation

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

HKBU Institutional Repository

Fundamental CUDA Optimization. NVIDIA Corporation

Optimization solutions for the segmented sum algorithmic function

CS 179 Lecture 4. GPU Compute Architecture

Lecture 2: CUDA Programming

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Programming in CUDA. Malik M Khan

Caches Concepts Review

B. Tech. Project Second Stage Report on

TUNING CUDA APPLICATIONS FOR MAXWELL

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

TUNING CUDA APPLICATIONS FOR MAXWELL

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Parallel Processing SIMD, Vector and GPU s cont.

Programmable Graphics Hardware (GPU) A Primer

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Page 1. Multilevel Memories (Improving performance using a little cash )

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

1. Creates the illusion of an address space much larger than the physical memory

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

A Framework for Memory Hierarchies

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

CSE502: Computer Architecture CSE 502: Computer Architecture

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Warps and Reduction Algorithms

GPU Profiling and Optimization. Scott Grauer-Gray

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Computer Architecture

Solutions for Chapter 7 Exercises

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17


CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Optimizing CUDA for GPU Architecture. CSInParallel Project

Fundamental Optimizations

ECE 30 Introduction to Computer Engineering

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

www-inst.eecs.berkeley.edu/~cs61c/

Memory hierarchy review. ECE 154B Dmitri Strukov

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CS61C : Machine Structures

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

CUDA Memories. Introduction 5/4/11

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Memory Hierarchy. Slides contents from:

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

CS 2410 Mid term (fall 2018)

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays

The University of Adelaide, School of Computer Science 13 September 2018

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA Architecture & Programming Model

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CS 6354: Memory Hierarchy I. 29 August 2016

Transcription:

1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong Kong Baptist University September 19, 2014

2 of 30 OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion

OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion Background & Motivation 3 of 30

GPU Computing GPU ARCHITECTURE GPU is a SIMD parallel many-core architecture Stream Multiprocessor (SM) * 12 ALU * 192 DPU * 64 SFU * 32 Tex * 16 Shared Memory/ L1 Data Cache Read Only Data Cache L2 Cache DRAM DRAM text... DRAM Figure: Block Diagram of GeForce 780 Background & Motivation 4 of 30

GPU Computing GPU MEMORY HIERARCHY Table: GPU Various Memory Spaces 1 Memory Type Location Cached Lifetime Register R/W on-chip no per-thread Shared Memory R/W on-chip no per-block Constant Memory R off-chip yes host allocation Texture Memory R off-chip yes host allocation Local Memory R/W off-chip yes per-thread Global Memory R/W off-chip yes 2 host allocation 1 Sorted by their normal accessing time in ascending order 2 Cached local/global memory accesses are for devices of compute capacity 2.0 above only Background & Motivation 5 of 30

GPU Computing GPU MEMORY HIERARCHY Table: GPU Various Memory Spaces 1 Memory Type Location Cached Lifetime Register R/W on-chip no per-thread Shared Memory R/W on-chip no per-block Constant Memory R off-chip yes host allocation Texture Memory R off-chip yes host allocation Local Memory R/W off-chip yes per-thread Global Memory R/W off-chip yes 2 host allocation Memory accesses have long been the bottleneck of further performance enhancement. 1 Sorted by their normal accessing time in ascending order 2 Cached local/global memory accesses are for devices of compute capacity 2.0 above only Background & Motivation 6 of 30

GPU Computing TARGET STRUCTURE Study GPU Memory hierarchy The most popular three: 1 Shared memory 2 Global memory 3 Texture memory Characteristics including: 1 Memory access latency 2 Unknown cache mechanism Background & Motivation 7 of 30

P-Chase Benchmark REVIEW: CACHE STRUCTURE Cache: fast back-up memory space Set-associative, LRU Data: 0-1 2-3 4-5 6-7 The organization of traditional 3-set set-associative cache (in word order): assume the cache size is 48 byte, i.e. 12 words, and each cache line contains 2 words. 6 lines Memory address = 4 Set 2-Way Set-Associative Cache 3 Word 2 1 0 8-9 10-11 Way 0: Way 1: Set 0 Set 1 Set 2 0-1 2-3 4-5 6-7 8-9 10-11 Figure: Typical Set-Associative CPU Cache Addressing Background & Motivation 8 of 30

P-Chase Benchmark P-CHASE BENCHMARK P-Chase: stride memory access Pseudo code for(i=0;i<iteration;i++) i=a[i]; Initialization for(i=0;i<array size;i++) A[i]=(i+stride)% array size; Memory access process 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 array: A, array size: 24, stride: 2, iteration: 12 cache miss: stride cache line size; array size > cache size cache hit: stride < cache line size; array size cache size Background & Motivation 9 of 30

P-Chase Benchmark P-CHASE BENCHMARK P-Chase: stride memory access Pseudo code for(i=0;i<iteration;i++) i=a[i]; Initialization for(i=0;i<array size;i++) A[i]=(i+stride)% array size; Memory access process 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 array: A, array size: 24, stride: 2, iteration: 12 cache miss: stride cache line size; array size > cache size cache hit: stride < cache line size; array size cache size Average memory access time reflects the cache structure! Background & Motivation 10 of 30

P-Chase Benchmark LITERATURE REVIEW cache line size: 32 bytes cache line size: 128 bytes Figure: Kepler Texture L1 Cache (result of Saavedraet1992 ) Figure: Kepler Texture L1 Cache (result of Wong2010 ) Cache line sizes are contradictory! Average memory latency hides some details Background & Motivation 11 of 30

P-Chase Benchmark MOTIVATION Hardware is upgraded Global memory was not cached Memory access time was much longer... Traditional P-Chase bases on CPU cache model GPU cache could be different Observe every latency rather than average one Fine-grained benchmark Record time consumption of every array element s access time Background & Motivation 12 of 30

OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion Fine-grained Benchmark 13 of 30

Design DESIGN Storage: shared memory On-chip, write is prompt 48 KB per SM Declare two spaces 1 s tvalue: current element access time 2 s index: index for next memory access Timing: clock() Store the value after it is used! Pseudo Code global void KernelFunction(){ shared unsigned int s tvalue [ ] ; shared unsigned int s index [ ] ; for ( k = 0 ; k < iterations ; k++) { start time = clock( ) ; j = my array [ j ] ; // store the element index s index [ k ]= j ; end time = clock ( ) ; // store the element access latency s tvalue [ k ] = end time-start time ; } } Fine-grained Benchmark 14 of 30

... Methodology DIRECTIONS Flowchart of applying our fine-grained benchmark: Get cache size from user manual or footprint experiment Overflow the cache with one element Also get cache replacement policy Get the cache line size from cache miss pattern Increase array size. Every increment equals cache line size Get the first cache set Get the second cache set Get: Cache associativity Memory mapping stage 1 Get the last cache set stage 2 Two stages 1 stride = 1 element, array size = cache size + 1 element 2 stride = 1 cache line, array size = cache size + 1:n cache lines Fine-grained Benchmark 15 of 30

OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion Experimental Results 16 of 30

Shared Memory BANK CONFLICT I Normal memory latency: 50 clock cycles Shared memory is organized as memory banks Bank conflict: Two or more threads in the same warp visit memory spaces belong to the same shared memory bank Stride memory access Pseudo code data = threadidx.x * stride; 4 byte 0 32 0 16 1 1 2 2 31 17 18 32 banks stride = 2, 2-way bank conflict thread ID Figure: 2-way Shared Memory Bank Conflict Caused By Stride Memory Access 63 15 31 Experimental Results 17 of 30

Shared Memory BANK CONFLICT II Bank conflict latency is much longer Memory requests are sequentially executed! Memory latency (clock cycles) 1,200 1,000 800 600 400 200 0 0 2 4 8 16 32 Stride / #-Way Bank Conflict Figure: Bank Conflict Memory Latency of Fermi Experimental Results 18 of 30

Shared Memory BANK CONFLICT III Kepler outperforms Fermi in terms of avoiding shared memory bank conflict by introducing 8-byte mode shared memory bank 500 450 400 4 byte mode 8 byte mode Latency (clock cycles) 350 300 250 200 150 100 50 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 stride Figure: Memory Latency of Kepler Shared Memory Experimental Results 19 of 30

Global Memory GLOBAL MEMORY: AN OVERVIEW I Global memory access Kepler: cached in L2 data cache Fermi: cached in L1 and L2 data cache, L1 can be disabled Two levels of TLB Memory latency exhibition Use our fine-grained benchmark with specialized initialization Experimental Results 20 of 30

Global Memory GLOBAL MEMORY: AN OVERVIEW II 1,400 Kepler Fermi: cache in L1 Fermi: cache in L2 Memory latency (clock cycles) 1,200 1,000 800 600 400 200 0 pattern1 pattern2 pattern3 pattern4 pattern5 pattern6 Figure: Global Memory Access Latencies Table: Global Memory Access Patterns Pattern Data cache TLB 1 hit L1 hit 2 hit L1 miss, L2 hit 3 hit L2 miss 4 miss L1 hit 5 miss L2 miss 6 miss L2 miss a a page table miss : switch between tables Experimental Results 21 of 30

Global Memory GLOBAL MEMORY: AN OVERVIEW III Observations 1 Big gap between pattern1 and pattern2 of Fermi: cached in L1 = Both L1 TLB and L2 TLB are off-chip 2 Differences between enable/disable L1 of Fermi = Cached in L1 brings some extra time consumption 3 Kepler outperforms Fermi in terms of i cache miss penalty ii L1/L2 TLB miss penalty iii memory latency (when they both cache in L2) 4 Kepler page table needs context switch (only 512 MB of page entries are activated) Experimental Results 22 of 30

Line 0 31 Line 32 63 Line 64 95 Line 96-127 Global Memory FERMI L1 DATA CACHE Fermi L1 data cache structure 3 cache size: 16 KB cache line size: 128 byte set associative: 4-way, 32-set Cache addressing: non-conventional Way 0 Way 1 Way 2 Way 3 One hot cache way is more frequently replaced 32 sets Replacement probability: ( 1 6, 1 2, 1 6, 1 6 ) Figure: Fermi L1 Data Cache Structure 3 This is the default setting. The Fermi L1 data cache can be configured as 48 KB, and the corresponding way number is 6. Experimental Results 23 of 30

Global Memory TLB The page size of both Fermi and Kepler are 2 MB (by brute force experiments) The TLB structure of Fermi and Kepler is the same 1 L1 TLB: 16 entries, fully-associative 2 L2 TLB: 65 entries, set-associative Non-uniform sets 1 big set: 17 entries 6 normal sets: 8 entries Set 0 Set 1 17 entries 8 entries Set 2 8 entries Set 3 8 entries 7 sets Set 4 8 entries Set 5 8 entries Set 6 8 entries Figure: TLB Structure of Fermi & Kepler Experimental Results 24 of 30

Texture Memory TEXTURE MEMORY: AN OVERVIEW Memory latency Table: Texture Memory Access Latency of Fermi & Kepler Device Texture cache Global cache L1 hit L1 miss, L2 hit L1 hit L1 miss, L2 hit Fermi 240 470 116 404 Kepler 110 220 230 = Fermi texture memory management is expensive! The same L2 cache, TLB with global memory Different L1 cache: 12 KB, 32-byte cache line, 4 sets Experimental Results 25 of 30

Texture Memory TEXTURE L1 CACHE 2D spacial locality optimized texture L1 cache addressing Data: 0-7 Set Word 8-15 Memory address = 16-23 8 7 6 5 4-0 24-31 4-Set Texture L1 Cache: 32-39 Set 0 Set 1 Set 2 Set 3 40-47 0-7 32-39 64-71 96-103 8-15 40-47 384 lines 16-23 48-55 24-31 56-63 88-95 120-127 128-135 128-135 96 lines 3064-3071 2968-2975 3000-3007 3032-3039 3064-3071 Figure: Fermi & Kepler Texture L1 Cache Addressing For the typical set-associative mapping, 5-6th bits define the cache set, but here 7-8th bits instead. It looks like a 128-byte, 24-way conventional set-associative cache. Experimental Results 26 of 30

Conclusion 27 of 30 OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion

SUMMARY 1 Study memory hierarchy of current GPU: Fermi and Kepler 2 Expose detailed GPU memory features 4 Latency of shared memory bank conflict Latency of Fermi/Kepler global memory accesses Structure of Fermi L1 data cache Structure of Fermi/Kepler TLBs Latency of Fermi/Kepler texture memory accesses Structure of Fermi/Kepler texture L1 cache Structure of Kepler read-only data cache 4 This is an open-source project. The testing files are at http://www.comp.hkbu.edu.hk/ chxw/gpu_benchmark.html. More technical details can be found in our submitted paper. Conclusion 28 of 30

Conclusion 29 of 30 Conclusions 1 GPU cache design is much different from CPU s 2 Fermi and Kepler outperform old architecture Contributions 1 Design a benchmark with some novelty 2 Unveil unknown GPU cache characteristics

Conclusion 30 of 30