Benchmarking the Memory Hierarchy of Modern GPUs

1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong Kong Baptist University September 19, 2014

2 of 30 OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion

OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion Background & Motivation 3 of 30

GPU Computing GPU ARCHITECTURE GPU is a SIMD parallel many-core architecture Stream Multiprocessor (SM) * 12 ALU * 192 DPU * 64 SFU * 32 Tex * 16 Shared Memory/ L1 Data Cache Read Only Data Cache L2 Cache DRAM DRAM text... DRAM Figure: Block Diagram of GeForce 780 Background & Motivation 4 of 30

GPU Computing GPU MEMORY HIERARCHY Table: GPU Various Memory Spaces 1 Memory Type Location Cached Lifetime Register R/W on-chip no per-thread Shared Memory R/W on-chip no per-block Constant Memory R off-chip yes host allocation Texture Memory R off-chip yes host allocation Local Memory R/W off-chip yes per-thread Global Memory R/W off-chip yes 2 host allocation 1 Sorted by their normal accessing time in ascending order 2 Cached local/global memory accesses are for devices of compute capacity 2.0 above only Background & Motivation 5 of 30

GPU Computing GPU MEMORY HIERARCHY Table: GPU Various Memory Spaces 1 Memory Type Location Cached Lifetime Register R/W on-chip no per-thread Shared Memory R/W on-chip no per-block Constant Memory R off-chip yes host allocation Texture Memory R off-chip yes host allocation Local Memory R/W off-chip yes per-thread Global Memory R/W off-chip yes 2 host allocation Memory accesses have long been the bottleneck of further performance enhancement. 1 Sorted by their normal accessing time in ascending order 2 Cached local/global memory accesses are for devices of compute capacity 2.0 above only Background & Motivation 6 of 30

GPU Computing TARGET STRUCTURE Study GPU Memory hierarchy The most popular three: 1 Shared memory 2 Global memory 3 Texture memory Characteristics including: 1 Memory access latency 2 Unknown cache mechanism Background & Motivation 7 of 30

P-Chase Benchmark REVIEW: CACHE STRUCTURE Cache: fast back-up memory space Set-associative, LRU Data: 0-1 2-3 4-5 6-7 The organization of traditional 3-set set-associative cache (in word order): assume the cache size is 48 byte, i.e. 12 words, and each cache line contains 2 words. 6 lines Memory address = 4 Set 2-Way Set-Associative Cache 3 Word 2 1 0 8-9 10-11 Way 0: Way 1: Set 0 Set 1 Set 2 0-1 2-3 4-5 6-7 8-9 10-11 Figure: Typical Set-Associative CPU Cache Addressing Background & Motivation 8 of 30

P-Chase Benchmark P-CHASE BENCHMARK P-Chase: stride memory access Pseudo code for(i=0;i<iteration;i++) i=a[i]; Initialization for(i=0;i<array size;i++) A[i]=(i+stride)% array size; Memory access process 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 array: A, array size: 24, stride: 2, iteration: 12 cache miss: stride cache line size; array size > cache size cache hit: stride < cache line size; array size cache size Background & Motivation 9 of 30

P-Chase Benchmark LITERATURE REVIEW cache line size: 32 bytes cache line size: 128 bytes Figure: Kepler Texture L1 Cache (result of Saavedraet1992 ) Figure: Kepler Texture L1 Cache (result of Wong2010 ) Cache line sizes are contradictory! Average memory latency hides some details Background & Motivation 11 of 30

P-Chase Benchmark MOTIVATION Hardware is upgraded Global memory was not cached Memory access time was much longer... Traditional P-Chase bases on CPU cache model GPU cache could be different Observe every latency rather than average one Fine-grained benchmark Record time consumption of every array element s access time Background & Motivation 12 of 30

Design DESIGN Storage: shared memory On-chip, write is prompt 48 KB per SM Declare two spaces 1 s tvalue: current element access time 2 s index: index for next memory access Timing: clock() Store the value after it is used! Pseudo Code global void KernelFunction(){ shared unsigned int s tvalue [ ] ; shared unsigned int s index [ ] ; for ( k = 0 ; k < iterations ; k++) { start time = clock( ) ; j = my array [ j ] ; // store the element index s index [ k ]= j ; end time = clock ( ) ; // store the element access latency s tvalue [ k ] = end time-start time ; } } Fine-grained Benchmark 14 of 30

... Methodology DIRECTIONS Flowchart of applying our fine-grained benchmark: Get cache size from user manual or footprint experiment Overflow the cache with one element Also get cache replacement policy Get the cache line size from cache miss pattern Increase array size. Every increment equals cache line size Get the first cache set Get the second cache set Get: Cache associativity Memory mapping stage 1 Get the last cache set stage 2 Two stages 1 stride = 1 element, array size = cache size + 1 element 2 stride = 1 cache line, array size = cache size + 1:n cache lines Fine-grained Benchmark 15 of 30

Shared Memory BANK CONFLICT I Normal memory latency: 50 clock cycles Shared memory is organized as memory banks Bank conflict: Two or more threads in the same warp visit memory spaces belong to the same shared memory bank Stride memory access Pseudo code data = threadidx.x * stride; 4 byte 0 32 0 16 1 1 2 2 31 17 18 32 banks stride = 2, 2-way bank conflict thread ID Figure: 2-way Shared Memory Bank Conflict Caused By Stride Memory Access 63 15 31 Experimental Results 17 of 30

Shared Memory BANK CONFLICT II Bank conflict latency is much longer Memory requests are sequentially executed! Memory latency (clock cycles) 1,200 1,000 800 600 400 200 0 0 2 4 8 16 32 Stride / #-Way Bank Conflict Figure: Bank Conflict Memory Latency of Fermi Experimental Results 18 of 30

Shared Memory BANK CONFLICT III Kepler outperforms Fermi in terms of avoiding shared memory bank conflict by introducing 8-byte mode shared memory bank 500 450 400 4 byte mode 8 byte mode Latency (clock cycles) 350 300 250 200 150 100 50 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 stride Figure: Memory Latency of Kepler Shared Memory Experimental Results 19 of 30

Global Memory GLOBAL MEMORY: AN OVERVIEW I Global memory access Kepler: cached in L2 data cache Fermi: cached in L1 and L2 data cache, L1 can be disabled Two levels of TLB Memory latency exhibition Use our fine-grained benchmark with specialized initialization Experimental Results 20 of 30

Global Memory GLOBAL MEMORY: AN OVERVIEW II 1,400 Kepler Fermi: cache in L1 Fermi: cache in L2 Memory latency (clock cycles) 1,200 1,000 800 600 400 200 0 pattern1 pattern2 pattern3 pattern4 pattern5 pattern6 Figure: Global Memory Access Latencies Table: Global Memory Access Patterns Pattern Data cache TLB 1 hit L1 hit 2 hit L1 miss, L2 hit 3 hit L2 miss 4 miss L1 hit 5 miss L2 miss 6 miss L2 miss a a page table miss : switch between tables Experimental Results 21 of 30

Global Memory GLOBAL MEMORY: AN OVERVIEW III Observations 1 Big gap between pattern1 and pattern2 of Fermi: cached in L1 = Both L1 TLB and L2 TLB are off-chip 2 Differences between enable/disable L1 of Fermi = Cached in L1 brings some extra time consumption 3 Kepler outperforms Fermi in terms of i cache miss penalty ii L1/L2 TLB miss penalty iii memory latency (when they both cache in L2) 4 Kepler page table needs context switch (only 512 MB of page entries are activated) Experimental Results 22 of 30

Line 0 31 Line 32 63 Line 64 95 Line 96-127 Global Memory FERMI L1 DATA CACHE Fermi L1 data cache structure 3 cache size: 16 KB cache line size: 128 byte set associative: 4-way, 32-set Cache addressing: non-conventional Way 0 Way 1 Way 2 Way 3 One hot cache way is more frequently replaced 32 sets Replacement probability: ( 1 6, 1 2, 1 6, 1 6 ) Figure: Fermi L1 Data Cache Structure 3 This is the default setting. The Fermi L1 data cache can be configured as 48 KB, and the corresponding way number is 6. Experimental Results 23 of 30

Global Memory TLB The page size of both Fermi and Kepler are 2 MB (by brute force experiments) The TLB structure of Fermi and Kepler is the same 1 L1 TLB: 16 entries, fully-associative 2 L2 TLB: 65 entries, set-associative Non-uniform sets 1 big set: 17 entries 6 normal sets: 8 entries Set 0 Set 1 17 entries 8 entries Set 2 8 entries Set 3 8 entries 7 sets Set 4 8 entries Set 5 8 entries Set 6 8 entries Figure: TLB Structure of Fermi & Kepler Experimental Results 24 of 30

Texture Memory TEXTURE MEMORY: AN OVERVIEW Memory latency Table: Texture Memory Access Latency of Fermi & Kepler Device Texture cache Global cache L1 hit L1 miss, L2 hit L1 hit L1 miss, L2 hit Fermi 240 470 116 404 Kepler 110 220 230 = Fermi texture memory management is expensive! The same L2 cache, TLB with global memory Different L1 cache: 12 KB, 32-byte cache line, 4 sets Experimental Results 25 of 30

Texture Memory TEXTURE L1 CACHE 2D spacial locality optimized texture L1 cache addressing Data: 0-7 Set Word 8-15 Memory address = 16-23 8 7 6 5 4-0 24-31 4-Set Texture L1 Cache: 32-39 Set 0 Set 1 Set 2 Set 3 40-47 0-7 32-39 64-71 96-103 8-15 40-47 384 lines 16-23 48-55 24-31 56-63 88-95 120-127 128-135 128-135 96 lines 3064-3071 2968-2975 3000-3007 3032-3039 3064-3071 Figure: Fermi & Kepler Texture L1 Cache Addressing For the typical set-associative mapping, 5-6th bits define the cache set, but here 7-8th bits instead. It looks like a 128-byte, 24-way conventional set-associative cache. Experimental Results 26 of 30

Conclusion 27 of 30 OUTLINE 1 Background & Motivation GPU Computing P-Chase Benchmark 2 Fine-grained Benchmark Design Methodology 3 Experimental Results Shared Memory Global Memory Texture Memory 4 Conclusion

SUMMARY 1 Study memory hierarchy of current GPU: Fermi and Kepler 2 Expose detailed GPU memory features 4 Latency of shared memory bank conflict Latency of Fermi/Kepler global memory accesses Structure of Fermi L1 data cache Structure of Fermi/Kepler TLBs Latency of Fermi/Kepler texture memory accesses Structure of Fermi/Kepler texture L1 cache Structure of Kepler read-only data cache 4 This is an open-source project. The testing files are at http://www.comp.hkbu.edu.hk/ chxw/gpu_benchmark.html. More technical details can be found in our submitted paper. Conclusion 28 of 30

Conclusion 29 of 30 Conclusions 1 GPU cache design is much different from CPU s 2 Fermi and Kepler outperform old architecture Contributions 1 Design a benchmark with some novelty 2 Unveil unknown GPU cache characteristics

Conclusion 30 of 30