Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Size: px

Start display at page:

Download "Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh"

Gerard Hampton
6 years ago
Views:

1 Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu

2 Executive Summary Our Goal: Accelerating pointer chasing inside main memory Challenges: Parallelism challenge and Address translation challenge Our Solution: In- PoInter Chasing Accelerator (IMPICA) Address-access decoupling: enabling parallelism in the accelerator with low cost IMPICA page table: low cost page table structure Key Results: 1.2X 1.9X speedup for pointer chasing operations, +16% database throughput 6% - 41% reduction in energy consumption 2

3 Linked Data Structures Linked data structures are widely used in many important applications 3

4 Linked Data Structures Linked data structures are widely used in many important applications Database 4

5 Linked Data Structures Linked data structures are widely used in many important applications Database B-Tree 5

6 Linked Data Structures Linked data structures are widely used in many important applications Database Key-value stores B-Tree 6

7 Linked Data Structures Linked data structures are widely used in many important applications Database Key-value stores B-Tree Hash Table 7

8 Linked Data Structures Linked data structures are widely used in many important applications Database Key-value stores B-Tree Hash Table 8

9 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers 9

10 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers Find(A) H E Q A F M MEM 10

11 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers E Find(A) H Q Addr (H) A F M MEM Data (H) 11

12 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers E Find(A) H Q Addr (E) (H) A F M MEM Data (H) (E) 12

13 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers E Find(A) H Q Addr (A) (E) (H) A F M MEM Data (H) (E) (A) 13

14 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers A E Find(A) F H Q M Addr (A) (E) (H) MEM Data (H) (E) (A) Serialized and irregular access pattern 6X cycles per instruction in real workloads 14

15 Our Goal 15

16 Our Goal Accelerating pointer chasing inside main memory 16

17 Our Goal Accelerating pointer chasing inside main memory MEM 17

18 Our Goal Accelerating pointer chasing inside main memory DRAM layers 18

19 Our Goal Accelerating pointer chasing inside main memory DRAM layers Logic layer 19

20 Our Goal Accelerating pointer chasing inside main memory Find(A) H E Q DRAM layers A F M Logic layer 20

21 Our Goal Accelerating pointer chasing inside main memory Find(A) H Find (A) A E F Q M Data DRAM layers (A) Logic layer 21

22 Outline Motivation and Our Approach Parallelism Challenge IMPICA Core Architecture Address Translation Challenge IMPICA Page Table Evaluation Conclusion 22

23 Parallelism Challenge Time 23

24 Parallelism Challenge CPU core Time 24

25 Parallelism Challenge CPU core access Time 25

26 Parallelism Challenge CPU core access Time In- Accelerator 26

27 Parallelism Challenge CPU core access Time In- Accelerator access 27

28 Parallelism Challenge CPU core access Time In- Accelerator access 28

29 Parallelism Challenge CPU core access Time CPU core In- Accelerator access 29

30 Parallelism Challenge CPU core access Time CPU core access In- Accelerator access 30

31 Parallelism Challenge CPU core access Time CPU core access In- Accelerator access access 31

32 Parallelism Challenge CPU core access Time CPU core access In- Accelerator access access 32

33 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores 33

34 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores CPU core CPU core CPU core Accelerator 34

35 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores CPU core CPU core CPU core Accelerator Opportunity: a pointer-chasing accelerator spends a long time waiting for memory 35

36 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores CPU core CPU core CPU core Accelerator Opportunity: a pointer-chasing accelerator spends a long time waiting for memory access (10-15X of ) 36

37 Our Solution: Address-Access Decoupling Time 37

38 Our Solution: Address-Access Decoupling CPU core access Time CPU core access 38

39 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine 39

40 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine 40

41 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access 41

42 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 42

43 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 43

44 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 44

45 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 45

46 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 46

47 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller 47

48 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Address Engine 48

49 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Address Engine Access Engine 49

50 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Access Queue Address Engine Access Engine 50

51 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Address Engine Access Queue Response Queue Access Engine 51

52 IMPICA Core Architecture DRAM Layers DRAM DRAM Dies Logic Layer IMPICA Cache Access Queue Controller Address Engine Access Engine Response Queue 52

53 IMPICA Core Architecture DRAM Layers DRAM DRAM Dies Logic Layer Request Queue Address Engine IMPICA Cache Access Queue Controller Access Engine To/From CPU Response Queue 53

54 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Request Queue Traversal 1 Traversal 2 To/From CPU Address Engine IMPICA Cache Access Queue Response Queue Controller Access Engine 54

55 Outline Motivation and Our Approach Parallelism Challenge IMPICA Core Architecture Address Translation Challenge IMPICA Page Table Evaluation Conclusion 55

56 Address Translation Challenge TLB/MMU Pointer (PA) PTW Page table walk 56

57 Address Translation Challenge TLB/MMU Pointer (PA) Page table walk 57

58 Address Translation Challenge TLB/MMU Pointer (PA) Page table walk 58

59 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs 59

60 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs CPU Page Table Virtual Address Space Physical Address Space 60

61 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs CPU Page Table Virtual Page Physical Page Virtual Address Space Physical Address Space 61

62 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs CPU Page Table Virtual Page Physical Page Virtual Page Virtual Address Space Physical Page Physical Address Space 62

63 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table Virtual Page Physical Page Virtual Page Virtual Address Space Physical Page Physical Address Space 63

64 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table IMPICA Region Virtual Page Physical Page Virtual Page Physical Page Virtual Address Space Physical Address Space 64

65 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table IMPICA Region Virtual Page Physical Page Virtual Page Physical Page Virtual Address Space Physical Address Space 65

66 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table IMPICA Region Virtual Page Physical Page Virtual Page Physical Page Virtual Address Space Physical Address Space 66

67 IMPICA Page Table: Mechanism Virtual Address Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] Region Table + + Flat Page Table (2MB) Small Page Table (4KB) + Physical Address

68 IMPICA Page Table: Mechanism Virtual Address Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] Region Table + + Flat Page Table (2MB) Small Page Table (4KB) + Physical Address

69 IMPICA Page Table: Mechanism Virtual Address Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] Region Table + + Flat Page Table (2MB) Small Page Table (4KB) + Physical Address

70 Outline Motivation and Our Approach Parallelism Challenge IMPICA Core Architecture Address Translation Challenge IMPICA Page Table Evaluation Conclusion 70

71 Evaluated Workloads Microbenchmarks Linked list (from Olden benchmark) Hash table (from Memcached) B-tree (from DBx1000) Application DBx1000 (with TPC-C benchmark) 71

72 Evaluation Methodology Simulator: gem5 System Configuration CPU 4 OoO cores, 2GHz Cache: 32KB L1, 1MB L2 IMPICA 1 core, 500MHz, 32KB Cache Bandwidth 12.8 GB/s for CPU, 51.2 GB/s for IMPICA Our simulator code will be released in Dec. 72

73 Result Microbenchmark Performance 73

74 Result Microbenchmark Performance Baseline + extra 128KB L2 IMPICA 2.0 Speedup Linked List Hash Table B-Tree 74

75 Result Microbenchmark Performance Baseline + extra 128KB L2 IMPICA 2.0 Speedup Linked List Hash Table B-Tree 75

76 Result Microbenchmark Performance Baseline + extra 128KB L2 IMPICA X Speedup X 1.2X 0.0 Linked List Hash Table B-Tree 76

77 Result Database Performance 77

78 Result Database Performance Database Throughput Baseline + extra 128KB L2 Baseline + extra 1MB L2 IMPICA 78

79 Result Database Performance Database Throughput % Baseline + extra 128KB L2 Baseline + extra 1MB L2 IMPICA 79

80 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 IMPICA 80

81 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 +16% IMPICA 81

82 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 +16% IMPICA Database Latency Baseline + extra 128KB L2 Baseline + extra 1MB L2 IMPICA 82

83 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 +16% IMPICA Database Latency % Baseline + extra 128KB L2-4% Baseline + extra 1MB L2-13% IMPICA 83

84 Energy Consumption 84

85 Energy Consumption Normalized Energy Baseline + extra 128KB L2 Linked List Hash Table B-Tree IMPICA DBx

86 Energy Consumption Normalized Energy Baseline + extra 128KB L2-41% Linked List -24% Hash Table -10% B-Tree IMPICA -6% DBx

87 More in the Paper Interface and design considerations CPU interface and programming model Page table management Cache coherence Area and power overhead analysis Sensitivity to IMPICA page table design 87

88 Conclusion Performing pointer-chasing inside main memory can greatly speed up the traversal of linked data structures Challenges: Parallelism challenge and Address translation challenge Our Solution: In- PoInter Chasing Accelerator Address-access decoupling: enabling parallelism with low cost IMPICA page table: low cost page table structure Key Results: 1.2X 1.9X speedup for pointer chasing operations, +16% database throughput 6% - 41% reduction in energy consumption Our solution can be applied to a broad class of in-memory accelerators 88

89 Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu

90 Microarchitecture Metrics 90

91 Microarchitecture Metrics 150 TLB MPKI Linked List Hash Table B-Tree 91

92 Microarchitecture Metrics 150 TLB MPKI Linked List Hash Table B-Tree 92

93 Microarchitecture Metrics TLB MPKI Linked List Hash Table B-Tree Normalized cache miss latency Linked List Hash Table B-Tree 93

94 Microarchitecture Metrics TLB MPKI Linked List Hash Table B-Tree Normalized cache miss latency Linked List Hash Table B-Tree 94

95 Address Translation Speedup Sensitivity to IMPICA TLB size & Page Table Design 32 TLB 64 TLB 32 TLB + RPT 64 TLB + RPT Linked List Hash Table B-Tree DBx

96 Full IMPICA Core Architecture DRAM Layers DRAM DRAM Dies Logic Layer Data RAM Request Queue Inst RAM Address Engine IMPICA Cache Access Queue Controller Access Engine To/From CPU Response Queue 96

97 CPU Interface We use packet-based interface between CPU and IMPICA Execution steps CPU sends function call and parameter to IMPICA The packet is written to IMPICA data RAM IMPICA loads the function into inst RAM IMPICA writes results to the data RAM, from which the CPU polls the results. 97

98 Programming Model An IMPICA program is written as a function in the application code with a compiler directive The compiler compiles these functions into IMPICA instructions and wraps the function calls with communication codes 98

99 Page Table Management The application allocates the memory for its linked data structures with a special API The OS reserves a portion of the virtual address space as IMPICA regions The OS maintains the coherence between CPU page table and IMPICA page table in the page fault handler 99

100 IMPICA Page Table Size Region Table 4 entries (covers a 2TB memory range) 68 B Flat page table (each) 2 20 entries 8 MB Small page table (each) 2 9 entries 4 KB 100

101 Handling of Multiple Stacks The OS knows the IMPICA region because of our page table management The OS always maps the IMPICA region of the same application into the same memory stack, including the corresponding IMPICA page table 101

102 Cache Coherence We execute every function that operates on the IMPICA regions in the accelerator It can be extended with more advanced cache coherence mechanism. 102

103 Limit of Parallelism The parallelism of IMPICA is limited by Data RAM size (for call stacks) access time vs. address computation time The size of the queues Each IMPICA core can easily parallelize pointer chasing requests. 103

104 Area and Power Overhead CPU (Cortex-A57) L2 Cache 5.85 mm 2 per core 5 mm 2 per MB Controller 10 mm 2 IMPICA (+32KB cache) 0.45 mm 2 Power overhead: average power increases by 5.6% 104

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,