Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

Size: px

Start display at page:

Download "Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks"

Magnus Fletcher
5 years ago
Views:

1 Snatch: Opportunistically Reassigning Power Allocation between and in 3D Stacks Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim, and Josep Torrellas UIUC, OSU, UMN, NVIDIA 1

2 Motivation: Cost of Power/Ground Pins in 3D stacks DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 2

3 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 3

4 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins 3D Stacks: Disjoint Power/Ground pins for and DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 4

5 Motivation: Cost of Power/Ground Pins in 3D stacks Size & cost of packages is proportional to # of pins 3D Stacks: Disjoint Power/Ground pins for and Each dimensioned for the worst case DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins 5

6 Motivation: Underutilization of Power Budget High or Power phases 6

7 Contribution: Snatch Dynamically and opportunistically divert power between processor and memory Mem1 die Conventional Mem Mem0 die TSVs die Proc 7

8 Contribution: Snatch Dynamically and opportunistically divert power between processor and memory On-chip voltage regulator connects the two Power Delivery Networks or can consume more power for the same # of pins Mem1 die Snatch Mem Mem0 die TSVs die Proc On-chip 8

9 Impact Compared to Conventional 3D Stacks For same # of power/ground pins: Application can consume more power Up to 23% application speedup For the same maximum power in and Fewer pins, about 30% package cost reduction 9

10 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 10

11 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 11

12 Conventional Implementation V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V 5.5W 4.5W 12V Cross-section 12

13 Snatch implementation V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V on-chip Single On-Chip 5.5W 4.5W 12V Cross-section 13

14 Snatch implementation Small 2W on-chip bidirectional on Proc die Bulk of work from off-chip s V PCB substrate DRAM1 die DRAM0 die die TSVs 1.1V microbumps C4 bumps BGA pins 12V on-chip Single On-Chip 5.5W 4.5W 12V Cross-section 14

15 Snatch: Dynamic power reassignment Up/Down convert power Snatched DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip 15

95V DRAM1 die PCB substrate DRAM0 die die TSVs

16 Snatch: Dynamic power reassignment Up/Down convert power Snatched 2W 1.1V V DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip 16

17 Snatch: Cross-Section Small 2W on-chip bidirectional on Proc die DRAM1 die PCB substrate DRAM0 die die TSVs microbumps C4 bumps BGA pins Single on-chip Single On-Chip Cross-section 17

18 Snatch: Top Down Small 2W on-chip bidirectional on Proc die DRAM1 die DRAM0 die die TSVs microbumps 5.5W V On-chip PCB substrate C4 bumps BGA pins Single on-chip Single On-Chip 4.5W 1.1V Cross-section Top Down 18

19 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 19

20 Snatching Power On processor intensive phase 5.5W 5.5W On-chip 4.5W 4.5W 20

21 Snatching Power On processor intensive phase Snatch Power TurboBoost 5.5W 7.5W 2W On-chip 4.5W 2.5W 21

22 Snatching Power On memory intensive phase Snatch Power TurboBoost 5.5W 3.5W 2W On-chip 4.5W 6.5W 22

23 Snatching Decisions or Intensive Phase? 5.5W 5.5W?W On-chip 4.5W 4.5W 23

24 Snatching Decisions or Intensive Phase? How much Power is available? 5.5W 5.5W?W On-chip 4.5W 4.5W 24

25 Snatching Decisions or Intensive Phase? How much Power is available? How much Power can we Snatch? 5.5W 5.5W?W On-chip 4.5W 4.5W 25

26 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs 5.5W?W On-chip 4.5W 26

27 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection 5.5W?W On-chip 4.5W 27

28 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection 5.5W MAX for power availability?w On-chip 4.5W 28

29 Conservative Snatching Algorithm Keep track of past power values of 10µs epochs Average for activity detection MAX for power availability 5.5W Avoid hysteresis?w On-chip 4.5W 29

30 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 30

31 Conventional Power Provisioning provisioned for 7.5W On-chip 31

32 Conventional Power Provisioning provisioned for 7.5W 7.5W 7.5W On-chip 32

33 Conventional Power Provisioning provisioned for 7.5W provisioned for 6.5W 7.5W 7.5W On-chip 6.5W 6.5W 33

34 Conventional Power Provisioning provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W On-chip 6.5W 6.5W 34

35 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W On-chip 6.5W 6.5W 35

36 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W provisioned for 6.5W Total = + = 14W 7.5W 7.5W Snatch 2W On-chip 6.5W 6.5W 36

37 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W Total = + = 14W 5.5W 7.5W Snatch 2W 5.5W +-2W On-chip 6.5W 6.5W 37

38 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W 5.5W 5.5W +-2W 7.5W Snatch 2W 6.5W 4.5W On-chip 4.5W +-2W38

39 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W - 4W = 10W Reduce Total Provisioning from 14W to 10W, approx same performance 5.5W 5.5W +-2W 7.5W Snatch 2W On-chip 6.5W 4.5W 4.5W +-2W39

40 Snatch: Provisioning 3D Stacks Just Right provisioned for 7.5W - 2W = 5.5W provisioned for 6.5W - 2W = 4.5W Total = + = 14W - 4W = 10W Reduce Total Provisioning from 14W to 10W, approx same performance 5.5W 5.5W +-2W 7.5W Snatch 2W On-chip 30% Reduction in Package Power/Ground Pins 6.5W 4.5W 4.5W +-2W40

41 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 41

42 Conventional Power Provisioning & provisioned for 5.5W & 4.5W On-chip 42

43 Conventional Power Provisioning & provisioned for 5.5W & 4.5W 5.5W 5.5W On-chip 4.5W 4.5W 43

44 Snatch: Provide Additional Power & provisioned for 5.5W & 4.5W Snatch power 5.5W 5.5W Snatch 2W On-chip 4.5W 4.5W 44

45 Snatch: Opportunistically boost performance & provisioned for 5.5W & 4.5W Snatch power and boost performance 5.5W 5.5W +-2W Snatch 2W On-chip 4.5W 4.5W +-2W45

46 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional 5.5W 5.5W +-2W Snatch 2W On-chip 4.5W 4.5W +-2W46

47 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional Higher performance for the same package cost 5.5W Snatch 2W 5.5W +-2W On-chip 4.5W 4.5W +-2W47

48 Snatch: Boost Performance with Same # of Pins & provisioned for 5.5W & 4.5W Snatch power and boost performance Same # of pins as conventional Higher performance for the same package cost IR-drop and EM characteristics remain the same 5.5W Snatch 2W 4.5W 5.5W +-2W On-chip 4.5W +-2W48

49 Snatch Outline Implementation Operation Case 1: Same Max Power in and, reduced # of pins Case 2: Same # of pins, improved performance Evaluation 49

50 Evaluation Methodology Case 2: Same # of pins, improved performance : 22nm LP 8-core w/ SESC + McPAT : 4GB 2-layer WideIO2 w/ DRAMSim2 Benchmarks: SPLASH-2, NAS, and SPEC 50

51 Performance 1.3 Baseline Turbo Boost Snatch Speedup Average(Splash + NAS) Average(SPEC) 51

52 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Average(Splash + NAS) Average(SPEC) 52

Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup 1.2 1.0 0.9 0.

53 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Average(Splash + NAS) Average(SPEC) 53

54 Performance 1.3 Baseline Turbo Boost Snatch Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Speedup Average(Splash + NAS) Average(SPEC) Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W 54

55 Performance Speedup Baseline Turbo Boost Snatch 25% 8% Average(Splash + NAS) 10% Average(SPEC) Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch boosts performance on average, by 25% Snatch against Baseline and 8% against Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W for Splash and NAS benchmarks 55

56 Performance Speedup Baseline Turbo Boost Snatch 25% 8% Average(Splash + NAS) 10% Average(SPEC) Baseline P = {5.5W, 1.2GHz} M= {4.5W, 400MHz} Turbo Boost P = {5.5W, GHz} M= {4.5W, MHz} DVFS Within Power Budget Snatch boosts performance on average, by 10% Snatch against Baseline for SPEC benchmarks P = {5.5W, GHz} M= {4.5W, MHz} DVFS and Snatch up to 2W Negligible gains against Turbo Boost 56

57 Snatching Activity Overview 90 M->P P->M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 57

Snatching Activity Overview 90 M P P M % of Total Time Snatching 67.5 45 22.

58 Snatching Activity Overview 90 M P P M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 58

59 Snatching Activity Overview 90 M P P M % of Total Time Snatching Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 59

60 Snatching Activity Overview 90 M P P M % of Total Time Snatching Application Snatch on average, 30% for Splash+NAS 9.4% for SPEC 0 Barnes BT CG Cholesky FFT FMM FT IS LU LU(NAS) MG Radiosity Radix Raytrace SP W-Nsquared W-Spatial AvgM->P mcf milc lbm bzip2 AvgP->M 60

61 More On the Paper Design and Implementation: On-chip Voltage Regulator Snatch Algorithm Additional Evaluation: Snatch Algorithm Power Delivery Network Pin Reliability 3D Stack Thermals 61

62 Summary Snatch: An opportunistic power reassignment design for 3D Stacked architectures Small on-chip bidirectional - phase detection and power availability estimation Up to 23% application speedup Alternatively, about 30% package cost reduction 62

63 Image Source : 63

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Aditya Agrawal, Josep Torrellas and Sachin Idgunji University of Illinois at Urbana Champaign and Nvidia Corporation http://iacoma.cs.uiuc.edu