Peeling the Power Onion

Size: px

Start display at page:

Download "Peeling the Power Onion"

Lynne Burke
5 years ago
Views:

1 CERCS IAB Workshop, April 26, 2010 Peeling the Power Onion Hsien-Hsin S. Lee Associate Professor Electrical & Computer Engineering Georgia Tech

2 Power Allocation for Server Farm Room Datacenter 8.1 Total Power = 580 kw Datacenter 8.2 Total Power = 1700 kw Data from Lawrence Berkeley National Lab [2002]

CPUs, 2 PCI, 1 IDE disk, 4 DDR1 DIMMs Total = 231W, Measured = 145W Main

3 Per-System Power Breakdown Nameplate Peak power breakdown Likely actual power breakdown Peak component power breakdown [ISCA-34, 07] Two Intel CPUs, 2 PCI, 1 IDE disk, 4 DDR1 DIMMs Total = 231W, Measured = 145W Main variation from CPU and memory, other components correlate well with CPU activity

quantification) are Multiple cores, More power aware

4 Per-CPU Power Breakdown Simulation results by UT researchers [ISLPED-03] Modern scenarios (requiring quantification) are Multiple cores, More power aware Aggressive gating, DVFS techniques Larger and deeper caches

5 Efficiency of Power Delivery Infrastructure 12V 3ɸ UPS PDU ADC DC-DC PSU V-Reg V-Reg Load Rack Fan Baseline 88% X 93% X 79% X 75% = 48% Power conversion lost efficiency Supply design for fault tolerance with backup, decreasing the efficiency Room for improvement Pay more for higher efficiency PSU (care TCO instead of the equipment price) Consolidating power conversion at a higher level (reduce redundancy) Source: datacenter journal & Rocky Mountain Institute

6 Bottom Line Every counts Cost of power infrastructure = $10 $20/W [Turner, uptimes 06] Energy cost = 10 / kw-h 10 years of energy bill ~ $10/W [Fan, ISCA-07] Infrastructure Energy Bill 6

7 Hierarchical Power Optimization Device/Circuits level Microarchitectural level Chip/Component level System level Compiler/Algorithmic level

8 Hierarchical Power Optimization Device/Circuits level Microarchitectural level Chip/Component level System level Compiler/Algorithmic level

9 Micro-Arch. Level Power Optimization Utility-based Fine-grained clocking gating On-chip cache optimized design (sub)-banking : encapsulated computing Vertical cache partition: Filter cache or Multi-level cache Horizontal cache partition Multi-lateral cache Segmented design Region-based caching Lookup filtering Snoop filtering

10 Way Guard Cache w/ Bloom Filter [Ghosh et al. ISLPED 09] Tag Idx ofst WG0 CBF WG1 CBF WG2 CBF WG3 CBF Tag Tag Tag Tag = = = = Data Data Data Data Hit 10

11 Way Guard Cache w/ Bloom Filter [Ghosh et al. ISLPED 09] Tag Idx ofst WG0 CBF WG1 CBF WG2 CBF WG3 CBF Tag Tag Tag Tag = = = = Data Data Data Data Hit Data out 11

12 Average Number of Cache Ways Looked Up 12

14 Hierarchical Power Optimization Device/Circuits level Microarchitectural level Chip/Component level System level Compiler/Algorithmic level

40mW per pin 1024 data pins ~ 4096 data pins ~ Techniques to avoiding I/O pad power Memory

17 Chip Level Power Challenge for Many-Core I/O power ITRS predicts slow growth in pin count 2/3 for Power and ground, 1/3 for Signal I/O Limited by physical metal properties and cost DDR3 ~ 40mW per pin 1024 data pins ~ 4096 data pins ~ Techniques to avoiding I/O pad power Memory Locality optimization Very tight Integration Intel: Cores+MCH+ICH Cores+PCH 3D integrated circuits 17

18 3D Memory-Stacked Processor [Woo et al. HPCA-16] WL BL Or GST BL WL Can reduce power in I/O pad Can reduce power in Interconnects on package Can reduce power in clock routing Or Any high-density emerging memory technology 18

Memory ring Processing Element Row Response Queue (RRQ) External TLB (xtlb) Arbiter (ARB)

19 POD: 3D-Integrated Broad-Purpose Co-Processor [Woo et al. IEEE MICRO 08] Interleaved 2D Torus Links Memory Bus (MBUS) Data Return Buffer (DRB) Memory ring Processing Element Row Response Queue (RRQ) External TLB (xtlb) Arbiter (ARB) Memory Controller L2 Cache x86 Processor Instruction Bus (IBUS) ORTree Collaborated with Intel (Allan Knies, Josh Fryman) 19

3D-MAPS Many-Core Processor @GA Tech D array tier Optimized 3D D design by moving

Data S TSV bus TIW core 2D mesh DDR memory controller D control logic tier 8 S tiles

20 3D-MAPS Many-Core Tech D array tier Optimized 3D D design by moving peripheral logic to one separate die tier. Data S TSV bus TIW core 2D mesh DDR memory controller D control logic tier 8 S tiles share one memory controller. 64 cores, each communicates with its private scratchpad memory tile thru TSV PIs: H.-H. S. Lee, S. Lim, G. Loh 20

21 Hierarchical Power Optimization Device/Circuits level Microarchitectural level Chip/Component level System level Compiler/Algorithmic level

23 System Level Power Provisioning (I) [Woo & Lee, ACM OS Review, 2009] PROPHET: PROvisioning Processing in a Heterogeneous EnvironmenT Use history to guide best-power-efficiency execution for a given application Netflix, youtube, Hulu.com, etc. Application Profile History Server 1 OS 4 5 Profile Multi-core Processor 23

24 System Level Power Provisioning (II) Power-Capping Stepping Max power 200W 120W 100W Max power consumed by CPU By the other devices No capping Strict capping Current work: to provision jobs to different capped servers based on power-use prediction Lower the overall infrastructure cost RACK 24

25 System Level Power Provisioning (II) Schedule jobs to a specific power-capped machine to just meet performance requirement of SLA They do not have to be this fast to meet SLA Power savings can be achieved Number of jobs 90% Number of jobs 90% Round robin job placement x (seconds) Intelligent job placement x (seconds) 25

26 Power Onion and Optimization Stack Efficiency Losses HVAC System Memory LLC HDD ICN 26

27 Thank You! 27

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University