LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)

Size: px

Start display at page:

Download "LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)"

Fay Barnett
5 years ago
Views:

1 LUMOS A Framework with Analy1cal Models for Heterogeneous Architectures Liang Wang, and Kevin Skadron (University of Virginia)

2 What is LUMOS A set of first- order analy1cal models targe1ng heterogeneous system including: conven1onal cores and accelerators User Input Applica@on Descrip@on Accelerator Descrip@on System Defini@on Technology Scaling Model Workload Model Lumos Voltage- Frequency Scaling Model Core Performance Model Accelerator Performance Model 2

3 Why LUMOS Heterogeneous systems with accelerators have huge design space Big Core Reconfigurable Logic (e.g. FPGA) Small Core S S App- specific Accelerator (e.g ASIC) S S S ASIC ASIC Power Efficiency App- Specific Accelerators Reconfigurable Logic Small Throughput Cores Big High- Perf. Cores Generalization 3

4 Why LUMOS (cont.) Quickly explore answer for: Big cores vs. many small cores Are massive small cores (poten1ally at lower Vdd.) beoer than a couple of big cores? Conven1onal cores vs. hardware accelerators Are hardware accelerators actually needed? Reconfigurable vs. app- specific In which case, reconfigurable logic is favored over app- specific accelerators, and vice versa. 4

5 Agenda Model Details Framework Interfaces Case Studies Extension to LUMOS 5

6 Technology Scaling Model Based on Predic1ve Technology Model (PTM)* From 45nm to 16nm Both High- Performance (HP) and Low- Power (LP) variants PTM_HP Vdd 0.7 Vt nm 32nm 22nm 16nm 6 *[B. H. Calhoun et al, at ISCAS 2009]

7 Voltage- Frequency Scaling Model Simulate an 32- bit ripple carry adder with Predic1ve Technology Model (PTM) *, from 45nm to 16nm Sweep voltage from 0.3V to 1.1V Using 3- σ in Monte- Carlo simula1ons to determine the worst frequency due to process varia1ons *[B. H. Calhoun et al, at ISCAS 2009] 7

8 Impact of Process Varia1ons Simulated adder frequency with HP process, at 16nm 10x 2x Larger performance near threshold than nominal 8

Core Performance Model Symmetric Mul@- core

Area: A Power: P Area: a Power: p(v) Freq:

Use Amdahl s Law to model mul@- core speedup

9 Core Performance Model Symmetric core System A Single core v Dynamic Power 2 f v Area: A Power: P Area: a Power: p(v) Freq: f(v) Sta@c Power Frequency Fi[ed to circuit sim. Fi[ed to circuit sim. Number of ac@ve cores n = min( P, p( v) A ) a Use Amdahl s Law to model mul@- core speedup in terms of throughput. speedup = 1 ρ perf ( V max ) 1 + n ρ perf ( V ) 9

10 Accelerator Performance Model Accelerator Parameters * A 0 :Area of a single baseline core A : Area of an accelerator μ:relative performance of an accelerator with the size of A 0 φ:relative power of an accelerator with the size of A 0 * Similar to [E. S. Chung, at. el., MICRO10] Accelerator Modeling perf= μ A/ A 0 power= φ A/ A 0 10

Workload model Execu@on Model Workload A0 A1 A n Serial Applica1on Serial (1- ρ) k2 Parallel

c 1 c 0 Kernel Kernel 1 Parameters for a kernel: 1) Speedup with accelerators 2 2) Coverage

1 Kernels are always within parallel porfon of an applicafon 2 Accelerator s speedup is

11 Workload model Model Workload A0 A1 A n Serial Applica1on Serial (1- ρ) k2 Parallel (ρ) k1 k0 Executed by a conven@onal core Parallel Executed by conven@onal cores in parallel c 2 c 1 c 0 Kernel Kernel 1 Parameters for a kernel: 1) Speedup with accelerators 2 2) Coverage within an app (c i ). 1 Kernels are always within parallel porfon of an applicafon 2 Accelerator s speedup is proporfonal to its area Executed by the best available accelerators (e.g. app- specific will be favored over reconfigurable), otherwise executed by conven@onal cores in parallel 11

12 Agenda Model Details Framework Interfaces Case Studies Extension to LUMOS 12

13 Applica1on Descrip1on Encoded in XML format Each applica1on is characterized by: Parallel frac1on (f_parallel) Kernels rela1ve execu1on 1me (cov) <workload> <app name= sigproc > <f_parallel>0.7</f_parallel> <kernel_config> <kernel name= sort cov= 0.3 /> <kernel name= fft cov= 0.35 /> </kernel_config> </app> </workload> 13

Accelerator Descrip1on Encoded in XML format, indexed by kernels Accelerator devices are characterized by: Rela1ve performance (µ) Rela1ve power (φ)

14 Accelerator Descrip1on Encoded in XML format, indexed by kernels Accelerator devices are characterized by: Rela1ve performance (µ) Rela1ve power (φ) Scaled to the same size of a baseline core <kernels> <kernel name= fft > <fpga mu= 2.4, phi= 1.5 /> <asic mu= 10.2 phi= 2.3 /> </kernel> </kernels> 14

15 System Defini1on Command- line parameter (could be migrated to XML in the future) Chip- level design constraints Area TDP Technology node Core parameters Core type (e.g. IOCore, O3Core) Accelerator related parameters Area alloca1on Type of ASIC accelerators System Def. area=100 power=60 tech=16 Core_type=IOCore fpga=10% asic_sort=10% asic_fft=5% 15

16 Agenda Model Details Framework Interfaces Case Studies Case I: Conven1onal Cores Case II: Op1mal Accelerator Alloca1on Case III: Benefit of App- Specific Accelerators Extension to LUMOS 16

17 Case Study I: Conven1onal Cores Baseline (Niagara2- like IOCore) Frequency (GHz) Chip- level budgets Dynamic Power (W) Leakage Power (W) Area (mm 2 ) * Derived from McPAT, scaled to 45nm System Type Reference Area (mm2)ⱡ TDP (W) ⱡ Max # of Base Cores at 45nm* Desktop Core2 Quad Small Server Xeon Large Server SPARC T ⱡ Assume 50% budget for core- only components * Calculated by dividing base core area from total core area budget 17

18 Case Study I: Assump1ons Sufficient bandwidth Highly parallel applica1on with ρ=0.99. (e.g. HPC) Speedup normalized to a single baseline core When accelera1ng parallel part of an applica1on, LUMOS searches for the best opera1ng supply voltage achieving the highest throughput 18

19 Case Study I: Dark Si. vs. Dim Si. Alterna1ve cases Ideal: ignore varia1on impacts maximize freq subject to TDP VarWC: worst- case varia1on penalty VarRdc1: 50% of WC penalty VarRdc2: 10% of WC penalty Area=107mm^2, TDP=33W 3x 2x 2x speedup if ignoring impacts 3x slowdown with worst- case frequency 19

20 Case Study II: Accelerator Setup Synthe1c kernels and accelerators Reconfigurable accelerator rela1ve performance (μ) ranging from 20x to 60x Assume a fixed ra1o (5x) of App- specific over reconfigurable accelerators Rela@ve Performance (μ) of Synthe@c Accelerators K1 K2 K3 K4 K5 K6 K7 K8 K9 K10 RL ASIC

21 Case Study II: Synthe1c Applica1ons Synthetic kernels K 1 K 2 K 3 K 4 K 5 K 6 K 7 K 8 K 9 K 10 Probability of appearance Weight of coverage lowest Highest lowest P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 1/P 1 1/P 2 1/P 3 1/P 4 1/P 5 1/P 6 1/P 7 1/P 8 1/P 9 1/P 10 Library- like kernels are likely to appear, but unlikely to dominate coverage, while app- specific kernels are the other way around Binary mask indicate kernels appearance Slot i has the probability of P i to be 1 Kernels in an application K 1 K 5 K 7 Each kernel s coverage is allocated by weight Parallel K 1 K 7 K 5 We generate 500 Total coverage of all kernels follows a normal distribution N(0.4, 0.1) synthe@c applica@ons to compose a generalized workload 21

22 Case Study II: Op1mal Accl. Configura1on 16nm, ideal parallelism for all applica1ons Best config: 80% - > Dim Si. 20% - > FPGA Op@mal at 20% of Acc. alloca@on 22

23 Why RL Only? RL is almost as powerful as AISC Assume 5x performance ra1o between RL and ASIC Assume linear performance scaling to allocated area A large RL alloca1on is jus1fied by the high u1liza1on achievable across mul1ple kernels A single ASIC is too expensive due to small average coverage of the target kernel Small coverage within applica1ons (library- call kernels) Rare presence among applica1ons (app- specific kernels) 23

24 Case Study III: Benefit of ASIC accelerators Alterna1ve ra1o between RL and ASIC 5x, 10x, 50x Add one more synthe1c kernel RL has a rela1ve performance µ=40 Keep its coverage at 20% across applica1ons Total coverage for all other kernels is fixed across applica1ons. 24

(SigPerf) SigCov SigPerf Sys config for the

25 Case Study III: Benefit of ASIC accelerators Dedicated ASIC is beneficial, when: Significant coverage of the targeted kernel (SigCov) Significant performance gap between ASIC and FPGA (SigPerf) SigCov SigPerf Sys config for the optimal throughput FPGA Only 20% coverage for the targeted kernel 25

26 Agenda Model Details Framework Interfaces Case Studies Extension to LUMOS 26

27 Extension to LUMOS New technology model (FinFET, T- FET, etc.) Detailed performance model (memory impacts) Realis1c accelerator and workload characteris1cs 27

28 LUMOS Release LUMOS is available at: liangwang.github.io/lumos Related publica1on L. Wang, K. Skadron, Implica1ons of the Power Wall: Dim Cores and Reconfigurable Logic. IEEE MICRO, Sep.- Oct L. Wang, K. Skadron, Dark vs. Dim Silicon and Near- Threshold Compu1ng Extended Results. TR, UVA- CS , University of Virginia. 28

29 BACKUP 29

30 U- core area for Case III Ra1o \ cov. 10% 20% 30% 40% 5x x x

(e.g. more than the total coverage of all other kernels) 50% FPGA+ 50%

31 Benefit of ASIC Accelerators (detailed) An ASIC accelerator is beneficial only when its targeted kernel covers significant por1on of the applica1on (e.g. more than the total coverage of all other kernels) 50% FPGA+ 50% ASIC FPGA Only Kernel coverage: 30% targeted kernel, 10% all other kernels 31

Implications of the Power Wall: Dim Cores and Reconfigurable Logic

Implications of the Power Wall: Dim Cores and Reconfigurable Logic Liang Wang Department of Computer Science,University of Virginia lw2aw@virginia.edu Kevin Skadron Department of Computer Science,University