ASIC Design of Shared Vector Accelerators for Multicore Processors

Size: px

Start display at page:

Download "ASIC Design of Shared Vector Accelerators for Multicore Processors"

Katherine Chambers
6 years ago
Views:

1 26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras Department of Electrical and Computer Engineering New Jersey Institute of Technology USA

2 Motivations Motivations/Objectives Omnipresence of vector operations in high-performance scientific and embedded applications Need for high performance and energy efficiency Per-core dedicated Vector coprocessors (VPs) in Multicores Underutilized due to Lack of sustained DLP Presence of vector-length variations in applications Energy not wisely spent: low resource utilization increases the impact of static power Rigid architecture! less room for Energy-Performance trade-offs Objectives Develop techniques for multiple cores to simultaneously share an on-chip VP Develop performance and power models for the VP Create a robust runtime framework to dynamically resize VP resources via power gating Minimize energy consumption Facilitate Energy-Performance trade-offs ASIC implementation Benefits Increased VP & overall resource utilization Reduced energy per task Reduced area needs! free resources for other critical HW design options Increased performance: QoS (resize resources as per individual application needs)

3 Introduction Lane-based Vector Processor Lane 0 Lane 1 Lane M-1 Scalar CORE +,-, x, /, logic, misc +,-, x, /, logic, misc +,-, x, /, logic, misc Scalar Regs VR 0 0 M VL-M VR (P-1) VR 0 1 M+1 VL-M+1 VR (P-1) VR 0 M-1 2M-1 VL-1 VR (P-1) Scalar Pipeline L/S 0 L/S 1 L/S M-1 Interconnection Network Banked Memory Vector Register File (VRF) interleaved across multiple banks! reduced connectivity between VRF and functional units increased access bandwidth to VRF [Asanovic, 1998] Lanes compete only for memory accesses

4 VP Sharing Architecture FSL 0 InstrPath CPU 0 CPU 1 FSL 0 FSL 1 Ctrl Ack InstrPath Ctrl Ack Path Path FSL 1 InstrPath Vector Controller 0 Instr Ctrl Path Scheduler Vector Controller 1 Control & configuration signals to VCs and Lanes Instr to Lanes Instr FIFOs Instr FIFOs Instr FIFOs FPU FPU FPU Vector Registers LDST unit Lane 0 Vector Flag Registers Vector Registers LDST unit Lane 1 Vector Flag Registers Vector Registers LDST unit Lane M Vector Flag Registers M vector lanes L memory banks M L 32-bit crossbar Arb 0 Arb 1 Memory Crossbar (MC) Arb L BRAM Memory Bank 0 BRAM Memory Bank 1 BRAM Memory Bank L Vector Memory Controller (VMC) PLB Interface DMA PLB BUS [Beldianu and Ziavras, ACM Trans. Emb. Comput. Syst. 2013] [Beldianu and Ziavras, IEEE Trans. on Comp. 2014] FPGA prototype

5 VC0 Instruction Bus VP Sharing Architecture Two instruction buses for two cores VC1 Instruction Bus LDST VFRF VC0 VC1 FVRF VC0 VC1 ALU LDST decode Ctrl LDST Ctrl & Arb Addr Gen Stage 1 VRF Addr Mem Indx 2R Ports 2R Ports VRF Addr ALU Ctrl & Arb EXECUTION UNITS Ctrl Addr Gen Stage 2 VRF +/- X NEG ABS MOVE Request Address WrData Result Buffer Result Buffer Result Buffer L/S WB 1W Port 1W Port ALU WB SM WB Stage Arbiter to Memory Crossbar from Memory Crossbar VRF: bit Vector Register File FVRF: 512-bit Flag Vector Register File for conditional execution

6 VP sharing techniques Coarse-grain Temporal Sharing (CTS) Temporally multiplexes the execution of sequences of vector instructions or threads containing them VC0 VC VP_REQ Lane 0 Lane 1 Lane 7 MB0 (exclusive access) MB1 (exclusive) MB0 (exclusive access) MB1 (exclusive) time VP_REL VP_REQ VP_REL

7 Independent vector lanes assigned to distinct cores VP sharing techniques (Ctd.) Vector-Lane Sharing (VLS) VC0 VC VP_REQ VP_REQ Lane 0 Lane 3 Lane 4 Lane 7 MB0 (4 lanes) MB1 (4 lanes) timevp_rel VP_REL No separate control bus for each lane. The lane is self-controlled by receiving appropriate instructions

8 VP sharing techniques (Ctd.) Fine-grain Temporal Sharing (FTS) Spatial (i.e., resource-based) multiplexing of vector instructions in each lane (like simultaneous multithreading) VC0 VC Lane 0 Lane 1 Lane 7 VP_REQ VP_REQ MB0 (ALL lanes) MB1 (ALL lanes) VP_REL VP_REL time

9 FPGA Prototype: Summary of Results Performance FTS-best Under low resource utilization: FTS doubles the speedup and reduces the energy by about 50% as compared to cores with VP exclusive access The lack of adequate DLP in applications is overcome via VP sharing Larger VL (vector length)! increases the performance (due to higher DLP) Loop unrolling! increases the ILP, utilization and overall performance With low utilization of lane units in CTS " the speedup of FTS almost doubles With high utilization of units in CTS " FTS improves the performance by 25-50% TLP provides higher speedup than DLP and ILP! one more justification for VP sharing Power FTS-best VP sharing for a dual-core yields speedups of and halves the energy needs as compared to a system having a single core with an attached VP

10 Benchmarks 32-tap FIR (Finite Impulse Response) filter 1024x1024 dense matrix multiplication (MM) 32-point decimation-in-time radix-2 butterfly FFT: five-stage butterfly; each stage involves complex multiply and add, and shuffle operations (the complex coefficients were pre-computed and stored in a vector register) LU decomposition: dense matrix using Doolittle algorithm (Gaussian elimination) Sparse Matrix Vector Multiplication (SpMVM): Compressed Row Storage (CSR) format with 2 stages First stage: array values are multiplied with vector elements Second stage: additions along each row The Load Index instruction is intensively used in both stages (the index vector has random values corresponding to positions in the sparse matrix)! the non-uniform I/O access of the LDST units to VM banks produces contention in the crossbar

11 ASIC Implementation FPGA to ASIC Design Transition SP FP Multiply, Add/Sub Xilinx IP cores OpenCore IPs, in-house customized and optimized VRF Xilinx Block RAM Latch based VRF; CACTI 6.0 model for area/delay/power analysis Vector Memory bank CACTI model for area/power analysis Xilinx DSP blocks Synopsys DesignWare Basic Block 40 nm TSMC High Performance (HP) process Fine grain clock gating enforced

12 ASIC Implementation Synopsys Design Flow RTL.vhd Design Constraints.tcl Synopsys VCS-MX RTL sim Simulation of the RTL description logic using Synopsys VCS-MX for performance purposes. Performance Synopsys Design Compiler TSMC Library.db,.lib Synthesis using the Synopsys Design Compiler. Delay info.sdf Netlist.v Constraints.sdc Simulation of the netlist produced by synthesis using Synopsys VCS-MX. Analysis of the power consumption for the implemented design using Synopsys Primetime-PX TSMC Library.v Synopsys VCS-MX Activity.saif Netlist sim Synopsys PrimeTime- PX/SX Wire Load Model <10% error estim. Power

13 ALU ASIC Implementation Design Exploration Clock Gating vs. No Clock Gating ALU Execution 1 GHz 40 nm TSMC process with VDD=1.21V and low Vth (voltage threshold) Wire Load Model most conservative FP ADD/SUB - NO CG FP MUL - NO CG FP MISC - NO CG STANDBY PWR - NO CG FP ADD/SUB - CG FP MUL - CG FP MISC - CG STANDBY PWR - CG Total power in a lane for FP operations Power (mw) Activity Rate (%)

14 PC TSMC HP 40nm Process Corners Description PC_01 VDD: 1.21 V Vth: Low PC_02 PC_03 High Performance non-well biased with UPF (unified power format) and multi-voltage support Threshold voltage range: mv Temperature: 125 C VDD: 1.21 V Vth: Nominal VDD: 1.21 V Vth: High PC_04 VDD: 0.99 V Vth: Low

15 ASIC Implementation Design Exploration Pareto Trade-Off Area (um 2 ) x 10 4 Area vs. Throughput One Lane ALU Unit Power vs. Throughput 1.3 PC_01: VDD=1.21 V, Low Vt 1.2 PC_02: VDD=1.21 V, Nom Vt 1.1 PC_04: VDD=0.99 V, Low Vt PC_03: VDD=1.21 V, High Vt Throughput (GFLOP/s) (a) Power (mw) PC_01: VDD=1.21 V, Low Vt PC_02: VDD=1.21 V, Nom Vt PC_04: VDD=0.99 V, Low Vt PC_03: VDD=1.21 V, High Vt Throughput (GFLOP/s) (b) For performance greater than 1.1 GFLOP/s the process corner adopted : TSMC 40nm High Performance with VDD=1.21V and Low Vt.

16 ASIC Implementation Area/Power CACTI 6.0 model; 40 1 GHz Vector Register File bank Vector Memory bank 4 Read ports/ 2 Write ports (6 ports totally) bit elements (2 KBytes) VDD: V Access time (ns): Total read dynamic energy per read port (nj): Total read dynamic power per read port at max freq (mw): Total standby leakage power per bank (mw): Total area (µm 2 ): Read/Write ports bit elements (8 KBytes) VDD: 0.661V Access time (ns): Total read dynamic energy per read port (nj): Total read dynamic power per read port at max freq (mw): Total standby leakage power per bank (mw): Total area (µm 2 ):

17 ASIC Implementation Area/Power (Ctd.) Lane Area Breakdown VP Power Breakdown (mw) ~ 16 GFLOPs/Watt (8 Lanes 1 GHz)

18 Component ASIC Implementation Area/Power Area (µm 2 ) Power (mw) Leakage Standby Max ALU (100%) ALU_CTRL 1951 (9.4%) ALU_ADD/SUB 5482 (26.5%) ALU_MUL 8126 (39.3%) ALU_MISC 1271 (6.2%) ALU_vc0_q 1571 (7.6 %) ALU_vc1_q 1571 (7.6 %) LDST (100 %) LDST_CTRL (56.6 %) LDST_vc0_q (21.7 %) LDST_vc1_q (21.6 %) VRF - Latch based (one bank) VRF - SRAM (CACTI) (one bank) 4.177/port VC Scheduler Crossbar Switch Vector Memory - SRAM (one bank) /port TOTAL VP AREA Gate Count VP mm VP mm VP mm VP mm VP mm

19 ASIC Implementation Performance ~ 16 GFLOPs/Watt (8 Lanes 1 GHz)

20 Conclusions ASIC implementation analysis of three architectural contexts (CTS, VLS, FTS) for the implementation of shared vector coprocessors in multicores! increased throughput and lower energy per operation! ~16 1 GHz, 8 lanes, and 40nm HP TSMC process! Power consumption of FTS better than CTS and VLS primarily when low average utilization (see SpMV)! Largest power consumption for operations on sparse matrices since substantial energy is spent to move data between LDST and Vector Memory! Leakage power: 13-19% of total; Clock network power: 23-34% of total Performance and Power Models suggest several techniques to increase the performance and/or reduce the energy consumption Increase DLP by increasing the vector length Increase ILP at compile time by loop unrolling Use TLP via sharing Resize the VP to reduce standby energy consumption per FLOP

Multicore-based Vector Coprocessor Sharing for Performance and Energy Gains

Multicore-based Vector Coprocessor Sharing for Performance and Energy Gains Spiridon F. Beldianu and Sotirios G. Ziavras Electrical and Computer Engineering Department New Jersey Institute of Technology