Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo

Size: px

Start display at page:

Download "Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo"

Franklin Willis
5 years ago
Views:

1 A Low-Power Handheld GPU using Logarithmic Arithmetic and Triple DVFS Power Domains Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo

2 Outline Backgrounds Proposed Handheld GPU Low-Power Vertex Shader Low-Power Rendering Engine Triple-Domain DVFS Evaluation Results Conclusion

, SIGGRAPH01] Conventional Vertex Shader Architecture [Kameyama et al.

3 Motivations Handheld 3D Graphics Systems Low-Power long-battery lifetime Small-Area limited-footprint High-Performance more realistic scenes Previous Works [Lindholm et al., SIGGRAPH01] Conventional Vertex Shader Architecture [Kameyama et al., GH03] Low-cost, limited performance [Sohn et al., GH04] Fixed-Point Vertex Shader [Yu et al., ISSCC06] High-Performance Floating-Point Vertex Shader

4 Contributions Handheld GPU using Logarithmic Arithmetic Power-, area-efficient, and high-throughput graphics pipeline Vertex shader using multi-function unit Merges matrix, vector, and elementary functions in a single-arithmetic 4-way unit Handheld GPU with Triple-domain DVFS Separate control of each power domain according to its workload Minimize power consumption for given workload

Logarithmic Arithmetic Various Math Operations for 3DG Pipeline Matrix-vector multiplication Vector operations Elementary functions Too complicated for resource limited handsets Logarithmic Number

5 Logarithmic Arithmetic Various Math Operations for 3DG Pipeline Matrix-vector multiplication Vector operations Elementary functions Too complicated for resource limited handsets Logarithmic Number System (LNS) Reduces arithmetic complexity Conversion errors Handheld 3D Graphics Small screen tolerates a little computation error Logarithmic arithmetic can be used MAT, VMUL, VMAD, VDOT, VDIV, VDSQ, POW, LOG, SIN, COS, ATAN

Dynamic Voltage Frequency Scaling Power Consumption is Proportional to Square of supply voltage Operating frequency Dynamic scaling supply voltage and frequency is important for low-power

6 Dynamic Voltage Frequency Scaling Power Consumption is Proportional to Square of supply voltage Operating frequency Dynamic scaling supply voltage and frequency is important for low-power design DVFS in module-wise manner Can assign the optimal frequency and voltage to each module [Sheaffer et al. GH04] cf) Chip-wise DVFS cannot exploit dynamic workload characteristic of each module

7 Overall Architecture PMU PMU PMU RISC Domain VS Domain RE Domain Vdd Clk Vdd Clk Vdd Clk RISC Matrix FIFO VIB Vertex Shader (VS) Index FIFO VOB Rendering Engine (RE) Application Geometry Rendering External asynchronous SRAM Integrates RISC, VS, and RE Logarithmic arithmetic based pipeline Triple power domains with DVFS 3 PMUs for management and FIFO for communication

8 Vertex Shader Matrix Multi-function Unit Vertex Fetch Matrix FIFO (4KB) Instruction Memory (2KB) Unifies matrix, vector, and elementary ftn. Vertex Input Buffer (512B) SWZ Constant Memory (4KB) SWZ Multifunction unit General Purpose Register (512B) SWZ Instruction Decode and Control Single-cycle throughput for vector and elem. ftn. 2-cycle Transformation Matrix-vector multiplication using LNS 4-cycle in conventional way Index FIFO (64B) VOR (256B) Operand Forwarding Log-domain forwarding

HNS LNS** (complex op) * FLP: Floating-Point Number System ** LNS: Logarithmic Number System Logarithmic converter x

9 Number System Hybrid Number System (HNS) Complex op. in LNS Linear add, sub in FLP Number Converters based on piecewise linear approximation FLP* (add,sub)! HNS LNS** (complex op) * FLP: Floating-Point Number System ** LNS: Logarithmic Number System Logarithmic converter x Antilogarithmic Converter x 15-entry LUT 8-entry LUT c i d i e i b i c i d i b i * CSA: Carry Save Adder ** CPA: Carry Propagate Adder m>>c i m>>d i m>>e i CSA* CSA* CPA** a i m+b i f>>c i f>>d i CSA* CSA* CPA** a i f+b i

10 Matrix-Vector Multiplication MAT requires 16 multiplications MAT in HNS requires 20 LOGCs, 16 ADDs, 16 ALOGCs Fixed matrix coeffis. for an object 4 LOGCs, 16 ADDs, 16 ALOGCs 2-phase implementation requires 2 LOGCs, 8 ADDs, 8 ALOGCs / phase MAT 1 st phase MAT 2 nd phase Channel 3 Channel 2 Channel 1 Channel 0 x 0 x 2 log 2 c 00 log 2 c 02 LOGC LOG LUT adder tree E1 ADD Booth Enc. PMUL* LOG LUT ALOG LUT E2 adder tree >> ADD E3 E4 E5 ALOGC ALOG LUT adder tree PADD ADD ACC log 2 c 01 log 2 c 03 to Channel1,2,3 log 2 x 1, for 1 st phase, log 2 x 3, for 2 nd phase (from LOGC of Channel1 ) * PMUL: programmable multiplier

11 Vector Operations Defined as a Generic Operation Vector MUL,DIV,DSQ,MAD Converted into HNS 2 LOGCs are required / Channel Channel 3 Channel 2 Channel 1 Channel 0 x 0 y 0 LOGC LOG LUT adder tree E1 ADD PMUL Booth Enc. ALOG LOG LUT LUT E2 adder tree >> ADD E3 E4 E5 ALOGC ALOG LUT adder tree PADD* ADD ACC z 0 Channel 0 Channel 1 Channel 2 Channel * PADD: programmable adder +

12 Elementary Functions Requires Power Ftns. Evaluations POW, Taylor Series for Trigonometric (SIN,COS,ATAN...) Converted into HNS Booth multiplier is required Channel 3 Channel 2 Channel 1 Channel 0 x 0 y 0 LOGC LOG LUT adder tree E1 ADD PMUL Booth Enc. ALOG LOG LUT LUT E2 adder tree >> ADD E3 E4 E5 ALOGC ALOG LUT adder tree PADD ADD ACC bias

13 Operand Forwarding in Log-domain Log Log op 1 Cancelled out Alog t Log op 2 c Log Alog Log Log op 1 Log op 2 Alog Avoids back-to-back antilog and log converters Forward intermediate log value to next instruction Pipeline latency reduction Computation error reduction

14 Rendering Engine Vertex Shader Vertex buffer Triangle Setup Engine (TSE) Rasterizer (RST) Pixel pipeline Texture Unit (TEX) Intpl Intpl DIV Divisions required in RE are implemented as subtractors SIMD Interpolators in TSE and RST SIMD Divider in Texture Unit s t u v w LOG Conv LOG Conv LOG Conv LOG Conv LOG Conv SUB SUB SUB SUB ALOG Conv ALOG Conv ALOG Conv ALOG Conv s/w t/w u/w v/w

15 Triple-Domain DVFS PMU PMU PMU Objects Vertices Pixels FIFO FIFO RISC VS RE Workloads for each module in 3D pipeline are different Objects for RISC, vertices for VS, pixels for RE Separate power-domains with DVFS Managed separately according to its workload

16 Performance Evaluations Environment Setup Accuracy simulation C libraries for HNS & FLP ops. Performance measurement Cycle count evaluator Hardware model for graphics processor core Fully synthesizable structural HDL model 3D application HNS library FLP library Cycle count evaluator Hardware model Power management model

17 Accuracy Comparison from FLP from HNS FLP and HNS libraries QVGA Test Screens Tr&Lighting, Tr&Texturing Tolerable image difference

18 Performance Comparison [AYH*04] [SWY04] [KCY*06] [YCK*06] This work Test vectors Cycle counts OpenGL Oren-Nayar Cook-Torrance Standard OpenGL TnL O-N and C-T for advanced diffuse and specular lighting Comparison results 19.1%, 49.3%, and 23.9% cycle count reduction for OpenGL, O- N, and C-T models from latest work [YCK*06]

Implementation Process Technology I$ RE PMU RISC D$ Matrix FIFO Vertex Shader VOB Rendering Engine RISC PMU CMEM IMEM VIB DMEM VS PMU Area TSMC 0.18um CMOS 1P 6M 17.

19 Implementation Process Technology I$ RE PMU RISC D$ Matrix FIFO Vertex Shader VOB Rendering Engine RISC PMU CMEM IMEM VIB DMEM VS PMU Area TSMC 0.18um CMOS 1P 6M 17.2mm 2 393K gates, 29KB SRAM Processing Speed 5.26 Mvertices/s 50Mpixels/s Power Consumption 60fps full speed PMU Specifications V, MHz 0.45mm 2, 5mW

Power Consumption # of pixels # of entries 50 0 16 11 180 Clk 1.7 89 Vdd 1.0 Triangle size FIFO entry level Freq.

20 Power Consumption # of pixels # of entries Clk Vdd 1.0 Triangle size FIFO entry level Freq. of VS Voltage of VS DVFS for VS according to the workload Higher workload makes a droop in FIFO entry level FIFO droop increases Freq and Vdd of VS Increased VS speed restores FIFO entry level For test model with 5,878 triangles, 5 DC, and OpenGL TnL GPU consumes 60fps

21 Handheld GPUs Performance (Mvertices/s) full speed Area (Kgates) Functions Process / Freq. Kvertices/s/ MHz [KKF*03] GE+RE 0.18um / 30MHz 6.17 [SWY04] GE 0.13um / 400MHz 18 [YCK*06] GE 0.18um / 100MHz 21.3 This work (87 for GE) 393 (242 for GE) GE+RE 0.18um / 200MHz 26.3 Compared with the latest work [YCK*06] 2.47x performance improvement 50.5% power reduction 38.5% area reduction 23.5% Kvertices/MHz improvement

22 Conclusion A Handheld GPU for Low-Power Wireless Applications 2.47x performance improvement 50.5% power reduction 38.5% area reduction 3D Graphics Pipeline based on Logarithmic Arithmetic 5.26Mvertices/s for OpenGL TnL Triple power domains with DVFS 60fps

23 Thank You For more information

Cost-Effective Low-Power Graphics Processing Unit for Handheld Devices

INTEGRATED CIRCUITS FOR COMMUNICATIONS Cost-Effective Low-Power Graphics Processing Unit for Handheld Devices Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seungjin Lee, and Hoi-Jun Yoo, Korea Advanced Institute