A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Size: px

Start display at page:

Download "A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications"

Buddy Hicks
6 years ago
Views:

1 A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System Lab. Dept. of EECS Korea Advanced Institute of Science and Technology

2 Outline Introduction System Architecture Architecture Overview Dual Operations Vertex Shader, Rendering Engine Low Power Techniques Fixed-Point SIMD Datapath Instruction-level Power Management Programmable Frequency Synthesizer Implementation Results Summary 2

Motivation Portable 3D graphics Wireless multimedia PDA and cell-phone Requirements Long battery time (3 hours play) High performance (>3Mvertices/s) Low cost (<5$) Problems Too large chip size Very

3 Motivation Portable 3D graphics Wireless multimedia PDA and cell-phone Requirements Long battery time (3 hours play) High performance (>3Mvertices/s) Low cost (<5$) Problems Too large chip size Very complicated chip Graphics Evolutions Simple Shading ISSCC 2001 Texture Mapping ISSCC 2003 Programmable shading architecture Flexibility and low power consumption 3D-CG IP ISSCC 2004 Programmable Shading ISSCC

Programmable Graphics Pipeline Traditional Graphics Pipeline Programmable Graphics Pipeline Fixed Functions Transform Vertex Shader User-defined Vertex

4 Programmable Graphics Pipeline Traditional Graphics Pipeline Programmable Graphics Pipeline Fixed Functions Transform Vertex Shader User-defined Vertex Processing Lighting Vertex Program Simple Graphics Clipping Clipping Flexible 3D Graphics Funtions Rasterization Rasterization Texture Mapping Texture Mapping 4

5 Standouts of Portable 3D Graphics ISSCC 2003 ISSCC 2004 This Work Process 0.16 µ m DRAM 0.18 µ m CMOS 0.18 µ m CMOS Frequency (MHz) 132 / / 50 Programmable Shading x x Vertex Program 1.1 Type of HW Integer RISC Floating-point Engine Fixed-point Vertex Shader Graphics Speed (Mvertices/s) Pixel Fill Rate (Mpixels/s) Power Consumption (mw) Performance Index (Kvertices/s per mw)

6 System Architecture ARM-10 Co-processor Architecture ARM-10 90b I$ 16KB D$ 16KB Co-processor Interface 32b 32b PFS CTRL PLL Fixed-point SIMD Datapath 90b Vertex Shader 2KB Code Mem. 128b Vertex Buffer 32KB Display Buf. System Bus (800MB/s) 32b Memory CTRL 32b Async. SRAM I/F 128b Rendering Engine Triangle Setup Pixel Processor 26KB Graphics Cache 32b 32b PERI 56b External I/O 6

7 Data Transfer Flow Separation of data and command paths Increase the parallelism of hardware blocks. Reduce the bus arbitration cycles and bandwidth. System Bus External Mem. Graphics Model Data Data Transfer Path Vertex Data Transfer Pixel Data Transfer ARM-10 Co-Proc I/F Vertex Shader Vertex Buf. Rendering Engine VS Commands Transfer RE Commands Transfer Command Transfer Path 7

8 Dual Operations Two operating states Tightly coupled co-processor (TCC) Normal co-processor General SIMD instructions Handshaking with ARM-10 ARM-10 ARM Inst.1 ARM Inst. 3 Vertex Shader SIMD Inst.2 SIMD Inst. 4 Parallel processor (PP) Independent processor Graphics instructions Parallel operations with ARM-10 ARM-10 ARM Program 1 ARM Program 2 Vertex Shader Vertex Program 8

9 Programmable Vertex Shader VP CTRL 2KB Code Mem. Graphics INSTR 32KB Display Buf. Input Vertex RF SWZ General SIMD RF ARM-10 Co- Proc I/F SIMD INSTR FETCH DEC & CTRL opa opb opc Fixed-point SIMD Datapath Fixed-point Special Function Unit CTRL Reg. Output Vertex RF 0 Output Vertex RF 1 Output Vertex RF 2 Control Datapath Vertex Buf. RE 9

10 Rendering Engine Set-up Operations Vertex Buf. 128b RE Instructions Triangle Setup Engine (TSE) Pixel Processor Interpol. Depth Compare write_mask Graphics Cache Depth / Frame: Direct-map Texture: 2-way 8KB Depth Cache REclk from PFS Texture Engine TM_req Aligner even odd 3KB Texture Cache 0 3KB Texture Cache 1 MEM I/F System Bus Clock Gating Blending 12KB Frame Cache 10

11 Fixed-point Graphics Processing : Low Power Technique (1) Fixed-point number representation Bit index Value b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 b i {0,1} (ex) Q4.4 format sign m-bit integer part Energy efficiency 40% less energy than floating-point on average. Fraction point n-bit fraction part Power Consumption 17 % Less Power Fixed-point Arithmetic Unit Floating-point Arithmetic Unit 30% Faster Execution Time 11

12 Fixed-point SIMD Datapath (1) 8-stage pipeline with 4-way SIMD F I D E1 SIMD INSTR (Co-Proc I/F) 1st DEC 2nd DEC 4-way 32b Integer ALU Display Buf. ADDR Generation SRAM Display Buf. Access Graphics INSTR (Code Mem.) 4-way 32x16 Integer MUL SIMD Reg. Index Generation SIMD Reg. Access Forwarding CLZ Parallel Accesses of Operands Forwarding Only in General SIMD Reg. Single Cycle Throughput for Fixed-point MAC E2 E3 Pipeline Reg. Pipeline Reg. 4-way 32x16 Integer MUL CPA Array for Low 32b Result Integer DIV SQRT E4 Pipeline Reg. CPA Array for High 32b Result SHIFT Array CPA W Reg. Writeback 12

13 Fixed-point SIMD Datapath (2) Fast 4-cycle matrix transformation 50Mvertices/s graphics performance Broadcasted Vector Elements M0 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 X Y Z W M0 M1 M2 M3 X X X X M4 M5 M6 M7 Y Y Y Y M8 M9 M10 M11 Z Z Z Z M12 M13 M14 M15 W W W W Convert 32b Fixed to 64b Integer MUL X MAC Y MAC Z MAC W Bypass of Low 32b Bypass of High 32b Accum. SHIFT MUL 32x16 MUL 32x16 MUL C S A C P A C S A C P A SHIFT Fixed-point Result Convert 64b Integer to 32b Fixed 13

14 Fixed-point Special Function Unit Calculate RCP (1/x) and RSQ (1/sqrt(x)) Fixed-point input Fixed-point output No additional data type conversion circuits For Qm.n Fixed-point Format Fixed-point Input 1.0 in Qm.n Count Leading Zero SHIFT Normalized Input Pre-scale Value Quotient Remainder Radix-4 Integer DIV SQRT Carry Propagate Adder Fixed-point Output Scale up fixed-point dividend to 64b space OP RCP RSQ Pre-scale Value # of Leading Zero n/2 + # of Leading Zero 14

15 Efficient Floating-point Extension Software floating-point emulation Add two special instructions: CAS, CLS If-else A single cycle SIMD arithmetic operation. Compare CAS Compare CLS Negative No Negative No Yes Yes SUB ADD Left Shift Right Shift Controlled ADD / SUB Controlled Logical Shift 80MFLOPS peak performance at 200MHz X 10 improvement over conventional integer SIMD unit 15

16 Instruction-level Power Management : Low Power Technique (2) SIMD RF and datapath activation in instruction-by-instruction. Clock Off PFS Vertex Shader Clock Source Enable SIMD Reg. Files ARM-10 Co-Proc I/F Valid Reg. D Latch E Q OP Fixed-point SIMD Datapath Shader is Called Prevent Signal Transitions 16

17 Programmable Frequency Synthesizer : Low Power Technique (3) Frequency scaling with clock gating Power Mode, Target Freq. VS Measure Workload of Current Frame PFS 1x 0.25x 0.5x RE 1x RE Cache Software Library ARM-10 Coarse-grained Clock Gating ARM-10 CLK. (MHz) Fast Normal Slow Step Continuous Frequency Scaling in Three Power Modes 17

PFS Block Diagram and Measurement REF CLK (1MHz) N CK PFD CP LPF VCO UP/ DOWN N 1x RISCclk OP_MODE [FAST/ NORMAL/ SLOW] FREQ_ CTRL P S 4 4 RESET PROGRAM COUNTER SWALLOW COUNTER PRE SCALAR CKout =

18 PFS Block Diagram and Measurement REF CLK (1MHz) N CK PFD CP LPF VCO UP/ DOWN N 1x RISCclk OP_MODE [FAST/ NORMAL/ SLOW] FREQ_ CTRL P S 4 4 RESET PROGRAM COUNTER SWALLOW COUNTER PRE SCALAR CKout = (16P+S)xCK Programmable counters tune target freq. FREQ DIV 1x 1/4x GEclk REclk Enable 1/2x REMEMclk Software Measured Waveform (RISCclk) 127MHz 112MHz 93MHz 70MHz Frequency Change (Normal Mode) 18

19 System Power Consumption Power consumption (mw) Sustaining Graphics Performance (Polygon/s) 0.07M 1.56M 2.77M M 10M 50x Improvement 26% Reduction A: No Vertex Shader B: Integer SIMD Processor C: Floating-point Graphics Processor D: This work (w/ light and texture) E: This work (w/o light and texture) Vertex Shader RE with Graphics Mem. RISC with I/D $ Power Manager Others (BUS, IO) 0 A B C Conventional D E This Work 19

20 Performance Comparison KVertices / Sec mw Performance index of portable 3D graphics Processing speed / power consumption Analogous to MIPS/mW 5.0 ISSCC 2003 (2.3) V, 75MHz ISSCC 2004 (18.4) 1.25V, 400MHz ISSCC 2004 (18.5) 1.8V, 200MHz This Work 20

Die Photograph 0.18μm CMOS 1P6M 36 mm 2 256 pin PBGA Power supply 1.

21 Die Photograph 0.18μm CMOS 1P6M 36 mm pin PBGA Power supply 1.8V: Core 3.3V: IO Transistors 2M Logic 96KB SRAM Power consumption Less than 155 mw 21

22 Summary Low power 3D graphics processor for mobile applications Integration of full 3D graphics pipeline with programmable vertex shader ARM-10 coprocessor architecture with dual operations Low power techniques Energy-efficient fixed-point SIMD datapath Instruction-level power management Programmable frequency synthesizer 50 Mvertices/s peak graphics performance 155 mw power consumption 22

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only