Vertex Shader Design I

Similar documents
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Vertex Shader Design II

Design and Optimization of Geometry Acceleration for Portable 3D Graphics

A Low Power Multimedia SoC with Fully Programmable 3D Graphics and MPEG4/H.264/JPEG for Mobile Devices

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

Byeong-Gyu Nam, Jeabin Lee, Kwanho Kim, Seung Jin Lee, and Hoi-Jun Yoo

A Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless Applications

Implementation of DSP Algorithms

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

NOWADAYS, 3-D graphics functions are integrated into

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Introduction to Modern GPU Hardware

EMBEDDED VERTEX SHADER IN FPGA

EMBEDDED VERTEX SHADER IN FPGA

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Windowing System on a 3D Pipeline. February 2005

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

PowerVR Hardware. Architecture Overview for Developers

Register Transfer Level in Verilog: Part I

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

The Nios II Family of Configurable Soft-core Processors

Graphics Processing Unit Architecture (GPU Arch)

VLSI Signal Processing

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Cost-Effective Low-Power Graphics Processing Unit for Handheld Devices

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

A SXGA 3D Display Processor with Reduced Rendering Data and Enhanced Precision

Verilog for Combinational Circuits

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

General Purpose Signal Processors

A SXGA 3D Display Processor with Reduced Rendering Data and Enhanced Precision. Seok-Hoon Kim MVLSI Lab., KAIST

Hardware-driven visibility culling

Threading Hardware in G80

Design of a Programmable Vertex Processing Unit for Mobile Platforms

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Review for Ray-tracing Algorithm and Hardware

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Portland State University ECE 588/688. Graphics Processors

Folding. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

1.2.3 The Graphics Hardware Pipeline

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

ARM ARCHITECTURE. Contents at a glance:

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Monday Morning. Graphics Hardware

A fixed-point 3D graphics library with energy-efficient efficient cache architecture for mobile multimedia system

Mattan Erez. The University of Texas at Austin

Design of a Programmable Vertex Processing Unit for Mobile Platforms

PowerPC 740 and 750

Hardware-driven Visibility Culling Jeong Hyun Kim

PIPELINE AND VECTOR PROCESSING

Iteration Bound. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

CIS 665 GPU Programming and Architecture

Jim Keller. Digital Equipment Corp. Hudson MA

High Speed Special Function Unit for Graphics Processing Unit

E.Order of Operations

CPE300: Digital System Architecture and Design

Processing Unit CS206T

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

A 120mW Embedded 3D Graphics Rendering Engine with 6Mb Logically Local Frame-Buffer and 3.2GByte/s Run-time Reconfigurable Bus for PDA-Chip

COMPUTER ORGANIZATION AND DESI

Spring 2009 Prof. Hyesoon Kim

Embedded Systems Ch 15 ARM Organization and Implementation

Spring 2011 Prof. Hyesoon Kim

Pipelining and Vector Processing

17. Instruction Sets: Characteristics and Functions

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

EECS150 - Digital Design Lecture 09 - Parallelism

RISC Processors and Parallel Processing. Section and 3.3.6

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Verilog Behavioral Modeling

Assembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

ARM Processors for Embedded Applications

Rendering approaches. 1.image-oriented. 2.object-oriented. foreach pixel... 3D rendering pipeline. foreach object...

Topics in computer architecture

Instruction Pipelining Review

Improving Application Performance with Instruction Set Architecture Extensions to Embedded Processors

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CS427 Multicore Architecture and Parallel Computing

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Computer Architecture

Structure of Computer Systems

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (

UNIT 2 PROCESSORS ORGANIZATION CONT.

The NVIDIA GeForce 8800 GPU

3-D Accelerator on Chip

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Antonio R. Miele Marco D. Santambrogio

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

15CS44: MICROPROCESSORS AND MICROCONTROLLERS. QUESTION BANK with SOLUTIONS MODULE-4

DSP Platforms Lab (AD-SHARC) Session 05

Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

CS433 Homework 2 (Chapter 3)

Transcription:

The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only and please do NOT broadcast. Thank you. Vertex Shader Design I Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Hsinchu, Taiwan Fall, 2016 1

Source from: A 155-mW 50-Mvertices/s Graphics Processor With Fixed-Point Programmable Vertex Shader for Mobile Applications J. H. Shon, J. H. Woo, M. W. Lee, H. J. Kim, R. Woo, H. J. Yoo A 155-mW 50-Mvertices/s graphics processor with fixed-point programmable vertex shader for mobile applications, IEEE J. Solid-State Circuits, vol. 41, no. 5, pp. 1081 1091, May 2006. 2

Outline Introduction System Architecture System Operations Data Transfer Flow Dual Operations Programmable SIMD Vertex Shader Internal Architecture Pipeline Structure Fixed-Point Graphics Processing SIMD Multiply Engine 3

Introduction (1/2) For wireless applications, the low power consumption is the most important issue because of limited battery lifetime. The reduced instruction set computer (RISC) architecture are widely used as the main platforms for wireless applications because of their high MIPS/watt. However, these low-power RISC platforms have very limited system resources in terms of computation power and memory bandwidth. 4

Introduction (2/2) Moreover, since users are watching 3-D graphics images on a small screen very closely to their eyes, the graphics processor must generate high quality of graphics images with advanced graphics algorithms. These increase image quality and/or lower memory bandwidth usage, and thus also power consumption. In the previous works, the graphics processors did not integrate vertex shader and showed lack of processing parallelism for streaming graphics data. In this work, we designed and implemented a graphics processor with programmable fixed-point singleinstruction-multiple-data (SIMD) vertex shader for mobile applications. 5

System Architecture (1/2) Block diagram of the graphics processor 6

System Architecture (2/2) The vertex shader is implemented as an ARM-10 coprocessor and processes all per-vertex operations such as geometry transformation and lighting calculation. The primitive assembly such as clipping and culling is also performed by the vertex shader in collaboration with the RISC processor. The rendering engine is responsible for the rasterization and the per-pixel operations such as pixel blending and texture mapping. The rendering engine instruction is composed of transformed vertex coordinates, texture coordinates and lit vertex color. The PFS (Programmable Frequency Synthesizer) reduces the dynamic power consumption of the chip by clock gating and frequency scaling. 7

System Operations (1/8) A. Data Transfer Flow The whole system performance of graphics hardware depends not only on the performance of the individual hardware acceleration blocks but also on the communication cycles for transferring the graphics data between external memory and the graphics hardware. 8

System Operations (2/8) In this data flow, hardware-instruction path and graphics-data path are separated from each other to improve the streaming processing within allowed memory bandwidth. 9

System Operations (3/8) The vertex data stored in the data cache of the ARM- 10 processor are transferred by vertex-attribute-move instruction of the vertex shader, which is mapped into the co-processor register transfer instruction of the ARM-10 processor. The separation of instruction and data paths increases processing parallelism of the hardware blocks and reduces the required bus bandwidth. 10

System Operations (4/8) B. Dual Operations (a) Tightly Coupled Co-processor (TCC) state (b) Parallel Processor (PP) state 11

System Operations (5/8) (a) Tightly Coupled Co-processor (TCC) state They do not affect the memory and registers unless the arithmetic flags (negative, zero, carry out and overflow) of the ARM-10 processor satisfy a condition specified in the instructions. In the vertex shader, SIMD control flags such as arithmetic flags, saturation, overflow and underflow are updated after execution of every SIMD data processing instructions and can be moved to program status register (PSR) of the ARM-10 processor. The general SIMD instructions such as arithmetic and movement operations are implemented in the TCC state, performing clipping and back-face culling operations in 3-D graphics pipeline. 12

System Operations (6/8) (b) Parallel Processor (PP) state In this state, the vertex shader behaves like an independent processor and it does not need any control from the ARM-10 processor. The PP state has a separate graphics instruction set different from the general SIMD instructions of the TCC state. 13

System Operations (7/8) Processor for synchronization. The vertex shader executes the independent vertex program codes while the ARM-10 processor performs its main application program or enters even into cache miss. Various user-defined vertex processing operations such as geometry transformation and lighting calculations can be performed for the current vertex input while next vertex data is fetched from the ARM- 10 processor. To maintain the communication protocol of the ARM- 10 co-processor interface, the vertex shader drives co-processor busy (CPbusy) signal to the ARM-10 processor in the PP state, blocking next co-processor instruction from the ARM-10 14

System Operations (8/8) The graphics instruction set is the subset of the general SIMD instructions with graphics extensions such as source swizzling and write-masks. In the PP state, there are more register file sets that can be used as input operands of instructions in programmer s model. The general SIMD instructions in the TCC state can also accelerate various multimedia functions beyond 3-D graphics such as MPEG-4 video. 15

Programmable SIMD Vertex Shader (1/4) 16

Programmable SIMD Vertex Shader (2/4) In the control part, there is a 2 kb code memory that stores vertex program codes of graphics instructions. Vertex program control unit (VPCTRL) issues the graphics instructions without control of the ARM-10 processor. The contents of control register determine its operating state. The two operating states the TCC state and the PP state share all of the hardware blocks except instruction fetch units. 17

Programmable SIMD Vertex Shader (3/4) Fixed-point vector unit is responsible for all SIMD arithmetic operations such as addition and multiplication. Special function unit (SFU) is responsible for reciprocal (RCP) and reciprocal square root (RSQ) operations. Display buffer, implemented as a 32 kb synchronous SRAM, stores graphics constants such as transformation matrix, lighting parameters and lookup table entries. Ex. of lighting parameters : normal vectors, view vectors, light vectors 18

Programmable SIMD Vertex Shader (4/4) For streaming graphics processing, the vertex shader contains multiple register files : Vertex Input Registers (VIR) Hold the vertex attributes such as position and normal vector Feed into the fixed-point SIMD datapath Vertex Output Registers (VOR) There are three output vertex register files for caching of vertex data in the primitive assembly and only one of them is accessible in the vertex program Vertex General SIMD Registers (VGR) Store temporary results during vertex program execution 19

Pipeline Structure (1/2) 20

Pipeline Structure (2/2) For programmable shading, operands of the SRAM display buffer and the SIMD register files are accessed at the same time in the decode stage. In the execution stage, there are three separated pipelines : SIMD arithmetic-and-logic (ALU) pipeline SIMD multiply pipeline SFU pipeline To reduce the design complexity, register forwarding logic between pipeline stages is used only in the general SIMD register file. 21

Fixed-Point Graphics Processing (1/3) Qm.n represents the format of a fixed point number, where m represents the number of bits used for integer part and n represents for that of fraction part. 22

Fixed-Point Graphics Processing (2/3) In this work, fixed-point number representation is used instead of floating-point number format. Simple integer datapath of fixed-point unit can achieve higher clock frequency while consuming less power than floating-point unit, yielding total energy reduction. To improve the usefulness of the fixed-point arithmetic operations, it is designed such that hardware status registers automatically indicate the overflows and the underflows occurred in the multiplications of two fixedpoint numbers. These status registers can be used to check errors in the fixed-point arithmetic without extra cycle penalties and degradation of SIMD parallelism. 23

Fixed-Point Graphics Processing (3/3) Block diagram of SIMD ALU 24

SIMD Multiply Engine (1/4) Since multiplication-equivalent instructions spend most of time in graphics operations, the throughput of fixed-point MAC operations is designed as a single cycle. In addition, fast 4- cycle matrix transformation (TRFM) is implemented. 25

SIMD Multiply Engine (2/4) However, fixed-point MUL and MAC operations require two cycle integer multiplications and two cycle integer additions, leading to 4-cycle latency. To resolve data dependency between theses MUL and MAC operations, it is allowed that intermediate value of the integer multipliers can be bypassed to accumulator input of the integer adders in the SIMD multiply engine. 26

SIMD Multiply Engine (3/4) By this scheme, the graphics processor shows 50 Mvertices/s peak graphics performance for parallel projection at 200 MHz operating frequency. 27

SIMD Multiply Engine (4/4) Hardware architecture of single fixed-point multiplier unit in the SIMD multiply engine 28

Instruction-level Power Management 29

Rendering Engine and Clock Gating 30

Programmable Frequency Synthesizer (PFS) 31

Performance Results (1/3) 32

Performance Results (2/3) 33

Performance Results (3/3) 34

Conclusion A low-power graphics processor is designed and implemented co-processor architecture with dual operations. Fixed-point graphics processing Instruction-level power management of the vertex shader Pixel-level clock gating of the rendering engine Programmable clocking of the PFS. 35