Reconfigurable Cell Array for DSP Applications

Size: px

Start display at page:

Download "Reconfigurable Cell Array for DSP Applications"

Posy Murphy
5 years ago
Views:

Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical

reconfigurable cell array Processing cell Memory cell Network router cell System

data path in addition to the control flow.

1 Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell array Processing cell Memory cell Network router cell System reconfiguration econfigurable FI econfigurable FFT processor Multi-standard OFDM coarse time synchronization ETI180 DSP-design Dec. 06 th, 2011 econfigurable computing High performance real-time DSP computing Updates on the data path in addition to the control flow. Combined flexibility with high performance at a feasible hardware cost. Software-centric programming approach. Coarse-grained granularity trade-off between efficiency, flexibility, and programmability. Dynamic reconfigurability.

Media. Processor LTE-A WLAN Media. Processor 5G?

Processor Wimax Software-defined hardware Apps.

Processor econfigurable architecture + Multi-task + Multi-standard +

Application specific DSP: Performance vs.

performance, small size, low power - Less flexible, manufacturing defects - High NE

2 Media. Processor LTE-A WLAN Media. Processor 5G? Multiple standards GPS WCDMA BT Cellular DVB-H Hardware sharing Apps. Processor Wimax Software-defined hardware Apps. Processor Accelerators: poor hardware reusability Processing chain A B C Apps. Processor econfigurable architecture + Multi-task + Multi-standard + Multi-algorithm Control overhead, e.g. area, power. Application specific DSP: Performance vs. Flexibility Tensilica ConnX Baseband Engine Specialized hardware (ASIC) + High performance, small size, low power - Less flexible, manufacturing defects - High NE cost Standard processor (GPP, DSP ) + Flexible, Short design time - Lack of computation capacity Fine-grained reconfigurable architecture (FPGA) + High calculation capacity, flexible - outing overhead, high power consumption - Hardware oriented design approach D

Tabula Spacetime Coarse-grained reconfigurable architecture Ultra-rapid reconfiguration: multi-ghz rates 2.5x logic density 3.

defects Sacrificed area & energy efficiency compared to ASICs Sacrificed mapping flexibility compared to FPGAs CGA elated work

System infrastructure ALU clusters: MathStar FPOA, ICA Instruction level, data level parallelism SIMD or VLIW Processor array: AW,

A Highly Parameterizable Parallel Processor Array Architecture. An array of resource cells.

3 Tabula Spacetime Coarse-grained reconfigurable architecture Ultra-rapid reconfiguration: multi-ghz rates 2.5x logic density 3.7x DSP performance High calculation capacity & flexible Software oriented: relevantly fast development tolerance to manufacturing defects Sacrificed area & energy efficiency compared to ASICs Sacrificed mapping flexibility compared to FPGAs CGA elated work Courtesy: MathStar FPOA architecture guide. System infrastructure ALU clusters: MathStar FPOA, ICA Instruction level, data level parallelism SIMD or VLIW Processor array: AW, WPPA, EMAC Instruction level, data level, and task level parallelism MIMD Courtesy: D. Kissler et al. A Highly Parameterizable Parallel Processor Array Architecture. An array of resource cells. Heterogeneous cell array: Processing cell Memory cell Accelerator (e.g. no configuration) Hierarchical cell array. Hybrid structure: ADES, PACT XPP Instruction level, data level, and task level parallelism SIMD or VLIW and MIMD Combined complexity? Coeff. gen Addr. gen Courtesy: PACT: XPP-III Processor Overview.

esource cell Processing cell P2 P1 P3 = f(p1,p2) Dedicated local interconnections: Processing core High data throughput Hierarchical global routing

.. Flexible global data transmission External data access Global cell (re)configuration Data driven synchronization Single-Cycle-Per-Hop latency L3 L0

In-cell NoC supervision and reconfiguration.

processing cell P1 P2 P3 = f(p1,p2) Example 2: Dataflow processing cell (I) PC L0 L1... Lx G IF/ID ID/EXE Branch EXE/WB 4 pipeline stages.

4 esource cell Processing cell P2 P1 P3 = f(p1,p2) Dedicated local interconnections: Processing core High data throughput Hierarchical global routing network: ALU, DSP, SIMD, VLIW, CODIC... Flexible global data transmission External data access Global cell (re)configuration Data driven synchronization Single-Cycle-Per-Hop latency L3 L0 G0 C L1 Implicit load-store operations in all instructions. un-time control and conditional reconfiguration. In-cell NoC supervision and reconfiguration. AMBA 4 AXI4-stream protocol L2 Processing shell GALS network data transmission Network adapter Local IO ports Global IO port Example 1: Generic signal processing cell P1 P2 P3 = f(p1,p2) Example 2: Dataflow processing cell (I) PC L0 L1... Lx G IF/ID ID/EXE Branch EXE/WB 4 pipeline stages.... egister Hybrid Load-Store & Memory-Memory architecture. Operation controller Compact program size (memory references). With external memory cells: Complex addressing modes, e.g. memory indirect, auto-increment. Flexible usage: program/data memory, processor stack, (cache). Single-cycle delayed branch. Input arrangement MUX Arith/Logic selection Output arrangement MUX Output MUX Zero-delay conditional inner loop control.

Example 2: Dataflow processing cell (II) Dataflow processing cell: Dynamic data path reconfiguration Input arrangement MUX Arith/Logic selection Output arrangement MUX Output MUX SIMD/VLIW-like

5 Example 2: Dataflow processing cell (II) Dataflow processing cell: Dynamic data path reconfiguration Input arrangement MUX Arith/Logic selection Output arrangement MUX Output MUX SIMD/VLIW-like operation: 2/4-way 16/8-bit independent data processing Multi-level data processing (implicit prolog & epilog processing) Dual-operand instruction set: Dual-OpCode & Dual-Operand: e.g. ADDSUB [d1], [d2], [s1], [s2] Input arrangement MUX Arith/Logic selection Output arrangement MUX Output MUX Vector operation option: e.g. complex number arithmetic Dynamic data path reconfiguration Conditional instruction executions Dataflow processing cell: un-time data arrangement (II) Complex number multiplication vs. eal number multiplication MUL 3, 1, 2 ; 3 = 1 * 2 where {ab} is stored in 1 and {cd} is stored in 2. Memory cell (I)

Memory cell (II) Memory descriptor Memory cell (III) 31 27 23 19 16 11 7 3 Sign Inphase Sign Quadrature 12 bits -> 4 bits 0 (a) Sign I Sign Q (b) Logic or 4(I) 3(I)

3(Q) 2(I) 1(I) 2(Q) 1(Q) (d) Memory cell (IV) econfiguration Network router cell (I) Individual memory DSC loading & tracing Memory DSC execution program: Cell

6 Memory cell (II) Memory descriptor Memory cell (III) Sign Inphase Sign Quadrature 12 bits -> 4 bits 0 (a) Sign I Sign Q (b) Logic or 4(I) 3(I) 4(Q) 3(Q) PC0 -> MC0 1(I) 2(I) 2(Q) 1(Q) Shift by 0 & mask Shift by 4 & mask Shift by 16 & mask Shift by 20 & mask (c) After 4 iterations Address X 4(I) 3(I) 4(Q) 3(Q) 2(I) 1(I) 2(Q) 1(Q) (d) Memory cell (IV) econfiguration Network router cell (I) Individual memory DSC loading & tracing Memory DSC execution program: Cell structure: Decision unit outing structure : Parallel network MUX-DEMUX switch Output packet queue (FIFOs) Memory DSC execution mode: restart, resume Memory data dump (debug)

Network router cell (II) Decision unit Static & Dynamic configuration (I) Static

switch) Fixed ound-robin Data broadcast Configure routing path Action list with

In(def) x (Parallel network) Action list with candidate transaction O(0) O(1) O(2)

econfigurable FI FI filter M2 M1 M3 M4 Processing cell: MAC Memory cell: Input data

7 Network router cell (II) Decision unit Static & Dynamic configuration (I) Static routing table Managing data transactions: Check in Packet arbitration (MUX-DEMUX switch) Fixed ound-robin Data broadcast Configure routing path Action list with candidate transactions O(0) O(1) O(2) O(3) O(4) In(0) o In(1) o o o In(2) x In(3) o In(def) x (Parallel network) Action list with candidate transaction O(0) O(1) O(2) O(3) O(4) In(0) x In(1) o x x In(2) x In(3) x In(def) x Memory Master icache dcache MPMC Conf. Ctrl Stream Ctrl (MUX-DEMUX switch) Static & Dynamic configuration (II) Case study: econfigurable FI FI filter M2 M1 M3 M4 Processing cell: MAC Memory cell: Input data FIFO, coefficient OM Time-multiplexed structure for area driven application. Unfolding (parallelize) to improve processing throughput. High-precision computations.

8 Case study: econfigurable FFT processor adix-2 2 structure adix-22 FFT building block Basic radix-2 2 FFT building block Folding A 2,048-point radix-2 2 pipeline FFT adix-2 2 pipeline FFT adix-2 2 pipeline FFT Simple mapping Simple to scale up. Local communication only. High storage capacity demand in each single memory cell. Simple mapping Simple to scale up. Local communication only. High storage capacity demand in each single memory cell. Simple mapping with concatenated memory cells Low storage capacity demand in each single memory cell. Global data communications.

Time-multiplied FFT (I) Time-multiplied FFT (II) FFT benchmark comparison Architecture f max [MHz] FFT size [point] Execution time [cc]

462 (code reload) AM926EJ-S 276 256 1024 13,194 66,196 - - apid system reconfiguration: 40nS @300MHz High performance: 2.5x vs. DSPs, 6.

GPPs Case study: Multi-standard OFDM synchronization Multiple wireless radio standards Concurrent data stream processing Coarse Time

9 Time-multiplied FFT (I) Time-multiplied FFT (II) FFT benchmark comparison Architecture f max [MHz] FFT size [point] Execution time [cc] Code size [byte] econfiguration code size [byte] CGA ,242 9,943 1, Texas TMS- 320VC ,389 25, (code reload) AM926EJ-S ,194 66, apid system reconfiguration: High performance: 2.5x vs. DSPs, 6.5x vs. GPPs Case study: Multi-standard OFDM synchronization Multiple wireless radio standards Concurrent data stream processing Coarse Time Synchronization Carrier Frequency Offset (CFO) estimation Implementation results (I) 65 nm low-power regular VT CMOS: Area: 0.48 mm 2 Clock frequency: 534 MHz Adaptive word length scheduling. Adoption of different algorithms, e.g. Novel sign-bit OFDM acquisition. γ [ θ ] $θ arg $ { γ θ }

10 Summary econfigurable cell array enables hardware sharing at different levels, i.e., task-, function-, and algorithm-level. Coarse-grained reconfigurable cell array comprises distributed processing and memory cells, and a hierarchical NoC structure. In-cell dynamic reconfiguration enables fast context switching.

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen