Development and synthesis of adaptive multi-grained i reconfigurable hardware architecture for dynamic functions patterns (AMURHA)

Size: px

Start display at page:

Download "Development and synthesis of adaptive multi-grained i reconfigurable hardware architecture for dynamic functions patterns (AMURHA)"

Amie Evans
5 years ago
Views:

Development and synthesis of adaptive multi-grained i reconfigurable hardware architecture for dynamic functions patterns (AMURHA) Alexander Thomas

1 Development and synthesis of adaptive multi-grained i reconfigurable hardware architecture for dynamic functions patterns (AMURHA) Alexander Thomas Institut für Technik der Informationsverarbeitung (ITIV) Universität Karlsruhe (TH) Prof. Dr.-Ing. Klaus Müller-Glaser Prof. Dr.-Ing. Jürgen Becker

2 AMURHA Project Overview Main Goal: Development and implementation of the new reconfigurable array-based hardware architecture the HoneyComb architecture Hardware goals: Hardware exploration integration of adaptive switching circuits Resulting architecture implements a set of new features, like - Adaptive online routing, multi-context data paths, programmable I/O-IF, Highly parametrizable hardware description (RTL) Synthesis and Layout, resulting in a Chip prototype (tape out expected Oct. 2009) Specification of the final system demonstrator Software goals: Programming model for the new architecture - Programming language specification - Compiler design Visualization and Simulation tools Configuration manager / SuperConfigurator Runtime environment (not finished) Applications for architecture demonstration

3 HoneyComb Platform Design Environment Runtime Environment Porting Applications / Algorithms Dynamic Allocation and Distribution ib ti Manager (dadm) Configuration Ctrl Parame eter HCL-Description Offline Mapper und Configuration-Template- Generator (MCTG) Assembler-Code Transformation Rules Super- Configuration- Generator Assembler Configura ation-templat te Executab ble-code RT TL-Configura ation HoneyComb-Architecture Simulation Data Debugging and Simulation Environment HCViewer HCSim Implemented Path Open Path Tools Tools Implemented Modules / Tools Open Implementation

4 HoneyComb Array-based Reconfigurable Architecture Main Features Hexagonal cell structure Three different cell types: - Datapath-HoneyComb-Cell () - Memory-HoneyComb-Cell (MEMHC) - Input/Output-HoneyComb-Cell (IOHC) Multi-grained data types / Multi-context datapath cells Programmable IO-Interfaces (IOHC) Hardware-Supported Online-Routing Fully synchronized communication network Two-clock domains Two-level clock gating Unified cell structure containing Routing Unit - Part of the communication network - Connects FU outputs and inputs within the Array Functional Module - Specifies Cell Type:, MEMHC, IOHC HoneyComb-Architecture IOHC IOHC IOHC MEMHC MEMHC MEMHC MEMHC IOHC IOHC IOHC Routing Unit Functional Module Honeycomb cell structure CG&MG Links Routing Unit CG Links MG Links

HoneyComb-Architecture Cell Types Cell Types Defined by functional modules

Coarse-/fine-grained data types - Multi-Context-Features - Register

interconnections, registers, operations, LUT-parameters, etc.

like RAM, FIFO, LIFO - Supporting all data types - Complex FSM programming

registers) Honeycomb cell structure Input / Output Cell (IOHC) - System

5 HoneyComb-Architecture Cell Types Cell Types Defined by functional modules Datapath Cell () - Integrates ALUs, LUTs, CG/FG registers / FIFOs - Coarse-/fine-grained data types - Multi-Context-Features - Register control functions - Highly parametrizable at RTL regarding interconnections, registers, operations, LUT-parameters, etc. Routing Unit Functional Module Memory Cell (MEMHC) - Storage functionality like RAM, FIFO, LIFO - Supporting all data types - Complex FSM programming is possible - Adaptable at RTL (module count / size, interconnect, registers) Honeycomb cell structure Input / Output Cell (IOHC) - System Interface / programmable µcontroller - Configuration-Sequencing - Conditional Control of the Array Datapath-Module Memory-Module

6 HoneyComb-Architecture Hardware supported online routing Routing Unit (RU) Main component of the routing network Each cell integrates a RU Parametrizable at RTL (position, neighbors, CG/MG connects / direction) Instruction based control of the point-2-point-routing Four instructions have been defined CG Routing Instruction (CGRI) MG Routing Instruction (MG1RI, MG2RI, 2 Words) End Packet Instruction (EPI) InReg0 InRegN RU Controller Algorithm: 1) Storing Incoming Instructions within Input Registers (InReg) 2) InRegs forward pending requests to Routing Unit (RU) 3) RU selects next request (round robin) 4) If current cell is the destination -> acknowledge route, continue with 5 Routing otherwise Unit -> calculate new route, continue with 6 5) Establish connection to Functional Module, continue with 1 InReg0 InRegN 6) Establish connection to next cell, continue with 1 Functional Module Honeycomb cell structure

7 HoneyComb-Architecture Hardware supported online routing Routing Network Coordinates-based Depth-search-first-strategy strategy Backtracking-algorithm Routing-performance: 4 Cycles per cell - 3 cycles for getting to next cell - 1 cycle for acknowledgement Establishes point-2-point connection between ports of functional modules Option to force shortest path routing - Optimum-Bit Support for multi-grained data types: - Coarse-grained - Multi-grained: 1 n bits Transports configurations as well as application data 0,1 1,1 Routing w/o obstacles 0,2 1,4 2,6 IOHC IOHC IOHC 1,3 2,5 3,7 3 cycles 1 cycle 1,2 2,4 3,6 4,7 2,3 3,5 3cycles 1cycle 2,2 3,4 4,6 1 cycle 2,1 3,1 MEMHC 4,1 5,1 3,2 4,2 5,2 6,2 62 3,3 4,5 5,7 3 cycles 4,4 1cycle 5,6 3 cycles MEMHC 4,3 MEMHC MEMHC 5,5 6,7 1 cycle 3 cycles 53 5,3 6,3 5,4 6,4 7, ,5 Routing path establishment: 6,1 20 cycles 7,5 Communication latency: 6,6 7,6 8,6 86 7,7 8,7 7,3 8,5 IOHC IOHC IOHC 7,2 5 cycles 8,4 9,6 9,7

8 HoneyComb-Architecture - Hardware supported online routing Routing Network Coordinates-based Depth-search-first-strategy strategy Backtracking-algorithm Routing-performance: 4 Cycles per cell - 3 cycles for getting to next cell - 1 cycle for acknowledgement Establishes point-2-point connection between ports of functional modules Option to force shortest path routing - Optimum-Bit Support for multi-grained data types: - Coarse-grained - Multi-grained: 1 n bits Transports configurations as well as application data Routing w/ obstacle IOHC IOHC IOHC MEMHC MEMHC MEMHC MEMHC Routing path establishment : 24 cycles IOHC IOHC IOHC Communication latency: 6 cycles

9 HoneyComb-Architecture - Hardware supported online routing Optimal path routing Optimal paths map - Shows all possible shortest t paths Decision if direction is optimal - Determined through direction spots Algorithm within each cell: 1. Wait for incoming requests 2. Check all directions 3. Found one: take one with smallest utilization else: go back to previous cell, go to step 1 4. Forward to selected direction and wait for response 5. Positive: acknowledge and reserve the path Negative: continue with step 2 6. Continue with step 1 Routing w/ obstacle and Optimum-Bit IOHC IOHC IOHC MEMHC MEMHC MEMHC MEMHC Possible paths to get the shortest path Routing path establishment: s e t 24 cycles IOHC Communication IOHClatency: IOHC 5 cycles

10 HoneyComb-Architecture - Using Online Routing for RePlacement Routing w/ obstacle and Optimum-Bit Configuration Technique IOHC IOHC IOHC Establish configuration path to target cell Target is specified by X,Y coordinates Transport configuration data to the target Configuration data is position independent Cell configuration must meet configuration data requirements (RTL-compatibility) Online Placement By changing the target coordinates X, Y Hardware establishes configuration path to the new target replacement is done Explicit handling of the data streams is necessary MEMHC MEMHC MEMHC MEMHC Original Placement New Placement Replacement can be done by runtime environment (x, y) IOHC IOHC IOHC (x+ x, y+ y)

11 RTL Configuration Manager Problem Highly parametrizable architecture description (RTL) High count of parameters (: parameters) - Input / output definitions of cells - Data width / granularity - Number of ALUs / LUTs / Registers - Interconnection between Modules - How are we supposed to manage this kind of complexity? Approach Easy representation of parameters in a table: e.g. MS Excel Scripting based consistency checks (Excel VBA) Generation of the complete HoneyComb-Array incl. VHDL and Compiler/Viewer-configuration files HoneyComb-Assembler is part of the application

12 Pre-defined Template List Control Buttons Currently defined Array

13 HoneyComb Architecture Programming HoneyComb Assembler (HCA) Low level programming language Highly RTL-configuration dependant Quite complex / not easy to understand Structural programming HoneyComb Language (HCL) Abstraction from strict structural programming Functional description on cell level - Partitioning is done by programmer - Use of high level constructs, like if-then-else Management of configurations (IOHC) - Conditional/unconditional I/O-control - Configuration sequencing RTL - independent code - Dependency is still selectable by programmer Utilization of the given hardware parallelism Process-based, VHDL-like parallel language

14 HoneyComb-Language (HCL) Functional Process () CELL CounterExample IN Start#1, Stop#1, Range; OUT CounterOut, Finish#1; ALIAS S1 = 1#1, S2 = 0#1; VAR State#1, Counter; INIT SET State = S1; BEGIN State <= State; Finish = 0; IF (State = S1) THEN IF (Start) THEN State <= S2; Counter <= Range; END IF ELSE // State = S2 Counter <= Counter 1; IF (Stop OR ALU(Counter).zf) THEN State <= S1; Finish = 1; END IF; END IF; CounterOut = Counter; END CELL; Definition Part Initial i Part Functional Statements

15 HoneyComb-Language (HCL) Functional Process () CELL CounterExample IN Start#1, Stop#1, Range; OUT CounterOut, Finish#1; ALIAS S1 = 1#1, S2 = 0#1; VAR State#1, Counter; INIT SET State = S1; BEGIN State <= State; Finish = 0; IF (State = S1) THEN IF (Start) THEN State <= S2; Counter <= Range; END IF ELSE // State = S2 Counter <= Counter 1; IF (Stop OR ALU(Counter).zf) THEN State <= S1; Finish = 1; END IF; END IF; CounterOut = Counter; END CELL; Stop V Counter = 0 Finish = 1 S1 ELSE Finish = 0 S2 ELSE Counter = Counter 1 Finish = 0 Startt Counter = Range Finish = 0

16 HoneyComb-Language (HCL) Programming Methodology Applications - Functions / Procedures - Input / Output -DFG/ CFG Start Application Funktion: Function: Datei Read lesen Data Prozedur: Procedure: Daten Data pre aufbereiten processing Function: Funktion: Calculation Berechnungsschleife Loop Procedure: Prozedur: Data Daten post nachbereiten processing Function: Funktion: Write Ergebnis Dataausgeben Exit Manual Partitioning - Break down to single cells - Consider communication - Goal: Cell-Descriptions in HCL Function: Funktion: Calculation Funktion: Loop Berechnungsschleife Funktion: Berechnungsschleife Funktion: Berechnungsschleife Berechnungsschleife proc1 proc4 proc2 proc5 proc3 proc3 proc6 proc6 proc3 proc6 proc6 Definition of Scheduling Sub-Configurations - Configuration Sequencing - Process/Cell instantiations - Load/Delete of Sub-Cfgs - Optional location predefinition - Conditional flow control - Interconnection description: - Parallel/sequential - inter/intra subconfiguration execution of Sub-Cfgs - Reuse of predefined sub-cfgs - Task of the main-processes libraries are imaginable Sub-Configuration: Application Configuration Calculation Loop proc1 proc2 proc4 proc3 proc5 proc6 Similar procedure for remaining functions l parallel SubC Cfg 1 Sub bcfg 2 SubCfg 3 t 1 t 2 t 3 sequential bcfg 4 Su ubcfg 5 Su t

17 HoneyComb Architecture Design Flow for the specification of the RTL-Configuration Template Library Excel Generator HoneyComb RTL template generation Cells contain all user-predefined elements Reference applications HCL HCL Application HCL Application HCL Application HCL HCL Application A Application B Application C D E F Reference configurations RTL Template MCTG Applications Cfg AD EB FC Initial point: Overloaded RTL-configuration with all allowed elements Compilation of the chosen applications RTL template is used as target Result: Set of RTL dependant configurations for best-possible application execution SuperCfg Generator SuperCfg Generator creates a superset for the given RTL descriptions and reduces the template Super Super RTL Configuration Result: Reduced HoneyComb-ConfigurationConfiguration Further iterative steps are possible

18 Extraction of the Super-Configurations Eased Presentation Create an empty cell template for the generation Application dependant d analysis of the given operations, source and target t units for each unit Incremental adding of the required resources to the current unit Quit, if all applications are satisfied Application A Application B a b c b Resulted RTL Configuration a c b + +, z y z y

19 Extraction of the Super-Configurations Homogeneous Arrays Mapping of all the application cells on one single cell: Homogeneous Array-Configuration Application A Application B Cell with all required characteristics HoneyComb-Architecture IOHC IOHC IOHC IOHC IOHC IOHC Advantage: Disadvantage: Highest flexibility for the runtime mapping Simplified application development Considering application structure - non-optimal utilization in the peripheral area

20 Application examples (1) 1024-point FFT Radix-2-Butterfly-Implementation Precision: Fixpoint Interleaver Single butterfly version requires 5 cells - Butterfly : 1 Cell (8 ALUs, 1 LUT), 2 cycles / operation - Controller: 2 Cells (11 ALUs, 7 LUTs) - Interleaver: 1 Cell (4 ALUs, 2 LUTs) - Memory: 1 Cell (4 HCMEMs, 4x4 Kbytes) Performance: 2 cycles / operation, 5120 butterfly operations = cycles / operation + Store / Load time of 1024 cycles Wavelet Transformation Frequency filter for JPEG2000 Works on the whole image Single Wavelet Filter Implementation: ti - High Pass Filter: 1 Cell (5 ALUs, 2 LUTs) - Low Pass Filter: 1 Cell (6 ALUs, 2 LUTs) Performance: 1 pixel / cycle Easy performance increase through parallel execution Controller Memory FFT-1024 Butterfly

21 Application examples (1) High-pass Filter 1024-point FFT Radix-2-Butterfly-Implementation Precision: Fixpoint Memory Single butterfly version requires 5 cells - Butterfly : 1 Cell (8 ALUs, 1 LUT), 2 cycles / operation - Controller: 2 Cells (11 ALUs, 7 LUTs) - Interleaver: 1 Cell (4 ALUs, 2 LUTs) - Memory: 1 Cell (4 HCMEMs, 4x4 Kbytes) Performance: 2 cycles / operation, 5120 butterfly operations = cycles / operation + Store / Load time of 1024 cycles Wavelet Transformation Frequency filter for JPEG2000 Works on the whole image Single Wavelet Filter Implementation: ti - High-pass Filter: 1 Cell (5 ALUs, 2 LUTs) - Low-pass Filter: 1 Cell (6 ALUs, 2 LUTs) Performance: 1 pixel / cycle Wavelet 3x Easy performance increase through parallel execution Low-pass Filter

Application examples (2) AES-256 Advanced Encryption Standard Block-based algorithm HC Implementation processes 4 bytes

kbytes Performance: 25,6 MB/s encryption speed AES-256 imdct Application: MP3/OggVorbis Decoder Used recursive approach

Cell (4 ALUs, 6 LUTs) - Interleaver: 1 Cell (5 ALUs) - Memory: 1 Cell (8 HCMEMs, 8x 4 kbytes) Performance by using

22 Application examples (2) AES-256 Advanced Encryption Standard Block-based algorithm HC Implementation processes 4 bytes at once in each functional block Requires 13 cells (complete prototype size) - 11 s: 69 ALUs, 7 LUTs - 2 MEMHCs: 16 x 4 kbytes Performance: 25,6 MB/s encryption speed AES-256 imdct Application: MP3/OggVorbis Decoder Used recursive approach due to Nokolajevic/Fettweis i Single finger version requires 4 cells - Finger: 1 Cell (7 ALUs, 1 LUT) - Controller: 1 Cell (4 ALUs, 6 LUTs) - Interleaver: 1 Cell (5 ALUs) - Memory: 1 Cell (8 HCMEMs, 8x 4 kbytes) Performance by using OggVorbis specification transformation ti and one finger - 47,6 blocks / sec, 1 block = 2048 samples, 43 blocks / sec are required

23 Application examples (2) AES-256 imdct Advanced Encryption Standard Block-based algorithm HC Implementation processes 4 bytes at once in each functional block Requires 13 cells (complete prototype size) - 11 s: 69 ALUs, 7 LUTs - 2 MEMHCs: 16 x 4 kbytes Performance: 25,6 MB/s encryption speed Controller Application: MP3/OggVorbis Decoder Used recursive approach due to Nokolajevic/Fettweis i Single finger version requires 4 cells - Finger: 1 Cell (7 ALUs, 1 LUT) - Controller: 1 Cell (4 ALUs, 6 LUTs) - Multiplexing: 1 Cell (5 ALUs) - Memory: 1 Cell (8 HCMEMs, 8x 4 kbytes) Performance by using OggVorbis specification and one finger - 47,6 blocks / sec, 1 block = 2048 samples, 43 blocks / sec are required Multiplexing Low Pass Filter Memory imdct

Prototype Configuration - Synthesis RTL Configuration generated based on the given Application application results set (AES, (@ 100 imdct, MHz) FFT, Wavelet) - 11 s - 2 MEMHCs - 2 IOHCs

technology Maximum possible frequencies: - IOHCs: up to 400 MHz - Array: around 166 MHz Application MEMHC Configuration time Performance AES256 11 2 6.85 µs 25.

6 cycles / pixel (1,1) 1) (3,2) Synthesis results (TSMC 90 nm standard cell technology) Area (mm2) Power(mW) (0,1) Application (2,2) MEMHC IOHC Static Dynamic AES256 0.362636 1.226638 0.

24 Prototype Configuration - Synthesis RTL Configuration generated based on the given Application application results set (AES, (@ 100 imdct, MHz) FFT, Wavelet) - 11 s - 2 MEMHCs - 2 IOHCs Additional functionality will be added if some area can be spared during the Layout process Synthesis Performed by using Synopsys Design Compiler Target Technology: TSMC 90nm standard cell technology Maximum possible frequencies: - IOHCs: up to 400 MHz - Array: around 166 MHz Application MEMHC Configuration time Performance AES µs 25.6 MB / sec IOHC IOHC imdct (1,0) µs 47.6 blocks / sec(3,1) MEMHC FFT µs blocks / sec (0,0) (2,1) Wavelet 3x µs 0.6 cycles / pixel (1,1) 1) (3,2) Synthesis results (TSMC 90 nm standard cell technology) Area (mm2) Power(mW) (0,1) Application (2,2) MEMHC IOHC Static Dynamic AES (1,2) (3,3) imdct MEMHC (4,2) (4,3) (0,2) (2,3), (4,4) 4) FFT ,02 Wavelet 3x ,84 ASIC Prototype

25 Prototype Configuration - Layout

26 Prototype Configuration and PCB Integration Multi-Chip-Approach DDR SDRAM Due to limited budget and limited available area Technology: 90nm TSMC standard cell Cell area: mm² () FPGA On-Die HoneyComb-Prototype Maximum available area: 16 mm² Current area of the array: 11.5 mm² Host-System-Implementation on FPGA HoneyComb Controller Additional device to control HC-Array SoC Interface MEMHC IOHC IOHC MEMHC Receives information from - IOHCs pipeline modules - IOHCs FIFO states t / activity it Flexible peripheral interfaces HC Controller - Routing units status - Controls - IOHC pipelines (starting, resetting, ) - Routing Units (disable if faulty) RS232 2 SPI I²C USB2.0 Ethernet t SATA A VGA PCB-Level Integration

27 HoneyComb architecture Discussion and Perspectives Advantages Runtime adaptive routing technique Hexagonal Cell Shape Programmable I/O-IF Multi-context / multi-grained functions Hardware template characterization Array-based approach: bandwidth advantage Local memories Clock gating Lower frequencies Disadvantages Synchronization protocol - Additional hardware overhead - Application specialization Adapted array is highly application dependant Array-based approach: no real DMA available von-neumann flexibility is practically gone Programming is a hard piece of work - Structural programming is harder than it seems Future Work / Improvements Optimization of the multiplexing structures (GFTs, Crossbar Networks, over 50% possible savings) Away from synchronized networks (architecture generalization) Adding debugging functions (currently very rudimental) Runtime environment the only way to exploit all given features C/C++ Compiler development

28 Thank you for your attention

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located