C-Based Hardware Design Platform for Dynamically Reconfigurable Processor

Size: px

Start display at page:

Download "C-Based Hardware Design Platform for Dynamically Reconfigurable Processor"

Abigail Barber
6 years ago
Views:

1 C-Based Hardware Design Platform for Dynamically Reconfigurable Processor September 22 nd, 2005 IPFlex Inc.

2 Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW II Design Tool Programming Example and Performance < 2 >

3 Short development iteration while achieving high performance Until Now IPFlex s Technology Enables Development Idea Development Idea Algorithm & System Design Algorithm & System Design Long re-design iterations takes 80% of system dev. time Logic Design Physical Design HW Test Chip SW Long Development Cycle and Short PLC leads to Risk SW Design = Product HW Design Direct and Timely Link between Idea and Product Reduced Development Cycle = Reduced Risk Product < 3 >

4 HW and SW enabling short iteration cycle Dynamically Reconfigurable Hardware Hardware functions reconfigured in a single clock Software to Silicon TM Hardware design with C-like language < 4 >

5 Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW II Design Tool Programming Example and Performance < 5 >

RISC core and dynamically reconfigurable data flow machine 166MHz 376 32 bit processing elements

6 RISC core and dynamically reconfigurable data flow machine 166MHz bit processing elements 512KB Internal RAM Direct I/O DDR SDRAM Processor chip, design tool, evaluation boards available < 6 >

7 RISC core and dynamically reconfigurable data flow machine PE Matrix < 7 >

8 DAP RISC and DNA Parallel Data Processing Memory.L8: cmpge t3,8 br/t.l15.l9: mov t8,cos_table1 mov t3,cos_table2 addv t2,t8,t2 mov cos_table3,t9 lsl t7,t3,3 ld t8,[fp+4] addv t2,t8,t2 IDU EU REG lsl t7,t7,1 : : : PE matrix Memory addv t8,t8,t11 addv t7,t7,t10 addv t4,t4,a1.l12: mov t9,1 addv t2,t2,t9 br.l10.l13:.l14: mov t2,1 addv t3,t3,t2 br.l8c Config Mem < 8 >

9 Heterogeneous PE Matrix PE Matrix PE EXE EXM DLE RAM # Functionality 32 bit 2 inputs / 1 output execution elements 16bitx16bit->32bit multipliers 32 bit 2 inputs / 1 output delay element DNA Internal memory element 16KB 32 =512KB C16E C32E LDx STx Total Address generators I/O buffers < 9 >

10 Scalability with DNA Direct I/O 16 DNAs at 16Gbps flow-thru (Total 32 Gbps I/O) DRAM DRAM DRAM #0 Bus Switch #1 Bus Switch #2 Bus Switch Ext. Device Direct IO DNA Direct IO Direct IO DNA Direct IO Direct IO DNA Direct IO Ext. Device Synchronize with the external clock 166MHz, 96 bit Interconnection between DNAs equivalent to DLE between segments < 10 >

11 DAPDNA architecture: solution at any system size System Example FPGA/ ASIC FPGA/ ASIC FPGA/ ASIC FPGA/ ASIC 4 Pipelined Processes Dynamic Reconfiguration Scalability Small scale Mid/Large size Ultra large Function 4 Function 3 Function 2 Function 1 One DNA Plane One DAPDNA Processor Up to 16 DAPDNA Processors < 11 >

12 Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW II Design Tool Programming Example and Performance < 12 >

13 Data Flow C compiler: A part of Software to Silicon design tool Design Specify : Tools included in DAPDNA-FW II DNA P.A. C for DAP System Specifications Profiler Wrapper for DNA Configuration Algorithm Design C Source Code Partitioning DFC Compiler C for DNA DFC Compiler DNA Configuration Block Diagram DNA Designer MATLAB/Simulink DNA Blockset DAP Compiler DNA Compiler (Dynamic Reconfiguration, Place & Route) Compile DAP DAPDNA Co-Simulator / Emulator DNA DAPDNA Object < 13 >

14 DFC Compiler: HW configurations from C language C based Data Flow Description DNA Configuration DNA Control code for DAP DFC Compiler < 14 >

15 HW configurations written in DFC as functions Application (DAP C code) void top_level() { fir(d,h,r); } Call DNA Control Code void fir (int d[ ], int h[ ], int r[ ]) { dna_cfgram_load3(bank3, cfg_fir); dna_config(0,bank3); dna_run(); Load config data Bank Switch DNA Run DNA Hardware DNA Config mem dna_stop(); } DNA Stop DNA PE Matrix < 15 >

16 HW configurations switched dynamically in single clock (6 nano second) Application (DAP C code)... while (1) {... VLC_Decoder(Vector,Quant_dat,Input); Inv_Quantize(dct_dat, quant_dat); Inv_DCT(mc_dat, dct_dat); Motion_Comp(mcu_dat,mc_dat); Color_Conv(mcu_dat, mcu_dat);... } Color Convert Motion Comp Inv DCT Dynam ic Recon. (Partition in Tim e) DNA Hardware Inv Quantization VLC Decoder < 16 >

17 Extension on ANSI C for data flow systems Delayline: delayed data signals delayline d; a d[0] d = a; z = d[0] + d[1] + d[2] + d[3]; Z -1 d[1] Z -1 d[2] + z SEQ: Static code replication Z -1 d[3] int h[] = {1,2,3}; 1 z[0] int z[]; seq (i=0; i<3; i++) { a 2 z[1] z[i] = a * h[i]; } 3 z[2] < 17 >

18 Advantages of DAPDNA architecture Short development period - Rapid prototyping - Co-Simulation between Processor (DAP RISC) and Data Flow Machine (DNA PE Matrix) - Deterministic placement / guaranteed clock 166Mhz Flexibility to change HW according to the application in demand - Dynamic reconfiguration in single clock Performance close to custom device - Deep pipelining and massive parallelization processing elements < 18 >

19 Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW II Design Tool Programming Example and Performance < 19 >

20 Using dynamic reconfigurability to change image filters DAPDNA-EB5 Evaluation Board DAPDNA-FW II Debug Box DAPDNA-2 Various image filters written in DFC and compiled Laplacian Binary Average Etc. Filters run on DAPDNA-2 using dynamic reconfigurability of hardware < 20 >

21 DAP calls DNA configurations with simple function calls List of filters written in Data Flow c An Average3x3 filter being called in main.c as a function Main.c for DAP RISC < 21 >

22 DFC-described filters compiled into HW with a single click of build button Average3x3 filter described in DFC uses delayline and seq commends to take advantage of parallelization < 22 >

23 resulting in a logical representation of the filter with 32-bit processing elements Pipeline Depth Parallelization Data In Data Out < 23 >

24 as well as physical mapping on DNA Place & Route result Resource manager displays usage and power consumption < 24 >

25 DFC compiler result for image filters DAPDNA-2 (166 Mhz) Image Filters EXE EXM RAM DLE Parallel Performance M Pixel/sec Fps(800x600) 3x3 Average Usage % Binary Usage % Binary Image Edge Det. Usage % < 25 >

Optimized result with DNA Designer (Block Diagram)

0 42 3x3 Laplacian (Edge Detection) 15.8 664.

0 4 Binary Image Edge Detection 38.6 664.

26 Optimized result with DNA Designer (Block Diagram) tool Image Filters Images Performance (M Pixels / s) Pentium 4* (3.06Ghz) DAPDNA-2 (166Mhz) Performance Multiple 3x3 Average x3 Laplacian (Edge Enhancement) x3 Laplacian (Edge Detection) Binary Binary Image Edge Detection * Pentium4@3.06GHz, Red Hat Linux 8.0, gcc 3.2, SIMD instruction not used, 24bit color, 800x600 pixels, 1000 Repetitions, Average performance (pixel/sec) < 26 >

$FFT: Protein structure analysis Extrapolating molecular structure from X-ray diffraction image Extrapolation Algorithm$

27 FFT: Protein structure analysis Extrapolating molecular structure from X-ray diffraction image Extrapolation Algorithm (HIO) Reciprocal Space Real Space 2D-IFFT Adjust x1000 Adjust 2D-FFT Requires 1024 X bit FFTs Source: RIKEN < 27 >

28 DAPDNA-2 is 10X faster than P4 with 1/200 energy Time elapsed (msec) D-FFT (1024x1024) 360 msec P4 8mWh 27 msec DD2 Energy used (mwh) mWh 1000 iterations (HIO) Time elapsed Energy used 16 (min) (Wh) Wh min 1.5 min P4 DD2.1Wh Time 13 : 1 10 : 1 Energy 267 : : 1 * Assuming 80W for P4 and 4W for DAPDNA-2 Source: IPFlex experimentation < 28 >

29 10Gbps active firewall by NTT 10Gb/s Firewall System using Reconfigurable Processors Reconfigurable system technical committee < 29 >

30 10Gbps active firewall by NTT 10Gb/s Firewall System using Reconfigurable Processors Reconfigurable system technical committee < 30 >

31 IPFlex s technology / product roadmap General MPU like Pentium and DNA+AXION complement well Realizes extreme performance Granularity of operands Fine Coarse General MPU Low DNA AXION High Degree of computational parallelism Performance Low High DAPDNA-2 Pentium series Single-core Super chip Pentium Extreme edition IPFlex-enhanced multi-core Processor series Intel multi-core Pentium series Multi-core 2005 Year < 31 >

32 IPFlex s technology / product roadmap Today HW Concept Proof DAPDNA -HP First Commercial Product DAPDNA -2 High-volume Product DAPDNA -2C High-end Product DAPDNA -3A First Software-defined (Axion-based) Product DAPDNA 100MHz 144PEs 16KB RAM 166MHz 376PEs 512KB RAM Direct I/O DDR SDRAM 200MHz ~1000 Simple PEs 333~800MHz 2~4 DAP core 2048~3072PEs 666~1200MHz 4~8 DAP core 4096~6144PEs SW DAP Compiler DNA Compiler Emulator Simulator DAP Compiler DNA Compiler Emulator Simulator DAP Compiler DNA Compiler Emulator Simulator DAP Compiler DNA Compiler Emulator Simulator DAP Compiler DNA Compiler Emulator Simulator Auto Fitter DataFlow-C Compiler Auto Fitter DataFlow-C Compiler Auto Fitter DataFlow-C Compiler Auto Fitter DataFlow-C Compiler Dry-C Compiler Dry-C Compiler C++ Compiler JAVA Compiler Dry-C Compiler C++ Compiler JAVA Compiler System-C Compiler System-Verilog Compiler < 32 >

33 Dynamically Reconfigurable Processor and C-Based Hardware Design Platform by IPFlex Inc. < 33 >

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY