Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

Size: px

Start display at page:

Download "Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture"

Ella Gardner
6 years ago
Views:

1 1 Physical Implementation of the DSPI etwork-on-chip in the FAUST Architecture Ivan Miro-Panades 1,2,3, Fabien Clermidy 3, Pascal Vivet 3, Alain Greiner 1 1 The University of Pierre et Marie Curie, Paris, France 2 STMicroelectronics, Crolles, France 3 CEA-Leti, MIATEC, Grenoble, France

2 Outline Motivation FAUST architecture Migration of DSPI into FAUST DSPI implementation etworks comparison 2

3 3 Motivation Physically implement the DSPI oc into the FAUST application platform - DSPI is a oc developed between Lip6 and ST - FAUST is a stream-oriented application platform for 4G telecom applications, based on AOC, developed by CEA-Leti. Compare the performances between AOC and DSPI on a real application and traffic

4 4 FAUST architecture OC1 IF 84 Pads SPort APort EXP OFDM MOD. RAC ALAM. MOD. CDMA MOD. JTAG MAPP. Clk & Reset CTRL BIT ITER. TURBO CODER Clk, Rst COV. CODER 23 computation units Asynchronous oc (AOC) 20 AOC routers TX units oc Perf. AHB RAM ARM946 RAM EXT. RAM CTRL RAM IF 58 Pads GALS conception 24 independent Clks AHB units RX units Async/ Sync IF Async node ROTOR FRAME SYC. EQUAL. ODFM DEM. CHA. EST. CDMA DEM. COV. DEC. DE- MAPP. ETHER ET DE- ITER. ETHERET IF 17 Pads Ethernet port Internal/External RAM CPU ARM946ES Cache 4KB-I 4KB-D OC2 IF 83 Pads EXP SPort APort DART Hardware OFDM modulation/demodulation

5 5 AOC architecture Crossbar Crossbar IP IC Asynchronous handshake protocol GALS interface Crossbar Crossbar Asynchronous oc - Asynchronous send/accept handshake protocol - QDI 4-phase/4-rail asynchronous logic - QoS with two Virtual Channels (Best Effort, Guaranteed Service) oc Routers - 5 ports router - Source routing - Wormhole packet switching - 32 bits payload FIFO based GALS interfaces Mapped onto libraries - CMOS ST 130nm - TIMA TAL130 library

5 M Gates 276 chip pins Chip area 80 mm² ST CMOS 130nm 166 MHz (worst-case) oc implementation : - Uses AOC router hard-macro - Perform

6 FAUST floor-plan with AOC Exp. RAC RAM1 Rotor Exp. Frame sync. Equal. P2 OFDM mod. ARM946 Channel Est. OFDM demod. Ala. P1 CDMA Mod. CDMA Dem. RAM2 CLK Reset Mapp. Conv. Dec. Demapp. Bit. Inter. Turbo Dec. Conv. Codec. Ext. RAM Ctrl. Ethernet Deinter. 4.5 M Gates 276 chip pins Chip area 80 mm² ST CMOS 130nm 166 MHz (worst-case) oc implementation : - Uses AOC router hard-macro - Perform buffering and routing of AOC links - o spaghetti routing at top level! orth Port Unit Port DART West Port East Port 6 South Port AOC router (0.211 mm²)

7 Outline Motivation FAUST architecture Migration of DSPI into FAUST DSPI implementation etworks comparison 7

8 in in in in DSPI architecture CK6 CK7 CK8 (Y+1,X-1) 1) Cluster (Y+1,X) (Y+1,X+1) CK3 CK4 CK5 (Y,X-1) Cluster (Y,X) (Y,X+1) CK0 CK1 CK 2 in in in in in Local port 8 Packet base Distributed router architecture Suited to GALS approach Mesochronous links between routers Synthesizable with standard cells either asynchronous nor custom cells Metastability resolved by bi-synchronous FIFO (Y-1,X 1,X-1) 1) Cluster (Y-1,X) (Y-1,X+1) More details in: "A Low Cost etwork-on-chip with Guaranteed Service Well Suited to the GALS Approach anoet 06

9 9 oc architectures comparison AOC Router arity 5 port router 5 port router DSPI Topology Irregular 2D mesh Regular 2D mesh Routing technique Source routing (9 hops) Address-based (8 bits) X-First algorithm Switching technique Wormhole Wormhole Flit size (payload) 34 bits (32 bits) Generic: 34 bits (32 bits) Flow control bits on the flit Virtual channels Begin of packet (BOP) End of packet (EOP) Best effort Guaranteed service Begin of packet (BOP) End of packet (EOP) Best effort Guaranteed service Programming model Message passing Shared memory (2 routers per cluster) Message passing (1 router per cluster) Clocking scheme Fully asynchronous (QDI) with GALS interfaces Multi-synchronous with mesochronous interfaces Flow control protocol Send/accept asynchronous handshake FIFO protocol (Write and WriteOk) Clock tree one One per router Physical implementation Hard macro Soft macro distributed on five modules Long wires Inter-router wires Intra-cluster wires

10 10 Packet format 34 bits 18 bits 34 bits (generic) 8 bits First flit H8... H2 H1 H0 Y X 2 bits 2 bits 2 bits 4 bits 4 bits Following flits AOC packet DSPI packet Similar packet format and control bits AOC uses Source-routing (18 bits) allowing 9 hops DSPI uses Address-based (8 bits) Packet conversion module required: - Design of Protocol_conversion module

11 FAUST integration CLK_IP CLK_IP IP IC Synchronous SED/ACCEPT IP IC Synchronous SED/ACCEPT Protocol_conversion module: Translates the routing algorithm using a LUT GALS interface Protocol_conversion Adapts the flow control signals: LUT - AOC: Send/Accept AOC router Asynchronous SED/ACCEPT Asynchronous SED/ACCEPT DSPI router Synchronous READ/WRITE Mesochronous READ/WRITE - DSPI: FIFO protocol Implements two bi-synchronous FIFOs CLK_oC AOC IP template DSPI IP template AOC GALS Interface: Implements 4 FIFOs synchronous asynchronous FAUST IPs and ICs are not modified 11

12 Outline Motivation FAUST architecture Migration of DSPI into FAUST DSPI implementation etworks comparison 12

13 13 DSPI implementation Synthesis Floorplanning Place Optimize placement Clock-tree Route Optimize Hierarchical synthesis Low Power CMOS ST 130nm technology Standard cells Without asynchronous nor custom cells Timing constraints file: - Muti-cycle path (mesochronous interfaces) - False path (asynchronous interfaces) GALS compatible Clock gating Implemented in 4 steps

14 DSPI clock-tree Mesochronous links Clk Clk Clk Clk GALS compatible Bi-synchronous FIFO [OCS 2007] Max skew 50% clock period Timing constraints: set_multi_cycle_path Clk_oC 5% skew within the router 180 phase shift and 30% skew between routers Router (0,0) Router (0,1) Router (0,2) 4 th Step 30% skew (top tree) 1 st Step 2 nd, 3 th Step 5% skew ( bottom tree) Clock-tree implementation 1. Add buffers/inverters 2. Built bottom clock tree (5% skew) 3. Characterize bottom clock-tree 4. Build top clock-tree (30% skew) 14

15 15 FAUST floor-plan with DSPI W S L RAC Rotor W S RAM1 E W Frame sync. W S W L S L L Equal. E W W E S L E E S W L OFDM mod. ARM946 L E W L S E W S E W L S E W S L E W L Channel Est. OFDM demod. Ala. P1 CDMA Dem. S E L S E CDMA Mod. CLK RAM2 W L S L E W W L L S E S E W S Conv. Dec. E W S E W Demapp. Bit. Inter. L L S E Turbo Dec. Conv. Codec. Ext. RAM Ctrl. Ethernet L E Deinter. E W L S E Distributed router implementation Soft macro approach Higher floor-plan flexibility oc adapts to the SoC Long wires are routed in a tree manner Different router configurations are possible DART (reserved area) W L S E DSPI router Placement density: %

16 16 FAUST floor-plan Mapp. Exp. P2 P2 Exp. P1 CLK Reset FAUST with AOC FAUST with DSPI

17 Outline Motivation FAUST architecture Migration of DSPI into FAUST DSPI implementation etworks comparison 17

18 oc Area AOC DSPI Router mm² mm² Interface GALS mm² mm² Clock tree mm² mm² Total mm² mm² CMOS 130nm AOC is implemented as a hard macro DSPI is implemented as a soft macro DSPI is 33% smaller than AOC 18

19 oc Throughput AOC DSPI Throughput on worst-case conditions Throughput on nominal conditions ~ 160Mflit/s 289Mflit/s ~ 220Mflit/s 408Mflit/s DSPI throughput is deterministic with respect to the clock frequency (one flit per clock cycle) Long wire latency penalty on throughput: - DSPI: critical path crosses one time the long wires - AOC: critical path crosses 4 times the long wires, 4-phase protocol AOC link pipelining is feasible In a commercial circuit, DSPI will be clocked not far away from worst-case (289 MHz) to improve the fabrication yield 19

20 Packet latency (1) AOC DSPI F = 150 MHz AOC DSPI F = 250 MHz Compute the packet latency through many routers DSPI has deterministic packet latency with respect to clock frequency DSPI has lower First and Last packet latencies AOC intermediate latency is lower than DSPI. DSPI resynchronize the data on each router Intermediate router latency First + Last latency 6.80 ns ns ns ns 6.80 ns ns ns ns 20

21 Packet latency (2) AOC DSPI AOC DSPI F = 150 MHz F = 250 MHz Latency for 5 hops path Latency for 9 hops path ns ns ns ns ns ns ns ns The packet latency are similar for clock frequencies >250 MHz Intermediate packet latency is important but the application should be mapped in order to optimize the data locality (try to communicate with neighbor IPs rather than with faraway IPs) 21

22 Power consumption AOC F = 150 MHz DSPI F = 250 MHz Router 2.07 mw 2.89 mw 4.85 mw GALS interface 1.62 mw 0.56 mw 0.81 mw Clock tree 0.00 mw 2.44 mw 4.73 mw Total 3.69 mw 5.89 mw mw Power extraction after P&R Functional packet traffic (OFDM demodulation) Power consumption majorly dominated by FIFO data registers The DSPI clock-gating reduced the power consumption by 67% DSPI clock-tree consumes as much power as the router itself - eeds to improve DSPI clock-gating - GALS clock-tree consumes only 2.5% of total clock-tree power 22

23 Conclusion Physical implementation of the DSPI etwork-on-chip on FAUST platform - Comparison between AOC and DSPI at architecture level - Adaptation of DSPI architecture to manage stream-oriented communications - Implementation up to layout of DSPI network on FAUST platform => DSPI mesochronous links fully implemented with standard tools Comparison between DSPI and AOC ocs (STMicroelectronics 130nm) - Area of DSPI is 33% smaller than AOC one - Maximum sustained throughput of DSPI is 31% higher than AOC - AOC has lower packet latency - DSPI power consumption 1.5 to 3 times higher than AOC AOC is a good candidate for low latency and low power applications, while DSPI is more suited to low area and high performance applications 23

24 24 See you at my PhD defense on May 20 th

Design of On-chip and Off-chip Interfaces for a GALS NoC Architecture

2006 Design of On-chip and Off-chip Interfaces for a GALS NoC Architecture Async 06 Edith Beigné Pascal Vivet {edith.beigne; pascal.vivet}@cea.fr Async 06 Symposium Grenoble P. Vivet 1 2006 FAUST : an