Ultra-Fast NoC Emulation on a Single FPGA

Size: px

Start display at page:

Download "Ultra-Fast NoC Emulation on a Single FPGA"

Bennett Harris
6 years ago
Views:

1 The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo Institute of Technology 1

Contributions Methodologies for emulating Networkon-Chip (NoC) architectures with up to 1000s of nodes on a single FPGA Up to 16384 nodes Cycle-accurate & 5000x simulation speedup over BookSim 1), a

2 Contributions Methodologies for emulating Networkon-Chip (NoC) architectures with up to 1000s of nodes on a single FPGA Up to nodes Cycle-accurate & 5000x simulation speedup over BookSim 1), a widely used software simulator High end PC Core i CPU 32GB RAM FPGA BookSim 1) : several days Load-latency graph Proposal: several minutes FPGA Virtex-7 FPGA 1) 2

3 Multi/Many-Core Architectures Have Become Mainstream 1000s cores in near future architectures 60+ cores 100 cores Multi/many-core 3

4 Network-on-Chip (NoC) The interconnection network of many-core architectures NoC simulation plays a vital role in designing many-core architectures C C C C C C C R R R R C C C R R R R C C C R R R R C C C R R R R A many-core architecture with 2D Mesh NoC C R Core 4

5 Software Simulators Flexible and easy to debug Too slow to simulate large architectures Parallelization is non-trivial Without sacrificing accuracy, only a limited degree of parallelization can be achieved Simple Models Moderate Models Detailed Models Typical Simulation Time of 1000s-Core Architectures Several days Several months Several years 5

6 FPGA-Accelerated Simulation Avoids impractical designs Possible to reuse RTL code Can achieve an ultra-fast simulation speed Many operations can be simulated simultaneously in a tick of FPGA s clock Adding detail to a model requires more hardware, but does not necessary degrade performance 6

7 FPGA-Accelerated Simulation Challenges Single FPGA Limited resources Multiple FPGAs Off-chip memory More resources More complex Off-chip communication Software: Dynamic memory allocation Traffic Generation Unit 1 Source Queue 1 Network Traffic Generation Unit N Source Queue N Memory constraint Must be very large to ensure that no generated packet is dropped 7

8 Method 2: Time-Multiplexing Proposals Method 1: Decoupling Time Counters Logical clusters Traffic Generation Unit 1 Source Queue 1 Network Physical cluster Traffic Generation Unit N Source Queue N Must be very large to ensure that no generated packet is dropped Eliminate the memory constraint Simulate architectures with 1000s of cores on a single FPGA 8

9 The primary router model Emulation Model Routing Logic VC Allocator Switch Allocator Input Ports Crossbar Three basic components : a state-of-the-art pipelined router architecture Traffic generator: generates and injects synthetic workloads Traffic sink: collects performance characteristics Traffic Generator Network Traffic Sink Traffic Generator Traffic Sink 9

10 Emulation Model: 2D Mesh flit in credit out flit in credit out flit in credit out flit out credit in flit out credit in flit out credit in Traffic Generator Traffic Sink # of packets total latency Source Queue Packet Source Flit Generator flit to router credit from router Packet source: models injection processes (e.g. Bernoulli process) Source queue: every packet generated by the packet source is stored in the source queue until it can enter the network Flit generator: models traffic patterns (e.g. uniform random) 10

11 Decoupling Time Counters Conventional approach Every packet source is synchronized with the network The source queues must be very large to cope with the case when the packet sources generate so many packets that the network becomes very congested T Packet Source Source Queue Network T Source Queue T Packet Source 11

12 Decoupling Time Counters Proposal Each packet source, as well as the network, has its own time counter and operates based on a separate state machine R Running W Waiting T 1 T N R W Packet Source 1 Packet Source N The state transitions are based on the status of the source queues and the relationship between the time counters Source Queue 1 Source Queue N Network T R W R W 12

13 Decoupling Time Counters W R Waiting Running State Machine of Packet Source i PS i intends to generate a packet but SQ i is full R W One space in SQ i becomes available No packet is dropped But.. Some packets may not be generated on time T 1 Packet Source 1 Source Queue 1 Network T Source Queue N T N Packet Source N 13

State Machine of the Network i SQ i is NOT empty OR T i = T The specified synthetic workload is

14 Decoupling Time Counters W R Waiting Running i SQ i is empty & T i < T R W i either SQ i contains at least one packet or the time of PS i is synchronized with the time of the network State Machine of the Network i SQ i is NOT empty OR T i = T The specified synthetic workload is simulated accurately T 1 Packet Source 1 Source Queue 1 Network T Source Queue N T N Packet Source N 14

15 Time-Multiplexing To complete one cycle of the network, the physical cluster sequentially emulates a number of logical clusters Logical clusters Logical clusters Logical clusters Physical cluster size = 1x1 Physical cluster size = 1x2 Physical cluster size = 2x

16 Time-Multiplexing Combinational logic and block RAMs (BRAMs) are utilized much more efficiently because they can be shared between many NoC nodes Example: 128x128 mesh NoC Direct implementation 5x128x128 = 81,920 BRAMs Routing Logic VC Allocator Switch Allocator Using time-multiplexing < 500 BRAMs flit in credit out flit out credit in Method for translating to emulation code... Please see the paper flit in credit out Input Ports Crossbar flit out credit in Input buffers can be implemented using BRAMs 16

17 Datapath Physical cluster emulates different logical clusters using different states loaded from state memory In buffer and out buffer store data passed between logical clusters Logical clusters N-1 In Buffer State Memory (1) Physical Cluster (2) Out Buffer (1) Load a state & Emulate the corresponding logical cluster (2) Store the updated state 17

18 Datapath Physical cluster emulates different logical clusters using different states loaded from state memory In buffer and out buffer store data passed between logical clusters Logical clusters N-1 State Memory In Buffer Physical Cluster Out Buffer 18

19 Example Emulated Emulating To be emulated Need D 0 i D 2 i D 5 i D 0 i : In Buffer D 2 i D 5 i : Out Buffer D 5 i 1 Logical clusters D 5 i D 2 i 1 D 1 i D 0 i State Memory D 2 i D 1 i+1 D 0 i In Buffer D 0 i Physical Cluster Out Buffer D 2 i D 5 i Emulation cycle i i D 0 i D 1 i D 7 i D 0 i+1 D 1 i+1 19

Xilinx VC707 board Evaluation and Analysis 128 128 mesh NoC (16,384 nodes) on a Xilinx VC707 board Four NoC designs 5-stage 2-VC: canonical 5-stage pipelined VC router architecture with 2 VCs per

20 Xilinx VC707 board Evaluation and Analysis mesh NoC (16,384 nodes) on a Xilinx VC707 board Four NoC designs 5-stage 2-VC: canonical 5-stage pipelined VC router architecture with 2 VCs per port 5-stage 1-VC: canonical 5-stage pipelined VC router architecture with 1 VC per port 4-stage 2-VC: canonical 4-stage pipelined VC router architecture with 2 VCs per port 4-stage 1-VC: canonical 4-stage pipelined VC router architecture with 1 VC per port Three configurations 4-phy (2 2): use four nodes to emulate the entire mesh network 16-phy (4 4): use 16 nodes to emulate the entire mesh network 32-phy (8 4): use 32 nodes to emulate the entire mesh network Metrics Hardware usage Verification against BookSim, a widely used cycle-accurate software simulator Simulation performance: speedup over BookSim 20

21 Configuration Parameters Topology 128x128 mesh (16,384 nodes) architecture # of VCs per port 2 or 1 Routing algorithm Flow control VC/Switch allocator Arbiter type Flit size VC size Packet length Injection process Traffic pattern Source queue length 5-stage pipelined VC router or 4-stage pipelined VC router (employing look-ahead routing) Dimension-order (XY) Credit-based Separable output first Fixed priority 25-bit or 22-bit 4-flit 8-flit Bernoulli Uniform random 8-entry 21

22 Hardware Usage Simulation of a same NoC design by using 4 physical nodes (4-phy), 16 physical nodes (16-phy), and 32 physical nodes (32-phy) 100% 5-stage 2-VC 4-phy 5-stage 2-VC 16-phy 5-stage 2-VC 32-phy 5-stage 1-VC 4-phy 5-stage 1-VC 16-phy 5-stage 1-VC 32-phy 4-stage 2-VC 4-phy 4-stage 2-VC 16-phy 4-stage 2-VC 32-phy 4-stage 1-VC 4-phy 4-stage 1-VC 16-phy 4-stage 1-VC 32-phy 80% 60% 40% 20% 0% Slices (Registers & Logic) Use more physical nodes Consume more registers & logic BRAMs (Memory) Only a minor difference in the number of occupied BRAMs 22

23 Accuracy The proposed methods do not affect the simulation accuracy Synthetic workloads can be modeled accurately without using a large amount of memory No compromise in simulation accuracy is made Verification Compare the output results in simulating 4 NoC designs of the FPGA-based emulator and BookSim 5-stage 2-VC 5-stage 1-VC 4-stage 2-VC 4-stage 1-VC 23

24 Verification: Proposal vs BookSim Solid Lines: Proposal (FPGA-based) Dotted Lines: BookSim (Software-based) Nearly identical 24

25 Simulation Performance Topology 128x128 mesh architecture 5-stage # of VCs per port 2 The drop is caused by stalling the emulated network in the first proposed method which helps to eliminate the memory constraint Speedup over BookSim (Amount of Network Activity) 25

26 Simulation Performance 5490x 32-phy High end PC Core i CPU 32GB RAM 2782x BookSim: 5.5 days 702x 16-phy 4-phy Proposal: 1.5 minutes FPGA Virtex-7 FPGA 26

27 Conclusions & Future Work Conclusions Two methods are proposed to enable ultra-fast and accurate emulation of large-scale NoC architectures on a single FPGA More than 5000x simulation speedup over BookSim is achieved when emulating an 128x128 NoC with state-ofthe-art router architectures Future work Support full-system simulations Support a wide range of benchmarks/workloads 27

28 Q & A Thank you! 28

An Effective Architecture for Trace-Driven Emulation of Networks-on-Chip on FPGAs

An Effective Architecture for Trace-Driven Emulation of Networks-on-Chip on FPGAs Thiem Van Chu and Kenji Kise Tokyo Institute of Technology, JSPS Research Fellow thiem@arch.cs.titech.ac.jp, kise@c.titech.ac.jp