FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

Size: px

Start display at page:

Download "FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations"

Jennifer Farmer
6 years ago
Views:

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.

1 FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu Computer Architecture Lab at Our work was supported by SF. We thank Xilinx and Bluespec for their FPGA and tool donations. 5/3/2011

Speed, accuracy, flexibility trade-off Full-system simulators sacrifice accuracy for speed and

2 Simulation in Computer Architecture Slow for large-scale multiprocessor studies Full-system fidelity + long benchmarks How can we make it faster? Speed, accuracy, flexibility trade-off Full-system simulators sacrifice accuracy for speed and flexibility Accuracy Speed Flexibility Accelerate simulation with FPGAs Can simulate up to millions of gates Orders of magnitude simulation speedup 2

The FIST Project Explores fast oc models for full-system simulations FPGA-friendly, but avoid direct implementation Low error, many topologies, >10M packets/sec Simpler requirements of full-system

3 The FIST Project Explores fast oc models for full-system simulations FPGA-friendly, but avoid direct implementation Low error, many topologies, >10M packets/sec Simpler requirements of full-system simulation Estimate packet latencies, capture high-order effects P L1 L2 P L1 L2 FPGA (Xilinx LX110T) 4x4 5x5 P L1 L2 P L1 L2 FPGA area requirements for state-of-the-art mesh oc* *oc TL from 3

4 Latency FIST Approach View oc as set of routers/links Abstract router into black-box epresent by load-delay curves Specific to each router configuration and traffic pattern State Logic Buffers Load 4

5 Latency FIST Approach Treat each hop as a set of load-delay curves Trade-off between model complexity and fidelity Keep track of load at each node To track router load monitor traffic over window of time Load 5

6 FIST in Action oute packet from source to destination Determine routers that will be traversed Sum up the delays for each traversed router Index load-delay curves using current load at each router S packet delay D 6

7 Outline Introduction to FIST FIST-based etwork Models Evaluation elated Work & Conclusions

8 Outline Introduction to FIST FIST-based etwork Models Evaluation elated Work & Conclusions

9 Feedback Putting FIST Into Context Detailed network models Cycle-accurate network simulators (e.g. BookSim) Analytical network models Typically study networks under synthetic traffic patterns Updated Curves Train Curves Use Curves etwork models within full-system simulators Model network within a broader simulated system Assign delay to each packet traversing the network Traffic generated by real workloads 9

10 Offline and Online FIST Offline FIST Detailed network simulator generates curves offline Can use synthetic or actual workload traffic Load curves into FIST and run experiment Detailed etwork Model Online FIST (tolerates dynamic changes in network behavior) Initialization of curves same as offline Periodically run detailed network simulator on the side Compare accuracy and, if necessary, update curves Provide feedback and receive updated curves Detailed etwork Model 10

11 Latency Online Training in Action Example with no initial training Before Training El El After Training Elapsed Ellapsed cycles Cycles (in (in 1000s) 1000s) Actual Latency Estimated Latency

12 FIST Applicability FIST-Friendly etworks Exhibit stable, predictable behavior as load fluctuates Actual traffic similar to training traffic FIST Limitations Depends on fidelity, representativeness of training models Higher loads and large buffers can limit FIST s accuracy High network load increased packet latency variance Large buffers increased range of observed packet latencies Cannot capture fine-grain packet interactions Cannot replace cycle-accurate detailed network models FIST only as good as its training data 12

13 Applying FIST to ocs ocs affected by on-chip limitations and scarce resources Employ simple routing algorithms Usually simple deterministic routing Operate at low loads ocs usually over-provisioned to handle worst-case Have been observed to operate at low injection rates Small buffers On-chip abundance of wires reduces buffering requirements Amount of buffering in ocs is limited or even eliminated ocs are FIST-Friendly 13

14 Outline Introduction to FIST FIST-based etwork Models Evaluation elated Work & Conclusions

15 FIST Implementations Software Implementation of FIST (written in C++) Implements online and offline FIST models Hardware Implementation (written in Bluespec) Precisely replicates software-based FIST Block diagram of architecture Packet Descriptors outing Logic outer Elements Src Dest Size Pick routers Calculate Delay Packet Delays Load Tracker Curve BAM Handle load tracking & delay queries Partial Delays 15

Peeking Under The Hood Tracking Latency T 7 6 5 4 3 2 H S D Store-and-forward 0 1 2 0 0 5 10 15 20 25 30 H 2 3 4 5 6 7 T H 2 3 4

H 2 3 4 5 6 7 T H 2 3 H 2 3 4 2 Injection latency 4 5 6 7 Traversal latencies Packet Latency = 30 Latency = Latency = Latency =

16 Peeking Under The Hood Tracking Latency T H S D Store-and-forward H T H T H T 1 2 Packet Latency = 30 Latency = 9 Latency = 14 Latency = Wormhole-routed H T H 2 3 H Injection latency Traversal latencies Packet Latency = 30 Latency = Latency = Latency = Similar issues arise for load tracking & dynamic training T Use separate injection and traversal latency curves per router 5 T 0 1 2?

17 Methodology Examined online and offline FIST models eplaced cycle-accurate oc model in tiled CMP simulator etwork and system configuration 4x4, 8x8, 16x16 wormhole-routed mesh Each network node hosts core+coherent L1 and a slice of L2 Multiprogrammed and multithreaded workloads 26 SPEC CPU2006 benchmarks of varying network intensity 8 SPLASH-2 and 2 PASEC workloads Traffic generated by cache misses Consists of control, data and coherence packets Offline and Online FIST models with two curves per router Curves represent injection and traversal latency at each router Initial training using uniform random synthetic traffic Please see paper for more details! 17

18 Latency/IPC Error (in %) Accuracy esults (offline) 8x8 mesh using FIST offline model Average Latency and Aggregate IPC Error 15% 10% Latency Error IPC Error 5% IPC Error < 4% 0% -5% -10% Latency Error < 8% -15% MP (Low) MP (Med) MP (High) MT (SPL/PA) 18

19 Latency/IPC Error (in %) Accuracy esults (online) 8x8 mesh using FIST online model Average Latency and Aggregate IPC Error 10% % Latency Error IPC Error 0% % -10% -0.1 MP (Low) MP (Med) MP (High) MT (SPL/PA) Both Latency and IPC Error below 3% 19

20 Latency/IPC Latency Error Error (in %) What about a very simple model? 8x8 mesh using hop-based model 1How does simple network model affect high-order results? 80% 0.8 tency Error Error 60% % % 0.2 0% 0-20% -0.2 Latency Error Latency Error IPC Error IPC Error FIST models always within this range -40% % -60% % -0.8 MP (Low) MP (Med) MP (High) MT (SPL/PA) Very high error for both latency and IPC! 20

21 Speed (packets/sec) Performance esults Qualitative comparison 100M 10M 1M 100K 10K 1K HW FIST SW FIST SW oc Sims (e.g. BookSim) Complexity / Level of Detail SW-based speedup results for 16x16 mesh Offline FIST: 43x Online FIST: 18x HW-based speedup (offline): ~3-4 orders of magnitude 21

22 Hardware Implementation esults FPGA resource usage & clock frequency Different mesh configurations Xilinx Virtex-5 LX155T FPGA FIST Model Direct Implementation Size FPGA Area Freq. FPGA Area Freq. 4x4 4% 380 MHz 61% 130 MHz 8x8 15% 263 MHz x12 34% 250 MHz - - Will not fit 16x16 60% 214 MHz x20 94% 200 MHz - - FIST can scale to large ocs with many routers

23 Outline Introduction to FIST FIST-based etwork Models Evaluation elated Work & Conclusions

24 elated Work Vast body of work on network modeling Analytical models, hardware prototyping, etc. Abstract network modeling Performance vs. accuracy trade-off studies [Burger 95] Load-delay curve representation of network [Lugones 09] FPGAs for network modeling Cycle-accurate fidelity at the cost of limited scalability Time-multiplexing can help with scalability [Wang 10] But still suffer from high implementation complexity

25 Conclusions & Future Directions Conclusions Full-system simulators can tolerate small inaccuracies FIST can provide fast SW- or HW-based oc models SW model provides 18x-43x average speedup w/ <2% error HW model can scale to 100s routers with >1000x speedup ocs within a CMP are FIST-friendly But not all networks good candidates for FIST modeling Future Directions FPGA-friendly oc models at multiple levels of fidelity Configurable generation of hardware oc models

26 Thanks! Questions?

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs