ProtoFlex: FPGA-Accelerated Hybrid Simulator

Size: px

Start display at page:

Download "ProtoFlex: FPGA-Accelerated Hybrid Simulator"

Rosamund Barker
5 years ago
Views:

1 ProtoFlex: FPGA-Accelerated Hybrid Simulator Eric S. Chung, Eriko Nurvitadhi James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at

2 Multiprocessor Simulation Simulating one processor in software is slow Simulating many processors is of course even slower Hardware concurrency of FPGA emulation can scale up multiprocessor simulation speed BUT, we also want full-system fidelity (OS and I/O support) full-system FPGA building effort can outweigh benefits CPU CPU I/O MMU controller DMA controller IRQ controller Terminal FPGA PCI Bus Memory Graphics card Ethernet controller SCSI controller Disk Disk January 2007, slide-2

3 Combining Simulators & FPGAs Simulators already provide full-system behaviors why not just simulate infrequent behaviors (e.g., I/O devices)? FPGA CPU CPU Simulator CPU CPU Memory SCSI Ethernet Memory SCSI Ethernet disk disk Advantages avoid implementing infrequent behaviors simplify full-system emulator development low impact on scalability and performance acceleration January 2007, slide-3

4 Transplanting Hybrid Simulator Target objects Target design FPGA Simulator CPU CPU 3 I/O mem mem 1 2 I/O 3 ways to map target object to hybrid-simulation host Emulation-only Simulation-only Transplantable Transplantable objects switch modes between FPGA & simulator hosts complete behavior need not be in implemented in FPGA i.e., implement only the frequently used ISA subset in FPGA January 2007, slide-4

5 Transplant Example Target-to-host mappings: CPU = transplantable Memory = FPGA-only Devices = SW-only FPGA Memory CPU SCSI time Example CPU instruction stream load add multiply I/O SCSI cmd add sub.. Simulator CPU state transfer Memory CPU SCSI SCSI cmd disk January 2007, slide-5

interface Ethernet Simics UltraSPARC Simulated target devices + = SUN 3800

6 Xilinx XUP Virtex-II Pro 30 It Really Works Virtutech Simics (commercial simulator) BlueSPARC DDR memory Embedded PowerPC Transplant & message interface Ethernet Simics UltraSPARC Simulated target devices + = SUN 3800 Server (1x UltraSPARC III, Solaris8) 1 graduate student in 6 months January 2007, slide-6

7 Our BlueSPARC model UltraSPARC III ISA (64-bit, V9) multi-cycle, in-order microarchitecture (avg. CPI ~ 6) implements only 38 out of 93 instr+event classes % dynamic instructions in SPECINT GZIP remaining behaviors transplanted to software Implementation 7K lines in Bluespec synthesizable HDL 13K LUTs partial SPARC core (47% X2CV30P) + 3K LUTs L1 I/D cache + 6K LUTs Xilinx glue/peripherals 100MHz 16 MIPS HW/SW co-simulation validation January 2007, slide-7

8 Transplantation Services Unimplemented instructions and I/O device accesses interrupt the embedded PowerPC and transplant to Simics software on PowerPC analyzes request transfer only the required processor state to Simics restore updated processor state returned by Simics Transplanting cost = 10 millisec per instance Xilinx XUP Virtex-II Pro 30 Virtutech Simics BlueSPARC DDR memory Embedded PowerPC Transplant & message interface Ethernet Simics UltraSPARC Simulated target devices January 2007, slide-8

9 Performance Reality Check Current BlueSPARC core = 16 MIPS raw Transplant overhead = 10 millisec or 1 million cycles MIPS effective = MIPS raw / (1 + MIPS raw rate xplant-per-million 0.010) Even if just 1 in 1 million instructions requires transplanting, effective MIPS goes down to 13.8 MIPS Big Problem decreasing transplant rate any further would require the escalating effort of implementing many rare and difficult instructions increasing MIPS raw increases the discounting factor such that 100 MIPS raw would yield only 50 MIPS effective Can t fight against diminishing return! January 2007, slide-9

10 Hierarchical Micro-Transplantation Recall the lessons in hierarchical cache design CPI effective =7.6 CPI effective =6.2 Run a SW simulator kernel on the embedded PowerPC very little work to cover the nearly the entire ISA only I/O operations need fullblown transplant to Simics (a 10x saving in our case) Key Implications 1. Now it makes sense to improve MIPS raw by pipelining 2. You actually need to put fewer instructions in HW FPGA fabric Embedded PPC ISAsim full-system Simics coverage= % CPI=6 raw =6 coverage= % CPI=4000 coverage=100% CPI=1,000,000 January 2007, slide-10

11 How to decide what goes where? Whether an instruction should be supported by FPGA, embedded PPC or Simics depends on 1. frequency of occurrence 2. cost in implementation effort 3. cost in logic resource 4. relative performance of the 3 options 5. simulation performance goal Use linear programming solver assign instruction types to the 3 options minimize implementation effort while satisfying performance goal (e.g., 90% of MIPS raw ) and resource bounds January 2007, slide-11

12 Profiling and Solver Results % total instruction behaviors 100% 80% 60% 40% 20% 0% db2-tpcc oracletpcc gcc gzip Unused Simics PPC FPGA only about half of the instructions needs to be on FPGA rare or hard instructions can be left in SW January 2007, slide-12

13 GZIP/Solaris8 ProtoFlex Screenshot first 4 billion instructions of SPEC2000 GZIP benchmark, train input ideal and actual CPI time per transplant January 2007, slide-13

14 How to build a 1K-node MP emulator, without building 1024 nodes?

15 How fast do you need to simulate? 1000 sim-outorder Slowdown relative to real system Aggregate Throughput 10 MIPS simulator 100 MIPS simulator 1000 MIPS simulator # processor cores in the simulated system In the uniprocessor world up to 100x slowdown for interactive software research (e.g. Simics) 1k to 10k slowdown for design exploration (e.g. SimpleScalar) January 2007, slide-15

16 Different ways to simulate 1K cores Even for a 1K-node MP, only need 1000 to 10,000 MIPS (in aggregate) to do useful work The naïve approach build a fast ISA core (estimate 100 MIPS per core) physically replicate the core 1000 times 10x to 100x faster than it needs to be Why spend effort on performance I don t need The better approach think in terms of MIPS build a 100-MIPS ISA core with a statically interleaved pipeline that can support multiple contexts interleave 100 contexts per core to emulate a 1Knode system with just 10 physical cores the parameters and the effort required can be tuned to make the emulator just fast enough and not more January 2007, slide-16

17 PROTOFLEX MP Build a 1000-MIPS simulator from 10s of FPGAs maximize throughput per emulation engine to be share by multiple interleaved contexts multiplex a large number of emulated contexts onto a few emulation engines Base the number of emulation engines you need on how much performance you need, and not on how many nodes you are emulating N-way target system CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU N P-way FPGA emulation engines, P<<N CPU CPU CPU P Memory January 2007, slide-17

18 Interleaved Emulation Engine Each emulation engine is essentially a static interleaving multithreaded datapath (ala. HEP) has simpler pipeline without forwarding or interlock can have deeper pipelines for higher frequency can help hide the latency of memory and transplant It is actually easier to optimize instruction throughput Issues to work out How to manage a very large number of core contexts? Do we need to dynamically page clusters of contexts in and out of the core? How to fake memory capacity? How much DRAM do you need to emulate a 1000-node system? Lots of interesting problems left January 2007, slide-18

19 What about performance simulation? ProtoFlex facilitates performance simulation via simulation sampling [Wenisch et. al IEEE Micro, Aug 2006] estimate accurate performance measurements by sampling only many small segments of execution Execution Functional Warming Detailed Warmup Measurement the amount you sample is so small, the speed of the timing simulator is inconsequential the bottleneck really is in architectural-level simulation to generate the architectural state at the start of the sampled sections to maintain the microarchiteture structures with long transients (L2 cache) in between sampled sections January 2007, slide-19

20 Conclusions Technology to build a large-scale full-system multicore/multiprocessor simulator Use hybrid transplantation to avoid a full-system construction effort Use interleaved emulation cores to reduce physical system size and complexity FPGA platform (BEE2) FPGA platform (BEE2) Components hosted on FPGAs CPU CPU CPU Memory Micro-transplant simulators MMU DMA Terminal Full-system simulator host Graphics NIC SCSI January 2007, slide-20

21 ProtoFlex Computer Architecture Lab (CALCM) January 2007, slide-21

ProtoFlex: FPGA Accelerated Full System MP Simulation

ProtoFlex: FPGA Accelerated Full System MP Simulation Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Our work in this area has been supported in part