ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

Size: px

Start display at page:

Download "ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs"

Loraine Lester
5 years ago
Views:

1 rotoflex Tutorial: Full-System M Simulations Using FGAs Eric S. Chung, Michael apamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai ROTOFLEX Computer Architecture Lab at Our work in this area has been supported in part by NSF, IBM, Intel, Sun, and Xilinx.

2 Overview art I: Basic rotoflex Concepts FGA Virtualization techniques BlueSARC FGA-accelerated simulator art II: Hands-on Technology review Access CMU s remote FGA infrastructure Compile and run code within an FGA-accelerated simulation of 16-CU UltraSARC III server Run real-time CM cache simulations 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 2

3 art I: Basic Concepts 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 3

4 Full-system Functional Simulators Convenient exploration of real (or future) HW Can boot OS, run commercial apps But too slow for large-scale (>100-way) M studies Existing functional simulators single-threaded High instrumentation overhead 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 4

5 FGA-based simulation Advantages Fast and flexible Low HW instrumentation performance overhead 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 5

scale Full-system support need device models + full ISA

6 FGA-based simulation Caveat: simulation w/ FGAs more complex than SW Large-M configurations (>100) expensive to scale Full-system support need device models + full ISA 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 6

7 Reducing Complexity 1 Hybrid Full-System Simulation 2 Multiprocessor Host Interleaving CU Memory Common-case behaviors Devices Uncommon behaviors 4-way Memory 4-way 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 7

8 Outline Hybrid Full-System Simulation Multiprocessor Host Interleaving BlueSARC Implementation Conclusion 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 8

transplanting between hosts Only common-case instructions/behaviors implemented in FGA Complex, rare behaviors

9 Hybrid Full-System Simulation FGA host transplant 3 I/O instr CD FIBRE TERM C Host Simulator 1 Memory SCSI 2 GFX NIC CI 3 ways to map target components FGA-only Simulation-only Transplantable CUs can fallback to SW by transplanting between hosts Only common-case instructions/behaviors implemented in FGA Complex, rare behaviors relegated to software Transplants reduce full-system complexity 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 9

10 Multiprocessor Host Interleaving Advantages: # processors to model in functional simulator # soft cores implemented in FGA Trade away FGA throughput for smaller implementation 1-to-1 mapping between target and host CUs Decouple logical simulated size from FGA host size 1-to-1 Host processor in FGA can be made very simple 4-to-1 host interleaving 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 10

11 FGA Host rocessor Architecturally executes multiple contexts Existing multithreaded micro-architectures are good candidates Our design: Instruction-Interleaved ipeline Switch in new processor context on each cycle Simple, efficient design w/ no stalling or forwarding Long-latency tolerance (e.g., cache miss, transplants) Coherence is free between contexts mapped to same pipeline 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 11

12 Outline Hybrid Full-System Simulation Host M Interleaving BlueSARC Implementation Conclusion 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 12

BlueSARC Simulator Functionally models 16-CU UltraSARC III Server HW components are hosted on BEE2 FGA platform Berkeley Emulation Engine 2 5 Virtex-II ro 70s (~66 KLUTs) 2 embedded

13 BlueSARC Simulator Functionally models 16-CU UltraSARC III Server HW components are hosted on BEE2 FGA platform Berkeley Emulation Engine 2 5 Virtex-II ro 70s (~66 KLUTs) 2 embedded owercs per FGA Only use 2 out of 5 FGAs (pipeline + DDR controllers) Virtutech Simics used as full-system I/O simulator 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 13

BlueSARC Simulator Virtex II ro 70 Core 2 Duo 16-way ipeline rocessor Bus owerc (Hard core) Simulated I/O Devices (using full-system simulator) Interface to 2nd FGA (memory) owerc simulates

14 BlueSARC Simulator Virtex II ro 70 Core 2 Duo 16-way ipeline rocessor Bus owerc (Hard core) Simulated I/O Devices (using full-system simulator) Interface to 2nd FGA (memory) owerc simulates unimplemented SARC instructions (~10-50 µsec) Ethernet Simics simulates devices, memorymapped I/O and DMA (~ µsec) 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 14

15 Hybrid host partitioning choices BlueSARC owerc405 Simics on C Integer ALU Register Windows Traps/Interrupts I-/D-MMU Memory + Atomics 36% special SARC insns Multimedia Floating oint Rare TLB operations 64% special SARC insns CI bus IS2200 Fibre Channel I21152 CI bridge IRQ bus Text Console SBBC CI device Serengeti I/O ROM Cheerio-hme NIC SCSI bus Implementation = 60% of basic ISA, 36% of special SARC insns, 0% devices 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 15

16 BlueSARC host microarchitecture 1 2, 3 4, ,10,11 12,13 14 Context Selector C, state x16 I-TLB 128-entry (2-way) x16 64KB I-cache RegFile Decode ALU 1 ALU 2 D-TLB 512-entry (2-way) x16 64KB D-cache Writeback Unit Transplant Unit I-TLB 16-entry (directmapped) x16 D-TLB 16-entry (fullyassoc) x16 Trap Unit Assist Unit owerc405 (transplant service processor) 64-bit ISA, SW-visible MMU, complex memory Multi-context state high # of pipeline stages Normal pipeline stage Transplant support unit 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 16

17 BlueSARC Simulator (continued) rocessing Nodes L1 caches Clock frequency Main memory Resources (Xilinx V270) Instrumentation Statistics Software bit UltraSARC III contexts 14-stage instruction-interleaved pipeline Split I/D, 64KB, 64B, direct-mapped, writeback Non-blocking loads/stores 16-entry MSHR, 4-entry store buffer 90MHz on Xilinx V270 4GB total 33.5 KLUTs (50%), 222 BRAMs (67%) w/o stats+debug 43.2 KLUTs (65%), 238 BRAMs (72%) All internal state fully traceable Attachable to FGA-based CM cache simulator 25K lines Bluespec HDL, 511 rules, 89 module types Runs unmodified Solaris + closed-source binaries 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 17

18 MIS erformance BlueSARC (90mhz) Simics-fast (2.0GHz C2Duo) Simics-trace 1.18 oracle bzip2 crafty gcc gzip parser vortex average 39x speedup on average over Simics-trace 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 18

19 One more thing Our latest application w/ BlueSARC: Trace CM L1 Caches InstructionCaches L2 Cache 8 ways Memory References Data Caches Statistics Cache Contents Statistics 16 64KB 2-way split I&D L1 caches 8-way seudo-lru Statistics 4MB 8-way L2 cache 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 19

20 Demo application BlueSARC Demo on BEE2 On-Line Transaction rocessing benchmark (TC-C) in Oracle Runs in Solaris 8 (unmodified binary) FGA + Memory directly loaded from Simics checkpoint 4 DDR2 Controllers + 4 GB memory Virtex-II ro 70 (owerc & BlueSARC) BEE2 latform Ethernet (to Simics on C) RS232 (Debugging) 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 20 20

21 Summary Two techniques for reducing complexity Hybrid full-system simulation M host interleaving Future work Timing extensions Larger-scale implementation (hundreds of CUs) Run-time instrumentation tools 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 21

22 art II: Hands-on Tutorial 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 22

23 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 23

24 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 24

25 Setup rocedures Divide up into FOUR groups Each group needs 1 laptop with: Remote Desktop Connection for Windows X Instructions: user/login: ramp/ramp 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 25

26 Live CM Cache Statistics Statistics updated in real-time on webpage: Click on tabs to watch statistics for any of the four BEE2 boards as they run simulations 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 26

27 Thanks! Any questions? Acknowledgements We would like to thank our colleagues in the RAM and TRUSS projects. 2-Mar-2008 RAM 2008 Tutorial / Eric S. Chung and Michael apamichael 27

ProtoFlex: FPGA Accelerated Full System MP Simulation

ProtoFlex: FPGA Accelerated Full System MP Simulation Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Our work in this area has been supported in part