Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Size: px

Start display at page:

Download "Evaluation of RISC-V RTL with FPGA-Accelerated Simulation"

Buck Tyler
5 years ago
Views:

1 Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV /14/2017

2 Evaluation Methodologies For Computer Architecture Research Microarchitectural Cycle-by-Cycle Software Simulators Analytic Power/Energy Modeling Simulation Sampling 2

3 How To Do Computer Architecture Research In The Past? 100M~1B instructions Few lines of C/C++ code Paper 3

4 How Easy Single-Cycle Perfect Caches? 10 lines of C++ code in MARRSx86 è Easy to implement unrealistic, non-cycle-accurate models 4

5 No µarch Simulation Any More! Validation of one design instance[1] Other designs points? Your target modeling? Validation is difficult - Is your modeling still cycle-accurate? Simulation is too slow à Simulation Sampling [1] Gutierrez et al. Sources of error in full-system simulation, ISPASS

No Simulation Sampling! Phase-based Sampling (e.g. SimPoint) - No theoretically guaranteed error bounds - Short periods of phases show similar IPC whenever repeated - 401.

6 No Simulation Sampling! Phase-based Sampling (e.g. SimPoint) - No theoretically guaranteed error bounds - Short periods of phases show similar IPC whenever repeated bzip2 with its reference input in BOOM-2w Statistical Sampling (e.g. SMARTS) - Statically bounded errors - State warming problems µarchitectural state should be recovered from functional simulators What about managed-language applications? 6

7 Build RTL To Validate Your Design Ideas! Hardware Specification FGPA-Accelerated Simulation Performance Evaluation RTL Implementation CAD Tools Area, Timing, Power Evaluation RTL is no longer difficult - Hardware construction languages (e.g. Chisel) - RISC-V implementation code base (e.g. RocketChip) RTL simulation is no longer slow - Tens of MIPS using FPGAs 7

8 Old BOOM Layout DCache Issue Window LSU ROB DCache Control IMUL Free List ALUs Busy Table IDIV Branch Predictor FPU Bypasses Fetch Buffer Rename Table Uncore RegFile ICache 8

evaluates performance, power, and energy of RTL designs with

9 MIDAS: Turn Your RTL into Gold Target RTL Automatically transforms any RTL designs into FPGA-Accelerated RTL simulators Quickly evaluates performance, power, and energy of RTL designs with realistic software applications. Flexibly Co-simulates abstract HW & SW models 9

10 MIDAS Custom Compiler Passes FIRRTL Compiler Target RTL Design Macro Mapping FAME1 Transform Scan Chain Insertion Simulation Mapping Platform Mapping FPGA-Accelerated RTL Simulator FIRRTL: IR for RTL transforms Instrument RTL designs for - Accurate performance modeling - Easy simulation control in FPGA - Interactions with abstract timing models - RTL state snapshots for energy modeling 10

11 Mapping Simulation to the FPGA Host I/O Devices Processor L2 Cache / Main Memory Software FPGA Board( e.g. Xilinx Zynq) FPGA Board DRAM I/O Devices Simulation Driver I/O Endpoints Processor Memory System Timing L2 Cache / Main Memory Simulation Speed: 3.56MHz(ISCA`16) à ~40MHz(CARRV`17) I/O Endpoints - Low-level timing tokens <-> High-level transactions - Optimize the communications between SW & FPGA L2 / Main Memory - Abstract timing model(l2$ tags) in FPGA - Actual data in Board DRAM - Set size, associativity, block size, latency: runtime configurable 11

12 Memory Timing Model Validation A pointer-chase µbenchmark of ccbench running in BOOM ( L1$: 16KiB, 6 cycles L2$: 1MiB, cycles (runtime configurable) DRAM: cycles (runtime configurable) 12

registers 32(int)/32(fp) 110 Branch predictor gshare: 16KiB history BTB entries 40 40 RAS entries 2 4 MSHR

13 Target Design: Rocket Chip Generator Rocket (In-order Processor) BOOM-2w (Version 1) (Out-of-order Processor) Fetch-width 1 2 Issue-width 1 3 Issue slots 20 ROB size 80 Ld/St entries 16/16 Physical registers 32(int)/32(fp) 110 Branch predictor gshare: 16KiB history BTB entries RAS entries 2 4 MSHR entreis 2 2 L1 I$ / D$ ITLB / DTLB reaches L2 $ DRAM latency 16 KiB or 32 KiB 128 KiB / 128 KiB 1MiB / 23 cycles 80 cycles 13

SPEC2006int Benchmark Suite Instruction count with RISC-V for the reference inputs Benchmarks Instruction Count(T) Benchmarks Instruction Count(T) 400.perlbench 2.48 458.sjeng 2.85 401.bzip2 3.08 462.

14 SPEC2006int Benchmark Suite Instruction count with RISC-V for the reference inputs Benchmarks Instruction Count(T) Benchmarks Instruction Count(T) 400.perlbench sjeng bzip libquantum gcc h264ref mcf omnetpp gobmk astar hmmer xalanbmk 1.10 Instruction Count Average: 2.08 Trillion 445.gobmk, 456.hmmer, 462.libquantum: fail in BOOM 14

Simulation Time for SPEC2006int Simulators Speed Average GEM5+Ruby 100 KIPS[1] 240 days (8 months) Max (464.h264ref: 5T) 640 days (1.8 years) MARSSx86 400 KIPS[2] 60 days (2 months) 160 days (5.

15 Simulation Time for SPEC2006int Simulators Speed Average GEM5+Ruby 100 KIPS[1] 240 days (8 months) Max (464.h264ref: 5T) 640 days (1.8 years) MARSSx KIPS[2] 60 days (2 months) 160 days (5.3 months) MIDAS 18 MIPS (40MHz) 1.4 days 3 days [1] [2] Patel et al. MARSSx86: A full system simulator for x86 CPUs, DAC

16 Case Study: IPC 1.41 IPC Rocket 16KiB L1 Rocket 32KiB L1 BOOM-2w 16KiB L1 BOOM-2w 32KiB L1 Cortex A9 Rocket is comparable to Cortex A9 BOOM outperforms Cortex A9 Performance improvement: 16KiB L1$ à 32KiB L1$ 16

17 Case Study: MPKIs of BOOM-2w MPKI Conditional Branch Indirect Branch L1 I-Cache ITLB L1 D-Cache DTLB L2 Cache 40 BTB entries, 32KiB L1$, 1MiB L2$, TLB reach = 128KiB 473.omnetpp, 483.xalancbmk: Large indirect branch MPKIs à Bigger BTBs 17

18 Issue Slots / Issue Width (%) Case Study: Issue Queue Utilization Issued Slots Empty Slots Non-issued Slots Issued Slots / Cycle = IPC Empty Slots: frontend hazards, lack of resources(403.gcc) Non-issued slots: backend hazards 18

19 Strober Power/Energy Modeling Full Program Execution In FPGA RTL State Snapshots I/O Traces RTL Simulation RTL Signal Activities Post-synthesis Designs PrimeTime PX Average Power No state warming is necessary in RTL/gate-level simulation 19

20 Power (mw) Case Study: Power Breakdown Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w Rocket BOOM-2w 400.perlbench 401.bzip2 403.gcc 429.mcf 458.sjeng 464.h264ref 471.omnetpp 473.astar 483.xalancbmk Misc Uncore L1 D-cache control L1 D-cache meta + data L1 I-cache ROB LSU FPU Integer Unit Issue Logic Register File Rename + Control Branch Predictior Fetch Unit Synopsys 32nm educational technology 50 random sample RTL state snapshots Pipeline Utilization à Power BOOM: 40% power from Rename Logic & Register File 20

21 1000 Case Study: Energy Per Instruction EPI(pJ / Inst) Rocket 32 KiB L1 BOOM-2w 32 KiB L1 BOOM-2w is performant, Rocket is more energy efficient 429.mcf, 471.omnetpp: low power, less energy efficient Pipeline Utilization à Energy Efficiency 21

22 On-going MIDAS Research In UCB Debugging, validation for RTL designs -Assertion detection -Commit log comparison with Spike Memory system timing modeling - Realistic DRAM timing models in FPGA FireSim: datacenter simulation(fires.im) - Connect RocketChips with NICs & switch timing models - Amazon blog post, 7 th RISC-V workshop 22

23 MIDAS(Strober) Is Open-Source Now strober.org Your Processors / Accelerators Examples: RocketChip template: Will support various host platforms (Xilinx Zynq, Amazon EC2 F1, Intel Xeon + In-package FPGA) 23

24 Acknowledgements Funding: - DARPA Award Number HR The Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA - ASPIRE Lab industrial sponsors and affiliates: Intel, Google, HPE, Huawei, LGE, Nokia, NVIDIA, Oracle, and Samsung - Kwanjeong Scholarship 24

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL

Strober: Fast and Accurate Sample-Based Energy Simulation Framework for Arbitrary RTL Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanović