A Fast Instruction Set Simulator for RISC-V

Similar documents
Lightweight Memory Tracing

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

UCB CS61C : Machine Structures

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Energy Models for DVFS Processors

ISA-Aging. (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015

Computer Architecture. Introduction

UCB CS61C : Machine Structures

Architecture of Parallel Computer Systems - Performance Benchmarking -

Footprint-based Locality Analysis

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Lightweight Memory Tracing

Near-Threshold Computing: How Close Should We Get?

Sandbox Based Optimal Offset Estimation [DPC2]

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Bias Scheduling in Heterogeneous Multi-core Architectures

PIPELINING AND PROCESSOR PERFORMANCE

A Dynamic Program Analysis to find Floating-Point Accuracy Problems

Linux Performance on IBM zenterprise 196

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Efficient Memory Shadowing for 64-bit Architectures

Generating Low-Overhead Dynamic Binary Translators

Open Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation

Data Prefetching by Exploiting Global and Local Access Patterns

PMCTrack: Delivering performance monitoring counter support to the OS scheduler

Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University

HOTL: a Higher Order Theory of Locality

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

SEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)

Making Data Prefetch Smarter: Adaptive Prefetching on POWER7

Efficient and Effective Misaligned Data Access Handling in a Dynamic Binary Translation System

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

HOTL: A Higher Order Theory of Locality

Thesis Defense Lavanya Subramanian

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

Scalable Dynamic Task Scheduling on Adaptive Many-Cores

Perceptron Learning for Reuse Prediction

Last time. Lecture #29 Performance & Parallel Intro

OpenPrefetch. (in-progress)

A Front-end Execution Architecture for High Energy Efficiency

Potential for hardware-based techniques for reuse distance analysis

A Comprehensive Scheduler for Asymmetric Multicore Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Coverage-guided Fuzzing of Individual Functions Without Source Code

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring

Exploi'ng Compressed Block Size as an Indicator of Future Reuse

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

Pipelining. CS701 High Performance Computing

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

Multi-Cache Resizing via Greedy Coordinate Descent

HW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation

Introducing the GCC to the Polyhedron Model

Energy-centric DVFS Controlling Method for Multi-core Platforms

Predicting Performance Impact of DVFS for Realistic Memory Systems

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

Software-Controlled Transparent Management of Heterogeneous Memory Resources in Virtualized Systems

SiNUCA: A Validated Micro-Architecture Simulator

Input-Sensitive Profiling

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Multiperspective Reuse Prediction

Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization

Impact of Compiler Optimizations on Voltage Droops and Reliability of an SMT, Multi-Core Processor

Dynamic and Adaptive Calling Context Encoding

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

LLVM Performance Improvements and Headroom

System Simulator for x86

IAENG International Journal of Computer Science, 40:4, IJCS_40_4_07. RDCC: A New Metric for Processor Workload Characteristics Evaluation

Microarchitecture Overview. Performance

Fine-Grained User-Space Security Through Virtualization

Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

Umbra: Efficient and Scalable Memory Shadowing

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Performance Analysis in Modern Multicores

Protecting the Stack with Metadata Policies and Tagged Hardware

COP: To Compress and Protect Main Memory

Performance analysis of Intel Core 2 Duo processor

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Energy-Based Accounting and Scheduling of Virtual Machines in a Cloud System

An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile

Decoupled Dynamic Cache Segmentation

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Scheduling the Intel Core i7

High System-Code Security with Low Overhead

Speedup Factor Estimation through Dynamic Behavior Analysis for FPGA

L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors

Architectural Supports to Protect OS Kernels from Code-Injection Attacks

Amplifying Side Channels Through Performance Degradation

Modeling Virtual Machines Misprediction Overhead

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

Practical Data Compression for Modern Memory Hierarchies

Transcription:

A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc. 5 th RISC-V Workshop November 30, 2016 Esperanto Technologies: A Fast ISA Simulator for RISC-V 1 5 th RISC-V Workshop Nov 30, 2016

Background Esperanto is a stealth mode startup designing chips with RISC-V. Esperanto wanted a fast RISC-V ISA simulator capable of: Running large applications with minimal (<5x) slowdown Running large number of threads with good scalability Providing flexibility in testing instruction extensions Providing flexibility in gathering performance data Fast simulation is a key productivity tool Gives a fast compile, run and test/debug loop prior to silicon Current simulators were judged to be too slow Undertook project to modify Eltechs ExaGear to run RISC-V Esperanto Technologies: A Fast ISA Simulator for RISC-V 2 5 th RISC-V Workshop Nov 30, 2016

Motivation Evaluated two existing options: QEMU and Spike Would either be sufficient? We compiled several tests from Spec2006 benchmark both for x86-64 and RISC-V with the same compiler version and options: GCC 6.1.0 O2 Esperanto Technologies: A Fast ISA Simulator for RISC-V 3 5 th RISC-V Workshop Nov 30, 2016

Comparison of native, QEMU and Spike runtimes Benchmark x86-64 native (in seconds) QEMU system mode ( in seconds) Spike pk (in seconds) 099.go 8.6 585 2,294 401.bzip2 475 Failed 41,043 410.bwaves 416.gamess 445.gobmk 435.gromacs 444 Failed 48,859 66* Failed 16,045 64* 2145 10,778 464 26040 Failed * Only one subtask Esperanto Technologies: A Fast ISA Simulator for RISC-V 4 5 th RISC-V Workshop Nov 30, 2016

Result of Spike and QEMU evaluation Spike is a very slow simulator: 86x 267x times slower than native Spike user mode (pk) requires static binaries not applicable in some cases QEMU system mode was not fast enough: 34x 68x times slower than native QEMU had no user mode support for RISC-V when we started the evaluation Esperanto Technologies: A Fast ISA Simulator for RISC-V 5 5 th RISC-V Workshop Nov 30, 2016

Fast Simulator Design Esperanto Technologies: A Fast ISA Simulator for RISC-V 6 5 th RISC-V Workshop Nov 30, 2016

User mode approach User mode emulation means we only need to run the RISC-V instructions in the application, not the OS. The OS still runs compiled to the native hardware. User mode approach advantages: Faster emulation (does not need sophisticated software MMU techniques) Does not need stable RISC-V kernel right now Kernel code does not blur performance results Esperanto Technologies: A Fast ISA Simulator for RISC-V 7 5 th RISC-V Workshop Nov 30, 2016

User mode approach RISC-V World! RISC-V Apps RISC-V Libs Fast Simulator x86 Apps x86 Libs x86-64 Linux x86-64 Native Hardware Esperanto Technologies: A Fast ISA Simulator for RISC-V 8 5 th RISC-V Workshop Nov 30, 2016

User mode key points Fast Simulator can only execute Linux RISC-V ELFbinaries with user mode RISC-V code is translated into corresponding x86-64 instructions RISC-V Linux system calls are translated into corresponding x86-64 Linux system calls RISC-V applications can only see a RISC-V world (some kind of chroot), but can communicate with host via kernel (for example, using sockets) Esperanto Technologies: A Fast ISA Simulator for RISC-V 9 5 th RISC-V Workshop Nov 30, 2016

Fast Simulator Translation Flow Perform optimizations here 3. Run translated trace on x86 RISC-V APPLICATION FAST TRANSLATION Was this trace previously translated? 2. If the trace was not translated, then translate it 4. Save translated trace in cache RUNNING TRANSLATED TRACE { RISC-V traces 1. Check if the trace was previously translated 5. If trace was translated before restore it from cache CACHE 6. Run restored trace on x86 Esperanto Technologies: A Fast ISA Simulator for RISC-V 10 5 th RISC-V Workshop Nov 30, 2016

Fast Simulator Translation Flow Key Points Fast Simulator translates traces All compiled traces are saved in a Translation Cache and are then reused Several optimizations are applied, including efficient register allocation, peephole optimization, dynamic jump cache, etc. Floating point calculations directly use hardware FPU and FPU registers Esperanto Technologies: A Fast ISA Simulator for RISC-V 11 5 th RISC-V Workshop Nov 30, 2016

Fast Simulator Design Frontends New Arch ARM v7 x86-32 IR Component ARM v7 ARM v8 x86-64 Backends Esperanto Technologies: A Fast ISA Simulator for RISC-V 12 5 th RISC-V Workshop Nov 30, 2016

Fast Simulator Design Frontends RISC-V 64bit ARM v7 x86-32 IR Component ARM v7 ARM v8 x86-64 Backends Esperanto Technologies: A Fast ISA Simulator for RISC-V 13 5 th RISC-V Workshop Nov 30, 2016

Fast Simulator Design Benefits x86-64 backend was implemented previously and we just reused it Intermediate Representation (IR) was also reused All reused components are very reliable due to exhaustive testing in previous products For first reasonable version we had to implement only RISC-V frontend and tune other components Many optimizations worked without changes Esperanto Technologies: A Fast ISA Simulator for RISC-V 14 5 th RISC-V Workshop Nov 30, 2016

Development time frame About 2 months for a first reliable version which is able to run full Spec2006 benchmark About one month of performance tuning the optimizations to get the presented numbers Esperanto Technologies: A Fast ISA Simulator for RISC-V 15 5 th RISC-V Workshop Nov 30, 2016

As a Result Fast development High reliability High performance Esperanto Technologies: A Fast ISA Simulator for RISC-V 16 5 th RISC-V Workshop Nov 30, 2016

Performance results Esperanto Technologies: A Fast ISA Simulator for RISC-V 17 5 th RISC-V Workshop Nov 30, 2016

Performance results Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz GCC 6.1.0, O2 optimization level RISC-V toolchain available on 24 OCT 2016 QEMU 2.7.50 (RISC-V user mode appears!) Spec2006 benchmarks Esperanto Technologies: A Fast ISA Simulator for RISC-V 18 5 th RISC-V Workshop Nov 30, 2016

CINT2006 Fast Sim/x86-64 performance comparison x86-64 Fast Sim Spec name (in seconds) (in seconds) Ratio 400.perlbench 359 1168 3.25 401.bzip2 475 1188 2.50 403.gcc 318 890 2.80 429.mcf 376 469 1.25 445.gobmk 447 1473 3.30 456.hmmer 427 1363 3.19 458.sjeng 494 1745 3.53 462.libquantum 550 787 1.43 464.h264ref 537 1986 3.70 471.omnetpp 383 719 1.88 473.astar 401 721 1.80 483.xalancbmk 301 828 2.75 GeoMean 2.47 Esperanto Technologies: A Fast ISA Simulator for RISC-V 19 5 th RISC-V Workshop Nov 30, 2016

CFP2006 Fast Sim/x86-64 performance comparison Spec name x86-64 (in seconds) Fast Sim (in seconds) Ratio 410.bwaves 444 1213 2.73 416.gamess 66 235 3.56 433.milc 441 1134 2.57 434.zeusmp 411 1938 4.72 435.gromacs 464 1360 2.93 436.cactusADM 763 4756 6.23 437.leslie3d 432 1595 3.69 444.namd 363 1263 3.48 447.dealII 310 956 3.08 450.soplex 314 667 2.12 453.povray 169 737 4.36 454.calculix 710 4149 5.84 459.GemsFDTD 516 1126 2.18 465.tonto 656 2126 3.24 470.lbm 490 1244 2.54 481.wrf 589 2280 3.87 482.sphinx3 733 3704 5.05 GeoMean 3.48 Esperanto Technologies: A Fast ISA Simulator for RISC-V 20 5 th RISC-V Workshop Nov 30, 2016

CINT2006 QEMU user mode/fast Sim performance comparison Fast Sim QEMU user mode Spec name (in seconds) (in seconds) Ratio 400.perlbench 1168 3334 2.85 401.bzip2 1188 1478 1.24 403.gcc 890 1766 1.98 429.mcf 469 506 1.08 445.gobmk 1473 2385 1.62 456.hmmer 1363 1609 1.18 458.sjeng 1745 3042 1.74 462.libquantum 787 921 1.17 464.h264ref 1986 4492 2.26 471.omnetpp 719 2050 2.85 473.astar 721 971 1.35 483.xalancbmk 828 2060 2.49 GeoMean 1.71 Esperanto Technologies: A Fast ISA Simulator for RISC-V 21 5 th RISC-V Workshop Nov 30, 2016

CFP2006 QEMU user mode/fast Sim performance comparison Spec name Fast Sim (in seconds) QEMU user mode (in seconds) Ratio 410.bwaves 1213 10127 8.35 416.gamess 235 1362 5.80 433.milc 1134 9306 8.21 434.zeusmp 1938 10104 5.21 435.gromacs 1360 16519 12.15 436.cactusADM 4756 24871 5.23 437.leslie3d 1595 9292 5.83 444.namd 1263 17074 13.52 447.dealII 956 4987 5.22 450.soplex 667 1739 2.61 453.povray 737 4325 5.87 454.calculix 4149 47720 11.50 459.GemsFDTD 1126 12004 10.66 465.tonto 2126 16752 7.88 470.lbm 1244 14201 11.42 481.wrf 2280 17689 7.76 482.sphinx3 3704 24365 6.58 GeoMean 7.29 Esperanto Technologies: A Fast ISA Simulator for RISC-V 22 5 th RISC-V Workshop Nov 30, 2016

CINT2006 Spike pk/fast Sim performance comparison Fast Sim Spike pk Spec name (in seconds) (in seconds) Ratio 400.perlbench 1168 55111 47.18 401.bzip2 1188 41040 34.55 403.gcc 890 failed 429.mcf 469 4762 10.15 445.gobmk 1473 71583 48.60 456.hmmer 1363 32425 23.79 458.sjeng 1745 102748 58.88 462.libquantum 787 10245 13.02 464.h264ref 1986 failed 471.omnetpp 719 19965 27.77 473.astar 721 11651 16.16 483.xalancbmk 828 failed GeoMean 26.56 Esperanto Technologies: A Fast ISA Simulator for RISC-V 23 5 th RISC-V Workshop Nov 30, 2016

CFP2006 Spike pk/fast Sim performance comparison Spec name Fast Sim (in seconds) Spike pk (in seconds) Ratio 410.bwaves 1213 48859 40.28 416.gamess 235 16045 68.28 433.milc 1134 29221 25.77 434.zeusmp 1938 35506 18.32 435.gromacs 1360 failed 436.cactusADM 4756 199003 41.84 437.leslie3d 1595 26769 16.78 444.namd 1263 44916 35.56 447.dealII 956 24788 25.93 450.soplex 667 7861 11.79 453.povray 737 37731 51.20 454.calculix 4149 159790 38.51 459.GemsFDTD 1126 30716 27.28 465.tonto 2126 91514 43.05 470.lbm 1244 39644 31.87 481.wrf 2280 failed 482.sphinx3 3704 failed GeoMean 30.92 Esperanto Technologies: A Fast ISA Simulator for RISC-V 24 5 th RISC-V Workshop Nov 30, 2016

Performance summary Fast Simulator is only 2.5 times slower than native on SpecInt2006 and 3.5 times on SpecFPU2006 (3x in average and 6.2x in the worst case) Fast Simulator is 1.7 times faster than QEMU user mode on SpecInt2006 and 7.3 times faster on SpecFPU2006 (4x in average and up to 13.5x on some tests ) Fast simulator ~30 times faster than Spike Esperanto Technologies: A Fast ISA Simulator for RISC-V 25 5 th RISC-V Workshop Nov 30, 2016

Why is Fast Simulator so fast? Use of a performance oriented architecture: Compiler style Intermediate Representation (allows implementing more optimizations) Smart trace collection Optimized x86-64 backend Many of optimization tricks in other components are enabled by default Hardware floating point emulation Esperanto Technologies: A Fast ISA Simulator for RISC-V 26 5 th RISC-V Workshop Nov 30, 2016

Good Multithreading Scalability Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz, 64 cores GeoBenchmark: $ mig_benchmark 1001 1001 1 Number of threads x86-64 seconds Ratio Fast Sim seconds Ratio 1 32.262 1 154.477 1 2 16.136 2.00 77.322 2.00 4 8.084 3.99 38.811 3.98 8 4.07 7.93 19.525 7.91 16 2.049 15.75 9.799 15.76 32 1.051 30.70 5.082 30.40 64 0.536 60.19 2.614 59.10 Esperanto Technologies: A Fast ISA Simulator for RISC-V 27 5 th RISC-V Workshop Nov 30, 2016

Future plans Still a proprietary simulator under development Future plans may include: Additional optimizations to improve performance Improved floating point precision support Support for additional extensions Support for additional metrics Improved debugging options System Mode? Esperanto Technologies: A Fast ISA Simulator for RISC-V 28 5 th RISC-V Workshop Nov 30, 2016

Summary Internal working name is ExaGear-RVx ExaGear-RVx is an extremely fast simulator Only 2.5 (int) to 3.5x(float) slower than native execution Up to 10+ times faster(4x in average) than QEMU user mode Good multithreading scalability Industrial level reliability Reasonable design flexibility Allows quick support for hardware ISA changes Esperanto Technologies: A Fast ISA Simulator for RISC-V 29 5 th RISC-V Workshop Nov 30, 2016