A Fast Instruction Set Simulator for RISC-V
|
|
- Luke Richard
- 6 years ago
- Views:
Transcription
1 A Fast Instruction Set Simulator for RISC-V Esperanto Technologies, Inc. 5 th RISC-V Workshop November 30, 2016 Esperanto Technologies: A Fast ISA Simulator for RISC-V 1 5 th RISC-V Workshop Nov 30, 2016
2 Background Esperanto is a stealth mode startup designing chips with RISC-V. Esperanto wanted a fast RISC-V ISA simulator capable of: Running large applications with minimal (<5x) slowdown Running large number of threads with good scalability Providing flexibility in testing instruction extensions Providing flexibility in gathering performance data Fast simulation is a key productivity tool Gives a fast compile, run and test/debug loop prior to silicon Current simulators were judged to be too slow Undertook project to modify Eltechs ExaGear to run RISC-V Esperanto Technologies: A Fast ISA Simulator for RISC-V 2 5 th RISC-V Workshop Nov 30, 2016
3 Motivation Evaluated two existing options: QEMU and Spike Would either be sufficient? We compiled several tests from Spec2006 benchmark both for x86-64 and RISC-V with the same compiler version and options: GCC O2 Esperanto Technologies: A Fast ISA Simulator for RISC-V 3 5 th RISC-V Workshop Nov 30, 2016
4 Comparison of native, QEMU and Spike runtimes Benchmark x86-64 native (in seconds) QEMU system mode ( in seconds) Spike pk (in seconds) 099.go , bzip2 475 Failed 41, bwaves 416.gamess 445.gobmk 435.gromacs 444 Failed 48,859 66* Failed 16,045 64* , Failed * Only one subtask Esperanto Technologies: A Fast ISA Simulator for RISC-V 4 5 th RISC-V Workshop Nov 30, 2016
5 Result of Spike and QEMU evaluation Spike is a very slow simulator: 86x 267x times slower than native Spike user mode (pk) requires static binaries not applicable in some cases QEMU system mode was not fast enough: 34x 68x times slower than native QEMU had no user mode support for RISC-V when we started the evaluation Esperanto Technologies: A Fast ISA Simulator for RISC-V 5 5 th RISC-V Workshop Nov 30, 2016
6 Fast Simulator Design Esperanto Technologies: A Fast ISA Simulator for RISC-V 6 5 th RISC-V Workshop Nov 30, 2016
7 User mode approach User mode emulation means we only need to run the RISC-V instructions in the application, not the OS. The OS still runs compiled to the native hardware. User mode approach advantages: Faster emulation (does not need sophisticated software MMU techniques) Does not need stable RISC-V kernel right now Kernel code does not blur performance results Esperanto Technologies: A Fast ISA Simulator for RISC-V 7 5 th RISC-V Workshop Nov 30, 2016
8 User mode approach RISC-V World! RISC-V Apps RISC-V Libs Fast Simulator x86 Apps x86 Libs x86-64 Linux x86-64 Native Hardware Esperanto Technologies: A Fast ISA Simulator for RISC-V 8 5 th RISC-V Workshop Nov 30, 2016
9 User mode key points Fast Simulator can only execute Linux RISC-V ELFbinaries with user mode RISC-V code is translated into corresponding x86-64 instructions RISC-V Linux system calls are translated into corresponding x86-64 Linux system calls RISC-V applications can only see a RISC-V world (some kind of chroot), but can communicate with host via kernel (for example, using sockets) Esperanto Technologies: A Fast ISA Simulator for RISC-V 9 5 th RISC-V Workshop Nov 30, 2016
10 Fast Simulator Translation Flow Perform optimizations here 3. Run translated trace on x86 RISC-V APPLICATION FAST TRANSLATION Was this trace previously translated? 2. If the trace was not translated, then translate it 4. Save translated trace in cache RUNNING TRANSLATED TRACE { RISC-V traces 1. Check if the trace was previously translated 5. If trace was translated before restore it from cache CACHE 6. Run restored trace on x86 Esperanto Technologies: A Fast ISA Simulator for RISC-V 10 5 th RISC-V Workshop Nov 30, 2016
11 Fast Simulator Translation Flow Key Points Fast Simulator translates traces All compiled traces are saved in a Translation Cache and are then reused Several optimizations are applied, including efficient register allocation, peephole optimization, dynamic jump cache, etc. Floating point calculations directly use hardware FPU and FPU registers Esperanto Technologies: A Fast ISA Simulator for RISC-V 11 5 th RISC-V Workshop Nov 30, 2016
12 Fast Simulator Design Frontends New Arch ARM v7 x86-32 IR Component ARM v7 ARM v8 x86-64 Backends Esperanto Technologies: A Fast ISA Simulator for RISC-V 12 5 th RISC-V Workshop Nov 30, 2016
13 Fast Simulator Design Frontends RISC-V 64bit ARM v7 x86-32 IR Component ARM v7 ARM v8 x86-64 Backends Esperanto Technologies: A Fast ISA Simulator for RISC-V 13 5 th RISC-V Workshop Nov 30, 2016
14 Fast Simulator Design Benefits x86-64 backend was implemented previously and we just reused it Intermediate Representation (IR) was also reused All reused components are very reliable due to exhaustive testing in previous products For first reasonable version we had to implement only RISC-V frontend and tune other components Many optimizations worked without changes Esperanto Technologies: A Fast ISA Simulator for RISC-V 14 5 th RISC-V Workshop Nov 30, 2016
15 Development time frame About 2 months for a first reliable version which is able to run full Spec2006 benchmark About one month of performance tuning the optimizations to get the presented numbers Esperanto Technologies: A Fast ISA Simulator for RISC-V 15 5 th RISC-V Workshop Nov 30, 2016
16 As a Result Fast development High reliability High performance Esperanto Technologies: A Fast ISA Simulator for RISC-V 16 5 th RISC-V Workshop Nov 30, 2016
17 Performance results Esperanto Technologies: A Fast ISA Simulator for RISC-V 17 5 th RISC-V Workshop Nov 30, 2016
18 Performance results Intel(R) Core(TM) i GHz GCC 6.1.0, O2 optimization level RISC-V toolchain available on 24 OCT 2016 QEMU (RISC-V user mode appears!) Spec2006 benchmarks Esperanto Technologies: A Fast ISA Simulator for RISC-V 18 5 th RISC-V Workshop Nov 30, 2016
19 CINT2006 Fast Sim/x86-64 performance comparison x86-64 Fast Sim Spec name (in seconds) (in seconds) Ratio 400.perlbench bzip gcc mcf gobmk hmmer sjeng libquantum h264ref omnetpp astar xalancbmk GeoMean 2.47 Esperanto Technologies: A Fast ISA Simulator for RISC-V 19 5 th RISC-V Workshop Nov 30, 2016
20 CFP2006 Fast Sim/x86-64 performance comparison Spec name x86-64 (in seconds) Fast Sim (in seconds) Ratio 410.bwaves gamess milc zeusmp gromacs cactusADM leslie3d namd dealII soplex povray calculix GemsFDTD tonto lbm wrf sphinx GeoMean 3.48 Esperanto Technologies: A Fast ISA Simulator for RISC-V 20 5 th RISC-V Workshop Nov 30, 2016
21 CINT2006 QEMU user mode/fast Sim performance comparison Fast Sim QEMU user mode Spec name (in seconds) (in seconds) Ratio 400.perlbench bzip gcc mcf gobmk hmmer sjeng libquantum h264ref omnetpp astar xalancbmk GeoMean 1.71 Esperanto Technologies: A Fast ISA Simulator for RISC-V 21 5 th RISC-V Workshop Nov 30, 2016
22 CFP2006 QEMU user mode/fast Sim performance comparison Spec name Fast Sim (in seconds) QEMU user mode (in seconds) Ratio 410.bwaves gamess milc zeusmp gromacs cactusADM leslie3d namd dealII soplex povray calculix GemsFDTD tonto lbm wrf sphinx GeoMean 7.29 Esperanto Technologies: A Fast ISA Simulator for RISC-V 22 5 th RISC-V Workshop Nov 30, 2016
23 CINT2006 Spike pk/fast Sim performance comparison Fast Sim Spike pk Spec name (in seconds) (in seconds) Ratio 400.perlbench bzip gcc 890 failed 429.mcf gobmk hmmer sjeng libquantum h264ref 1986 failed 471.omnetpp astar xalancbmk 828 failed GeoMean Esperanto Technologies: A Fast ISA Simulator for RISC-V 23 5 th RISC-V Workshop Nov 30, 2016
24 CFP2006 Spike pk/fast Sim performance comparison Spec name Fast Sim (in seconds) Spike pk (in seconds) Ratio 410.bwaves gamess milc zeusmp gromacs 1360 failed 436.cactusADM leslie3d namd dealII soplex povray calculix GemsFDTD tonto lbm wrf 2280 failed 482.sphinx failed GeoMean Esperanto Technologies: A Fast ISA Simulator for RISC-V 24 5 th RISC-V Workshop Nov 30, 2016
25 Performance summary Fast Simulator is only 2.5 times slower than native on SpecInt2006 and 3.5 times on SpecFPU2006 (3x in average and 6.2x in the worst case) Fast Simulator is 1.7 times faster than QEMU user mode on SpecInt2006 and 7.3 times faster on SpecFPU2006 (4x in average and up to 13.5x on some tests ) Fast simulator ~30 times faster than Spike Esperanto Technologies: A Fast ISA Simulator for RISC-V 25 5 th RISC-V Workshop Nov 30, 2016
26 Why is Fast Simulator so fast? Use of a performance oriented architecture: Compiler style Intermediate Representation (allows implementing more optimizations) Smart trace collection Optimized x86-64 backend Many of optimization tricks in other components are enabled by default Hardware floating point emulation Esperanto Technologies: A Fast ISA Simulator for RISC-V 26 5 th RISC-V Workshop Nov 30, 2016
27 Good Multithreading Scalability Intel(R) Xeon Phi(TM) CPU 1.30GHz, 64 cores GeoBenchmark: $ mig_benchmark Number of threads x86-64 seconds Ratio Fast Sim seconds Ratio Esperanto Technologies: A Fast ISA Simulator for RISC-V 27 5 th RISC-V Workshop Nov 30, 2016
28 Future plans Still a proprietary simulator under development Future plans may include: Additional optimizations to improve performance Improved floating point precision support Support for additional extensions Support for additional metrics Improved debugging options System Mode? Esperanto Technologies: A Fast ISA Simulator for RISC-V 28 5 th RISC-V Workshop Nov 30, 2016
29 Summary Internal working name is ExaGear-RVx ExaGear-RVx is an extremely fast simulator Only 2.5 (int) to 3.5x(float) slower than native execution Up to 10+ times faster(4x in average) than QEMU user mode Good multithreading scalability Industrial level reliability Reasonable design flexibility Allows quick support for hardware ISA changes Esperanto Technologies: A Fast ISA Simulator for RISC-V 29 5 th RISC-V Workshop Nov 30, 2016
Lightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for
More informationResource-Conscious Scheduling for Energy Efficiency on Multicore Processors
Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationNightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems
NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationPerformance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing
More informationEnergy Models for DVFS Processors
Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July
More informationISA-Aging. (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015
ISA-Aging (SHRINK: Reducing the ISA Complexity Via Instruction Recycling) Accepted for ISCA 2015 Bruno Cardoso Lopes, Rafael Auler, Edson Borin, Luiz Ramos, Rodolfo Azevedo, University of Campinas, Brasil
More informationComputer Architecture. Introduction
to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe
More informationUCB CS61C : Machine Structures
inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in
More informationArchitecture of Parallel Computer Systems - Performance Benchmarking -
Architecture of Parallel Computer Systems - Performance Benchmarking - SoSe 18 L.079.05810 www.uni-paderborn.de/pc2 J. Simon - Architecture of Parallel Computer Systems SoSe 2018 < 1 > Definition of Benchmark
More informationFootprint-based Locality Analysis
Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.
More informationEnergy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012
Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso
More informationImproving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.
Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses
More informationPerformance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor
Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University
More informationLightweight Memory Tracing
Lightweight Memory Tracing Mathias Payer ETH Zurich Enrico Kravina ETH Zurich Thomas R. Gross ETH Zurich Abstract Memory tracing (executing additional code for every memory access of a program) is a powerful
More informationNear-Threshold Computing: How Close Should We Get?
Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on
More informationSandbox Based Optimal Offset Estimation [DPC2]
Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationBias Scheduling in Heterogeneous Multi-core Architectures
Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that
More informationPIPELINING AND PROCESSOR PERFORMANCE
PIPELINING AND PROCESSOR PERFORMANCE Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 1, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationA Dynamic Program Analysis to find Floating-Point Accuracy Problems
1 A Dynamic Program Analysis to find Floating-Point Accuracy Problems Florian Benz fbenz@stud.uni-saarland.de Andreas Hildebrandt andreas.hildebrandt@uni-mainz.de Sebastian Hack hack@cs.uni-saarland.de
More informationLinux Performance on IBM zenterprise 196
Martin Kammerer martin.kammerer@de.ibm.com 9/27/10 Linux Performance on IBM zenterprise 196 visit us at http://www.ibm.com/developerworks/linux/linux390/perf/index.html Trademarks IBM, the IBM logo, and
More informationCPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:
CPI CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: Clock cycle where: Clock rate = 1 / clock cycle f =
More informationEfficient Memory Shadowing for 64-bit Architectures
Efficient Memory Shadowing for 64-bit Architectures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Qin Zhao, Derek Bruening,
More informationGenerating Low-Overhead Dynamic Binary Translators
Generating Low-Overhead Dynamic Binary Translators Mathias Payer ETH Zurich, Switzerland mathias.payer@inf.ethz.ch Thomas R. Gross ETH Zurich, Switzerland trg@inf.ethz.ch Abstract Dynamic (on the fly)
More informationOpen Access Research on the Establishment of MSR Model in Cloud Computing based on Standard Performance Evaluation
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2015, 7, 821-825 821 Open Access Research on the Establishment of MSR Model in Cloud Computing based
More informationData Prefetching by Exploiting Global and Local Access Patterns
Journal of Instruction-Level Parallelism 13 (2011) 1-17 Submitted 3/10; published 1/11 Data Prefetching by Exploiting Global and Local Access Patterns Ahmad Sharif Hsien-Hsin S. Lee School of Electrical
More informationPMCTrack: Delivering performance monitoring counter support to the OS scheduler
PMCTrack: Delivering performance monitoring counter support to the OS scheduler J. C. Saez, A. Pousa, R. Rodríguez-Rodríguez, F. Castro, M. Prieto-Matias ArTeCS Group, Facultad de Informática, Complutense
More informationInformation System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University
2110684 Information System Architecture Natawut Nupairoj Ph.D. Department of Computer Engineering, Chulalongkorn University Agenda Capacity Planning Determining the production capacity needed by an organization
More informationHOTL: a Higher Order Theory of Locality
HOTL: a Higher Order Theory of Locality Xiaoya Xiang Chen Ding Hao Luo Department of Computer Science University of Rochester {xiang, cding, hluo}@cs.rochester.edu Bin Bao Adobe Systems Incorporated bbao@adobe.com
More informationEKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ EKT 303 WEEK 2 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. Chapter 2 + Performance Issues + Designing for Performance The cost of computer systems continues to drop dramatically,
More informationSEN361 Computer Organization. Prof. Dr. Hasan Hüseyin BALIK (2 nd Week)
+ SEN361 Computer Organization Prof. Dr. Hasan Hüseyin BALIK (2 nd Week) + Outline 1. Overview 1.1 Basic Concepts and Computer Evolution 1.2 Performance Issues + 1.2 Performance Issues + Designing for
More informationMaking Data Prefetch Smarter: Adaptive Prefetching on POWER7
Making Data Prefetch Smarter: Adaptive Prefetching on POWER7 Víctor Jiménez Barcelona Supercomputing Center Barcelona, Spain victor.javier@bsc.es Alper Buyuktosunoglu IBM T. J. Watson Research Center Yorktown
More informationEfficient and Effective Misaligned Data Access Handling in a Dynamic Binary Translation System
Efficient and Effective Misaligned Data Access Handling in a Dynamic Binary Translation System JIANJUN LI, Institute of Computing Technology Graduate University of Chinese Academy of Sciences CHENGGANG
More informationInsertion and Promotion for Tree-Based PseudoLRU Last-Level Caches
Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency
More informationHOTL: A Higher Order Theory of Locality
HOTL: A Higher Order Theory of Locality Xiaoya Xiang Chen Ding Hao Luo Department of Computer Science University of Rochester {xiang, cding, hluo}@cs.rochester.edu Bin Bao Adobe Systems Incorporated bbao@adobe.com
More informationThesis Defense Lavanya Subramanian
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)
More informationDEMM: a Dynamic Energy-saving mechanism for Multicore Memories
DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University
More informationScalable Dynamic Task Scheduling on Adaptive Many-Cores
Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for
More informationPerceptron Learning for Reuse Prediction
Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level
More informationLast time. Lecture #29 Performance & Parallel Intro
CS61C L29 Performance & Parallel (1) inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #29 Performance & Parallel Intro 2007-8-14 Scott Beamer, Instructor Paper Battery Developed by Researchers
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationA Front-end Execution Architecture for High Energy Efficiency
A Front-end Execution Architecture for High Energy Efficiency Ryota Shioya, Masahiro Goshima and Hideki Ando Department of Electrical Engineering and Computer Science, Nagoya University, Aichi, Japan Information
More informationPotential for hardware-based techniques for reuse distance analysis
Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Reports - Open Dissertations, Master's Theses and Master's Reports 2011 Potential for hardware-based
More informationA Comprehensive Scheduler for Asymmetric Multicore Systems
A Comprehensive Scheduler for Asymmetric Multicore Systems Juan Carlos Saez Manuel Prieto Complutense University, Madrid, Spain {jcsaezal,mpmatias}@pdi.ucm.es Alexandra Fedorova Sergey Blagodurov Simon
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationCoverage-guided Fuzzing of Individual Functions Without Source Code
Coverage-guided Fuzzing of Individual Functions Without Source Code Alessandro Di Federico Politecnico di Milano October 25, 2018 1 Index Coverage-guided fuzzing An overview of rev.ng Experimental results
More informationThe Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory
The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*
More informationKruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring
NDSS 2012 Kruiser: Semi-synchronized Nonblocking Concurrent Kernel Heap Buffer Overflow Monitoring Donghai Tian 1,2, Qiang Zeng 2, Dinghao Wu 2, Peng Liu 2 and Changzhen Hu 1 1 Beijing Institute of Technology
More informationExploi'ng Compressed Block Size as an Indicator of Future Reuse
Exploi'ng Compressed Block Size as an Indicator of Future Reuse Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch Execu've Summary In a compressed
More informationA Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements
More informationPipelining. CS701 High Performance Computing
Pipelining CS701 High Performance Computing Student Presentation 1 Two 20 minute presentations Burks, Goldstine, von Neumann. Preliminary Discussion of the Logical Design of an Electronic Computing Instrument.
More informationEvaluation of RISC-V RTL with FPGA-Accelerated Simulation
Evaluation of RISC-V RTL with FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, Krste Asanovic CARRV 2017 10/14/2017 Evaluation Methodologies For Computer
More informationMulti-Cache Resizing via Greedy Coordinate Descent
Noname manuscript No. (will be inserted by the editor) Multi-Cache Resizing via Greedy Coordinate Descent I. Stephen Choi Donald Yeung Received: date / Accepted: date Abstract To reduce power consumption
More informationHW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation
HW/SW Co-designed Processors: Challenges, Design Choices and a Simulation Infrastructure for Evaluation Rakesh Kumar, José Cano, Aleksandar Brankovic, Demos Pavlou, Kyriakos Stavrou, Enric Gibert, Alejandro
More informationIntroducing the GCC to the Polyhedron Model
1/15 Michael Claßen University of Passau St. Goar, June 30th 2009 2/15 Agenda Agenda 1 GRAPHITE Introduction Status of GRAPHITE 2 The Polytope Model in GRAPHITE What code can be represented? GPOLY - The
More informationEnergy-centric DVFS Controlling Method for Multi-core Platforms
Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To
More informationPredicting Performance Impact of DVFS for Realistic Memory Systems
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisors: Todd C. Mowry and Onur Mutlu Computer Science Department, Carnegie Mellon
More informationPiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory
PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory Tri M. Nguyen Department of Electrical Engineering Princeton University Princeton, USA trin@princeton.edu David Wentzlaff
More information562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,
More informationSoftware-Controlled Transparent Management of Heterogeneous Memory Resources in Virtualized Systems
Software-Controlled Transparent Management of Heterogeneous Memory Resources in Virtualized Systems Min Lee Vishal Gupta Karsten Schwan College of Computing Georgia Institute of Technology {minlee,vishal,schwan}@cc.gatech.edu
More informationSiNUCA: A Validated Micro-Architecture Simulator
SiNUCA: A Validated Micro-Architecture ulator Marco A. Z. Alves, Matthias Diener, Francis B. Moreira, Philippe O. A. Navaux Informatics Institute Federal University of Rio Grande do Sul Email: {mazalves,
More informationInput-Sensitive Profiling
Input-Sensitive Profiling Emilio Coppa Dept. of Computer and System Sciences Sapienza University of Rome ercoppa@gmail.com Camil Demetrescu Dept. of Computer and System Sciences Sapienza University of
More informationMinimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era
Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu
More informationMultiperspective Reuse Prediction
ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting
More informationEnhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization
Enhanced Operating System Security Through Efficient and Fine-grained Address Space Randomization Anton Kuijsten Andrew S. Tanenbaum Vrije Universiteit Amsterdam 21st USENIX Security Symposium Bellevue,
More informationImpact of Compiler Optimizations on Voltage Droops and Reliability of an SMT, Multi-Core Processor
Impact of ompiler Optimizations on Voltage roops and Reliability of an SMT, Multi-ore Processor Youngtaek Kim Lizy Kurian John epartment of lectrical & omputer ngineering The University of Texas at ustin
More informationDynamic and Adaptive Calling Context Encoding
ynamic and daptive alling ontext Encoding Jianjun Li State Key Laboratory of omputer rchitecture, Institute of omputing Technology, hinese cademy of Sciences lijianjun@ict.ac.cn Wei-hung Hsu epartment
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationLLVM Performance Improvements and Headroom
LLVM Performance Improvements and Headroom Gerolf Hoflehner Apple LLVM Developers Meeting 2015 San Jose, CA Messages Tuning and focused local optimizations Advancing optimization technology Getting inspired
More informationSystem Simulator for x86
MARSS Micro Architecture & System Simulator for x86 CAPS Group @ SUNY Binghamton Presenter Avadh Patel http://marss86.org Present State of Academic Simulators Majority of Academic Simulators: Are for non
More informationIAENG International Journal of Computer Science, 40:4, IJCS_40_4_07. RDCC: A New Metric for Processor Workload Characteristics Evaluation
RDCC: A New Metric for Processor Workload Characteristics Evaluation Tianyong Ao, Pan Chen, Zhangqing He, Kui Dai, Xuecheng Zou Abstract Understanding the characteristics of workloads is extremely important
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationFine-Grained User-Space Security Through Virtualization
Fine-Grained User-Space Security Through Virtualization Mathias Payer mathias.payer@inf.ethz.ch ETH Zurich, Switzerland Thomas R. Gross trg@inf.ethz.ch ETH Zurich, Switzerland Abstract This paper presents
More informationPage-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers
2012 IEEE 18th International Conference on Parallel and Distributed Systems Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers Mónica Serrano, Salvador Petit, Julio Sahuquillo,
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationUmbra: Efficient and Scalable Memory Shadowing
Umbra: Efficient and Scalable Memory Shadowing Qin Zhao CSAIL Massachusetts Institute of Technology Cambridge, MA, USA qin zhao@csail.mit.edu Derek Bruening VMware, Inc. bruening@vmware.com Saman Amarasinghe
More informationMemory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1
Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006
More informationComputing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design
Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design Computing Element Choices: Computing Element Programmability Spatial vs. Temporal Computing Main Processor Types/Applications
More informationPerformance Analysis in Modern Multicores
Practical Parallel Programming course - Winter 2017 CS @ Haifa University Performance Analysis in Modern Multicores Ahmad Yasin CPU Architect, Intel Corporation 14 January 2018 1 Ahmad Yasin Performance
More informationProtecting the Stack with Metadata Policies and Tagged Hardware
2018 IEEE Symposium on Security and Privacy Protecting the Stack with Metadata Policies and Tagged Hardware Nick Roessler University of Pennsylvania nroess@seas.upenn.edu André DeHon University of Pennsylvania
More informationCOP: To Compress and Protect Main Memory
COP: To Compress and Protect Main Memory David J. Palframan Nam Sung Kim Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin Madison palframan@wisc.edu, nskim3@wisc.edu,
More informationPerformance analysis of Intel Core 2 Duo processor
Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 27 Performance analysis of Intel Core 2 Duo processor Tribuvan Kumar Prakash Louisiana State University and Agricultural
More informationCache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems
1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,
More informationEnergy-Based Accounting and Scheduling of Virtual Machines in a Cloud System
Energy-Based Accounting and Scheduling of Virtual Machines in a Cloud System Nakku Kim Email: nkkim@unist.ac.kr Jungwook Cho Email: jmanbal@unist.ac.kr School of Electrical and Computer Engineering, Ulsan
More informationAn SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile
An SSA-based Algorithm for Optimal Speculative Code Motion under an Execution Profile Hucheng Zhou Tsinghua University zhou-hc07@mails.tsinghua.edu.cn Wenguang Chen Tsinghua University cwg@tsinghua.edu.cn
More informationDecoupled Dynamic Cache Segmentation
Appears in Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8), February, 202. Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A.
More informationEfficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors
Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University
More informationScheduling the Intel Core i7
Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne
More informationHigh System-Code Security with Low Overhead
High System-Code Security with Low Overhead Jonas Wagner, Volodymyr Kuznetsov, George Candea, and Johannes Kinder École Polytechnique Fédérale de Lausanne Royal Holloway, University of London High System-Code
More informationSpeedup Factor Estimation through Dynamic Behavior Analysis for FPGA
Speedup Factor Estimation through Dynamic Behavior Analysis for FPGA Zhongda Yuan 1, Jinian Bian 1, Qiang Wu 2, Oskar Mencer 2 1 Dept. of Computer Science and Technology, Tsinghua Univ., Beijing 100084,
More informationL1-Bandwidth Aware Thread Allocation in Multicore SMT Processors
L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors Josué Feliu, Julio Sahuquillo, Salvador Petit, and José Duato Department of Computer Engineering (DISCA) Universitat Politècnica de València
More informationArchitectural Supports to Protect OS Kernels from Code-Injection Attacks
Architectural Supports to Protect OS Kernels from Code-Injection Attacks 2016-06-18 Hyungon Moon, Jinyong Lee, Dongil Hwang, Seonhwa Jung, Jiwon Seo and Yunheung Paek Seoul National University 1 Why to
More informationAmplifying Side Channels Through Performance Degradation
Amplifying Side Channels Through Performance Degradation Thomas Allan The University of Adelaide and Data61, CSIRO tom.allan@student.adelaide.edu.au Katrina Falkner The University of Adelaide katrina.falkner@adelaide.edu.au
More informationModeling Virtual Machines Misprediction Overhead
Modeling Virtual Machines Misprediction Overhead Divino César, Rafael Auler, Rafael Dalibera, Sandro Rigo, Edson Borin and Guido Araújo Institute of Computing, University of Campinas Campinas, São Paulo
More informationAn Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs
International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File
More informationPractical Data Compression for Modern Memory Hierarchies
Practical Data Compression for Modern Memory Hierarchies Thesis Oral Gennady Pekhimenko Committee: Todd Mowry (Co-chair) Onur Mutlu (Co-chair) Kayvon Fatahalian David Wood, University of Wisconsin-Madison
More information