SimOS PPC. Full System Simulation of PowerPC Architecture. Tom Keller Austin Research Lab IBM Research Division

Similar documents
G Disco. Robert Grimm New York University

Full-System Timing-First Simulation

The Role of Performance

Virtual Memory Review. Page faults. Paging system summary (so far)

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

Operating Systems. Operating System Structure. Lecture 2 Michael O Boyle

RAMP-White / FAST-MP

Why Use Simulation? Simics & dark2 Assignment 1. What is a Simulator? Full-System Simulation

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

CS 134: Operating Systems

Wind River. All Rights Reserved.

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Translation Buffers (TLB s)

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Embedded Linux Architecture

Architectural Support for Operating Systems

Operating System Review

CS510 Operating System Foundations. Jonathan Walpole

EE 457 Unit 8. Exceptions What Happens When Things Go Wrong

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Announcement. Exercise #2 will be out today. Due date is next Monday

What are Exceptions? EE 457 Unit 8. Exception Processing. Exception Examples 1. Exceptions What Happens When Things Go Wrong

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Operating System Structure

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

ProtoFlex: FPGA Accelerated Full System MP Simulation

Hardware Speculation Support

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Parallel SimOS: Scalability and Performance for Large System Simulation

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

int $0x32 // call interrupt number 50

Hardware, Modularity, and Virtualization CS 111

Chapter 13: I/O Systems

Virtual Machines and Dynamic Translation: Implementing ISAs in Software

Today: Computer System Overview (Stallings, chapter ) Next: Operating System Overview (Stallings, chapter ,

Instruction Set Principles and Examples. Appendix B

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Efficient and Large Scale Program Flow Tracing in Linux. Alexander Shishkin, Intel

19: I/O. Mark Handley. Direct Memory Access (DMA)

High Performance Microprocessor Emulation for Software Validation Facilities and Operational Simulators. Dr. Mattias Holm

CSE 120 Principles of Operating Systems

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

Operating System Structure

CSE 120 Principles of Operating Systems Spring 2017

ProtoFlex: FPGA-Accelerated Hybrid Simulator

Lecture 14: Cache & Virtual Memory

Distributed Systems Operation System Support

Processes and Threads. Processes and Threads. Processes (2) Processes (1)

Anne Bracy CS 3410 Computer Science Cornell University

The control of I/O devices is a major concern for OS designers

System Virtual Machines

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

EECS 373 Design of Microprocessor-Based Systems

LEON2/3 SystemC Instruction Set Simulator

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

Virtual Memory Outline

Midterm #2 Solutions April 23, 1997

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro

Topics. Operating System I. What is an Operating System? Let s Get Started! What is an Operating System? OS History.

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36

CSE 451: Operating Systems Winter Module 2 Architectural Support for Operating Systems

There are different characteristics for exceptions. They are as follows:

System Virtual Machines

Even coarse architectural trends impact tremendously the design of systems

Administrivia. Lab 1 due Friday 12pm. We give will give short extensions to groups that run into trouble. But us:

Even coarse architectural trends impact tremendously the design of systems. Even coarse architectural trends impact tremendously the design of systems

Lecture 2 - Fundamental Concepts

Even coarse architectural trends impact tremendously the design of systems

Part I Introduction Hardware and OS Review

Unit OS2: Operating System Principles. Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze

Multiprocessor Systems. Chapter 8, 8.1

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

A Survey on Virtualization Technologies

Introduction to gem5. Nizamudheen Ahmed Texas Instruments

Lecture 4: Instruction Set Architecture

Final Lecture. A few minutes to wrap up and add some perspective

CSE 120 Principles of Operating Systems

Software Development Using Full System Simulation with Freescale QorIQ Communications Processors

Sistemi in Tempo Reale

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

Chapter 12. CPU Structure and Function. Yonsei University

Background: Operating Systems

CS 31: Intro to Systems Operating Systems Overview. Kevin Webb Swarthmore College March 31, 2015

CPS104 Computer Organization and Programming Lecture 17: Interrupts and Exceptions. Interrupts Exceptions and Traps. Visualizing an Interrupt

ECEN 449 Microprocessor System Design. Hardware-Software Communication. Texas A&M University

CS510 Operating System Foundations. Jonathan Walpole

Anne Bracy CS 3410 Computer Science Cornell University

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

CS533 Concepts of Operating Systems. Jonathan Walpole

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Transcription:

SimOS PPC Full System Simulation of PowerPC Architecture Tom Keller IBM Research Division tkeller@us.ibm.com This Talk:(Inside IBM) (Outside IBM) http://simos.austin.ibm.com http://www.ece.utexas.edu/projects/ece/lca/events

Full time Pat Bohrer Tom Keller Ann Marie Maynard Rick Simpson Contractor (ex-aix development) Bob Willcox Temporary assignment (now completed) Brian Twichell The SimOS-PPC Team SimOS-PPC/1999-3-10 p1

Full System Simulation Background Processor and system design has been focused around trace-based analysis, using traces (mostly) of problem-state programs. Traces that include operating system code are difficult to collect. Traces of multi-processors are extremely difficult to collect and to correlate. Traces tend to be old, because they re difficult to collect. From the traces, we generate statistics such as cache and TLB miss rates, code frequency-of-use, memory bandwidth requirements, and the like. Tracing problem state code tells us nothing about Execution paths through supervisor state code Operating system services Device drivers Cache traffic due to supervisor state code Cache traffic due to interrupt handlers Memory traffic due to I/O operations SimOS-PPC/1999-3-10 p2

Full System Simulation Background For multiprocessors, trace-based analysis is even more problematic Problem state traces miss operating system code, especially code that deals with MP synchronization. Execution of code on one processor affects execution (and hence trace) on other processors (lock spinning, contents of shared caches,...). The usual trick, which is to pretend that several copies of the same uniprocessor trace are running on the different processors, usually with different starting points, doesn t model any of the MP interaction. SimOS-PPC/1999-3-10 p3

Full System Simulation A way around the difficulties of traces is to use execution based simulation on a full system simulator. A (mostly) unmodified operating system and interesting applications are run on the simulator. The simulator models Instruction set architecture Caches Memories Busses I/O devices Multiple processors with enough precision to answer interesting questions, and rapidly enough to give answers in finite time. SimOS-PPC/1999-3-10 p4

Full System Simulation Recent History SimOS (Stanford University) MIPS R4000, R10000 + Irix Compaq Alpha + Unix http://www.simos.stanford.edu SimICS (virtutech, spinoff from Swedish Institute of Computer Science) Sparc V8 + SunOS 5.x http://www.simics.com SimOS-PPC/1999-3-10 p5

SimOS PowerPC Our project is a port of SimOS to AIX on PowerPC, with the addition of PowerPC processor simulators. We model PowerPC ISA (64-bit Raven ) Caches, memory Selected I/O devices, sufficient to run a server workload: Disk Ethernet Console Each element can be modeled at varying levels of faithfulness, with an inverse relationship between accuracy and speed of model. Example: Disk model Simple instantaneous, interrupt-free model used for AIX bring-up More compex model now in use models delays of actual disk access, interrupts at proper simulated time, models DMA transfers. SimOS-PPC/1999-3-10 p6

SimOS-PPC Applications 32- or 64-bit AIX 4.3.1 Disk driver Ethernet driver SimOS-PPC 64-bit PowerPC Simple ISA simulator Cache simulator Block ISA simulator Memory simulator Disk simulator Ethernet simulator AIX 4.x PowerPC processor 32- or 64-bit SimOS-PPC/1999-3-10 p7

Adapting AIX to SimOS AIX kernel Is unmodified. We use a copy we ve built with g so that all the debugging information is present. RAMFS Contains drivers for our disk and console. Contains an ODM database with special config rules to configure our drivers. savebase information Built by us to describe the simulated machine configuration to the early stages of the boot process. Everything non-essential is removed, so that AIX doesn t spend time trying to discover what s on the bus. All this is bundled into a boot image by a modified version of the bosboot command. SimOS-PPC/1999-3-10 p8

Adapting AIX to SimOS Device drivers Interface with SimOS device simulators via a special PowerPC instruction Unassigned PowerPC opcode interpreted by SimOS as simulator support call (same trick as diagnose on VM/370) Interrupt-driven console, disk and ethernet drivers Console interface will remain simple, as it isn t performance critical. (Simple doesn t mean lack of function, though: it runs well enough for vi, smitty, and emacs.) Files can be copied between the simulated AIX and the host environment: simos-source /simos/src/tmp/ros/emacs.tar tar xvf SimOS-PPC/1999-3-10 p9

TCL interface to SimOS TCL scripts are used to control the simulator Configuration (cache geometry, memory size, number of processors,...) Statistics collection Run-time control Most TCL is in the form of annotations Run arbitrary TCL scripts at specified points of interest Specify where/when to run an annotation by: Execution address (numeric, symbolic) Load or store to specified address Hardware event (device interrupt) User-defined events Creating a new process Entering the idle loop Dispatching a particular thread... SimOS-PPC/1999-3-10 p10

TCL Annotations Annotations are the basis of SimOS s data collection. Annotations have access to all the symbols of the program(s) being executed: symbol load kernel unix Through special TCL variables, annotations have access to the entire machine state: REGISTER(regname) MEMORY(virtual address) PHYSICAL(real address) CYCLES INSTRUCTIONS Examples (from IRIX): Get the name of the current process: (current cycle count) (current instr count on current CPU) symbol read kernel::((struct user*)$uarea)->u_comm Count number of TLB faults via an annotation on the entry to vfault(): annotation set pc kernel::vfault:start { incr vfaultcount } SimOS-PPC/1999-3-10 p11

Another IRIX example: Tracking process fork, exec, and exit annotation set osevent procstart { log PFORK $CYCLES cpu=$cpu $PID($CPU)-$PROCESS($CPU)\n } annotation set pc kernel::exece { set argv [symbol read kernel:exec.c:((struct execa*)$a0)->argp ] log PEXEC $CYCLES cpu=$cpu } # print out the whole command line set arg 1 while {$arg!= 0} { set arg [symbol read kernel::((int*)$argv)<0>] if {$arg!= 0} { set arg [symbol read kernel::((char**)$argv)<0>] log $arg set argv [expr $argv + 4] } } log \n annotation set osevent procexit { log PEXIT $CYCLES cpu=$cpu $PID($CPU)-$PROCESS($CPU)\n } SimOS-PPC/1999-3-10 p12

Copy-on-write disks The AIX disk image is copied from a real disk on a running AIX 4.3.1 system. The disk is not modified during SimOS execution. One disk image serves for multiple simulations, multiple users. ckpts AIX AIX ckpts SimOS PPC SimOS PPC disk read-write AIX disk image read-only disk read-write SimOS-PPC/1999-3-10 p13

How Fast? (1) SimOS-PPC s simple simulator running a cpu-intensive speech benchmark: 32-bit 160 Mhz 604e 64-bit 125 Mhz Raven 64-bit 250 Mhz Blackbird 340 1.9 240 1.4 156.97 Simulated throughput, KIPS Simulated KIPS Host MHz SimOS-PPC/1999-3-10 p14

How Fast? (2) SimOS-PPC Simple simulator running on an IBM 260 Mhz 64-bit host: CPU-intensive benchmark runs native at 1.6 cycles/instruction or 162 MIPS SimOS-PPC running on host emulates host at.34 MIPS,or 466 to 1 SimOS-PPC running on host, modeling host s L1 and L2 caches, executes at.095 MIPS, or 1713 to 1 Simple simulator has not been tuned for performance. Block translator should improve the 466 to 1 case by a factor of 5. SimOS-PPC/1999-3-10 p15

Checkpointing A checkpoint can be taken at any time After every n instructions When a specific point has been reached (via an annotation) By operator command at the simulated OS console The checkpoint includes the entire state of the system: Registers and memory on all processors Cache contents Outstanding interrupt state Disk contents Start-up from a checkpoint is immediate (about 2 seconds) Repeatability for debugging and for what if experiments with different configurations SimOS-PPC/1999-3-10 p16

What s currently running Simple CPU simulator One-at-a-time fetch/decode/execute loop Implements semantics of 64-bit Raven running as a uniprocessor, 2 or 4-way SMP General L1/L2 SMP caches modelled Idle loop is recognized; clock advanced to the next interesting event Ethernet Simulated machines are seen as real to site, each with unique IP address Full Telnet and FTP. X11 is supported. Simple model of disk (fixed delays) Simple console TCL interface for configuring the simulator SimOS-PPC/1999-3-10 p17

Lines of Code SimOS Framework 600 Files 95,000 lines SGI MIPS 175 Files 57,000 lines GNU Environment Compaq Alpha 160 Files 55,000 lines 64-bit AIX gdb 165 files changed 10 files added 14,000 lines added or changed IBM PowerPC 180 Files 47,000 lines SimOS-PPC/1999-3-10 p18

What s next 1Q-99 Block CPU simulator for UP Translates and caches basic blocks Drops back to Simple simulator for hard instructions (e.g., mtmsr) 2Q-99 User program debugging with gdb. Running on a real machine, user attaches to a simulated process running in SimOS-PPC and debugs it. 3Q-99 AIX system and user program profiling 4Q-99 & Beyond SimOS-PPC Support and IBM Proprietary Studies SimOS-PPC/1999-3-10 p19

Customers & Status External to IBM Users Principal Requirement Additional Development University of Texas Carnegie Mellon General architectural analyses tool Front End for PowerPC MW simulator Integrate cycle accurate models Integrate Microprocessor Workbench IBM Customers Principal Requirement Additional Development Unix Performance Group Software Tuning and Performance Debugging Future IBM Systems High End SMP and Processor Design OS Research Platform for debugging new OS boot Profilers and Instruction flow tracing Integrate cycle accurate models None SimOS-PPC/1999-3-10 p20

The Block Simulator Translates and caches sequences (blocks) of instructions Almost all the simulated machine registers reside in actual registers r0, r2 r15, r24 r31, cr, xer, all FP regs Re-use of values in registers across separately-generated blocks Block ended by Branch instruction Any instruction the block simulator can t deal with Generated code calls (via blrl) an assembly-language routine to handle Address translation for load/store Periodically checking for pending interrupts Complicated instructions (lmw/stmw, trap,...) Resolution of branch addresses SimOS-PPC/1999-3-10 p21

Translated block example Original block: 10008154 rlwinm r6, r6, 2, 0xFFFFFFFC 10008158 lwzx r0, r7, r6 1000815C cmpwi cr1, r0, 10 10008160 bge cr0, 0x10008180 Translated block: decr ctr, branch non-zero call support (time expired) rlwinm r6, r6, 2, 0xFFFFFFFC add r17, r7, r6 call support (translate for load) lwzx r0, r0, r17 cmpwi cr1, r0, 10 add 4 to instruction count bge cr0, call support (resolve branch fallthru) call support (resolve branch target) SimOS-PPC/1999-3-10 p21

Java Spec 1% MTRT KERNEL_INSTS USER_INSTS IDLE_INSTS 100% 90% 80% Percentage of Instructions 70% 60% 50% 40% 30% 20% 10% 0% 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 Interval is 1 Million Instructions

Java Spec 1% MTRT Caches (Stacked) IL1 Miss / KI DL1 Miss / KI L2 Miss / KI 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1 27 53 79 105 131 157 183 209 235 261 287 313 339 Misses Per 1000 Instructions 365 391 417 443 469 495 521 547 573 599 625 651 677 703 729 755 781 807 Each Interval is 1M Ins

Java Spec 1% MTRT Memory Reads/Writes (Stacked) MEM_READ MEM_WRITE 1.40E+06 2325254 1.20E+06 2131417 1.00E+06 Count 8.00E+05 6.00E+05 4.00E+05 2.00E+05 0.00E+00 1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487 514 541 568 595 622 649 676 703 730 757 784 811 Each Interval is 1M Instructions

Spec Java 1% MTRT Disk Activity (Stacked!) DISK_WRITE DISK_READ 140 120 248 100 80 60 40 20 0 1 27 53 79 105 131 157 183 209 235 261 287 313 339 365 391 417 443 469 495 521 547 573 599 625 651 677 703 729 755 781 807 Count of Reads/Writes Each Interval is 1M Instructions