Parallel SimOS: Scalability and Performance for Large System Simulation

Size: px

Start display at page:

Download "Parallel SimOS: Scalability and Performance for Large System Simulation"

Herbert Taylor
6 years ago
Views:

1 Parallel SimOS: Scalability and Performance for Large System Simulation Ph.D. Oral Defense Robert E. Lantz Computer Systems Laboratory Stanford University 1

2 Overview This work develops methods to simulate large computer systems with practical performance We use smaller machines to simulate larger machines We extend the capabilities of computer system simulation by an order of magnitude, to systems of more than 1000 processors 2

3 Outline Background and Motivation Parallel SimOS Investigation Design Issues and Experiences Performance Evaluation Usability Evaluation Related Work Future Work and Conclusions 3

4 Why large systems? Large applications! Biology, Chemistry, Physics, Engineering From large systems (e.g. Earth s climate) to small systems (e.g. cells, DNA) Web applications, search, databases Simulation, visualization (and games!) 4

5 Why simulate large systems? Compare alternative designs Verify a system before building it Predict behavior and performance Debug a system during bring-up Write software when the system is not available (or before it exists!) Avoid expensive mistakes 5

6 The SimOS System Complete Machine Simulator developed in CSL Simulates complete hardware of computer system: CPU, memory, devices Enough speed and detail to run full operating system, system software, application programs Multiple CPU and memory models for fast or detailed performance and behavioral modeling Target Workload Target OS SimOS Simulated Hardware CPU Model Memory Model P P P P M M M M Device Models Disk Network Other Host OS Host Hardware 6

7 Using SimOS Disk Image OS, System Software User Applications Config/ Control Scripts SimOS Modeled performance and event statistics Program output Application Data External I/O Simulator statistics 7

8 Performance Terminology Execution time is the most meaningful measurement of simulator performance Slowdown = Real Time/Simulated Time Slowdown tells you how much longer it will take to simulate a workload compared to running it on actual hardware Self-relative slowdown compares a simulator with the machine it is running on 8

9 Speed/Detail Trade-off SimOS CPU Model MXS Mipsy Embra w/caches Embra Detail Dynamic, superscalar microarchitecture model; non-blocking memory system Sequential interpreter; blocking memory system Single-cycle CPU model; simplified cache model Single-cycle CPU and memory model Approximate KIPS (225 MHz R10K) Self-relative slowdown ~ ~10 SimOS CPU and Memory Models 9

10 Benefits of fast simulation Makes it possible to simulate complex workloads Many billions of cycles Allows software development, debugging interactive usability Enables exploration of large design space Real OS, system software, large applications Positioning before more detailed simulation Provides rough estimate of performance, trends 10

11 SimOS Applications Used in design, development, debugging of Stanford FLASH multiprocessor throughout its life cycle Enabled numerous studies of OS and application performance Research platform for operating systems, virtual machines, visualization 11

12 SimOS Limitations As we simulate larger machines, slowdown increases 15,000 Slowdown (real time / simulated time) 10,000 5,000 Barnes FFT Radix LU Simulated Processors 12

13 SimOS Limitations...resulting in longer simulation times Time (minutes) to simulate one minute of virtual time 15,000 > 1 week 10,000 5, hours 10 minutes Simulated Processors 13

14 Problem: Simulator Slowdown What causes simulator slowdown? Intrinsic Slowdown Resource Exhaustion Linear slowdown Overall multiplicative slowdown: Simulation Time = Workload Time * (Intrinsic Slowdown + Resource Exhaustion Penalty) * Linear Slowdown 14

15 Solution: Parallel SimOS Use increased capacity of shared-memory multiprocessors to address resource exhaustion and linear slowdown Extend speed/detail trade-off with fast, parallel mode of simulation Goal: eliminate slowdown due to parallelism and increase scalability to enable large system simulation with practical performance 15

16 Outline Background and Motivation Parallel SimOS Investigation Design Issues and Experiences Embra background Parallel Embra Design Performance Evaluation Usability Evaluation Related Work Future Work and Conclusions 16

17 Embra: SimOS fastest simulation mode Binary translation CPU and memory simulator Translation Cache (TC) Callouts to handle events, MMU operations, exceptions and annotations CPU multiplexing ~10x base slowdown MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index Translator Decoder Callout and Exception Handlers Event Handlers Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 17

18 Embra: sources of slowdown Binary translation overhead Multiplexing overhead Resource Exhaustion ST = WT * (Slowdown(I) + Slowdown(R)) * M 18

sw SIM_T2, R1(cpu_base) lw SIM_T1, R4(cpu_base) jal mem_read_addr lw SIM_T3, (SIM_T1) sw

19 Binary translation overhead PC TC Index lw r1, (r2) lw r3, (r4) add r5, r1, r3 Simulator Memory Decoder and Translator lw SIM_T1, R2(cpu_base) jal mem_read_addr lw SIM_T2, (SIM_T1) sw SIM_T2, R1(cpu_base) lw SIM_T1, R4(cpu_base) jal mem_read_addr lw SIM_T3, (SIM_T1) sw SIM_T3, R3(cpu_base) add.w SIM_T1, SIM_T2, SIM_T3 sw SIM_T1, R5(cpu_base) Translation Cache (TC) 19

20 CPU multiplexing CPU State array Context switching with variable timeslice overhead P CPU 0 Registers FPU MMU other state large for low overhead P Registers CPU 1 FPU small for better responsiveness MMU other state minimal: MPinUP mode P Registers CPU 2 FPU MMU other state 20

21 A new, faster mode: Use parallelism and memory system of shared-memory multiprocessor Parallel Embra Decimation-in-space approach Simulated nodes Simulator threads Parallelism and increased memory bandwidth reduce linear slowdown and resource exhaustion: ST = WT * (Slowdown(I) + Slowdown(R)) * M 21

22 Design Evolution We started with a baseline design and evolved it to achieve scalable performance Baseline: thread-based parallelism, shared memory Critical design features: Mirroring hardware in software Replication, fine-grained parallelism Unsynchronized execution speed 22

23 Design: Software should mirror Hardware Shared Translation Cache to reduce overhead? Problem: contention and serialization; chaining and cache conflicts Fuses hardware, breaks parallelism Solution: mirror hardware in software with replicated Translation Caches MMU/ glue glue code code Kernel TC TC TC User User TC TC TC Translation Cache (TC) (TC) Translation Cache (TC) (TC) index index Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 23

24 Design: Software should mirror Hardware Shared Event Queue for global ordering? Events are rare! Problem: event frequency increases with parallelism Solution: replicated event queues to mirror hardware in software MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 24

25 Design: Software should mirror Hardware 90% of time in TC - how about parallelize TC only? Problem: Amdahl s law Problem: frequent callouts, contention everywhere MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index MMU Cache MMU Handler Statistics Reporting Result: critical region expansion and serialization Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra SimOS Interface 25

26 Critical Region Expansion Critical Regions Expansion and Serialization Contention and Descheduling Time 26

27 Design: Software should mirror Hardware Solution: mirror hardware in software with finegrained parallelism throughout Parallel Embra OS and apps require parallel callouts from Translation Cache Parallel statistics reporting is also a good idea, but happens infrequently MMU/ glue code Kernel TC Translation Cache (TC) User TC Translation Cache (TC) index Translator Decoder Callout and Exception Handlers Event Handlers Parallel Embra MMU Cache MMU Handler Statistics Reporting SimOS Interface 27

28 Design: flexible virtual time synchronization Problem: cycle skew between fast, slow processors Solution: configurable barrier synchronization fast processors wait for slow processors fine-grain (like MPinUP mode) loose grain (reduce sync overhead) variable interval for flexibility 28

29 Design: synchronization causes slowdown 4 32p Slowdown vs. large sync interval Barnes FFT LU MP3D Ocean Raytrace Radix Water Synchronization interval (cycles) 29

30 Design: unsynchronized execution For performance, the best synchronization interval is longer than the workload, i.e. never synchronize We were surprised to find that both the OS and parallel benchmarks ran correctly with unlimited time skew This is because every thread sees a consistent ordering of memory and synchronization events 30

31 Design conclusions Parallelism increases contention for: callouts, event system, TC, clock, MMU, interrupt controllers, any shared subsystem Contention cascades, resulting in critical region expansion and serialization Mirroring hardware in software preserves parallelism, avoids contention effects Fine-grained synchronization is required to permit correct and highly parallel access to simulator data Time synchronization across processors is unnecessary for correctness and undesirable for speed Performance depends on combination of all parallel performance features 31

32 Outline Background and Motivation Parallel SimOS Investigation Design Issues and Experiences Performance Evaluation Usability Evaluation Related Work Future Work Conclusions 32

33 Performance:Test Configuration Stanford FLASH Multiprocessor 64 nodes MIPS R10000, 225 Mhz 220 MB DRAM/node (14GB total) flash1, flash32, flash64, etc. Machine benchmark Barnes FFT LU MP3D Radix Raytrace Ocean Water pmake ptest description Hierarchical Barnes-Hut method for N-body problem Fast Fourier Transform Lower/Upper matrix factorization Particle-based hypersonic wind tunnel simulation Integer radix sort Ray tracer Ocean currents simulation Water molecule simulation Compile phase of Modified Andrew Benchmark Simple benchmark for sanity check/peak performance Workload 33

34 Performance: Peak and actual MIPS 1600 MIPS MIPS over time -vpc-suite/flash-32-suite.log 1000 MIPS Flash32: ptest Flash32: SPLASH-2 Overall result: > 1000 MIPS in simulation, ~10x slowdown compared to hardware 34

35 Performance: Hardware self-relative slowdown 60 Self-relative slowdown Barnes FFT LU MP3D Ocean Radix Raytrace Water pmake LU-big Radix-big Simulated Machine Size ~10x slowdown regardless of machine size 35

36 Performance: benchmark phases Barnes-Flash32 LU-Flash32 36

37 Performance: benchmark phases MP3D-Flash32 37

38 Large Scale Performance 38

39 Large Scale Performance 15,000 12,500 Slowdown (Real time/ simulated time) 10,000 7,500 10,323 9,409 Radix/Flash32 LU/Flash64 5,000 2,500 0 SimOS Parallel SimOS Hours or days rather than weeks 39

40 Speed/Detail Trade-off, revisited Parallel SimOS CPU Model MXS Mipsy Embra w/caches Embra Parallel Embra Detail Dynamic, superscalar microarchitecture model; non-blocking memory system Sequential interpreter; blocking memory system Single-cycle CPU model; simplified cache model Single-cycle CPU and memory model Non-deterministic, single-cycle CPU and memory model Approximate KIPS (225 MHz R10K) Self-relative slowdown ~ ~10 > 1,000,000 ~10 Parallel SimOS CPU and Memory Models 40

41 Performance Conclusions Parallel SimOS achieves peak and actual MIPS far beyond serial SimOS Parallel SimOS simulates multiprocessor with analogous performance to Serial SimOS simulating a uniprocessor Parallel SimOS extends scalability of complete machine simulation to 1024 processor systems 41

42 Usability Study Study of large, complex parallel program: Parallel SimOS itself Self-hosting capability of orthogonal simulators Performance debugging of Parallel SimOS, and test of functionality and usability Self-hosting architecture: Benchmark (Radix) Inner Irix 6.5 Inner SimOS Outer Irix 6.5 Outer SimOS Irix 6.5 Hardware (SGI Origin) 42

43 Phase profile CPU CPU time(s) time(s) Computation intervals for self-hosted radix Serial SimOS Computation intervals for self-hosted radix Parallel SimOS Bugs: Excessive TLB misses, interrupt storms Limitation: system imbalance effects 43

44 Usability Conclusions Parallel SimOS worked correctly on itself Revealed bugs and limitations of Parallel SimOS Speed/detail trade-off enabled with checkpoints Detailed mode too slow - ended up scaling down workload Need for faster detailed simulation modes 44

45 Limitations Virtual time depends on real time but can use checkpoints! System Imbalance Effects Memory Limits Need for fast detailed mode Loss of determinism, repeatability future work 45

46 Related Work Parallel SimOS uses shared-memory multiprocessors and decimation in space Other approaches to improving performance using parallelism include: Decimation in time Cluster-based simulation 46

47 Related Work: Decimation in Time checkpoint checkpoint checkpoint Initial serial execution Segment 1 Segment 2 Segment 3 Segment 4 checkpoint checkpoint Subsequent parallel execution Segment 1 Segment 2 Segment 3 Segment 4 overlap checkpoint checkpoint checkpoint Serial reconstruction Segment 1 Segment 2 Segment 3 Segment 4 ST = WT * (Slowdown(I) + Slowdown(R)) * N 47

48 Parallel SimOS: Decimation in Space Simulated nodes Simulator threads ST = WT * (Slowdown(I) + Slowdown(R)) * M 48

49 Related Work: Clusterbased Simulation Most common means of parallel simulation: Shaman, BigSim, others; Fast (?) LAN = highlatency communication Software-based shared memory = low performance switch Reduced flexibility 49

50 Parallel SimOS: Flexible Simulation Tightly and loosely coupled machines Workstation Cluster - Sweet Hall Network From clusters to multiprocessors and everything in between NUMA Shared-Memory Multiprocessor - Stanford FLASH Machine Node CPU Cache Node CPU Cache Node CPU Cache CPU Cache CPU Cache CPU Cache Memory Controller Memory Controller Memory Controller Node CPU CPU Memory Cache Cache Controller Parallelism across multiprocessor nodes Multi-level Bus/interconnect... Multiprocessor Cluster Node Multiprocessor Node Node Multiprocessor Node Node Node Node Node Network Interface Network Interface Network... 50

51 Related Work Summary Decimation in Time achieves good speedup at the expense of interactivity synergistic with Parallel SimOS Cluster-based simulation addresses needs of loosely-coupled systems, generally without shared memory Parallel SimOS approach achieves programmability and performance - for larger design space that includes tightlycoupled and hybrid systems 51

52 Future Work Faster detailed simulation Parallel detailed mode with flexible memory, pipeline models Try to recapture determinism Faster less-detailed simulation Global memory ordering in virtual time Revisit direct execution, using virtual machine monitors, user-mode OS, etc. 52

53 Conclusion: Thesis Contributions Developed design and implementation of scalable, parallel complete machine simulation Eliminated slowdown due to resource exhaustion and multiplexing Scaled complete machine simulation up by an order of magnitude processor machines on our hardware Developed flexible simulator capable of simulating large, tightly-coupled systems with interactive performance 53

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate