ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Similar documents
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Fast and Accurate Microarchitectural Simulation with ZSim

STATS AND CONFIGURATION

ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems

Multithreaded Processors. Department of Electrical Engineering Stanford University

Practical Near-Data Processing for In-Memory Analytics Frameworks

CSE502: Computer Architecture CSE 502: Computer Architecture

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

45-year CPU Evolution: 1 Law -2 Equations

Computer Architecture Spring 2016

Profiling: Understand Your Application

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

ibench: Quantifying Interference in Datacenter Applications

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

High Level Queuing Architecture Model for High-end Processors

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Flexible Architecture Research Machine (FARM)

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

CS425 Computer Systems Architecture

Introduction to GPU programming with CUDA

Lecture 14: Multithreading

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Multithreading: Exploiting Thread-Level Parallelism within a Processor

System Simulator for x86

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

EECS 570 Final Exam - SOLUTIONS Winter 2015

Speculation and Future-Generation Computer Architecture

CS 654 Computer Architecture Summary. Peter Kemper

Lecture 18: Multithreading and Multicores

Software-Controlled Multithreading Using Informing Memory Operations

TDT 4260 lecture 7 spring semester 2015

ECE 571 Advanced Microprocessor-Based Design Lecture 4

RAMP-White / FAST-MP

Graphite IntroducDon and Overview. Goals, Architecture, and Performance

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Intel Architecture for Software Developers

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Exploitation of instruction level parallelism

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Kaisen Lin and Michael Conley

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

November 7, 2014 Prediction

StressRight: Finding the Right Stress for Accurate In-development System Evaluation

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Lecture 1: Introduction

SPECULATIVE MULTITHREADED ARCHITECTURES

MODELING OUT-OF-ORDER SUPERSCALAR PROCESSOR PERFORMANCE QUICKLY AND ACCURATELY WITH TRACES

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Computer Systems Architecture

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CS 6354: Memory Hierarchy I. 29 August 2016

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Computer Systems Architecture

Modern CPU Architectures

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Memory Hierarchy. Slides contents from:

Computer Science 146. Computer Architecture

Effective Prefetching for Multicore/Multiprocessor Systems

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Handout 3 Multiprocessor and thread level parallelism

Simultaneous Multithreading on Pentium 4

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Spectre and Meltdown. Clifford Wolf q/talk

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]


Instruction Level Parallelism (Branch Prediction)

Microarchitecture Overview. Performance

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Advanced Parallel Programming I

Advanced issues in pipelining

Hardware-Based Speculation

Architecture-Conscious Database Systems

Introduction to GPU computing

Final Lecture. A few minutes to wrap up and add some perspective

Chip-Multithreading Systems Need A New Operating Systems Scheduler

CS252 Spring 2017 Graduate Computer Architecture. Lecture 14: Multithreading Part 2 Synchronization 1

UNIT I (Two Marks Questions & Answers)

Transcription:

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013

Introduction 2 Current detailed simulators are slow (~200 KIPS) Simulation performance wall More complex targets (multicore, memory hierarchy, ) Hard to parallelize Problem: Time to simulate 1000 cores @ 2GHz for 1s at 200 KIPS: 4 months 200 MIPS: 3 hours Alternatives? FPGAs: Fast, good progress, but still hard to use Simplified/abstract models: Fast but inaccurate

ZSim Techniques 3 Three techniques to make 1000-core simulation practical: 1. Detailed DBT-accelerated core models to speed up sequential simulation 2. Bound-weave to scale parallel simulation 3. Lightweight user-level virtualization to bridge user-level/fullsystem gap ZSim achieves high performance and accuracy: Simulates 1024-core systems at 10s-1000s of MIPS 100-1000x faster than current simulators Validated against real Westmere system, avg error ~10%

This Presentation is Also a Demo! 4 ZSim is simulating these slides OOO cores @ 2 GHz 3-level cache hierarchy! ZSim performance relevant when busy Running 2-core laptop CPU ~12x slower than 16-core server Idle (< 0.1 cores active) 0.1 < cores active < 0.9 Busy (> 0.9 cores active) Total cycles and instructions simulated (in billions) Current simulation speed and basic stats (updated every 500ms)

Main Design Decisions 5 General execution-driven simulator: Functional model Timing model Emulation? (e.g., gem5, MARSSx86) Instrumentation? (e.g., Graphite, Sniper) Dynamic Binary Translation (Pin) Functional model for free Base ISA = Host ISA (x86) Cycle-driven? Event-driven? DBT-accelerated, instruction-driven core + Event-driven uncore

Outline 6 Introduction Detailed DBT-accelerated core models Bound-weave parallelization Lightweight user-level virtualization

Accelerating Core Models 7 Shift most of the work to DBT instrumentation phase Basic block Instrumented basic block + Basic block descriptor mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Load(addr = (%rbp)) mov (%rbp),%rcx add %rax,%rdx Store(addr = (%rbp)) mov %rdx,(%rbp) BasicBlock(BBLDescriptor) ja 10840530a Ins µop decoding µop dependencies, functional units, latency Front-end delays Instruction-driven models: Simulate all stages at once for each instruction/ µop Accurate even with OOO if instruction window prioritizes older instructions Faster, but more complex than cycle-driven See paper for details

Detailed OOO Model 8 OOO core modeled and validated against Westmere Fetch Main Features Wrong-path fetches Branch Prediction Decode Issue OOO Exec Commit Front-end delays (predecoder, decoder) Detailed instruction to µop decoding Rename/capture stalls IW with limited size and width Functional unit delays and contention Detailed LSU (forwarding, fences, ) Reorder buffer with limited size and width

Detailed OOO Model 9 OOO core modeled and validated against Westmere Fetch Decode Fundamentally Hard to Model Wrong-path execution In Westmere, wrong-path instructions don t affect recovery latency or pollute caches Skipping OK Issue OOO Exec Commit Not Modeled (Yet) Rarely used instructions BTB LSD TLBs

Single-Thread Accuracy 10 29 SPEC CPU2006 apps for 50 Billion instructions Real: Xeon L5640 (Westmere), 3x DDR3-1333, no HT Simulated: OOO cores @ 2.27 GHz, detailed uncore 9.7% average IPC error, max 24%, 18/29 within 10%

Single-Thread Performance 11 Host: E5-2670 @ 2.6 GHz (single-thread simulation) 29 SPEC CPU2006 apps for 50 Billion instructions 40 MIPS hmean ~3x between least and most detailed models! 12 MIPS hmean ~10-100x faster

Outline 12 Introduction Detailed DBT-accelerated core models Bound-weave parallelization Lightweight user-level virtualization

Parallelization Techniques 13 Parallel Discrete Event Simulation (PDES): Divide components across host threads Execute events from each component maintaining illusion of full order Accurate Not scalable Skew < 10 cycles Host Thread 0 Mem 0 15 L3 Bank 0 L3 Bank 1 5 10 Core 0 Host Thread 1 15 10 5 Core 1 Lax synchronization: Allow skews above inter-component latencies, tolerate ordering violations Scalable Inaccurate

Characterizing Interference 14 Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Path-preserving interference If we simulate two accesses out of order, their timing changes but their paths do not Mem Mem Mem 3 4 Mem 4 5 LLC LLC LLC (blocking) LLC (blocking) 1 2 GETS A MISS GETS A HIT 2 1 GETS A HIT GETS A MISS 1 5 2 6 GETS A MISS GETS B HIT 2 6 1 3 GETS A MISS GETS A HIT Core 0 Core1 Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 In small intervals (1-10K cycles), path-altering interference is extremely rare (<1 in 10K accesses)

Bound-Weave Parallelization 15 Divide simulation in small intervals (e.g., 1000 cycles) Two parallel phases per interval: Bound and weave Bound phase: Find paths Weave phase: Find timings Bound-Weave equivalent to PDES for path-preserving interference

Bound-Weave Example 16 2-core host simulating 4-core system 1000-cycle intervals Divide components L1I among 2 domains Core 1 Bound Phase: Parallel simulation until cycle 1000, gather access traces Host Thread 0 Host Thread 1 Core 0 Core 3 Core 1 Core 2 Mem Ctrl 0 Mem Ctrl 1 L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3 L2 L2 L2 L2 L1D L1I L1D L1I L1D L1I L1D Core 0 Core 2 Core 3 Domain 0 Domain 1 Weave Phase: Parallel event-driven simulation of gathered traces until actual cycle 1000 Domain 0 Domain 1 Feedback: Adjust core cycles Core 3 Core 1 Core 2 Core 0 Bound Phase (until cycle 2000) Host Time

Bound-Weave Take-Aways 17 Minimal synchronization: Bound phase: Unordered accesses (like lax) Weave: Only sync on actual dependencies No ordering violations in weave phase Works with standard event-driven models e.g., 110 lines to integrate with DRAMSim2 See paper for details!

Multithreaded Accuracy 18 23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM 11.2% avg perf error (not IPC), 10/23 within 10% Similar differences as single-core results Scalability, contention model validation see paper

1024-Core Performance 19 Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads) Results for the 14/23 parallel apps that scale 200 MIPS hmean 41 MIPS hmean ~5x between least and most detailed models! ~100-1000x faster

Bound-Weave Scalability 20 10.1-13.6x speedup @ 16 cores

Outline 21 Introduction Detailed DBT-accelerated core models Bound-weave parallelization Lightweight user-level virtualization

Lightweight User-Level Virtualization 22 No 1Kcore OSs No parallel full-system DBT ZSim has to be user-level for now Problem: User-level simulators limited to simple workloads Lightweight user-level virtualization: Bridge the gap with full-system simulation Simulate accurately if time spent in OS is minimal

Lightweight User-Level Virtualization 23 Multiprocess workloads Scheduler (threads > cores) Time virtualization System virtualization See paper for: Simulator-OS deadlock avoidance Signals ISA extensions Fast-forwarding

Lightweight User-Level Virtualization 24 Multiprocess workloads Scheduler (threads > cores) Time virtualization System virtualization See paper for: Simulator-OS deadlock avoidance Signals ISA extensions Fast-forwarding

Lightweight User-Level Virtualization 25 Multiprocess workloads Scheduler (threads > cores) Time virtualization System virtualization See paper for: Simulator-OS deadlock avoidance Signals ISA extensions Fast-forwarding

Lightweight User-Level Virtualization 26 Multiprocess workloads Scheduler (threads > cores) Time virtualization System virtualization See paper for: Simulator-OS deadlock avoidance Signals ISA extensions Fast-forwarding

Lightweight User-Level Virtualization 27 Multiprocess workloads Scheduler (threads > cores) Time virtualization System virtualization See paper for: Simulator-OS deadlock avoidance Signals ISA extensions Fast-forwarding

ZSim Limitations 28 Not implemented yet: Multithreaded cores Detailed NoC models Virtual memory (TLBs) Fundamentally hard: Simulating speculation (e.g., transactional memory) Fine-grained message-passing across whole chip Kernel-intensive applications

Conclusions 29 Three techniques to make 1Kcore simulation practical DBT-accelerated models: 10-100x faster core models Bound-weave parallelization: ~10-15x speedup from parallelization with minimal accuracy loss Lightweight user-level virtualization: Simulate complex workloads without full-system support ZSim achieves high performance and accuracy: Simulates 1024-core systems at 10s-1000s of MIPS Validated against real Westmere system, avg error ~10% Source code available soon at zsim.csail.mit.edu

THANKS FOR YOUR ATTENTION! QUESTIONS?