ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Similar documents
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Fast and Accurate Microarchitectural Simulation with ZSim

STATS AND CONFIGURATION

ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems

Multithreaded Processors. Department of Electrical Engineering Stanford University

Practical Near-Data Processing for In-Memory Analytics Frameworks

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

CSE502: Computer Architecture CSE 502: Computer Architecture

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

45-year CPU Evolution: 1 Law -2 Equations

Computer Architecture Spring 2016

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Profiling: Understand Your Application

ibench: Quantifying Interference in Datacenter Applications

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

High Level Queuing Architecture Model for High-end Processors

Flexible Architecture Research Machine (FARM)

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Introduction to GPU programming with CUDA

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

CS425 Computer Systems Architecture

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Lecture 14: Multithreading

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Multithreading: Exploiting Thread-Level Parallelism within a Processor

System Simulator for x86

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Speculation and Future-Generation Computer Architecture

EECS 570 Final Exam - SOLUTIONS Winter 2015

CS 654 Computer Architecture Summary. Peter Kemper

Lecture 18: Multithreading and Multicores

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Graphite IntroducDon and Overview. Goals, Architecture, and Performance

RAMP-White / FAST-MP

TDT 4260 lecture 7 spring semester 2015

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Intel Architecture for Software Developers

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Exploitation of instruction level parallelism

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Kaisen Lin and Michael Conley

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

November 7, 2014 Prediction

StressRight: Finding the Right Stress for Accurate In-development System Evaluation

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Software-Controlled Multithreading Using Informing Memory Operations

MODELING OUT-OF-ORDER SUPERSCALAR PROCESSOR PERFORMANCE QUICKLY AND ACCURATELY WITH TRACES

Lecture 1: Introduction

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

SPECULATIVE MULTITHREADED ARCHITECTURES

Copyright 2012, Elsevier Inc. All rights reserved.

CS 6354: Memory Hierarchy I. 29 August 2016

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

15-740/ Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Modern CPU Architectures

Memory Hierarchy. Slides contents from:

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Effective Prefetching for Multicore/Multiprocessor Systems

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Simultaneous Multithreading on Pentium 4

Computer Science 146. Computer Architecture

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Handout 3 Multiprocessor and thread level parallelism

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Microarchitecture Overview. Performance

Spectre and Meltdown. Clifford Wolf q/talk

Instruction Level Parallelism (Branch Prediction)

Advanced Parallel Programming I

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Adapted from David Patterson s slides on graduate computer architecture

CSE502: Computer Architecture CSE 502: Computer Architecture

Architecture-Conscious Database Systems

Introduction to GPU computing


Advanced issues in pipelining

UNIT I (Two Marks Questions & Answers)

Hardware-Based Speculation

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Final Lecture. A few minutes to wrap up and add some perspective

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Achieving Out-of-Order Performance with Almost In-Order Complexity

CSE502: Computer Architecture CSE 502: Computer Architecture

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Transcription:

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013

Introduction 2 Current detailed simulators are slow (~200 KIPS) Simulation performance wall More complex targets (multicore, memory hierarchy, ) Hard to parallelize Problem: Time to simulate 1000 cores @ 2GHz for 1s at 200 KIPS: 4 months 200 MIPS: 3 hours Alternatives? FPGAs: Fast, good progress, but still hard to use Simplified/abstract models: Fast but inaccurate

ZSim Techniques 3 Three techniques to make 1000-core simulation practical: 1. Detailed DBT-accelerated core models to speed up sequential simulation 2. Bound-weave to scale parallel simulation 3. Lightweight user-level virtualization to bridge user-level/fullsystem gap ZSim achieves high performance and accuracy: Simulates 1024-core systems at 10s-1000s of MIPS 100-1000x faster than current simulators Validated against real Westmere system, avg error ~10%

This Presentation is Also a Demo! 4 ZSim is simulating these slides OOO cores @ 2 GHz 3-level cache hierarchy! ZSim performance relevant when busy Running 2-core laptop CPU ~12x slower than 16-core server Idle (< 0.1 cores active) 0.1 < cores active < 0.9 Busy (> 0.9 cores active) Total cycles and instructions simulated (in billions) Current simulation speed and basic stats (updated every 500ms)

Main Design Decisions 5 General execution-driven simulator: Functional model Timing model Emulation? (e.g., gem5, MARSSx86) Instrumentation? (e.g., Graphite, Sniper) Dynamic Binary Translation (Pin) Functional model for free Base ISA = Host ISA (x86) Cycle-driven? Event-driven? DBT-accelerated, instruction-driven core + Event-driven uncore

Outline 6 Introduction Detailed DBT-accelerated core models Bound-weave parallelization Lightweight user-level virtualization

Accelerating Core Models 7 Shift most of the work to DBT instrumentation phase Basic block Instrumented basic block + Basic block descriptor mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Load(addr = (%rbp)) mov (%rbp),%rcx add %rax,%rdx Store(addr = (%rbp)) mov %rdx,(%rbp) BasicBlock(BBLDescriptor) ja 10840530a Ins µop decoding µop dependencies, functional units, latency Front-end delays Instruction-driven models: Simulate all stages at once for each instruction/ µop Accurate even with OOO if instruction window prioritizes older instructions Faster, but more complex than cycle-driven See paper for details

Detailed OOO Model 8 OOO core modeled and validated against Westmere Main Features Fetch Decode Issue OOO Exec Wrong-path fetches Branch Prediction Front-end delays (predecoder, decoder) Detailed instruction to µop decoding Rename/capture stalls IW with limited size and width Functional unit delays and contention Detailed LSU (forwarding, fences, ) Commit Reorder buffer with limited size and width

Detailed OOO Model 9 OOO core modeled and validated against Westmere Fundamentally Hard to Model Fetch Decode Issue OOO Exec Commit Wrong-path execution In Westmere, wrong-path instructions don t affect recovery latency or pollute caches Skipping OK Not Modeled (Yet) Rarely used instructions BTB LSD TLBs

Single-Thread Accuracy 10 29 SPEC CPU2006 apps for 50 Billion instructions Real: Xeon L5640 (Westmere), 3x DDR3-1333, no HT Simulated: OOO cores @ 2.27 GHz, detailed uncore 9.7% average IPC error, max 24%, 18/29 within 10%

Single-Thread Performance 11 Host: E5-2670 @ 2.6 GHz (single-thread simulation) 29 SPEC CPU2006 apps for 50 Billion instructions 40 MIPS hmean ~3x between least and most detailed models! 12 MIPS hmean ~10-100x faster

Outline 12 Introduction Detailed DBT-accelerated core models Bound-weave parallelization Lightweight user-level virtualization

Parallelization Techniques 13 Parallel Discrete Event Simulation (PDES): Divide components across host threads Execute events from each component maintaining illusion of full order Accurate Not scalable Skew < 10 cycles Host Thread 0 Mem 0 15 L3 Bank 0 L3 Bank 1 5 10 Core 0 Host Thread 1 15 10 5 Core 1 Lax synchronization: Allow skews above inter-component latencies, tolerate ordering violations Scalable Inaccurate

Characterizing Interference 14 Path-altering interference If we simulate two accesses out of order, their paths through the memory hierarchy change Path-preserving interference If we simulate two accesses out of order, their timing changes but their paths do not Mem Mem Mem Mem LLC LLC LLC (blocking) LLC (blocking) GETS A MISS GETS A HIT GETS A HIT GETS A MISS GETS A MISS GETS B HIT GETS A MISS GETS A HIT Core 0 Core1 Core 0 Core 1 Core 0 Core 1 Core 0 Core 1 In small intervals (1-10K cycles), path-altering interference is extremely rare (<1 in 10K accesses)

Bound-Weave Parallelization 15 Divide simulation in small intervals (e.g., 1000 cycles) Two parallel phases per interval: Bound and weave Bound phase: Find paths Weave phase: Find timings Bound-Weave equivalent to PDES for path-preserving interference

Bound-Weave Example 16 2-core host simulating 4-core system 1000-cycle intervals Divide components L1I among 2 domains Core 1 Bound Phase: Unordered simulation until cycle 1000, gather access traces Host Thread 0 Host Thread 1 Core 0 Core 3 Core 1 Core 2 Mem Ctrl 0 Mem Ctrl 1 L3 Bank 0 L3 Bank 1 L3 Bank 2 L3 Bank 3 L2 L2 L2 L2 L1D L1I L1D L1I L1D L1I L1D Core 0 Core 2 Core 3 Domain 0 Domain 1 Weave Phase: Parallel event-driven simulation of gathered traces until cycle 1000 Domain 0 Domain 1 Core 3 Core 1 Core 2 Core 0 Bound Phase (until cycle 2000) Host Time

Bound-Weave Take-Aways 17 Minimal synchronization: Bound phase: Unordered accesses (like lax) Weave: Only sync on actual dependencies No ordering violations in weave phase Works with standard event-driven models e.g., 110 lines to integrate with DRAMSim2 See paper for details!

Multithreaded Accuracy 18 23 apps: PARSEC, SPLASH-2, SPEC OMP2001, STREAM 11.2% avg perf error (not IPC), 10/23 within 10% Similar differences as single-core results Scalability, contention model validation see paper

1024-Core Performance 19 Host: 2-socket Sandy Bridge @ 2.6 GHz (16 cores, 32 threads) Results for the 14/23 parallel apps that scale 200 MIPS hmean 41 MIPS hmean ~5x between least and most detailed models! ~100-1000x faster

Bound-Weave Scalability 20 10.1-13.6x speedup @ 16 cores

Outline 21 Introduction Detailed DBT-accelerated core models Bound-weave parallelization Lightweight user-level virtualization

Lightweight User-Level Virtualization 22 No 1Kcore OSs No parallel full-system DBT ZSim has to be user-level for now Problem: User-level simulators limited to simple workloads Lightweight user-level virtualization: Bridge the gap with full-system simulation Simulate accurately if time spent in OS is minimal

Lightweight User-Level Virtualization 23 Multiprocess workloads Scheduler (threads > cores) Time virtualization System virtualization See paper for: Simulator-OS deadlock avoidance Signals ISA extensions Fast-forwarding

ZSim Limitations 24 Not implemented yet: Multithreaded cores Detailed NoC models Virtual memory (TLBs) Fundamentally hard: Simulating speculation (e.g., transactional memory) Fine-grained message-passing across whole chip Kernel-intensive applications

Conclusions 25 Three techniques to make 1Kcore simulation practical DBT-accelerated models: 10-100x faster core models Bound-weave parallelization: ~10-15x speedup from parallelization with minimal accuracy loss Lightweight user-level virtualization: Simulate complex workloads without full-system support ZSim achieves high performance and accuracy: Simulates 1024-core systems at 10s-1000s of MIPS Validated against real Westmere system, avg error ~10% Source code available soon at zsim.csail.mit.edu