Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Similar documents
Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 654 Computer Architecture Summary. Peter Kemper

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Advanced issues in pipelining

Hardware-based Speculation

CS425 Computer Systems Architecture

Lecture-13 (ROB and Multi-threading) CS422-Spring

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Instruction Level Parallelism

Hyperthreading Technology

CS / ECE 6810 Midterm Exam - Oct 21st 2008


CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Simultaneous Multithreading on Pentium 4

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Exploitation of instruction level parallelism

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Parallel Processing SIMD, Vector and GPU s cont.

Simultaneous Multithreading Architecture

Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Kaisen Lin and Michael Conley

Simultaneous Multithreading and the Case for Chip Multiprocessing

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multi-core Architectures. Dr. Yingwu Zhu

TDT 4260 lecture 7 spring semester 2015

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Limitations of Scalar Pipelines

Multi-core Programming Evolution

Software-Controlled Multithreading Using Informing Memory Operations

Techniques for Efficient Processing in Runahead Execution Engines

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

h Coherence Controllers

Transactional Memory Coherence and Consistency

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

CSE502: Computer Architecture CSE 502: Computer Architecture

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Hardware-Based Speculation

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Unit 8: Superscalar Pipelines

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

CS533: Speculative Parallelization (Thread-Level Speculation)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Execution-based Prediction Using Speculative Slices

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Dynamic Memory Dependence Predication

High Performance SMIPS Processor

Limits of Data-Level Parallelism

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

CS377P Programming for Performance Multicore Performance Multithreading

Phastlane: A Rapid Transit Optical Routing Network

One-Level Cache Memory Design for Scalable SMT Architectures

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Towards a More Efficient Trace Cache

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

The University of Texas at Austin

Multicast Snooping: A Multicast Address Network. A New Coherence Method Using. With sponsorship and/or participation from. Mark Hill & David Wood

Multiple Instruction Issue. Superscalars

Datapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Power-Efficient Approaches to Reliability. Abstract

Hardware Speculation Support

Computer Architecture Spring 2016

LogTM: Log-Based Transactional Memory

Superscalar SMIPS Processor

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Synthetic Traffic Generation: a Tool for Dynamic Interconnect Evaluation

Chapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

IBM's POWER5 Micro Processor Design and Methodology

45-year CPU Evolution: 1 Law -2 Equations

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

Advanced Instruction-Level Parallelism

ECE/CS 757: Homework 1

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Transcription:

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering,

Outline Problem statement Assumptions and studied system Methodology Results Conclusion

Problem What is the best trade-off between the number of cores and their complexity in a CMP? Wide design space ranging from very few very complex superscalar processors to lots of very simple single-issue cores.

Assumptions Chip-area requirements are constant in all designs Clock frequency is constant in all designs Parallel applications

Assumptions & Disclamers Chip-area requirements are constant in all designs Very rough area estimates Clock frequency is constant in all designs Perhaps more realistic with faster clock for simpler designs Parallel applications The world is not entirely parallel

Four basic systems studied 2 cores, 8-issue 4 cores, 4-issue 8 cores, dual-issue 16 cores, single-issue

Things that we study Total execution time of the same task on all systems How does applications exploit ILP vs. TLP? Power consumption for the different systems Gives hints about hot-spots in the designs Total energy consumption of executing the same task on all systems How efficient is the system?

Simulation methodology (complexity effective?) Multiprocessor version of SimWattch [1] SimWattch is based on Simics [2] and Wattch [3] (which is based on SimpleScalar [4] and Cacti [5]). [1] SimWattch, 2003 IEEE International Symposium on Performance Analysis of Systems and Software [2] www.simics.net [3] ISCA 2000 [4] www.simplescalar.org [5] research.compaq.com/wrl/people/jouppi/cacti.html

How it works Simics generates traces dynamically Traces are fed into the detailed processor simulators, which tell Simics if they can handle more instructions or if they should stall. Activity counters are used in order to get an estimation of energy consumption

Simulation parameters all systems SimpleScalar pipeline Snoop-based MOESI protocol Shared bus, with contention modeled Shared L2-Cache: L1-latency: L2-latency: Mem-latency: 2M, 8-way 1 cycle 12 cycles+bus-arb. 128 cycles

Simulation parameters 8-issue core Issue-width: 8 Window and ROB-size: 128 Load/Store-queue: 64 G-Share BP: 16K-entries Branch Target Buffer: 4K-entries Return Address Stack: 8 entries L1I-Cache 64K, 2-way L1D-Cache 64K, 4-way

Scaling methodology Everything except return address stack is scaled linearly. Tend to favor systems with many cores.

Benchmarks Parallel applications from Splash-2 Cholesky Raytrace FFT Radix Water-sp

Execution time

Instructions per cycle

Executed instructions 1IPC system Baseline system

IPC with perfect memory

Execution time with longer memory latency (3x) Increased execution time Cholesky: 114% Radix: 112% FFT: 103% Water: 61% Raytrace: 94%

Power consumption Radix FFT Water-sp

Energy consumption Radix FFT Water-sp

Conclusions Four 4-issue cores seem to yield almost as good performance as more cores for these multi-threaded applications. Considering power and energy, four or eight cores seem beneficial. Choose four cores in order to achieve good single-thread performance!