CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

Similar documents
Optimizing Pipelines for Power and Performance

Cache Implications of Aggressively Pipelined High Performance Microprocessors

ECE 486/586. Computer Architecture. Lecture # 2

Total Power-Optimal Pipelining and Parallel Processing under Process Variations in Nanometer Technology

VLSI microarchitecture. Scaling. Toni Juan

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Power-Optimal Pipelining in Deep Submicron Technology Λ

Multiple Instruction Issue. Superscalars

Low-power Architecture. By: Jonathan Herbst Scott Duntley

CIS 371 Spring 2010 Thu. 4 March 2010

Course web site: teaching/courses/car. Piazza discussion forum:

Lecture 1: Introduction

Advanced Computer Architecture (CS620)

The Optimum Pipeline Depth for a Microprocessor

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Chapter 4. The Processor

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Microprocessor Trends and Implications for the Future

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Performance of computer systems

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CS/EE 6810: Computer Architecture

Computer Architecture s Changing Definition

Multiple Issue ILP Processors. Summary of discussions

Write only as much as necessary. Be brief!

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

Instructor Information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

EEC 483 Computer Organization

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Use-Based Register Caching with Decoupled Indexing

A 1-GHz Configurable Processor Core MeP-h1

Computer Architecture

Control Hazards. Branch Prediction

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Pipelining. CSC Friday, November 6, 2015

Fundamentals of Quantitative Design and Analysis

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

EECS 322 Computer Architecture Superpipline and the Cache

Advanced Memory Organizations

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Lecture 2: Performance

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Power-Optimal Pipelining in Deep Submicron Technology

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE331: Hardware Organization and Design

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Increasing Processor Performance by Implementing Deeper Pipelines

High Performance SMIPS Processor. Jonathan Eastep May 8, 2005

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Chapter 4. The Processor

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Modern Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Cycle Time for Non-pipelined & Pipelined processors

DEPARTMENT OF ECE IV YEAR ECE EC6009 ADVANCED COMPUTER ARCHITECTURE LECTURE NOTES

101. The memory blocks are mapped on to the cache with the help of a) Hash functions b) Vectors c) Mapping functions d) None of the mentioned

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Copyright 2012, Elsevier Inc. All rights reserved.

Processor (IV) - advanced ILP. Hwansoo Han

Computer Architecture

ECE 341 Final Exam Solution

Lecture 18: Core Design, Parallel Algos

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Advanced Computer Architecture

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Practice Assignment 1

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Transistors and Wires

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Chapter 2: Memory Hierarchy Design Part 2

Cache Optimization. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152

Chapter 9. Pipelining Design Techniques

CS425 Computer Systems Architecture

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

CS429: Computer Organization and Architecture

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Lecture notes for CS Chapter 2, part 1 10/23/18

CSE A215 Assembly Language Programming for Engineers

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

The Implications of Multi-core

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Transcription:

CSE 548 Computer Architecture Clock Rate vs IPC V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger Presented by: Ning Chen

Transistor Changes Development of silicon fabrication technology caused transistor sizes to decrease. Benefits: provide area for more complex microarchitectures reduce transistor switching time Impact: result in larger wire resistance however, wire capacitance has not increased proportionally

Delay of a Wire

Clock Scaling With increasing clock rates, the distance that a signal can travel in a single clock cycle decreases. The absolute # of bits that can be reached in a single clock cycle increases

Clock Scaling With increasing clock rates, fraction of the total chip area that can be reached in a single clock cycle increases.

Conclusion (1) Due to increasing clock frequencies, wire delays are increasing at a high rate. Chip performance will no longer be determined solely by the # of transistors, but will depend on the amount of state and logic that can be reached in a sufficiently small # of clock cycles. With future wire delays, structure size will be limited and the time to bypass results between pipeline stages will grow.

Access Time Factors affect memory structure access time cache capacity block size associativity number of ports process technology

Access Time and Capacity

Access Time and Instruction Window

Conclusion (2) To access caches, register files, branch prediction tables, and instruction windows in a single cycle will require the capacity of these structures to decrease as clock rates increase. The # of cycles needed to access the structures

Performance Analysis Approaches Capacity scaling: shrink the microarchitectural structures sufficiently so that their access penalties are constant across technologies, where access penalty is the access time for a structure measured in clock cycles. Pipeline scaling: hold the capacity of a structure constant and increase the pipeline depth as necessary to cover the increased latency across technologies.

Capacity Scaling vs Pipeline Scaling Performance increases for different scaling strategies.

Conclusion (3) The overall performance for both clocks at both scaling methodologies is nearly identical. The maximal performance increase is a factor of 7.4, which corresponds to a 12.5% annual improvement over that 17-year span.

Conclusion (4) No scaling strategy permits annual performance improvements of better than 12.5%, which is far worse than the annual 50-60% to which we have grown accustomed.

CSE 548 Optimizing Pipelines for Power and Performance Presented by: Anna Cavender

Terminology Latch = a logic circuit. Input: data and clock Output: data Transfers data to the output when signaled from the clock. Clock gating = logic units that aren t being used, don t need to receive signals from the clock. Power can be saved by only sending a clock signal when necessary.

Analytical Model an English Translation Basic idea: Model the throughput of the machine in terms of the pipeline stages. Related to the number of stalls that occur in the pipelines: Stalls due to data dependence: Split the load store pipe into two: one for cache hits the other for cache misses. Stalls due to instruction fetch delay: Modeled this given pipeline utilization and total time per stage of the pipe.

Performance and Power Methodology Two types of power: Dynamic Power Hold Power = power when no switching is occurring Switching Power = logic and data. Leakage Power Increase pipeline depth = increased power usage

Results from Simulator SPEC2000 = standard benchmarks. Optimal pipeline depth changes based on what metric you choose. TPC-C = transaction processing benchmarks. Optimal pipeline depth shifts because BIPS decrease more dramatically, so ratio peaks sooner.

Sensitivity Analysis Finding the optimal pipeline depth is sensitive to many parameters: Latch Growth Factor Number of latches increases with Pipeline depth. Latches added to break up logic into more stages. Determined by logic shape. Favors shallower pipeline. Graph shows 4 different growth factors going from easier to harder to pull apart. They all result in same estimate for optimal pipeline depth. But, it shows that there is a range where we can increase performance without loosing too much in the power world.

Sensitivity Analysis Latch Power Ratio Ratio of hold power to total power. Favors deeper pipeline Latch Insertion Delay More latches needed for more pipe stages Favors shallower pipeline for lower-power latches Glitch Factor Difference in delay from latch output to gate. Linearly dependent on the logic depth. Favors deeper pipeline Leakage Factor Favors deeper pipeline

Conclusion Modeled and simulated powerperformance trade-offs. Optimal size of pipeline stages is around 18 FO4 with a little wiggle room to achieve better performance with small sacrifice in power usage. This optimal is shallower than if performance was our only concern.

Questions What technology is being developed to make sure we keep getting really good performance? (well, WaveScalar, and what else?) More local communication and optimized layout (e.g. circular with shared units in the middle) could help. Aren t there tools for optimization? The clock is one cause of all this; any research of new asynchronous cores (in part, or completely)?