A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

Size: px
Start display at page:

Download "A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST"

Transcription

1 A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1

2 Outline Introduction Data-path Partitioning for a dependable processor Simple Multiplexing (DPSM) Compressed Multiplexing (DPCM) Evaluation Conclusions 2

3 Introduction Dependability of computer systems is one of the most important design constrains!! The field where high reliability is demanded Online transaction processing Artificial satellite Medical treatment Traffic control Electric money and private information 3

4 Fault Period of time Caused by Permanent fault long Internal factors Destruction of semiconductor junction Short circuit Disconnection Temporal fault very short External factors Unexpected temperature Vibration Alpha particle Cosmic ray 4

5 Conventional fault detection Temporal Redundancy Spatial Redundancy How to detect Execution Time Hardware Cost Detectable Fault Temporal Fault Temporal Fault Permanent Fault 5

6 Our Goal Execution Time Hardware Cost Detectable Fault Temporal Redundancy Temporal Fault Spatial Redundancy Temporal Fault Permanent Fault Supporting Spatial Redundancy without increasing the hardware cost!! 6

7 Approach: Data-Path Partitioning Data-Path Partitioning for a dependable processor: Data-path is partitioned some narrow-width datapath An instruction is executed in each partitioned datapath in parallel Comparing the execution results generated from the partitioned data-path Fault is detected!! 7

8 Approach: Data-Path Partitioning Implementation alternatives Simple Multiplexing (DPSM) Compressed Multiplexing (DPCM) Assumptions Baseline Processor has a 32-bit Data-Path Each 32-bit data execution can be completed in 1 clock cycle Degree of redundancy to be realized is two or four 32-bit data-path 1 (SR1) 32-bit data-path 16-bit data-path 2 (SR2) 8-bit data-path 4 (SR4) 8

9 Simple Multiplexing (DPSM) 32bit 1[SR1] 32bit 16bit 2 [SR2] 8bit 4 [SR4] 16bit 16bit 8bit 8bit 8bit 8bit 3CC Instruction1 6CC Instruction2 Instruction3 CC: Clock Cycle Increasing Execution time!! Detecting Permanent fault!! 12CC 9

10 Effective Bit-width The data which instruction uses Data-path bit-width *D. Brooks and M. Martonosi. Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance. HPCA, pp.13-22, Bit-width to be unnecessary for each instruction execution Bit-width to be required for each instruction execution Effective bit-width Many applications have small effective bit-width!! In SPECint95, 50% of instructions have both operands less than or equal to 16 bits* Media applications mainly deal with 8-bit pixel value 10

11 Compressed Multiplexing (DPCM) Effective bit-width Bit-width of the partitioned data-path 11

12 12 Coverage P C Instruction Memory Data Memory read address Sign Extend Register File read register1 read register2 write register data read data1 read data2 Instruction M U X M U X M U X ALU result Shift left read adress write data data Add Add result

13 Evaluation Evaluation Purpose Primary evaluation for the impact of Data-Path Partitioning on processor performance Experimental setup Simulator SimpleScalar(ver.3.0d) Instruction-level simulation Benchmark program SPEC2000 benchmark suite 164.gzip, 175.vpr, 176.gcc, 181.mcf 197.parser, 255.vortex, 256.bzip Input: small input data set Assumption Perfect cache 13

14 Execution time IC : Instruction Count CPI : Clock Cycle Per Instruction 1 CCT : Clock Cycle Time fixed value SR2:IC org + IC gt16b 1 SR4:IC org + IC gt8b 3 IC org :The number of instructions with 32-bit data-path IC gt16b : The number of instructions with Effective bit-width 16 IC gt8b : The number of instructions with Effective bit-width 8 14

15 Execution time overhead (DPCM) 4.0 SR2 SR Normalized execution time 164.gzip 175.vpr 176.gcc 181.mcf 197.parser 255.vortex 256.bzip benchmark

16 Breakdown for compressed instruction Branch load/store ALU Others Execution Frequency 18.23% 37.32% 41.59% 2.86% % of Compressible Instruction SR2 SR % 75.61% 0.00% 0.00% 52.62% 36.32% 0.00% 0.00% Base address is set to a large value!! Replacement of data by a compiler is necessary to reduce execution time increase 16

17 Conclusions This work Primary evaluation for the impact of data-path partitioning on processor performance Simple Multiplexing (DPSM) Compressed Multiplexing (DPCM) DPSM DPCM Normalized execution time SR SR Future work Establishing the complete microarchitecture to support the proposed idea 17

18 Thank you!! 18

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)

More information

Energy Efficient Asymmetrically Ported Register Files

Energy Efficient Asymmetrically Ported Register Files Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Chip-Multithreading Systems Need A New Operating Systems Scheduler Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency

Koji Inoue Department of Informatics, Kyushu University Japan Science and Technology Agency Lock and Unlock: A Data Management Algorithm for A Security-Aware Cache Department of Informatics, Japan Science and Technology Agency ICECS'06 1 Background (1/2) Trusted Program Malicious Program Branch

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Ultra Low-Cost Defect Protection for Microprocessor Pipelines Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key

More information

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics

Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics Masayo Haneda Peter M.W. Knijnenburg Harry A.G. Wijshoff LIACS, Leiden University Motivation An optimal compiler optimization

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

cs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes

cs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes 1 of 7 ay 13, 2002 v2 Spring 2002 Final Exam open books, open notes Starts: 7:30 pm Ends: 9:30 pm Name: (please print) ID: Problem ax points Your mark Comments 1 10 5+5 2 40 10+5+5+10+10 3 15 5+10 4 10

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Dependability, Power, and Performance Trade-off on a Multicore Processor

Dependability, Power, and Performance Trade-off on a Multicore Processor Dependability, Power, and Performance Trade-off on a Multi Processor Toshinori Sato System LSI Research Center Kyushu University toshinori.sato@computer.org Abstract - As deep submicron technologies are

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

REGISTER FILE ACCESS REDUCTION BY DATA REUSE

REGISTER FILE ACCESS REDUCTION BY DATA REUSE REGISTER FILE ACCESS REDUCTION BY DATA REUSE Hiroshi Takamura Koji Inoue Vasily G. Moshnyaga Dept. of Electronics Engineering and Computer Science Fukuoka University, Japan 1 Overview of the talk Motivation

More information

Speculative Multithreaded Processors

Speculative Multithreaded Processors Guri Sohi and Amir Roth Computer Sciences Department University of Wisconsin-Madison utline Trends and their implications Workloads for future processors Program parallelization and speculative threads

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor Hari Kannan, Michael Dalton, Christos Kozyrakis Computer Systems Laboratory Stanford University Motivation Dynamic analysis help

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies Administrivia CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) HW #3, on memory hierarchy, due Tuesday Continue reading Chapter 3 of H&P Alan Sussman als@cs.umd.edu

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University The Processor: Datapath and Control Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction CPU performance factors Instruction count Determined

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Exploiting Idle Floating-Point Resources for Integer Execution

Exploiting Idle Floating-Point Resources for Integer Execution Exploiting Idle Floating-Point Resources for Integer Execution, Subbarao Palacharla, James E. Smith University of Wisconsin, Madison Motivation Partitioned integer and floating-point resources on current

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core 3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

APPENDIX Summary of Benchmarks

APPENDIX Summary of Benchmarks 158 APPENDIX Summary of Benchmarks The experimental results presented throughout this thesis use programs from four benchmark suites: Cyclone benchmarks (available from [Cyc]): programs used to evaluate

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Dynamic Scheduling with Narrow Operand Values

Dynamic Scheduling with Narrow Operand Values Dynamic Scheduling with Narrow Operand Values Erika Gunadi University of Wisconsin-Madison Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706 Submitted in partial

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

Simple and Efficient Construction of Static Single Assignment Form

Simple and Efficient Construction of Static Single Assignment Form Simple and Efficient Construction of Static Single Assignment Form saarland university Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon and Andreas Zwinkau computer science

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Introduction. Datapath Basics

Introduction. Datapath Basics Introduction CPU performance factors - Instruction count; determined by ISA and compiler - CPI and Cycle time; determined by CPU hardware 1 We will examine a simplified MIPS implementation in this course

More information

{ssnavada, nkchoudh,

{ssnavada, nkchoudh, Criticality-driven Superscalar Design Space Exploration Sandeep Navada, Niket K. Choudhary, Eric Rotenberg Department of Electrical and Computer Engineering North Carolina State University {ssnavada, nkchoudh,

More information

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor)

Chapter 7-1. Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授. V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor) Chapter 7-1 Large and Fast: Exploiting Memory Hierarchy (part I: cache) 臺大電機系吳安宇教授 V1 11/24/2004 V2 12/01/2004 V3 12/08/2004 (minor) 臺大電機吳安宇教授 - 計算機結構 1 Outline 7.1 Introduction 7.2 The Basics of Caches

More information

L1 Data Cache Decomposition for Energy Efficiency

L1 Data Cache Decomposition for Energy Efficiency L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/flexram Objective Reduce

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

3D Memory Architecture. Kyushu University

3D Memory Architecture. Kyushu University 3D Memory Architecture Koji Inoue Kyushu University 1 Outline Why 3D? Will 3D always work well? Support Adaptive Execution! Memory Hierarchy Run time Optimization Conclusions 2 Outline Why 3D? Will 3D

More information

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores

Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Demand-Only Broadcast: Reducing Register File and Bypass Power in Clustered Execution Cores Mary D. Brown Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin {mbrown,patt}@ece.utexas.edu

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Detecting Global Stride Locality in Value Streams

Detecting Global Stride Locality in Value Streams Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University 1 Introduction

More information

Computer Performance Evaluation: Cycles Per Instruction (CPI)

Computer Performance Evaluation: Cycles Per Instruction (CPI) Computer Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate: where: Clock rate = 1 / clock cycle A computer machine

More information

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions

Page 1. Structure of von Nuemann machine. Instruction Set - the type of Instructions Structure of von Nuemann machine Arithmetic and Logic Unit Input Output Equipment Main Memory Program Control Unit 1 1 Instruction Set - the type of Instructions Arithmetic + Logical (ADD, SUB, MULT, DIV,

More information

MCD: A Multiple Clock Domain Microarchitecture

MCD: A Multiple Clock Domain Microarchitecture MCD: A Multiple Clock Domain Microarchitecture Dave Albonesi in collaboration with Greg Semeraro Grigoris Magklis Rajeev Balasubramonian Steve Dropsho Sandhya Dwarkadas Michael Scott Caveats We started

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

Data-Triggered Threads: Eliminating Redundant Computation

Data-Triggered Threads: Eliminating Redundant Computation In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA 2011) Data-Triggered Threads: Eliminating Redundant Computation Hung-Wei Tseng and Dean M. Tullsen Department

More information

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (1) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Cache Pipelining with Partial Operand Knowledge

Cache Pipelining with Partial Operand Knowledge Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin - Madison {egunadi,mikko}@ece.wisc.edu Abstract

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Lecture 2: Computer Performance Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533 Performance and Cost Purchasing perspective given a collection of machines, which has the - best performance?

More information

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture

More information

Inf2C - Computer Systems Lecture Processor Design Single Cycle

Inf2C - Computer Systems Lecture Processor Design Single Cycle Inf2C - Computer Systems Lecture 10-11 Processor Design Single Cycle Boris Grot School of Informatics University of Edinburgh Previous lectures Combinational circuits Combinations of gates (INV, AND, OR,

More information

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Number Representation 09212011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Logic Circuits for Register Transfer

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Select-Free Instruction Scheduling Logic

Select-Free Instruction Scheduling Logic Select-Free Instruction Scheduling Logic Mary D. Brown y Jared Stark z Yale N. Patt y Dept. of Electrical and Computer Engineering y The University of Texas at Austin fmbrown,pattg@ece.utexas.edu Microprocessor

More information

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading CSE 502 Graduate Computer Architecture Lec 11 Simultaneous Multithreading Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks

Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks Breaking Cyclic-Multithreading Parallelization with XML Parsing Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks 0 / 21 Scope Today s commodity platforms include multiple cores

More information

Chapter 4. The Processor Designing the datapath

Chapter 4. The Processor Designing the datapath Chapter 4 The Processor Designing the datapath Introduction CPU performance determined by Instruction Count Clock Cycles per Instruction (CPI) and Cycle time Determined by Instruction Set Architecure (ISA)

More information

Introduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1

Introduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1 Chapter 4 The Processor Introduction CPU performance factors Instruction count etermined by IS and compiler CPI and Cycle time etermined by CPU hardware We will examine two MIPS implementations simplified

More information

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Hamid Noori, Farhad Mehdipour, Koji Inoue, and Kazuaki Murakami Institute of Systems, Information

More information

Energy-Effective Instruction Fetch Unit for Wide Issue Processors

Energy-Effective Instruction Fetch Unit for Wide Issue Processors Energy-Effective Instruction Fetch Unit for Wide Issue Processors Juan L. Aragón 1 and Alexander V. Veidenbaum 2 1 Dept. Ingen. y Tecnología de Computadores, Universidad de Murcia, 30071 Murcia, Spain

More information

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras Performance, Cost and Amdahl s s Law Arquitectura de Computadoras Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Arquitectura de Computadoras Performance-

More information

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) Lecture 2: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) 1 Defining Fault, Error, and Failure A fault produces a latent error; it becomes effective when

More information

GRE Architecture Session

GRE Architecture Session GRE Architecture Session Session 2: Saturday 23, 1995 Young H. Cho e-mail: youngc@cs.berkeley.edu www: http://http.cs.berkeley/~youngc Y. H. Cho Page 1 Review n Homework n Basic Gate Arithmetics n Bubble

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 41 Performance II CS61C L41 Performance II (1) Lecturer PSOE Dan Garcia www.cs.berkeley.edu/~ddgarcia UWB Ultra Wide Band! The FCC moved

More information

Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions

Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions 1 Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions Lieven Eeckhout John Cavazos ELIS Department, Ghent University, Belgium School of Informatics, University of Edinburgh,

More information

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design 1 TDT4255 Computer Design Lecture 4 Magnus Jahre 2 Outline Chapter 4.1 to 4.4 A Multi-cycle Processor Appendix D 3 Chapter 4 The Processor Acknowledgement: Slides are adapted from Morgan Kaufmann companion

More information

ShortCut: Architectural Support for Fast Object Access in Scripting Languages

ShortCut: Architectural Support for Fast Object Access in Scripting Languages Jiho Choi, Thomas Shull, Maria J. Garzaran, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2017 Overheads of Scripting Languages

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Efficient Program Power Behavior Characterization

Efficient Program Power Behavior Characterization Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ

More information

Designing and Optimizing the Fetch Unit for a RISC Core

Designing and Optimizing the Fetch Unit for a RISC Core Journal of Computer and Robotics (200) 3-25 3 Designing and Optimizing the Fetch Unit for a RISC Core Mojtaba Shojaei, Bahman Javadi *, Mohammad Kazem Akbari, Farnaz Irannejad Computer Engineering and

More information

Copyright by Mary Douglass Brown 2005

Copyright by Mary Douglass Brown 2005 Copyright by Mary Douglass Brown 2005 The Dissertation Committee for Mary Douglass Brown certifies that this is the approved version of the following dissertation: Reducing Critical Path Execution Time

More information

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections ) Lecture 23: Storage Systems Topics: disk access, bus design, evaluation metrics, RAID (Sections 7.1-7.9) 1 Role of I/O Activities external to the CPU are typically orders of magnitude slower Example: while

More information

CS/COE1541: Introduction to Computer Architecture

CS/COE1541: Introduction to Computer Architecture CS/COE1541: Introduction to Computer Architecture Dept. of Computer Science University of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/1541p/index.html 1 Computer Architecture? Application pull Operating

More information

Processor (I) - datapath & control. Hwansoo Han

Processor (I) - datapath & control. Hwansoo Han Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two

More information

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations Chapter 4 The Processor Part I Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations

More information