Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks

Size: px
Start display at page:

Download "Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks"

Transcription

1 Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University of Texas) IEEE International Symposium on Workload Characterization October 26, 2006

2 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

3 Toy enchmarks e.g. Hanoi, Heapsort enchmark Spectrum Microbenchmarks e.g. STREM Kernel Codes e.g. Livermore Loops pplication Suites e.g. SPEC CPU Synthetic enchmarks e.g. hrystone, Whetstone Complete pplication Code Less evelopment Effort More Scalable More Maintainable Less Representative More evelopment Effort Less Scalable Less Maintainable More Representative

4 Real World pplications as enchmarks Increases confidence in making design tradeoffs Customize microprocessor design to specific applications est way to understand processor s use Perhaps the only way to understand emerging workload characteristics Simplifies purchasing decisions for customers

5 Challenges With Using Real World pplications Real world applications tend to be proprietary Using real world applications for performance studies can be tedious - ifficult to duplicate user environment - Modifying application to research environment - uplicating real input data set Real world workloads are a moving target..

6 The Problem. Need a methodology to create benchmarks that capture the main performance of real world applications Resulting benchmarks should hide functional meaning of code bility to study what-if scenarios by varying program characteristics

7 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

8 Performance Cloning Central Idea Real World pplication Workload Characteristics Instruction Mix asic lock Size ILP ata Locality. Performance Clone R1, R2,R3 L R4, R1, R6 MUL R3, R6, R7 R3, R2, R5 IV R10, R2, R1 SU R3, R5, R6 STORE R3, R10, R20 R1, R2,R3 L R4, R1, R6 MUL R3, R6, R7 R3, R2, R5 IV R10, R2, R1 SU R3, R5, R1 EQ R3, R6, LOOP SU R3, R5, R6 STORE R3, R10, R20 IV R10, R2, R1. Measure Inherent Workload Characteristics Generate Clone with Similar Characteristics

9 Performance Cloning Framework Microarchitecture-Independent Workload Profiling Modeling Workload ttributes into Synthetic Workload Experiment Environment Real World Proprietary Workload Workload Profiler inary Instrumentation OR Simulation Workload Profile = Workload Synthesizer Synthetic enchmark Clone Real Hardware Workload ttributes + istribution Of ttribute Values Execution riven Simulator

10 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

11 Microarchitecture-Independent Profile Control Flow ehavior ata Locality Control Flow Predictability Instruction Mix Instruction Level Parallelism

12 Control Flow ehavior (1) Statistical Flow Graph (K=0) [Eeckhout et al., ISC 2004] pplication inary Profiling 40% (5) 60% 100% 33% (3) C (2) 67%

13 Control Flow ehavior (2) pplication inary Statistical Flow Graph (K=1) [Eeckhout et al., ISC 2004] (2) 60% 100% (3) 67% 33% (1) 100% C (2) C (1) 100% Profiling

14 Modeling Memory ata ccess Pattern Identify streams of data references Stream? Sequence of memory addresses in an arithmetic progression Elements of arrays,, and C form 3 streams for( ii = 0; ii < N; ii ++) [ii] = [ii] + C [ii] 200, 204, , 324, , 408, Issuing Sequence : 320, 404, 200, 324, 408, 204. Streams are interleaved and may contain noise 4, 8, 12, 16, 1, 3, 20, 24, 5, 7, 2, 9, 11, 28

15 Extracting Streams Reference pattern of static Load / Store Instructions PC-correlated spatial locality - ependence on address referenced by nearby Ld / St - Programs with pointer chasing codes PC-correlated temporal locality - ependence on previous address generated by same Ld / St - Programs with multidimensional arrays Could static Load / Store instructions be natural sources of streams? Profile every static Load / Store instruction Number of different strides with which it accesses data

16 ehavior of Static Load/Store Instructions Percentage of ynamic Memory References basicmath bitcount crc32 dijkstra fft ghostscript_mibench gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript_media mpeg-decode rasta rawaudio texgen unepic s a First-Order Model, Static Load/Stores can be modeled as single stream

17 Modeling Control Flow Predictability Capture behavior of easy and difficult to predict branches Inherent program feature that captures branch behavior Transition Rate [ Haungs et al. HPC 00 ] # of Taken-Not Taken transitions / # of times executed ranches with low transition-rate (easier to predict) TTTTTTTTTN, NNNNNNNNNT ranches with high transition-rate (easier to predict) TNTNTNTNTN ranches with moderate transition-rate (tougher to predict)

18 Modeling Instruction Level Parallelism ependency istance R1, R3,R4 MUL R5,R3,R2 R5,R3,R6 L R4, (R8) SU R8,R2, R1 Read fter Write ependency istance = 3 Measure istribution of ependency istances Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32

19 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

20 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities R C R R 0.9 R 0.1 Workload Profile

21 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile

22 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile

23 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile

24 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile ranching Model ased on Transition Rate

25 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile ranching Model ased on Transition Rate Register ssignment C code with asm & volatile constructs

26 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

27 Tools & enchmarks SimpleScalar/Wattch Simulators for profiling and cycle-accurate simulation lpha IS Programs compiled with Compaq cc v O3 level enchmarks from Miench and Mediaench benchmark suites as representatives of characteristics of Embedded pplications Program basicmath, qsort, bitcount, susan crc32, dijkstra, patricia fft, gsm ghostscript, rsynth, stringsearch jpeg, typeset cjpeg, djpeg, g721-decode, ghostscript, mpeg, rasta, rawaudio, texgen, unepic pplication omain utomotive Networking Telecommunication Office Consumer Media

28 Evaluation bsolute accuracy - bility of performance clone to estimate absolute IPC and Power Relative accuracy - Sensitivity (IPC and Power) of performance clone to cache & microarchitecture design changes ase Configuration L1 I-cache L1 -cache L2 Unified cache Fetch, ecode, and Issue Width Fetch Queue ranch Predictor Functional Units Reorder uffer Load Store Queue Memory (us Width, First lock Latency) 16 K/2-way/32 16 K/2-way/32 64 K/4-way/64 1-wide out-of-order 8 entry 2-level Gp predictor 2 Integer LU, 1 FP Multiplication Unit, 1 FP LU 16 entries 8 entries 8, 40 cycles

29 bsolute ccuracy in IPC Original enchmark Synthetic Clone IPC on ase Configuration basicmath bitcount crc32 mpeg-decode rasta rawaudio texgen unepic dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript verage absolute error in estimating IPC is 8.7%

30 bsolute ccuracy in Power verage absolute error in estimating power is 6.4% Original enchmark Synthetic Clone Power Consumption on ase Configuration basicmath bitcount crc32 dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript mpeg-decode rasta rawaudio texgen unepic

31 Tracking esign Changes (1) cross 28 cache configurations Pearson' Correlation Coefficient basicmath bitcount crc32 dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript mpeg-decode rasta rawaudio texgen unepic verage

32 Tracking esign Changes (2) Ranking of Cache Configuration (Real) Ranking of Cache Configuration (Synthetic) cross 28 cache configurations

33 Tracking esign Changes (3) esign Change verage Relative Error in IPC verage Relative Error in Power ouble the number of entries in the reorder buffer and load store Queue 5.81% 3.41% Reduce the L1 cache size to half 1.48% 0.39% ouble the fetch, decode, and issue Width Change the predictor from a 2- level to a not-taken predictor Change the instruction issue policy to in-order 5.41% 4.59% 6.51% 1.80% 3.26% 1.22% 5 ifferent Microarchitecture Changes

34 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary

35 Conclusions Technique that clones performance but hides functional meaning of code - rchitects & esigners can get access to proprietary workloads - Foster benchmark sharing between industry and academia - Customers can make informed purchase decisions Evaluation of technique on embedded benchmarks is promising - Synthetic clone exhibits similar power/performance characteristics - Synthetic clone is a good proxy to original application

36 Challenges & Limitations Compiler technology is absorbed into the performance clone - Limited use for compiler studies enchmark contains IS specific embedded asm statements - Every embedded microprocessor designer cares about single IS - Possibilities for true portability virtual IS, binary translation bstract workload model simple by construction - bility to perform what-if performance studies - Higher order models to capture complex dataflow

37

Distilling the Essence of Proprietary Workloads into Miniature Benchmarks

Distilling the Essence of Proprietary Workloads into Miniature Benchmarks Distilling the Essence of Proprietary Workloads into Miniature Benchmarks AJAY JOSHI University of Texas at Austin LIEVEN EECKHOUT Ghent University ROBERT H. BELL JR. IBM, Austin and LIZY K. JOHN University

More information

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks

Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Ajay Joshi 1, Lieven Eeckhout 2, Robert H. Bell Jr. 3, and Lizy John 1 1 - Department of Electrical and Computer

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

The Case for Automatic Synthesis of Miniature Benchmarks

The Case for Automatic Synthesis of Miniature Benchmarks The Case for utomatic Synthesis of Miniature Benchmarks Robert H. Bell, Jr. Lizy K. John IBM Systems and Technology ivision epartment of Electrical and Computer Engineering ustin, Texas The University

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

Understanding multimedia application chacteristics for designing programmable media processors

Understanding multimedia application chacteristics for designing programmable media processors Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable

More information

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Hamid Noori, Farhad Mehdipour, Koji Inoue, and Kazuaki Murakami Institute of Systems, Information

More information

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation

More information

Energy Consumption Evaluation of an Adaptive Extensible Processor

Energy Consumption Evaluation of an Adaptive Extensible Processor Energy Consumption Evaluation of an Adaptive Extensible Processor Hamid Noori, Farhad Mehdipour, Maziar Goudarzi, Seiichiro Yamaguchi, Koji Inoue, and Kazuaki Murakami December 2007 Outline Introduction

More information

Instruction Cache Energy Saving Through Compiler Way-Placement

Instruction Cache Energy Saving Through Compiler Way-Placement Instruction Cache Energy Saving Through Compiler Way-Placement Timothy M. Jones, Sandro Bartolini, Bruno De Bus, John Cavazosζ and Michael F.P. O Boyle Member of HiPEAC, School of Informatics University

More information

Static Analysis of Worst-Case Stack Cache Behavior

Static Analysis of Worst-Case Stack Cache Behavior Static Analysis of Worst-Case Stack Cache Behavior Florian Brandner Unité d Informatique et d Ing. des Systèmes ENSTA-ParisTech Alexander Jordan Embedded Systems Engineering Sect. Technical University

More information

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Statistical Simulation of Superscalar Architectures using Commercial Workloads Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Evaluating MMX Technology Using DSP and Multimedia Applications

Evaluating MMX Technology Using DSP and Multimedia Applications Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical

More information

Reporting Performance Results

Reporting Performance Results Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility - another experimenter would need to duplicate the results. However: A system s software

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics

Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Improving Data Access Efficiency by Using Context-Aware Loads and Stores

Improving Data Access Efficiency by Using Context-Aware Loads and Stores Improving Data Access Efficiency by Using Context-Aware Loads and Stores Alen Bardizbanyan Chalmers University of Technology Gothenburg, Sweden alenb@chalmers.se Magnus Själander Uppsala University Uppsala,

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

Dynamic Scheduling. CSE471 Susan Eggers 1

Dynamic Scheduling. CSE471 Susan Eggers 1 Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip

More information

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su

A Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su A Scheme of Predictor Based Stream Buffers Bill Hodges, Guoqiang Pan, Lixin Su Outline Background and motivation Project hypothesis Our scheme of predictor-based stream buffer Predictors Predictor table

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching

More information

Instructor Information

Instructor Information CS 203A Advanced Computer Architecture Lecture 1 1 Instructor Information Rajiv Gupta Office: Engg.II Room 408 E-mail: gupta@cs.ucr.edu Tel: (951) 827-2558 Office Times: T, Th 1-2 pm 2 1 Course Syllabus

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Lecture 19: Instruction Level Parallelism

Lecture 19: Instruction Level Parallelism Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register

More information

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N. A1: Architecture (25 points) Consider these four possible branch predictors: (A) Static backward taken, forward not taken (B) 1-bit saturating counter (C) 2-bit saturating counter (D) Global predictor

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Performance Modeling and Analysis of Flash based Storage Devices

Performance Modeling and Analysis of Flash based Storage Devices Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Automatic Generation of Miniaturized Synthetic Proxies for Target Applications to Efficiently Design Multicore Processors

Automatic Generation of Miniaturized Synthetic Proxies for Target Applications to Efficiently Design Multicore Processors 1 Automatic Generation of Miniaturized Proxies for Target Applications to Efficiently Design Multicore Processors Karthik Ganesan, Member, IEEE, and Lizy Kurian John, Fellow, IEEE, Abstract Prohibitive

More information

Proxy Benchmarks for Emerging Big-data Workloads

Proxy Benchmarks for Emerging Big-data Workloads Proxy Benchmarks for Emerging Big-data Workloads Reena Panda University of Texas at Austin reena.panda@utexas.edu Lizy Kurian John University of Texas at Austin ljohn@ece.utexas.edu Abstract Early design-space

More information

Anand Raghunathan

Anand Raghunathan ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.15: ASIP: Approaches to Design Anand Raghunathan raghunathan@purdue.edu ECE 695R: System-on-Chip Design, Fall 2014 Fall 2014, ME 1052,

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli 06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

CAMP: Accurate Modeling of Core and Memory Locality for Proxy Generation of Big-data Applications

CAMP: Accurate Modeling of Core and Memory Locality for Proxy Generation of Big-data Applications CAMP: Accurate Modeling of Core and Memory Locality for Generation of Big-data Applications Reena Panda, Xinnian Zheng, Andreas Gerstlauer and Lizy Kurian John The University of Texas at Austin, NVIDIA

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES By PETER GAVIN A Dissertation submitted to the Department

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES By PETER GAVIN A Dissertation submitted to the Department

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Computer Architecture

Computer Architecture Computer Architecture Architecture The art and science of designing and constructing buildings A style and method of design and construction Design, the way components fit together Computer Architecture

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

Computer Architecture EE 4720 Final Examination

Computer Architecture EE 4720 Final Examination Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)

More information

Advanced Topic in Pipeline: Pipeline scheduling

Advanced Topic in Pipeline: Pipeline scheduling Contents dvanced Topic in Pipeline: Pipeline scheduling Linear Pipelines Nonlinear pipelines Instruction Pipelines rithmetic Operations esign of Multifunction Pipeline Linear Pipeline Processing Stages

More information

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1 Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Insight into Application Performance Using Application-Dependent Characteristics

Insight into Application Performance Using Application-Dependent Characteristics Insight into Application Performance Using Application-Dependent Characteristics Waleed Alkohlani 1, Jeanine Cook 2, and Nafiul Siddique 1 1 Klipsch School of Electrical and Computer Engineering, New Mexico

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) #1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing

More information

A Reconfigurable Functional Unit for an Adaptive Extensible Processor

A Reconfigurable Functional Unit for an Adaptive Extensible Processor A Reconfigurable Functional Unit for an Adaptive Extensible Processor Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue and Morteza SahebZamani Department of Informatics, Graduate School of Information

More information

Introduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1

Introduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1 Chapter 4 The Processor Introduction CPU performance factors Instruction count etermined by IS and compiler CPI and Cycle time etermined by CPU hardware We will examine two MIPS implementations simplified

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction

Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction {bclee,dbrooks}@eecs.harvard.edu Division of Engineering and Applied Sciences Harvard University 24 October

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Understanding multimedia application characteristics for designing programmable media processors

Understanding multimedia application characteristics for designing programmable media processors Header for SPIE use Understanding multimedia application characteristics for designing programmable media processors Jason Fritts *, Wayne Wolf, and Bede Liu Dept. of Electrical Engineering, Princeton

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are

More information

The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems

The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems Hamid Noori, Maziar Goudarzi, Koji Inoue, and Kazuaki Murakami Kyushu University Outline Motivations

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I

More information

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues

Performance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural

More information

Benchmark: Uses. Guide computer design. Guide purchasing decisions. Marketing tool Benchmarks. Program used to evaluate performance.

Benchmark: Uses. Guide computer design. Guide purchasing decisions. Marketing tool Benchmarks. Program used to evaluate performance. 02 1 Benchmarks 02 1 Benchmark: Program used to evaluate performance. Uses Guide computer design. Guide purchasing decisions. Marketing tool. 02 1 EE 4720 Lecture Transparency. Formatted 9:15, 6 April

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School

[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator, ACAPS Technical Memo 64, School References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative

More information

Benchmark Synthesis for Architecture and Compiler Exploration

Benchmark Synthesis for Architecture and Compiler Exploration Benchmark Synthesis for Architecture and Compiler Exploration Luk Van Ertvelde Lieven Eeckhout Ghent University, Belgium Abstract This paper presents a novel benchmark synthesis framework with three key

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University Stanford, CA 94305, USA zmily@stanford.edu,

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

Central Processing Unit

Central Processing Unit Central Processing Unit Networks and Embedded Software Module.. by Wolfgang Neff Components () lock diagram Execution Unit Control Unit Registers rithmetic logic unit DD, SU etc. NOT, ND etc. us Interface

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

The University of Texas at Austin

The University of Texas at Austin EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

I/O Characterization of Commercial Workloads

I/O Characterization of Commercial Workloads I/O Characterization of Commercial Workloads Kimberly Keeton, Alistair Veitch, Doug Obal, and John Wilkes Storage Systems Program Hewlett-Packard Laboratories www.hpl.hp.com/research/itc/csl/ssp kkeeton@hpl.hp.com

More information

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial

ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Copyright by Karthik Ganesan 2011

Copyright by Karthik Ganesan 2011 Copyright by Karthik Ganesan 2011 The Dissertation Committee for Karthik Ganesan certifies that this is the approved version of the following dissertation: Automatic Generation of Synthetic Workloads for

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information