Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks
|
|
- Chad Fowler
- 5 years ago
- Views:
Transcription
1 Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University of Texas) IEEE International Symposium on Workload Characterization October 26, 2006
2 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary
3 Toy enchmarks e.g. Hanoi, Heapsort enchmark Spectrum Microbenchmarks e.g. STREM Kernel Codes e.g. Livermore Loops pplication Suites e.g. SPEC CPU Synthetic enchmarks e.g. hrystone, Whetstone Complete pplication Code Less evelopment Effort More Scalable More Maintainable Less Representative More evelopment Effort Less Scalable Less Maintainable More Representative
4 Real World pplications as enchmarks Increases confidence in making design tradeoffs Customize microprocessor design to specific applications est way to understand processor s use Perhaps the only way to understand emerging workload characteristics Simplifies purchasing decisions for customers
5 Challenges With Using Real World pplications Real world applications tend to be proprietary Using real world applications for performance studies can be tedious - ifficult to duplicate user environment - Modifying application to research environment - uplicating real input data set Real world workloads are a moving target..
6 The Problem. Need a methodology to create benchmarks that capture the main performance of real world applications Resulting benchmarks should hide functional meaning of code bility to study what-if scenarios by varying program characteristics
7 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary
8 Performance Cloning Central Idea Real World pplication Workload Characteristics Instruction Mix asic lock Size ILP ata Locality. Performance Clone R1, R2,R3 L R4, R1, R6 MUL R3, R6, R7 R3, R2, R5 IV R10, R2, R1 SU R3, R5, R6 STORE R3, R10, R20 R1, R2,R3 L R4, R1, R6 MUL R3, R6, R7 R3, R2, R5 IV R10, R2, R1 SU R3, R5, R1 EQ R3, R6, LOOP SU R3, R5, R6 STORE R3, R10, R20 IV R10, R2, R1. Measure Inherent Workload Characteristics Generate Clone with Similar Characteristics
9 Performance Cloning Framework Microarchitecture-Independent Workload Profiling Modeling Workload ttributes into Synthetic Workload Experiment Environment Real World Proprietary Workload Workload Profiler inary Instrumentation OR Simulation Workload Profile = Workload Synthesizer Synthetic enchmark Clone Real Hardware Workload ttributes + istribution Of ttribute Values Execution riven Simulator
10 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary
11 Microarchitecture-Independent Profile Control Flow ehavior ata Locality Control Flow Predictability Instruction Mix Instruction Level Parallelism
12 Control Flow ehavior (1) Statistical Flow Graph (K=0) [Eeckhout et al., ISC 2004] pplication inary Profiling 40% (5) 60% 100% 33% (3) C (2) 67%
13 Control Flow ehavior (2) pplication inary Statistical Flow Graph (K=1) [Eeckhout et al., ISC 2004] (2) 60% 100% (3) 67% 33% (1) 100% C (2) C (1) 100% Profiling
14 Modeling Memory ata ccess Pattern Identify streams of data references Stream? Sequence of memory addresses in an arithmetic progression Elements of arrays,, and C form 3 streams for( ii = 0; ii < N; ii ++) [ii] = [ii] + C [ii] 200, 204, , 324, , 408, Issuing Sequence : 320, 404, 200, 324, 408, 204. Streams are interleaved and may contain noise 4, 8, 12, 16, 1, 3, 20, 24, 5, 7, 2, 9, 11, 28
15 Extracting Streams Reference pattern of static Load / Store Instructions PC-correlated spatial locality - ependence on address referenced by nearby Ld / St - Programs with pointer chasing codes PC-correlated temporal locality - ependence on previous address generated by same Ld / St - Programs with multidimensional arrays Could static Load / Store instructions be natural sources of streams? Profile every static Load / Store instruction Number of different strides with which it accesses data
16 ehavior of Static Load/Store Instructions Percentage of ynamic Memory References basicmath bitcount crc32 dijkstra fft ghostscript_mibench gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript_media mpeg-decode rasta rawaudio texgen unepic s a First-Order Model, Static Load/Stores can be modeled as single stream
17 Modeling Control Flow Predictability Capture behavior of easy and difficult to predict branches Inherent program feature that captures branch behavior Transition Rate [ Haungs et al. HPC 00 ] # of Taken-Not Taken transitions / # of times executed ranches with low transition-rate (easier to predict) TTTTTTTTTN, NNNNNNNNNT ranches with high transition-rate (easier to predict) TNTNTNTNTN ranches with moderate transition-rate (tougher to predict)
18 Modeling Instruction Level Parallelism ependency istance R1, R3,R4 MUL R5,R3,R2 R5,R3,R6 L R4, (R8) SU R8,R2, R1 Read fter Write ependency istance = 3 Measure istribution of ependency istances Upto 1, Upto 2, Upto 4, Upto 8, Upto 16, Upto 32, >32
19 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary
20 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities R C R R 0.9 R 0.1 Workload Profile
21 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile
22 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile
23 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile
24 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile ranching Model ased on Transition Rate
25 Performance Clone Generation Instruction Mix Register ependency istance Stride Pattern of Load/Store ranch Transition Rate ranch Transition Probabilities Memory ccess Model (Strides) 1 ig Loop 0.8 R 0.2 R R C Synthetic Clone Generation C R Workload Profile ranching Model ased on Transition Rate Register ssignment C code with asm & volatile constructs
26 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary
27 Tools & enchmarks SimpleScalar/Wattch Simulators for profiling and cycle-accurate simulation lpha IS Programs compiled with Compaq cc v O3 level enchmarks from Miench and Mediaench benchmark suites as representatives of characteristics of Embedded pplications Program basicmath, qsort, bitcount, susan crc32, dijkstra, patricia fft, gsm ghostscript, rsynth, stringsearch jpeg, typeset cjpeg, djpeg, g721-decode, ghostscript, mpeg, rasta, rawaudio, texgen, unepic pplication omain utomotive Networking Telecommunication Office Consumer Media
28 Evaluation bsolute accuracy - bility of performance clone to estimate absolute IPC and Power Relative accuracy - Sensitivity (IPC and Power) of performance clone to cache & microarchitecture design changes ase Configuration L1 I-cache L1 -cache L2 Unified cache Fetch, ecode, and Issue Width Fetch Queue ranch Predictor Functional Units Reorder uffer Load Store Queue Memory (us Width, First lock Latency) 16 K/2-way/32 16 K/2-way/32 64 K/4-way/64 1-wide out-of-order 8 entry 2-level Gp predictor 2 Integer LU, 1 FP Multiplication Unit, 1 FP LU 16 entries 8 entries 8, 40 cycles
29 bsolute ccuracy in IPC Original enchmark Synthetic Clone IPC on ase Configuration basicmath bitcount crc32 mpeg-decode rasta rawaudio texgen unepic dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript verage absolute error in estimating IPC is 8.7%
30 bsolute ccuracy in Power verage absolute error in estimating power is 6.4% Original enchmark Synthetic Clone Power Consumption on ase Configuration basicmath bitcount crc32 dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript mpeg-decode rasta rawaudio texgen unepic
31 Tracking esign Changes (1) cross 28 cache configurations Pearson' Correlation Coefficient basicmath bitcount crc32 dijkstra fft ghostscript gsm jpeg patricia qsort rsynth stringsearch susan typeset cjpeg djpeg g721-decode ghostscript mpeg-decode rasta rawaudio texgen unepic verage
32 Tracking esign Changes (2) Ranking of Cache Configuration (Real) Ranking of Cache Configuration (Synthetic) cross 28 cache configurations
33 Tracking esign Changes (3) esign Change verage Relative Error in IPC verage Relative Error in Power ouble the number of entries in the reorder buffer and load store Queue 5.81% 3.41% Reduce the L1 cache size to half 1.48% 0.39% ouble the fetch, decode, and issue Width Change the predictor from a 2- level to a not-taken predictor Change the instruction issue policy to in-order 5.41% 4.59% 6.51% 1.80% 3.26% 1.22% 5 ifferent Microarchitecture Changes
34 Outline ackground and Motivation Performance Cloning Central Idea Performance Cloning Framework Workload Profiling lgorithm for Clone Generation nalysis and Results Summary
35 Conclusions Technique that clones performance but hides functional meaning of code - rchitects & esigners can get access to proprietary workloads - Foster benchmark sharing between industry and academia - Customers can make informed purchase decisions Evaluation of technique on embedded benchmarks is promising - Synthetic clone exhibits similar power/performance characteristics - Synthetic clone is a good proxy to original application
36 Challenges & Limitations Compiler technology is absorbed into the performance clone - Limited use for compiler studies enchmark contains IS specific embedded asm statements - Every embedded microprocessor designer cares about single IS - Possibilities for true portability virtual IS, binary translation bstract workload model simple by construction - bility to perform what-if performance studies - Higher order models to capture complex dataflow
37
Distilling the Essence of Proprietary Workloads into Miniature Benchmarks
Distilling the Essence of Proprietary Workloads into Miniature Benchmarks AJAY JOSHI University of Texas at Austin LIEVEN EECKHOUT Ghent University ROBERT H. BELL JR. IBM, Austin and LIZY K. JOHN University
More informationPerformance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks
Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks Ajay Joshi 1, Lieven Eeckhout 2, Robert H. Bell Jr. 3, and Lizy John 1 1 - Department of Electrical and Computer
More informationReducing Power Consumption for High-Associativity Data Caches in Embedded Processors
Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine
More informationThe Case for Automatic Synthesis of Miniature Benchmarks
The Case for utomatic Synthesis of Miniature Benchmarks Robert H. Bell, Jr. Lizy K. John IBM Systems and Technology ivision epartment of Electrical and Computer Engineering ustin, Texas The University
More informationMulti-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview
Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?
More informationUnderstanding multimedia application chacteristics for designing programmable media processors
Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable
More informationEnhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension
Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Hamid Noori, Farhad Mehdipour, Koji Inoue, and Kazuaki Murakami Institute of Systems, Information
More informationEvaluation of Static and Dynamic Scheduling for Media Processors. Overview
Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation
More informationEnergy Consumption Evaluation of an Adaptive Extensible Processor
Energy Consumption Evaluation of an Adaptive Extensible Processor Hamid Noori, Farhad Mehdipour, Maziar Goudarzi, Seiichiro Yamaguchi, Koji Inoue, and Kazuaki Murakami December 2007 Outline Introduction
More informationInstruction Cache Energy Saving Through Compiler Way-Placement
Instruction Cache Energy Saving Through Compiler Way-Placement Timothy M. Jones, Sandro Bartolini, Bruno De Bus, John Cavazosζ and Michael F.P. O Boyle Member of HiPEAC, School of Informatics University
More informationStatic Analysis of Worst-Case Stack Cache Behavior
Static Analysis of Worst-Case Stack Cache Behavior Florian Brandner Unité d Informatique et d Ing. des Systèmes ENSTA-ParisTech Alexander Jordan Embedded Systems Engineering Sect. Technical University
More informationStatistical Simulation of Superscalar Architectures using Commercial Workloads
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information Systems (ELIS) Ghent University, Belgium CAECW
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationOperating system integrated energy aware scratchpad allocation strategies for multiprocess applications
University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationEvaluating MMX Technology Using DSP and Multimedia Applications
Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical
More informationReporting Performance Results
Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility - another experimenter would need to duplicate the results. However: A system s software
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationMapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics
Mapping of Applications to Heterogeneous Multi-cores Based on Micro-architecture Independent Characteristics Jian Chen, Nidhi Nayyar and Lizy K. John Department of Electrical and Computer Engineering The
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationImproving Data Access Efficiency by Using Context-Aware Loads and Stores
Improving Data Access Efficiency by Using Context-Aware Loads and Stores Alen Bardizbanyan Chalmers University of Technology Gothenburg, Sweden alenb@chalmers.se Magnus Själander Uppsala University Uppsala,
More informationMPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors
MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development
More informationDynamic Scheduling. CSE471 Susan Eggers 1
Dynamic Scheduling Why go out of style? expensive hardware for the time (actually, still is, relatively) register files grew so less register pressure early RISCs had lower CPIs Why come back? higher chip
More informationA Scheme of Predictor Based Stream Buffers. Bill Hodges, Guoqiang Pan, Lixin Su
A Scheme of Predictor Based Stream Buffers Bill Hodges, Guoqiang Pan, Lixin Su Outline Background and motivation Project hypothesis Our scheme of predictor-based stream buffer Predictors Predictor table
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationSimple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model
Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching
More informationInstructor Information
CS 203A Advanced Computer Architecture Lecture 1 1 Instructor Information Rajiv Gupta Office: Engg.II Room 408 E-mail: gupta@cs.ucr.edu Tel: (951) 827-2558 Office Times: T, Th 1-2 pm 2 1 Course Syllabus
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More informationT T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.
A1: Architecture (25 points) Consider these four possible branch predictors: (A) Static backward taken, forward not taken (B) 1-bit saturating counter (C) 2-bit saturating counter (D) Global predictor
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationPerformance Modeling and Analysis of Flash based Storage Devices
Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash
More informationChapter 14 Performance and Processor Design
Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures
More informationPerformance and Power Impact of Issuewidth in Chip-Multiprocessor Cores
Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions
More informationProcessors. Young W. Lim. May 12, 2016
Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationAutomatic Generation of Miniaturized Synthetic Proxies for Target Applications to Efficiently Design Multicore Processors
1 Automatic Generation of Miniaturized Proxies for Target Applications to Efficiently Design Multicore Processors Karthik Ganesan, Member, IEEE, and Lizy Kurian John, Fellow, IEEE, Abstract Prohibitive
More informationProxy Benchmarks for Emerging Big-data Workloads
Proxy Benchmarks for Emerging Big-data Workloads Reena Panda University of Texas at Austin reena.panda@utexas.edu Lizy Kurian John University of Texas at Austin ljohn@ece.utexas.edu Abstract Early design-space
More informationAnand Raghunathan
ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.15: ASIP: Approaches to Design Anand Raghunathan raghunathan@purdue.edu ECE 695R: System-on-Chip Design, Fall 2014 Fall 2014, ME 1052,
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationCAMP: Accurate Modeling of Core and Memory Locality for Proxy Generation of Big-data Applications
CAMP: Accurate Modeling of Core and Memory Locality for Generation of Big-data Applications Reena Panda, Xinnian Zheng, Andreas Gerstlauer and Lizy Kurian John The University of Texas at Austin, NVIDIA
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationFLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES
FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES By PETER GAVIN A Dissertation submitted to the Department
More informationCS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines
CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell
More informationFLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES
FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A PRESENTATION AND LOW-LEVEL ENERGY USAGE ANALYSIS OF TWO LOW-POWER ARCHITECTURAL TECHNIQUES By PETER GAVIN A Dissertation submitted to the Department
More informationReducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research
Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationComputer Architecture
Computer Architecture Architecture The art and science of designing and constructing buildings A style and method of design and construction Design, the way components fit together Computer Architecture
More informationCourse web site: teaching/courses/car. Piazza discussion forum:
Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start
More informationComputer Architecture EE 4720 Final Examination
Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)
More informationAdvanced Topic in Pipeline: Pipeline scheduling
Contents dvanced Topic in Pipeline: Pipeline scheduling Linear Pipelines Nonlinear pipelines Instruction Pipelines rithmetic Operations esign of Multifunction Pipeline Linear Pipeline Processing Stages
More informationPage 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1
Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationInsight into Application Performance Using Application-Dependent Characteristics
Insight into Application Performance Using Application-Dependent Characteristics Waleed Alkohlani 1, Jeanine Cook 2, and Nafiul Siddique 1 1 Klipsch School of Electrical and Computer Engineering, New Mexico
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationSimultaneous Multithreading (SMT)
#1 Lec # 2 Fall 2003 9-10-2003 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing
More informationA Reconfigurable Functional Unit for an Adaptive Extensible Processor
A Reconfigurable Functional Unit for an Adaptive Extensible Processor Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue and Morteza SahebZamani Department of Informatics, Graduate School of Information
More informationIntroduction. Chapter 4. Instruction Execution. CPU Overview. University of the District of Columbia 30 September, Chapter 4 The Processor 1
Chapter 4 The Processor Introduction CPU performance factors Instruction count etermined by IS and compiler CPI and Cycle time etermined by CPU hardware We will examine two MIPS implementations simplified
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationAccurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction
Accurate & Efficient Regression Modeling for Microarchitectural Performance & Power Prediction {bclee,dbrooks}@eecs.harvard.edu Division of Engineering and Applied Sciences Harvard University 24 October
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationUnderstanding multimedia application characteristics for designing programmable media processors
Header for SPIE use Understanding multimedia application characteristics for designing programmable media processors Jason Fritts *, Wayne Wolf, and Bede Liu Dept. of Electrical Engineering, Princeton
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationWrong Path Events and Their Application to Early Misprediction Detection and Recovery
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University of Texas at Austin Motivation Branch predictors are
More informationThe Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems
The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems Hamid Noori, Maziar Goudarzi, Koji Inoue, and Kazuaki Murakami Kyushu University Outline Motivations
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationPerformance measurement. SMD149 - Operating Systems - Performance and processor design. Introduction. Important trends affecting performance issues
Performance measurement SMD149 - Operating Systems - Performance and processor design Roland Parviainen November 28, 2005 Performance measurement Motivation Techniques Common metrics Processor architectural
More informationBenchmark: Uses. Guide computer design. Guide purchasing decisions. Marketing tool Benchmarks. Program used to evaluate performance.
02 1 Benchmarks 02 1 Benchmark: Program used to evaluate performance. Uses Guide computer design. Guide purchasing decisions. Marketing tool. 02 1 EE 4720 Lecture Transparency. Formatted 9:15, 6 April
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More information[1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School
References [1] C. Moura, \SuperDLX A Generic SuperScalar Simulator," ACAPS Technical Memo 64, School of Computer Science, McGill University, May 1993. [2] C. Young, N. Gloy, and M. D. Smith, \A Comparative
More informationBenchmark Synthesis for Architecture and Compiler Exploration
Benchmark Synthesis for Architecture and Compiler Exploration Luk Van Ertvelde Lieven Eeckhout Ghent University, Belgium Abstract This paper presents a novel benchmark synthesis framework with three key
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More information15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15
More informationCustom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit
Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi
More informationSimultaneous Multithreading (SMT)
Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue
More informationA Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set
A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University Stanford, CA 94305, USA zmily@stanford.edu,
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationMethod-Level Phase Behavior in Java Workloads
Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS
More informationCentral Processing Unit
Central Processing Unit Networks and Embedded Software Module.. by Wolfgang Neff Components () lock diagram Execution Unit Control Unit Registers rithmetic logic unit DD, SU etc. NOT, ND etc. us Interface
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationThe University of Texas at Austin
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel
More informationCMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationI/O Characterization of Commercial Workloads
I/O Characterization of Commercial Workloads Kimberly Keeton, Alistair Veitch, Doug Obal, and John Wilkes Storage Systems Program Hewlett-Packard Laboratories www.hpl.hp.com/research/itc/csl/ssp kkeeton@hpl.hp.com
More informationARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial
ARMv8 Micro-architectural Design Space Exploration for High Performance Computing using Fractional Factorial Roxana Rusitoru Systems Research Engineer, ARM 1 Motivation & background Goal: Why: Who: 2 HPC-oriented
More informationImproving Data Cache Performance via Address Correlation: An Upper Bound Study
Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationCopyright by Karthik Ganesan 2011
Copyright by Karthik Ganesan 2011 The Dissertation Committee for Karthik Ganesan certifies that this is the approved version of the following dissertation: Automatic Generation of Synthetic Workloads for
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More information