Power-Aware Compile Technology. Xiaoming Li

Size: px
Start display at page:

Download "Power-Aware Compile Technology. Xiaoming Li"

Transcription

1 Power-Aware Compile Technology Xiaoming Li

2 Frying Eggs

3 Future CPU? Watts/cm i386 Hot plate i486 Nuclear Reactor Pentium III processor Pentium II processor Pentium Pro processor Pentium processor!"#$!$ %"&$ %"#$ %"'#$ %"(#$ %"!)$ %"!'$ %"!$ %"%&$ Rocket Nozzle Sun s Surface ( Source: Fred Pollack, Intel, Micro 1999 keynote)

4 Maximizing power efficiency is within reach Hardware support Enhanced SpeedStep: Low overhead frequency/voltage scaling. 10us/transition. Opportunities: CPU frequency and frontbus frequency are decoupled. Programs have memory bound segments and CPU bound segments

5 Is any energy wasted when executing programs? Current Status: Programs run at a single frequency from start to end, neglecting segmental behavior of execution. CPU idles in memory bound segments. Out plan: Find out how a program execute. Remove the fat in energy consumption

6 Searching for the most power efficient FFT Select the proper frequency/voltage for different regions in the FFT code. Challenges include: Where to switch? Which frequency? Schedule the code to reveal more opportunities for frequency scaling. Challenges include: How to schedule for power?

7 What previous research does? Hardware/OS [Berkeley, MIT, UMD, UMich, UVa, ] Interactive applications Predict memory access pattern Fix window size -> Wrong prediction Batch applications Predict the execution time of every task Distribute unused time to remaining tasks Low granularity, no use for DSP program 7

8 Previous Compiler-based DVS algorithms The Design, Implementation, and Evaluation of a Compiler Algorithm for CPU Energy Reduction Chung- Hsing Hsu and Ulrich Kremer, PLDI 03. Select basic blocks from program structure ENTRY C1 C2 L3 Entry/exit is unique EXIT Loop Call site If statement Seq. of regions Entire procedure C5 L4 8 16

9 Chung-Hsing Hsu and Ulrich Kremer s Approach (Cont ) Measure the execution time and the power consumption of every region at every possible frequencies. Change the frequency of only one region Exhaustive search for the best region and the optimal frequency. 9

10 Run programs on the simulator Compile-time Dynamic Voltage Scaling Settings: Opportunities and Limits, Fen Xie, Margaret Martonosi, and Sharad Malik, PLDI 03 Divide the execution into memory accesses and cpu operations. Assuming the processor has continuous frequency spectrum. Model power consumption in memory accesses phases and cpu active phases. Use existing optimizing software to find the best single region for scheduling. 10

11 Re-examine our goal What we really want to optimize? Power ~ O(v) Trivial solution if just to reduce power Energy ~ O(v 2 ) Minimal energy consumption at the lowest frequency Energy Delay SPEC / Jules Energy Delay 2 Test if we really make improvement 11

12 Energy vs. Delay Landscape A energy B C How to affect tradeoffs? How to compare tradeoffs? delay

13 Optimization Space energy Pareto Optimal delay

14 Projection of Compile Optimizations F parallelizing scaling energy energy-aware compilation F =(F, s, p, -O) runtime 14

15 Our Goal energy new, higher quality Pareto front for any metric runtime 15

16 Simulator vs. Real Machine? Simulator Watt, SimPower Power-model should be verified. Not the best environment for compiler research. Real machine How to identify phases in the program? How to measure power consumption? How to search the front-line of energy-delay? 16

17 Identify Program Phases Use hardware counters Low overhead Limited number Find the correct events Memory access: L2_Cache_Load_In, L2_Prefetch_Load_In Instruction number: Instruction_Retired Execution time: Cycle 17

18 Insert Reading Points Control the overhead of reading. Reading evenly during the execution Use a simplified model of memory accesses and working cycles. Understand how compiler translate instructions. Constant loading Array access 18

19 L2 Cache Miss/10 us Are there really program phases? e e+06 2e e+06 3e+06 Cycle PM 19

20 30 25 L2 Cache Miss/10 us e e e e+06 Cycle PM 20

21 Iterations have different patterns 21

22 Frequency Scaling Select the program region with the highest cache miss ratio. Lower the processor frequency before entering the region. Restore the frequency after exiting the region. 22

23 WHT-2 19 (out-of-cache) low frequency Each point shows the cache miss ratio every 100!seconds Cache miss ratio high frequency Time 23

24 Example: code with voltage/frequency scaling instructions setfreq(2); i14 = 0; while (i14 <= 32767) { s277 = T2[i14]; s278 = T2[ i14]; t459 = s277 - s278; i14++; } setfreq(3); decrease frequency increase frequency 24

25 Frequency Scaling Select the program region with the highest cache miss ratio. Lower the processor frequency before entering the region. Restore the frequency after exiting the region. Transform the program to reveal more opportunities for frequency scaling. 25

26 26

27 Measure Energy Consumption Energy = Volt*Amp*Time Volt: Constant Amp: Oscilloscope Time: Cycle/Frequency 27

28 Pentium-M 2.13GHz Six frequency settings 2.13 GHz at volt (max performance) 800 MHz at volt (min performance/energy) The change in performance/energy tradeoff is dramatic. 28

29 29

30 30

31 WHT-2 20 Experiment Results Energy versus execution time Pareto curve Energy (Joules) Energy (Joules) % energy reduction % energy reduction Withing 5% of the execution time of the fastest version Execution Time (Seconds) Execution Time (Seconds) Fixed Dynamic 31

32 DCT-2 20 Energy versus execution time Pareto curve Energy (Joules) Energy (Joules) Execution Time (Seconds) Execution Time (Seconds) Fixed Dynamic 32

33 Real DFT-2 20 Energy versus execution time Pareto curve Energy (Joules) Energy (Joules) Execution Time (Seconds) Execution Time (Seconds) Fixed Dynamic 33

34 DFT-2 20 Energy versus execution time Pareto curve Energy (Joules) Execution Time (Seconds) Energy (Joules) Execution Time (Seconds) Fixed Dynamic 34

35 Future Directions Loop transformation Global optimization Strength reduction Parallelization for power... 35

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI Contents 1.7 - End of Chapter 1 Power wall The multicore era

More information

New Challenges in Microarchitecture and Compiler Design

New Challenges in Microarchitecture and Compiler Design New Challenges in Microarchitecture and Compiler Design Contributors: Jesse Fang Tin-Fook Ngai Fred Pollack Intel Fellow Director of Microprocessor Research Labs Intel Corporation fred.pollack@intel.com

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

Crusoe Power Management:

Crusoe Power Management: Crusoe Power Management: Cutting x86 Operating Power Through LongRun Marc Fleischmann Director, Low Power Programs Transmeta Corporation Crusoe, LongRun and Code Morphing are trademarks of Transmeta Corp.

More information

Introduction to Energy-Efficient Software 2 nd life talk

Introduction to Energy-Efficient Software 2 nd life talk Introduction to Energy-Efficient Software 2 nd life talk Intel Software and Solutions Group Bob Steigerwald Nov 8, 2007 Taylor Kidd Nov 15, 2007 Agenda Demand for Mobile Computing Devices What is Energy-Efficient

More information

DFT Compiler for Custom and Adaptable Systems

DFT Compiler for Custom and Adaptable Systems DFT Compiler for Custom and Adaptable Systems Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler:

More information

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer Nikolas Ioannou, Michael Kauschke, Matthias Gries, and Marcelo Cintra University of Edinburgh Intel Labs Braunschweig Introduction

More information

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John Computer Performance Evaluation and Benchmarking EE 382M Dr. Lizy Kurian John Evolution of Single-Chip Transistor Count 10K- 100K Clock Frequency 0.2-2MHz Microprocessors 1970 s 1980 s 1990 s 2010s 100K-1M

More information

Parallel Computing. Parallel Computing. Hwansoo Han

Parallel Computing. Parallel Computing. Hwansoo Han Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple

More information

Multicore and Parallel Processing

Multicore and Parallel Processing Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement

More information

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering CSE 141: Computer 0 Architecture Professor: Michael Taylor RF UCSD Department of Computer Science & Engineering Computer Architecture from 10,000 feet foo(int x) {.. } Class of application Physics Computer

More information

Efficient Program Power Behavior Characterization

Efficient Program Power Behavior Characterization Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ

More information

High-Performance Parallel Computing

High-Performance Parallel Computing High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre Course Overview Emphasis on algorithm development and programming issues for high performance No assumed background in computer architecture;

More information

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas

More information

CIT 668: System Architecture

CIT 668: System Architecture CIT 668: System Architecture Computer Systems Architecture I 1. System Components 2. Processor 3. Memory 4. Storage 5. Network 6. Operating System Topics Images courtesy of Majd F. Sakr or from Wikipedia

More information

How to get realistic C-states latency and residency? Vincent Guittot

How to get realistic C-states latency and residency? Vincent Guittot How to get realistic C-states latency and residency? Vincent Guittot Agenda Overview Exit latency Enter latency Residency Conclusion Overview Overview PMWG uses hikey960 for testing our dev on b/l system

More information

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters

Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters Gregor von Laszewski, Lizhe Wang, Andrew J. Younge, Xi He Service Oriented Cyberinfrastructure Lab Rochester Institute of Technology,

More information

EE241 - Spring 2004 Advanced Digital Integrated Circuits

EE241 - Spring 2004 Advanced Digital Integrated Circuits EE24 - Spring 2004 Advanced Digital Integrated Circuits Borivoje Nikolić Lecture 2 Impact of Scaling Class Material Last lecture Class scope, organization Today s lecture Impact of scaling 2 Major Roadblocks.

More information

Dynamic Performance Tuning for Speculative Threads

Dynamic Performance Tuning for Speculative Threads Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

Applications. Department of Computer Science, University of Pittsburgh, Pittsburgh, PA fnevine, mosse, childers,

Applications. Department of Computer Science, University of Pittsburgh, Pittsburgh, PA fnevine, mosse, childers, Toward the Placement of Power Management Points in Real Time Applications Nevine AbouGhazaleh, Daniel Mosse, Bruce Childers and Rami Melhem Department of Computer Science, University of Pittsburgh, Pittsburgh,

More information

Separating Reality from Hype in Processors' DSP Performance. Evaluating DSP Performance

Separating Reality from Hype in Processors' DSP Performance. Evaluating DSP Performance Separating Reality from Hype in Processors' DSP Performance Berkeley Design Technology, Inc. +1 (51) 665-16 info@bdti.com Copyright 21 Berkeley Design Technology, Inc. 1 Evaluating DSP Performance! Essential

More information

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA: CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,

More information

Embedded Systems Architecture

Embedded Systems Architecture Embedded System Architecture Software and hardware minimizing energy consumption Conscious engineer protects the natur M. Eng. Mariusz Rudnicki 1/47 Software and hardware minimizing energy consumption

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall

More information

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER

CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 73 CHAPTER 7 IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING IN LINUX SCHEDULER 7.1 INTRODUCTION The proposed DVS algorithm is implemented on DELL INSPIRON 6000 model laptop, which has Intel Pentium Mobile Processor

More information

Parallel Functional Programming Lecture 1. John Hughes

Parallel Functional Programming Lecture 1. John Hughes Parallel Functional Programming Lecture 1 John Hughes Moore s Law (1965) The number of transistors per chip increases by a factor of two every year two years (1975) Number of transistors What shall we

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com

More information

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group avi.mendelson@intel.com 1 Disclaimer No Intel proprietary information is disclosed. Every future estimate

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26

More information

Lecture #1. Teach you how to make sure your circuit works Do you want your transistor to be the one that screws up a 1 billion transistor chip?

Lecture #1. Teach you how to make sure your circuit works Do you want your transistor to be the one that screws up a 1 billion transistor chip? Instructor: Jan Rabaey EECS141 1 Introduction to digital integrated circuit design engineering Will describe models and key concepts needed to be a good digital IC designer Models allow us to reason about

More information

Embedded System Architecture

Embedded System Architecture Embedded System Architecture Software and hardware minimizing energy consumption Conscious engineer protects the natur Embedded Systems Architecture 1/44 Software and hardware minimizing energy consumption

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

Energy Conservation In Computational Grids

Energy Conservation In Computational Grids Energy Conservation In Computational Grids Monika Yadav 1 and Sudheer Katta 2 and M. R. Bhujade 3 1 Department of Computer Science and Engineering, IIT Bombay monika@cse.iitb.ac.in 2 Department of Electrical

More information

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University P & H Chapter 4.10, 1.7, 1.8, 5.10, 6 Why do I need four computing cores on my phone?! Why do I need eight computing

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 2: Figures of Merit and Evaluation Methodologies

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 2: Figures of Merit and Evaluation Methodologies 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 2: Figures of Merit and Evaluation Methodologies Instructor: Ron Dreslinski Winter 2016 1 1 Measuring performance 2 2 Performance

More information

Potentials and Limitations for Energy Efficiency Auto-Tuning

Potentials and Limitations for Energy Efficiency Auto-Tuning Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne

More information

What is this class all about?

What is this class all about? EE141-Fall 2012 Digital Integrated Circuits Instructor: Elad Alon TuTh 11-12:30pm 247 Cory 1 What is this class all about? Introduction to digital integrated circuit design engineering Will describe models

More information

Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications

Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications Analyzing the Energy-Time Tradeoff in High-Performance Computing Applications Vincent W. Freeh Feng Pan David K. Lowenthal Nandini Kappiah Rob Springer Barry L. Rountree Mark E. Femal Department of Computer

More information

What is this class all about?

What is this class all about? EE141-Fall 2007 Digital Integrated Circuits Instructor: Elad Alon TuTh 3:30-5pm 155 Donner 1 1 What is this class all about? Introduction to digital integrated circuit design engineering Will describe

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Part 1.1.2: Introduction (Digital VLSI Systems) Liang Liu liang.liu@eit.lth.se 1 Outline Why Digital? History & Roadmap Device Technology & Platforms System

More information

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks Last Time Making correct concurrent programs Maintaining invariants Avoiding deadlocks Today Power management Hardware capabilities Software management strategies Power and Energy Review Energy is power

More information

EITF20: Computer Architecture Part1.1.1: Introduction

EITF20: Computer Architecture Part1.1.1: Introduction EITF20: Computer Architecture Part1.1.1: Introduction Liang Liu liang.liu@eit.lth.se 1 Course Factor Computer Architecture (7.5HP) http://www.eit.lth.se/kurs/eitf20 EIT s Course Service Desk (studerandeexpedition)

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Lecture #10 Context Switching & Performance Optimization

Lecture #10 Context Switching & Performance Optimization SPRING 2015 Integrated Technical Education Cluster At AlAmeeria E-626-A Real-Time Embedded Systems (RTES) Lecture #10 Context Switching & Performance Optimization Instructor: Dr. Ahmad El-Banna Agenda

More information

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors

Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Integrated CPU and Cache Power Management in Multiple Clock Domain Processors Nevine AbouGhazaleh, Bruce Childers, Daniel Mossé & Rami Melhem Department of Computer Science University of Pittsburgh HiPEAC

More information

SDR Forum Technical Conference 2007

SDR Forum Technical Conference 2007 THE APPLICATION OF A NOVEL ADAPTIVE DYNAMIC VOLTAGE SCALING SCHEME TO SOFTWARE DEFINED RADIO Craig Dolwin (Toshiba Research Europe Ltd, Bristol, UK, craig.dolwin@toshiba-trel.com) ABSTRACT This paper presents

More information

Performance of computer systems

Performance of computer systems Performance of computer systems Many different factors among which: Technology Raw speed of the circuits (clock, switching time) Process technology (how many transistors on a chip) Organization What type

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Evaluating MMX Technology Using DSP and Multimedia Applications

Evaluating MMX Technology Using DSP and Multimedia Applications Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical

More information

Power Management as I knew it. Jim Kardach

Power Management as I knew it. Jim Kardach Power Management as I knew it Jim Kardach 1 Agenda Philosophy of power management PM Timeline Era of OS Specific PM (OSSPM) Era of OS independent PM (OSIPM) Era of OS Assisted PM (APM) Era of OS & hardware

More information

Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications Chris Hughes

Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications Chris Hughes Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications Chris Hughes w/ Jayanth Srinivasan and Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information

19: I/O Devices: Clocks, Power Management

19: I/O Devices: Clocks, Power Management 19: I/O Devices: Clocks, Power Management Mark Handley Clock Hardware: A Programmable Clock Pulses Counter, decremented on each pulse Crystal Oscillator On zero, generate interrupt and reload from holding

More information

Performance Analysis in the Real World of Online Services

Performance Analysis in the Real World of Online Services Performance Analysis in the Real World of Online Services Dileep Bhandarkar, Ph. D. Distinguished Engineer 2009 IEEE International Symposium on Performance Analysis of Systems and Software My Background:

More information

A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance

A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance Qiang Wu 1, V.J. Reddi 2, Youfeng Wu 3, Jin Lee 3, Dan Connors 2, David Brooks 4, Margaret Martonosi 1, Douglas W.

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

EE141- Spring 2004 Introduction to Digital Integrated Circuits. What is this class about?

EE141- Spring 2004 Introduction to Digital Integrated Circuits. What is this class about? - Spring 2004 Introduction to Digital Integrated Circuits Tu-Th am-2:30pm 203 McLaughlin What is this class about? Introduction to digital integrated circuits.» CMOS devices and manufacturing technology.

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Independent DSP Benchmarks: Methodologies and Results. Outline

Independent DSP Benchmarks: Methodologies and Results. Outline Independent DSP Benchmarks: Methodologies and Results Berkeley Design Technology, Inc. 2107 Dwight Way, Second Floor Berkeley, California U.S.A. +1 (510) 665-1600 info@bdti.com http:// Copyright 1 Outline

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 Structured Computer Organization A computer s native language, machine language, is difficult for human s to use to program the computer

More information

Power Profiling and Optimization for Heterogeneous Multi-Core Systems

Power Profiling and Optimization for Heterogeneous Multi-Core Systems Power Profiling and Optimization for Heterogeneous Multi-Core Systems Kuen Hung Tsoi and Wayne Luk Department of Computing, Imperial College London {khtsoi, wl}@doc.ic.ac.uk ABSTRACT Processing speed and

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

Profiling and Workflow

Profiling and Workflow Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance

More information

EE141- Spring 2002 Introduction to Digital Integrated Circuits. What is this class about?

EE141- Spring 2002 Introduction to Digital Integrated Circuits. What is this class about? - Spring 2002 Introduction to Digital Integrated Circuits Tu-Th 9:30-am 203 McLaughlin What is this class about? Introduction to digital integrated circuits.» CMOS devices and manufacturing technology.

More information

EE141- Spring 2007 Introduction to Digital Integrated Circuits

EE141- Spring 2007 Introduction to Digital Integrated Circuits - Spring 2007 Introduction to Digital Integrated Circuits Tu-Th 5pm-6:30pm 150 GSPP 1 What is this class about? Introduction to digital integrated circuits.» CMOS devices and manufacturing technology.

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141 ECE 637 Integrated VLSI Circuits Introduction EE141 1 Introduction Course Details Instructor Mohab Anis; manis@vlsi.uwaterloo.ca Text Digital Integrated Circuits, Jan Rabaey, Prentice Hall, 2 nd edition

More information

Dynamic Translation for EPIC Architectures

Dynamic Translation for EPIC Architectures Dynamic Translation for EPIC Architectures David R. Ditzel Chief Architect for Hybrid Computing, VP IAG Intel Corporation Presentation for 8 th Workshop on EPIC Architectures April 24, 2010 1 Dynamic Translation

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Automatic Post Silicon Clock Scheduling 08/12/2008. UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering

Automatic Post Silicon Clock Scheduling 08/12/2008. UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Power Issues in Computer Architecture Fall 2008 Power Density Trend for Intel mp 1000 Watt/cm2 100 10

More information

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program. Performance COMP375 Computer Architecture and dorganization What is Good Performance Which is the best performing jet? Airplane Passengers Range (mi) Speed (mph) Boeing 737-100 101 630 598 Boeing 747 470

More information

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice CSE 392/CS 378: High-performance Computing: Principles and Practice Administration Professors: Keshav Pingali 4.126 ACES Email: pingali@cs.utexas.edu Jim Browne Email: browne@cs.utexas.edu Robert van de

More information

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:

More information

Response Time and Throughput

Response Time and Throughput Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing

More information

The benefits and costs of writing a POSIX kernel in a high-level language

The benefits and costs of writing a POSIX kernel in a high-level language 1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38

More information

What is this class all about?

What is this class all about? -Fall 2004 Digital Integrated Circuits Instructor: Borivoje Nikolić TuTh 3:30-5 247 Cory EECS141 1 What is this class all about? Introduction to digital integrated circuits. CMOS devices and manufacturing

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Structural Graph Matching With Polynomial Bounds On Memory and on Worst-Case Effort

Structural Graph Matching With Polynomial Bounds On Memory and on Worst-Case Effort Structural Graph Matching With Polynomial Bounds On Memory and on Worst-Case Effort Fred DePiero, Ph.D. Electrical Engineering Department CalPoly State University Goal: Subgraph Matching for Use in Real-Time

More information

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 6

ECE 571 Advanced Microprocessor-Based Design Lecture 6 ECE 571 Advanced Microprocessor-Based Design Lecture 6 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 4 February 2016 HW#3 will be posted HW#1 was graded Announcements 1 First

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Computer Organization Architectures for Embedded Computing Tuesday 28 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition,

More information

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell R-Storm: A Resource-Aware Scheduler for STORM Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell Introduction STORM is an open source distributed real-time data stream processing system

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Tuesday 27 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary Previous Class Memory hierarchy

More information

Rechnerarchitektur (RA)

Rechnerarchitektur (RA) 12 Rechnerarchitektur (RA) Sommersemester 2017 Architecture-Aware Optimizations -Software Optimizations- Jian-Jia Chen Informatik 12 Jian-jia.chen@tu-.. http://ls12-www.cs.tu.de/daes/ Tel.: 0231 755 6078

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU

Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU Effects of Dynamic Voltage and Frequency Scaling on a K2 GPU Rong Ge, Ryan Vogt, Jahangir Majumder, and Arif Alam Dept. of Mathematics, Statistics and Computer Science Marquette University Milwaukee, WI,

More information