COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law

Size: px
Start display at page:

Download "COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law"

Transcription

1 COSC 6374 Parallel Computation Performance Oriented Software Design Spring 2008 Amdahl s Law Describes the performance gains by enhancing one part of the overall system (code, computer) Speedup = Performance of entire task using the enhancement Performance of entire task not using the enhancement Or Speedup = Execution time of the task not using the enhancement Execution time of the task using the enhancement 1

2 Amdahl s Law (II) Amdahl s Law depends on two factors: Fraction of the execution time affected by enhancement The improvement gained by the enhancement for this fraction Thus Fractionenh Execution _ timeenh = Execution_ timeorg((1 Fractionenh) + ) Speedup enh (1:27:1) Speedup overall Execution_ time = Execution_ time org enh 1 = Fraction (1 Fractionenh) + Speedup enh enh (1:27:2) 6 Speedup Amdahl s Law (III) overall 1 = Fraction (1 Fractionenh) + Speedup enh enh 5 Speedup total Fraction enhanced: 20% Fraction enhanced: 40% Fraction enhanced: 60% Fraction enhanced: 80% Speedup enhanced 2

3 Amdahl s Law (IV) Speedup according to Amdahl's Law Speedup total Speedup enhanced: 2 Speedup enhanced: 4 Speedup enhanced: Fraction enhanced Three big questions Where do I spend the most time? How efficient are those routines? Where do we loose efficiency? 3

4 Where do we spend most time? Need to profile the application Standard tools in UNIX like environments: gprof, valgrind Valgrind: Collection of various tools to analyze an application at runtime tool=memcheck: memory debugger tool=cachegrind: estimate on the cache usage of an application tool=callgrind: provides a trace of the function calls Most tools produce an output file (cachegrind.<procid>.out> kcachegrind: visualization tool of valgrind output files 4

5 How to determine the sources of overhead? Get detailed data for different sections of the routine get an estimate on the number of operations executed within these section Scaling issues: For each process we might end up with a large no. of time stamps (e.g. k per process) a large no. of measurements per time stamp (e.g. m per time stamp) (Execution time of MPI functions, various PAPI counters, user defined values) This leads to (n * k * m) data values for the performance analysis Data reduction for performance Analysis Data reduction for the number of processes analyzed: Find processors exposing the same behavior and focus on the performance analysis of a single processor of each group Data reduction per process: Eliminate the measurements exposing the same information Data reduction in time: Find a small, typical cycle in the application and ignore the rest. Automatic, statistical methods inevitable 5

6 Where do we loose efficieny? valgrind --tool=cachegrind./atf ================================================= ==27050== ==27050== I refs: 7,477,574,763 ==27050== I1 misses: 1,856 ==27050== L2i misses: 1,774 ==27050== I1 miss rate: 0.00% ==27050== L2i miss rate: 0.00% ==27050== ==27050== D refs: 3,663,973,777 (3,517,790,756 rd + 146,183,021 wr) ==27050== D1 misses: 89,705,595 ( 85,089,836 rd + 4,615,759 wr) ==27050== L2d misses: 85,614,772 ( 81,648,115 rd + 3,966,657 wr) ==27050== D1 miss rate: 2.4% ( 2.4% + 3.1% ) ==27050== L2d miss rate: 2.3% ( 2.3% + 2.7% ) ==27050== ==27050== L2 refs: 89,707,451 ( 85,091,692 rd + 4,615,759 wr) ==27050== L2 misses: 85,616,546 ( 81,649,889 rd + 3,966,657 wr) ==27050== L2 miss rate: 0.7% ( 0.7% + 2.7% ) 6

7 PAPI hardware performance counters Modern processors expose a some counters which give some information about the performance Limited number of counters No. of simultaneous counters and the supported combination of hardware counters depending on the processor Available on most modern operating systems: Linux: requires recompiling the kernel Windows: works right away, however not very accurate due to some restrictions of the OS on context switches Requires modification of your source code to insert the PAPI calls 7

8 General Counters PAPI_FP_OPS Floating point operations PAPI_TOT_CYC Total cycles PAPI_HW_INT Hardware interrupts Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Instruction Counters PAPI_TOT_IIS Instructions issued PAPI_TOT_INS Instructions completed PAPI_INT_INS Integer instructions PAPI_LD_INS Load instructions PAPI_SR_INS Store instructions PAPI_BR_INS Branch instructions PAPI_VEC_INS Vector/SIMD instructions PAPI_LST_INS Load/store instr. completed PAPI_SYC_INS Synch. instr. completed Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 8

9 FP Instruction Counters PAPI_FP_INS Floating point instructions PAPI_FML_INS Floating point multiply PAPI_FAD_INS Floating point add PAPI_FDV_INS Floating point divide PAPI_FSQ_INS Floating point square root PAPI_FNV_INS Floating point inverse Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Cache Counters PAPI_L[1 2 3]_[D I T]C[M H A R W] Cache level 1/2/3 [D I T]: data/instruction/total cache [M H A R W]: misses/hits/accesses/ reads/writes PAPI_L[1 2 3]_[LD ST]M Cache level 1/2/3 [LD ST]: load/store misses PAPI_PRF_DM Data prefetch cache misses Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 9

10 PAPI manual example PAPI_library_init(PAPI_VER_CURRENT); /* query and set up the right events to monitor */ if (PAPI_query_event(PAPI_FP_INS) == PAPI_OK) { Events[0] = PAPI_FP_INS; } else { Events[0] = PAPI_TOT_INS; } Events[1] = PAPI_TOT_CYC; PAPI_start_counters((int *) Events, NUM_EVENTS); /* Execute the real code*/ do_flops(num_flops); PAPI_read_counters(values, NUM_EVENTS); Vampir: Process Timeline Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 10

11 Example: low FP rate due to FP exceptions Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Consequences for software design Multi-dimensional allocations in C/C++ Typical code sequence double **matrix; matrix = (double **) malloc ( dim1 * sizeof(double *); for ( i=0; i< dim1; i++ ) { matrix[i] = (double *) malloc ( dim2 *sizeof(double)); } memory allocated might not be contiguous lowers performance 11

12 Consequences for software design Alternative allocation technique: double **matrix; double *data; data = (double *) malloc(dims1*dims2*sizeof(double)); matrix = (double **) malloc (dims1*sizeof(double *)); for (i=0; i<(dims[0]); i++) { matrix[i] = &(data[i*dims1]); } Consequences for software design Inner loop should go over the outmost index of multidimensional arrays in C/C++ correct version: for (i=0; i<dims1; i++) { for ( j=0; j<dims2; j++ ){ matrix[i][j]= ; } } wrong version: for ( j=0; j<dims2; j++ ) { for (i=0; i<dims1; i++) { matrix[i][j]= ; } } 12

13 What shall you do if one variable requires access along the row and one variable along the columns? for ( i=0; i<dim; i++ ) for ( j=0; j<dim; j++ ) for ( k=0; k<dim; k++) c[i][j] += a[i][k] * b[k][j]; Blocked code versions optimize cache usage for ( i=0; i<dim; i+=block ) for ( j=0; j<dim; j+=block ) for ( k=0; k<dim; k+=block) for (ii=i; ii<(i+block); ii++) for (jj=j; jj<(j+block); jj++) for (kk=k; kk<(k+block);kk++) c[ii][jj] += a[ii][kk] * b[kk][jj]; Comparison operators Comparing integer values is orders of magnitudes faster than comparing strings map options to integers and use if or switch statements avoid strcmp or similar functions wherever possible Avoid unnecessary memory copy operations minimizing memory footprint improves cache behavior passing pointers to a subroutine instead of making a copy of the data array might have however a negative impact on loops within the subroutine, since the compiler does not know boundaries of the array/loop. 13

14 Object structures Rule of thumb: it is better to have an object containing a vector of data, than having a vector of objects with one data point each fewer indirections better cache usage 14

Organizational issues (I)

Organizational issues (I) COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2007 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework

More information

Organizational issues (I)

Organizational issues (I) COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2009 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, SEC 202 Wednesday, 1.00pm 2.30pm, SEC 202 Evaluation 25% homework

More information

Organizational issues (I)

Organizational issues (I) COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2008 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework

More information

HiPERiSM Consulting, LLC.

HiPERiSM Consulting, LLC. HiPERiSM Consulting, LLC. George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com Models-3 User s Conference September

More information

PAPI Software Specification

PAPI Software Specification PAPI Software Specification This software specification describes the PAPI 3.0 Release, and is current as of March 08, 2004. It consists of the following sections: Introduction to PAPI Constants Standardized

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,

More information

PAPI - PERFORMANCE API. ANDRÉ PEREIRA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) 2 Motivation Application

More information

Go Multicore Series:

Go Multicore Series: Go Multicore Series: Understanding Memory in a Multicore World, Part 2: Software Tools for Improving Cache Perf Joe Hummel, PhD http://www.joehummel.net/freescale.html FTF 2014: FTF-SDS-F0099 TM External

More information

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced. Principles Performance Tuning CS 27 Don t optimize your code o Your program might be fast enough already o Machines are getting faster and cheaper every year o Memory is getting denser and cheaper every

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Vector and Parallel Processors. Amdahl's Law

Vector and Parallel Processors. Amdahl's Law Vector and Parallel Processors. Vector processors are processors which have special hardware for performing operations on vectors: generally, this takes the form of a deep pipeline specialized for this

More information

COSC 6385 Computer Architecture. Defining Computer Architecture

COSC 6385 Computer Architecture. Defining Computer Architecture COSC 6385 Computer rchitecture Defining Computer rchitecture Fall 007 icro-processors in today s world arkets Desktop computing Servers Embedded computers Characteristics Price vailability Reliability

More information

Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms

Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance

More information

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017 ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance Prof. Peter Bermel January 13, 2017 Outline Time Scaling Examples General performance strategies Computer architectures

More information

Cache Profiling with Callgrind

Cache Profiling with Callgrind Center for Information Services and High Performance Computing (ZIH) Cache Profiling with Callgrind Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945

More information

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture

More information

Parallel Performance and Optimization

Parallel Performance and Optimization Parallel Performance and Optimization Erik Schnetter Gregory G. Howes Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa May 20-22, 2013 Thank you Ben Rogers Glenn Johnson

More information

Evaluation of Profiling Tools for the Acquisition of Time Independent Traces

Evaluation of Profiling Tools for the Acquisition of Time Independent Traces Evaluation of Profiling Tools for the Acquisition of Time Independent Traces Frédéric Desprez, George S. Markomanolis, Frédéric Suter TECHNICAL REPORT N 437 July 2013 Project-Team AVALON ISSN 0249-0803

More information

Distributed and Parallel Technology

Distributed and Parallel Technology Distributed and Parallel Technology Parallel Performance Tuning Hans-Wolfgang Loidl http://www.macs.hw.ac.uk/~hwloidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 No

More information

Performance Optimization: Simulation and Real Measurement

Performance Optimization: Simulation and Real Measurement Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to

More information

Profiling Parallel Performance using Vampir and Paraver

Profiling Parallel Performance using Vampir and Paraver Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD Abstract Two popular parallel profiling tools installed on HPCx are

More information

Profiling and debugging. John Cazes Texas Advanced Computing Center

Profiling and debugging. John Cazes Texas Advanced Computing Center Profiling and debugging John Cazes Texas Advanced Computing Center Outline Debugging Profiling GDB DDT Basic use Attaching to a running job Identify MPI problems using Message Queues Catch memory errors

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

PAPI Programmer s Reference

PAPI Programmer s Reference PAPI Programmer s Reference This document is a compilation of the reference material needed by a programmer to effectively use PAPI. It is identical to the material found in the PAPI man pages, but organized

More information

CSCI-580 Advanced High Performance Computing

CSCI-580 Advanced High Performance Computing CSCI-580 Advanced High Performance Computing Performance Hacking: Matrix Multiplication Bo Wu Colorado School of Mines Most content of the slides is from: Saman Amarasinghe (MIT) Square-Matrix Multiplication!2

More information

ECE 563 Spring 2012 First Exam

ECE 563 Spring 2012 First Exam ECE 563 Spring 2012 First Exam version 1 This is a take-home test. You must work, if found cheating you will be failed in the course and you will be turned in to the Dean of Students. To make it easy not

More information

Can We Understand Performance Counter Results?

Can We Understand Performance Counter Results? Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do We Know if Counters are Working? Three common failures: Wrong counter (PAPI, Kernel, User) Counter works but

More information

Profiling and debugging. Yaakoub El Khamra Texas Advanced Computing Center

Profiling and debugging. Yaakoub El Khamra Texas Advanced Computing Center Profiling and debugging Yaakoub El Khamra Texas Advanced Computing Center Outline Debugging GDB DDT PTP Basic use Attaching to a running job Identify MPI problems using Message Queues Catch memory errors

More information

COSC 6385 Computer Architecture - Instruction Set Principles

COSC 6385 Computer Architecture - Instruction Set Principles COSC 6385 Computer rchitecture - Instruction Set Principles Fall 2006 Organizational Issues September 4th: no class (labor day holiday) Classes of onday Sept. 11 th and Wednesday Sept. 13 th have to be

More information

Final CSE 131B Winter 2003

Final CSE 131B Winter 2003 Login name Signature Name Student ID Final CSE 131B Winter 2003 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 _ (20 points) _ (25 points) _ (21 points) _ (40 points) _ (30 points) _ (25 points)

More information

Debugging, Profiling and Optimising Scientific Codes. Wadud Miah Research Computing Group

Debugging, Profiling and Optimising Scientific Codes. Wadud Miah Research Computing Group Debugging, Profiling and Optimising Scientific Codes Wadud Miah Research Computing Group Scientific Code Performance Lifecycle Debugging Scientific Codes Software Bugs A bug in a program is an unwanted

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 2

ECE 571 Advanced Microprocessor-Based Design Lecture 2 ECE 571 Advanced Microprocessor-Based Design Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 January 2016 Announcements HW#1 will be posted tomorrow I am handing out

More information

ECE 454 Computer Systems Programming Measuring and profiling

ECE 454 Computer Systems Programming Measuring and profiling ECE 454 Computer Systems Programming Measuring and profiling Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan It is a capital mistake to theorize before one has data. Insensibly

More information

COSC 6385 Computer Architecture. Instruction Set Architectures

COSC 6385 Computer Architecture. Instruction Set Architectures COSC 6385 Computer Architecture Instruction Set Architectures Spring 2012 Instruction Set Architecture (ISA) Definition on Wikipedia: Part of the Computer Architecture related to programming Defines set

More information

Performance Analysis of AERMOD on Commodity Platforms

Performance Analysis of AERMOD on Commodity Platforms Performance Analysis of AERMOD on Commodity Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance of the AERMOD Air Quality

More information

Parallel Performance and Optimization

Parallel Performance and Optimization Parallel Performance and Optimization Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26 August

More information

Performance Profiling

Performance Profiling Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance

More information

COSC 6385 Computer Architecture - Project

COSC 6385 Computer Architecture - Project COSC 6385 Computer Architecture - Project Edgar Gabriel Spring 2018 Hardware performance counters set of special-purpose registers built into modern microprocessors to store the counts of hardwarerelated

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Parallel Code Optimisation

Parallel Code Optimisation April 8, 2008 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code Theoretical perfomance The theoretical floating point performance of a processor

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This

More information

Lecture: Static ILP, Branch Prediction

Lecture: Static ILP, Branch Prediction Lecture: Static ILP, Branch Prediction Topics: compiler-based ILP extraction, branch prediction, bimodal/global/local/tournament predictors (Section 3.3, notes on class webpage) 1 Problem 1 Use predication

More information

CSE 141 Summer 2016 Homework 2

CSE 141 Summer 2016 Homework 2 CSE 141 Summer 2016 Homework 2 PID: Name: 1. A matrix multiplication program can spend 10% of its execution time in reading inputs from a disk, 10% of its execution time in parsing and creating arrays

More information

High Performance Computing and Programming, Lecture 3

High Performance Computing and Programming, Lecture 3 High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time

More information

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) Lecture 2: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) 1 Defining Fault, Error, and Failure A fault produces a latent error; it becomes effective when

More information

Program Transformations for the Memory Hierarchy

Program Transformations for the Memory Hierarchy Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

Lecture: Branch Prediction

Lecture: Branch Prediction Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage) 1 Support for Speculation In general, when we

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

Section Notes - Week 1 (9/17)

Section Notes - Week 1 (9/17) Section Notes - Week 1 (9/17) Why do we need to learn bits and bitwise arithmetic? Since this class is about learning about how computers work. For most of the rest of the semester, you do not have to

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information

Advanced Programming & C++ Language

Advanced Programming & C++ Language Advanced Programming & C++ Language ~6~ Introduction to Memory Management Ariel University 2018 Dr. Miri (Kopel) Ben-Nissan Stack & Heap 2 The memory a program uses is typically divided into four different

More information

Software Analysis. Asymptotic Performance Analysis

Software Analysis. Asymptotic Performance Analysis Software Analysis Performance Analysis Presenter: Jonathan Aldrich Performance Analysis Software Analysis 1 Asymptotic Performance Analysis How do we compare algorithm performance? Abstract away low-level

More information

ECE Spring 2017 Exam 2

ECE Spring 2017 Exam 2 ECE 56300 Spring 2017 Exam 2 All questions are worth 5 points. For isoefficiency questions, do not worry about breaking costs down to t c, t w and t s. Question 1. Innovative Big Machines has developed

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

A Portable Programming Interface for Performance Evaluation on Modern Processors

A Portable Programming Interface for Performance Evaluation on Modern Processors A Portable Programming Interface for Performance Evaluation on Modern Processors S. Browne *, J Dongarra, N. Garner *, K. London *, P. Mucci * Abstract The purpose of the PAPI project is to specify a standard

More information

The Public Shared Objects Run-Time System

The Public Shared Objects Run-Time System The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese E-mail: wiese@tu-harburg.d400.de Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg

More information

RISC Architecture Ch 12

RISC Architecture Ch 12 RISC Architecture Ch 12 Some History Instruction Usage Characteristics Large Register Files Register Allocation Optimization RISC vs. CISC 18 Original Ideas Behind CISC (Complex Instruction Set Comp.)

More information

PAPI: Performance API

PAPI: Performance API Santiago 2015 PAPI: Performance API Andrés Ávila Centro de Modelación y Computación Científica Universidad de La Frontera andres.avila@ufrontera.cl October 27th, 2015 1 Motivation 2 Motivation PERFORMANCE

More information

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

Figure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set

More information

CSE 230 Intermediate Programming in C and C++ Arrays and Pointers

CSE 230 Intermediate Programming in C and C++ Arrays and Pointers CSE 230 Intermediate Programming in C and C++ Arrays and Pointers Fall 2017 Stony Brook University Instructor: Shebuti Rayana http://www3.cs.stonybrook.edu/~cse230/ Definition: Arrays A collection of elements

More information

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

CS222: Cache Performance Improvement

CS222: Cache Performance Improvement CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

THE PAPI PERFORMANCE ANALYSIS TOOL

THE PAPI PERFORMANCE ANALYSIS TOOL THE PAPI PERFORMANCE ANALYSIS TOOL Rui Silva 20 November 2012 Universidade do Minho OUTLINE Introduction to PAPI My experience with PAPI INTRODUCTION TO PAPI PAPI Access to hardware performance counters

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group Operating Systems (2INC0) 20/19 Introduction (01) Dr. Courtesy of Prof. Dr. Johan Lukkien System Architecture and Networking Group Course Overview Introduction to operating systems Processes, threads and

More information

ECE 498 Linux Assembly Language Lecture 1

ECE 498 Linux Assembly Language Lecture 1 ECE 498 Linux Assembly Language Lecture 1 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 13 November 2012 Assembly Language: What s it good for? Understanding at a low-level what

More information

Memory Checking and Single Processor Optimization with Valgrind [05b]

Memory Checking and Single Processor Optimization with Valgrind [05b] Memory Checking and Single Processor Optimization with Valgrind Memory Checking and Single Processor Optimization with Valgrind [05b] University of Stuttgart High-Performance Computing-Center Stuttgart

More information

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance Necessity of evaluation computer performance MEASURING COMPUTER PERFORMANCE For comparing different computer performances User: Interested in reducing the execution time (response time) of a task. Computer

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

Single Processor Optimization III

Single Processor Optimization III Russian-German School on High-Performance Computer Systems, 27th June - 6th July, Novosibirsk 2. Day, 28th of June, 2005 HLRS, University of Stuttgart Slide 1 Outline Motivation Valgrind Memory Tracing

More information

Fixed-Point Math and Other Optimizations

Fixed-Point Math and Other Optimizations Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead

More information

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V. Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of

More information

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time COSC 6374 Parallel Computation Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall 2015 Execution Time Serial runtime T s : time elapsed between beginning and the end of the execution of a

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function William Stallings Computer Organization and Architecture 8 th Edition Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Inside out of your computer memories (III) Hung-Wei Tseng

Inside out of your computer memories (III) Hung-Wei Tseng Inside out of your computer memories (III) Hung-Wei Tseng Why memory hierarchy? CPU main memory lw $t2, 0($a0) add $t3, $t2, $a1 addi $a0, $a0, 4 subi $a1, $a1, 1 bne $a1, LOOP lw $t2, 0($a0) add $t3,

More information

CPU Structure and Function

CPU Structure and Function CPU Structure and Function Chapter 12 Lesson 17 Slide 1/36 Processor Organization CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data Lesson 17 Slide 2/36 CPU With Systems

More information

CSCE 5610: Computer Architecture

CSCE 5610: Computer Architecture HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 15

CO Computer Architecture and Programming Languages CAPL. Lecture 15 CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125

More information

Assignment 6: The Power of Caches

Assignment 6: The Power of Caches Assignment 6: The Power of Caches Due by: April 20, 2018 before 10:00 pm Collaboration: Individuals or Registered Pairs (see Piazza). It is mandatory for every student to register on Piazza. Grading: Packaging

More information

Prof. Thomas Sterling

Prof. Thomas Sterling High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel

More information

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So Computer Architecture ELEC344 Computer Performance How do you measure performance of a computer? 2 nd Semester, 208-9 Dr. Hayden Kwok-Hay So How do you make a computer fast? Department of Electrical and

More information

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität

More information

CS 31: Intro to Systems Operating Systems Overview. Kevin Webb Swarthmore College March 31, 2015

CS 31: Intro to Systems Operating Systems Overview. Kevin Webb Swarthmore College March 31, 2015 CS 31: Intro to Systems Operating Systems Overview Kevin Webb Swarthmore College March 31, 2015 Reading Quiz OS: Turn undesirable into desirable Turn undesirable inconveniences: reality Complexity of hardware

More information

Principles of Operating Systems

Principles of Operating Systems Principles of Operating Systems Lecture 18-20 - Main Memory Ardalan Amiri Sani (ardalan@uci.edu) [lecture slides contains some content adapted from previous slides by Prof. Nalini Venkatasubramanian, and

More information

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

I/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes

More information