COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law
|
|
- Maurice O’Brien’
- 6 years ago
- Views:
Transcription
1 COSC 6374 Parallel Computation Performance Oriented Software Design Spring 2008 Amdahl s Law Describes the performance gains by enhancing one part of the overall system (code, computer) Speedup = Performance of entire task using the enhancement Performance of entire task not using the enhancement Or Speedup = Execution time of the task not using the enhancement Execution time of the task using the enhancement 1
2 Amdahl s Law (II) Amdahl s Law depends on two factors: Fraction of the execution time affected by enhancement The improvement gained by the enhancement for this fraction Thus Fractionenh Execution _ timeenh = Execution_ timeorg((1 Fractionenh) + ) Speedup enh (1:27:1) Speedup overall Execution_ time = Execution_ time org enh 1 = Fraction (1 Fractionenh) + Speedup enh enh (1:27:2) 6 Speedup Amdahl s Law (III) overall 1 = Fraction (1 Fractionenh) + Speedup enh enh 5 Speedup total Fraction enhanced: 20% Fraction enhanced: 40% Fraction enhanced: 60% Fraction enhanced: 80% Speedup enhanced 2
3 Amdahl s Law (IV) Speedup according to Amdahl's Law Speedup total Speedup enhanced: 2 Speedup enhanced: 4 Speedup enhanced: Fraction enhanced Three big questions Where do I spend the most time? How efficient are those routines? Where do we loose efficiency? 3
4 Where do we spend most time? Need to profile the application Standard tools in UNIX like environments: gprof, valgrind Valgrind: Collection of various tools to analyze an application at runtime tool=memcheck: memory debugger tool=cachegrind: estimate on the cache usage of an application tool=callgrind: provides a trace of the function calls Most tools produce an output file (cachegrind.<procid>.out> kcachegrind: visualization tool of valgrind output files 4
5 How to determine the sources of overhead? Get detailed data for different sections of the routine get an estimate on the number of operations executed within these section Scaling issues: For each process we might end up with a large no. of time stamps (e.g. k per process) a large no. of measurements per time stamp (e.g. m per time stamp) (Execution time of MPI functions, various PAPI counters, user defined values) This leads to (n * k * m) data values for the performance analysis Data reduction for performance Analysis Data reduction for the number of processes analyzed: Find processors exposing the same behavior and focus on the performance analysis of a single processor of each group Data reduction per process: Eliminate the measurements exposing the same information Data reduction in time: Find a small, typical cycle in the application and ignore the rest. Automatic, statistical methods inevitable 5
6 Where do we loose efficieny? valgrind --tool=cachegrind./atf ================================================= ==27050== ==27050== I refs: 7,477,574,763 ==27050== I1 misses: 1,856 ==27050== L2i misses: 1,774 ==27050== I1 miss rate: 0.00% ==27050== L2i miss rate: 0.00% ==27050== ==27050== D refs: 3,663,973,777 (3,517,790,756 rd + 146,183,021 wr) ==27050== D1 misses: 89,705,595 ( 85,089,836 rd + 4,615,759 wr) ==27050== L2d misses: 85,614,772 ( 81,648,115 rd + 3,966,657 wr) ==27050== D1 miss rate: 2.4% ( 2.4% + 3.1% ) ==27050== L2d miss rate: 2.3% ( 2.3% + 2.7% ) ==27050== ==27050== L2 refs: 89,707,451 ( 85,091,692 rd + 4,615,759 wr) ==27050== L2 misses: 85,616,546 ( 81,649,889 rd + 3,966,657 wr) ==27050== L2 miss rate: 0.7% ( 0.7% + 2.7% ) 6
7 PAPI hardware performance counters Modern processors expose a some counters which give some information about the performance Limited number of counters No. of simultaneous counters and the supported combination of hardware counters depending on the processor Available on most modern operating systems: Linux: requires recompiling the kernel Windows: works right away, however not very accurate due to some restrictions of the OS on context switches Requires modification of your source code to insert the PAPI calls 7
8 General Counters PAPI_FP_OPS Floating point operations PAPI_TOT_CYC Total cycles PAPI_HW_INT Hardware interrupts Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Instruction Counters PAPI_TOT_IIS Instructions issued PAPI_TOT_INS Instructions completed PAPI_INT_INS Integer instructions PAPI_LD_INS Load instructions PAPI_SR_INS Store instructions PAPI_BR_INS Branch instructions PAPI_VEC_INS Vector/SIMD instructions PAPI_LST_INS Load/store instr. completed PAPI_SYC_INS Synch. instr. completed Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 8
9 FP Instruction Counters PAPI_FP_INS Floating point instructions PAPI_FML_INS Floating point multiply PAPI_FAD_INS Floating point add PAPI_FDV_INS Floating point divide PAPI_FSQ_INS Floating point square root PAPI_FNV_INS Floating point inverse Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Cache Counters PAPI_L[1 2 3]_[D I T]C[M H A R W] Cache level 1/2/3 [D I T]: data/instruction/total cache [M H A R W]: misses/hits/accesses/ reads/writes PAPI_L[1 2 3]_[LD ST]M Cache level 1/2/3 [LD ST]: load/store misses PAPI_PRF_DM Data prefetch cache misses Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 9
10 PAPI manual example PAPI_library_init(PAPI_VER_CURRENT); /* query and set up the right events to monitor */ if (PAPI_query_event(PAPI_FP_INS) == PAPI_OK) { Events[0] = PAPI_FP_INS; } else { Events[0] = PAPI_TOT_INS; } Events[1] = PAPI_TOT_CYC; PAPI_start_counters((int *) Events, NUM_EVENTS); /* Execute the real code*/ do_flops(num_flops); PAPI_read_counters(values, NUM_EVENTS); Vampir: Process Timeline Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 10
11 Example: low FP rate due to FP exceptions Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Consequences for software design Multi-dimensional allocations in C/C++ Typical code sequence double **matrix; matrix = (double **) malloc ( dim1 * sizeof(double *); for ( i=0; i< dim1; i++ ) { matrix[i] = (double *) malloc ( dim2 *sizeof(double)); } memory allocated might not be contiguous lowers performance 11
12 Consequences for software design Alternative allocation technique: double **matrix; double *data; data = (double *) malloc(dims1*dims2*sizeof(double)); matrix = (double **) malloc (dims1*sizeof(double *)); for (i=0; i<(dims[0]); i++) { matrix[i] = &(data[i*dims1]); } Consequences for software design Inner loop should go over the outmost index of multidimensional arrays in C/C++ correct version: for (i=0; i<dims1; i++) { for ( j=0; j<dims2; j++ ){ matrix[i][j]= ; } } wrong version: for ( j=0; j<dims2; j++ ) { for (i=0; i<dims1; i++) { matrix[i][j]= ; } } 12
13 What shall you do if one variable requires access along the row and one variable along the columns? for ( i=0; i<dim; i++ ) for ( j=0; j<dim; j++ ) for ( k=0; k<dim; k++) c[i][j] += a[i][k] * b[k][j]; Blocked code versions optimize cache usage for ( i=0; i<dim; i+=block ) for ( j=0; j<dim; j+=block ) for ( k=0; k<dim; k+=block) for (ii=i; ii<(i+block); ii++) for (jj=j; jj<(j+block); jj++) for (kk=k; kk<(k+block);kk++) c[ii][jj] += a[ii][kk] * b[kk][jj]; Comparison operators Comparing integer values is orders of magnitudes faster than comparing strings map options to integers and use if or switch statements avoid strcmp or similar functions wherever possible Avoid unnecessary memory copy operations minimizing memory footprint improves cache behavior passing pointers to a subroutine instead of making a copy of the data array might have however a negative impact on loops within the subroutine, since the compiler does not know boundaries of the array/loop. 13
14 Object structures Rule of thumb: it is better to have an object containing a vector of data, than having a vector of objects with one data point each fewer indirections better cache usage 14
Organizational issues (I)
COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2007 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework
More informationOrganizational issues (I)
COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2009 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, SEC 202 Wednesday, 1.00pm 2.30pm, SEC 202 Evaluation 25% homework
More informationOrganizational issues (I)
COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2008 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework
More informationHiPERiSM Consulting, LLC.
HiPERiSM Consulting, LLC. George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com Models-3 User s Conference September
More informationPAPI Software Specification
PAPI Software Specification This software specification describes the PAPI 3.0 Release, and is current as of March 08, 2004. It consists of the following sections: Introduction to PAPI Constants Standardized
More informationPerformance analysis basics
Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA
PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA
PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 1 Motivation 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) 2 Motivation Application
More informationGo Multicore Series:
Go Multicore Series: Understanding Memory in a Multicore World, Part 2: Software Tools for Improving Cache Perf Joe Hummel, PhD http://www.joehummel.net/freescale.html FTF 2014: FTF-SDS-F0099 TM External
More informationPrinciples. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.
Principles Performance Tuning CS 27 Don t optimize your code o Your program might be fast enough already o Machines are getting faster and cheaper every year o Memory is getting denser and cheaper every
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationVector and Parallel Processors. Amdahl's Law
Vector and Parallel Processors. Vector processors are processors which have special hardware for performing operations on vectors: generally, this takes the form of a deep pipeline specialized for this
More informationCOSC 6385 Computer Architecture. Defining Computer Architecture
COSC 6385 Computer rchitecture Defining Computer rchitecture Fall 007 icro-processors in today s world arkets Desktop computing Servers Embedded computers Characteristics Price vailability Reliability
More informationPerformance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms
Performance Metrics for Ocean and Air Quality Models on Commodity Linux Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance
More informationECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017
ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance Prof. Peter Bermel January 13, 2017 Outline Time Scaling Examples General performance strategies Computer architectures
More informationCache Profiling with Callgrind
Center for Information Services and High Performance Computing (ZIH) Cache Profiling with Callgrind Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945
More informationChapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture
More informationParallel Performance and Optimization
Parallel Performance and Optimization Erik Schnetter Gregory G. Howes Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa May 20-22, 2013 Thank you Ben Rogers Glenn Johnson
More informationEvaluation of Profiling Tools for the Acquisition of Time Independent Traces
Evaluation of Profiling Tools for the Acquisition of Time Independent Traces Frédéric Desprez, George S. Markomanolis, Frédéric Suter TECHNICAL REPORT N 437 July 2013 Project-Team AVALON ISSN 0249-0803
More informationDistributed and Parallel Technology
Distributed and Parallel Technology Parallel Performance Tuning Hans-Wolfgang Loidl http://www.macs.hw.ac.uk/~hwloidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 No
More informationPerformance Optimization: Simulation and Real Measurement
Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to
More informationProfiling Parallel Performance using Vampir and Paraver
Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD Abstract Two popular parallel profiling tools installed on HPCx are
More informationProfiling and debugging. John Cazes Texas Advanced Computing Center
Profiling and debugging John Cazes Texas Advanced Computing Center Outline Debugging Profiling GDB DDT Basic use Attaching to a running job Identify MPI problems using Message Queues Catch memory errors
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationPAPI Programmer s Reference
PAPI Programmer s Reference This document is a compilation of the reference material needed by a programmer to effectively use PAPI. It is identical to the material found in the PAPI man pages, but organized
More informationCSCI-580 Advanced High Performance Computing
CSCI-580 Advanced High Performance Computing Performance Hacking: Matrix Multiplication Bo Wu Colorado School of Mines Most content of the slides is from: Saman Amarasinghe (MIT) Square-Matrix Multiplication!2
More informationECE 563 Spring 2012 First Exam
ECE 563 Spring 2012 First Exam version 1 This is a take-home test. You must work, if found cheating you will be failed in the course and you will be turned in to the Dean of Students. To make it easy not
More informationCan We Understand Performance Counter Results?
Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do We Know if Counters are Working? Three common failures: Wrong counter (PAPI, Kernel, User) Counter works but
More informationProfiling and debugging. Yaakoub El Khamra Texas Advanced Computing Center
Profiling and debugging Yaakoub El Khamra Texas Advanced Computing Center Outline Debugging GDB DDT PTP Basic use Attaching to a running job Identify MPI problems using Message Queues Catch memory errors
More informationCOSC 6385 Computer Architecture - Instruction Set Principles
COSC 6385 Computer rchitecture - Instruction Set Principles Fall 2006 Organizational Issues September 4th: no class (labor day holiday) Classes of onday Sept. 11 th and Wednesday Sept. 13 th have to be
More informationFinal CSE 131B Winter 2003
Login name Signature Name Student ID Final CSE 131B Winter 2003 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 _ (20 points) _ (25 points) _ (21 points) _ (40 points) _ (30 points) _ (25 points)
More informationDebugging, Profiling and Optimising Scientific Codes. Wadud Miah Research Computing Group
Debugging, Profiling and Optimising Scientific Codes Wadud Miah Research Computing Group Scientific Code Performance Lifecycle Debugging Scientific Codes Software Bugs A bug in a program is an unwanted
More informationECE 571 Advanced Microprocessor-Based Design Lecture 2
ECE 571 Advanced Microprocessor-Based Design Lecture 2 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 January 2016 Announcements HW#1 will be posted tomorrow I am handing out
More informationECE 454 Computer Systems Programming Measuring and profiling
ECE 454 Computer Systems Programming Measuring and profiling Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan It is a capital mistake to theorize before one has data. Insensibly
More informationCOSC 6385 Computer Architecture. Instruction Set Architectures
COSC 6385 Computer Architecture Instruction Set Architectures Spring 2012 Instruction Set Architecture (ISA) Definition on Wikipedia: Part of the Computer Architecture related to programming Defines set
More informationPerformance Analysis of AERMOD on Commodity Platforms
Performance Analysis of AERMOD on Commodity Platforms George Delic george@hiperism.com HiPERiSM Consulting, LLC Durham, North Carolina Abstract. This report examines performance of the AERMOD Air Quality
More informationParallel Performance and Optimization
Parallel Performance and Optimization Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 25-26 August
More informationPerformance Profiling
Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance
More informationCOSC 6385 Computer Architecture - Project
COSC 6385 Computer Architecture - Project Edgar Gabriel Spring 2018 Hardware performance counters set of special-purpose registers built into modern microprocessors to store the counts of hardwarerelated
More informationThe Role of Performance
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware
More informationParallel Code Optimisation
April 8, 2008 Terms and terminology Identifying bottlenecks Optimising communications Optimising IO Optimising the core code Theoretical perfomance The theoretical floating point performance of a processor
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This
More informationLecture: Static ILP, Branch Prediction
Lecture: Static ILP, Branch Prediction Topics: compiler-based ILP extraction, branch prediction, bimodal/global/local/tournament predictors (Section 3.3, notes on class webpage) 1 Problem 1 Use predication
More informationCSE 141 Summer 2016 Homework 2
CSE 141 Summer 2016 Homework 2 PID: Name: 1. A matrix multiplication program can spend 10% of its execution time in reading inputs from a disk, 10% of its execution time in parsing and creating arrays
More informationHigh Performance Computing and Programming, Lecture 3
High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time
More informationLecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)
Lecture 2: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4) 1 Defining Fault, Error, and Failure A fault produces a latent error; it becomes effective when
More informationProgram Transformations for the Memory Hierarchy
Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California
More informationComputer Systems A Programmer s Perspective 1 (Beta Draft)
Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface
More informationLecture: Branch Prediction
Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage) 1 Support for Speculation In general, when we
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationSection Notes - Week 1 (9/17)
Section Notes - Week 1 (9/17) Why do we need to learn bits and bitwise arithmetic? Since this class is about learning about how computers work. For most of the rest of the semester, you do not have to
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More informationAdvanced Programming & C++ Language
Advanced Programming & C++ Language ~6~ Introduction to Memory Management Ariel University 2018 Dr. Miri (Kopel) Ben-Nissan Stack & Heap 2 The memory a program uses is typically divided into four different
More informationSoftware Analysis. Asymptotic Performance Analysis
Software Analysis Performance Analysis Presenter: Jonathan Aldrich Performance Analysis Software Analysis 1 Asymptotic Performance Analysis How do we compare algorithm performance? Abstract away low-level
More informationECE Spring 2017 Exam 2
ECE 56300 Spring 2017 Exam 2 All questions are worth 5 points. For isoefficiency questions, do not worry about breaking costs down to t c, t w and t s. Question 1. Innovative Big Machines has developed
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationA Portable Programming Interface for Performance Evaluation on Modern Processors
A Portable Programming Interface for Performance Evaluation on Modern Processors S. Browne *, J Dongarra, N. Garner *, K. London *, P. Mucci * Abstract The purpose of the PAPI project is to specify a standard
More informationThe Public Shared Objects Run-Time System
The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese E-mail: wiese@tu-harburg.d400.de Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg
More informationRISC Architecture Ch 12
RISC Architecture Ch 12 Some History Instruction Usage Characteristics Large Register Files Register Allocation Optimization RISC vs. CISC 18 Original Ideas Behind CISC (Complex Instruction Set Comp.)
More informationPAPI: Performance API
Santiago 2015 PAPI: Performance API Andrés Ávila Centro de Modelación y Computación Científica Universidad de La Frontera andres.avila@ufrontera.cl October 27th, 2015 1 Motivation 2 Motivation PERFORMANCE
More informationFigure 1: 128-bit registers introduced by SSE. 128 bits. xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7
SE205 - TD1 Énoncé General Instructions You can download all source files from: https://se205.wp.mines-telecom.fr/td1/ SIMD-like Data-Level Parallelism Modern processors often come with instruction set
More informationCSE 230 Intermediate Programming in C and C++ Arrays and Pointers
CSE 230 Intermediate Programming in C and C++ Arrays and Pointers Fall 2017 Stony Brook University Instructor: Shebuti Rayana http://www3.cs.stonybrook.edu/~cse230/ Definition: Arrays A collection of elements
More informationSystems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations
Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache
More informationChapter 8 & Chapter 9 Main Memory & Virtual Memory
Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationCS222: Cache Performance Improvement
CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationTHE PAPI PERFORMANCE ANALYSIS TOOL
THE PAPI PERFORMANCE ANALYSIS TOOL Rui Silva 20 November 2012 Universidade do Minho OUTLINE Introduction to PAPI My experience with PAPI INTRODUCTION TO PAPI PAPI Access to hardware performance counters
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More information16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as
372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More informationOperating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group
Operating Systems (2INC0) 20/19 Introduction (01) Dr. Courtesy of Prof. Dr. Johan Lukkien System Architecture and Networking Group Course Overview Introduction to operating systems Processes, threads and
More informationECE 498 Linux Assembly Language Lecture 1
ECE 498 Linux Assembly Language Lecture 1 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 13 November 2012 Assembly Language: What s it good for? Understanding at a low-level what
More informationMemory Checking and Single Processor Optimization with Valgrind [05b]
Memory Checking and Single Processor Optimization with Valgrind Memory Checking and Single Processor Optimization with Valgrind [05b] University of Stuttgart High-Performance Computing-Center Stuttgart
More informationMEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance
Necessity of evaluation computer performance MEASURING COMPUTER PERFORMANCE For comparing different computer performances User: Interested in reducing the execution time (response time) of a task. Computer
More informationSupercomputing in Plain English Part IV: Henry Neeman, Director
Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is
More informationSingle Processor Optimization III
Russian-German School on High-Performance Computer Systems, 27th June - 6th July, Novosibirsk 2. Day, 28th of June, 2005 HLRS, University of Stuttgart Slide 1 Outline Motivation Valgrind Memory Tracing
More informationFixed-Point Math and Other Optimizations
Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead
More informationIntroduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.
Introduction to optimizations CS3300 - Compiler Design Introduction to Optimizations V. Krishna Nandivada IIT Madras Copyright c 2018 by Antony L. Hosking. Permission to make digital or hard copies of
More informationCOSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time
COSC 6374 Parallel Computation Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall 2015 Execution Time Serial runtime T s : time elapsed between beginning and the end of the execution of a
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 12 Processor Structure and Function
William Stallings Computer Organization and Architecture 8 th Edition Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data
More informationReducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research
Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of
More informationChapter 12. CPU Structure and Function. Yonsei University
Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor
More informationInside out of your computer memories (III) Hung-Wei Tseng
Inside out of your computer memories (III) Hung-Wei Tseng Why memory hierarchy? CPU main memory lw $t2, 0($a0) add $t3, $t2, $a1 addi $a0, $a0, 4 subi $a1, $a1, 1 bne $a1, LOOP lw $t2, 0($a0) add $t3,
More informationCPU Structure and Function
CPU Structure and Function Chapter 12 Lesson 17 Slide 1/36 Processor Organization CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data Lesson 17 Slide 2/36 CPU With Systems
More informationCSCE 5610: Computer Architecture
HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 15
CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125
More informationAssignment 6: The Power of Caches
Assignment 6: The Power of Caches Due by: April 20, 2018 before 10:00 pm Collaboration: Individuals or Registered Pairs (see Piazza). It is mandatory for every student to register on Piazza. Grading: Packaging
More informationProf. Thomas Sterling
High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel
More informationComputer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So
Computer Architecture ELEC344 Computer Performance How do you measure performance of a computer? 2 nd Semester, 208-9 Dr. Hayden Kwok-Hay So How do you make a computer fast? Department of Electrical and
More informationVAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW
VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW 8th VI-HPS Tuning Workshop at RWTH Aachen September, 2011 Tobias Hilbrich and Joachim Protze Slides by: Andreas Knüpfer, Jens Doleschal, ZIH, Technische Universität
More informationCS 31: Intro to Systems Operating Systems Overview. Kevin Webb Swarthmore College March 31, 2015
CS 31: Intro to Systems Operating Systems Overview Kevin Webb Swarthmore College March 31, 2015 Reading Quiz OS: Turn undesirable into desirable Turn undesirable inconveniences: reality Complexity of hardware
More informationPrinciples of Operating Systems
Principles of Operating Systems Lecture 18-20 - Main Memory Ardalan Amiri Sani (ardalan@uci.edu) [lecture slides contains some content adapted from previous slides by Prof. Nalini Venkatasubramanian, and
More informationI/O Devices. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)
I/O Devices Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau) Hardware Support for I/O CPU RAM Network Card Graphics Card Memory Bus General I/O Bus (e.g., PCI) Canonical Device OS reads/writes
More information