High Performance Computing and Programming. Performance Analysis

Size: px
Start display at page:

Download "High Performance Computing and Programming. Performance Analysis"

Transcription

1 High Performance Computing and Programming Performance Analysis

2 What is performance? Performance is a total effectiveness of a program What are the measures of the performance? execution time FLOPS memory usage power consumption response time (time from request until result) throughput (number of operations per unit time) bandwidth (number of bits transmitted per unit time)

3 What is performance? Performance is a total effectiveness of a program What are the measures of the performance? execution time FLOPS memory usage power consumption response time (time from request until result) throughput (number of operations per unit time) bandwidth (number of bits transmitted per unit time)

4 What is performance analysis? Computers are systems of components that interact with each other. Complex and hard to model analytically Performance analysis is an experimental discipline of computer science involving: Measurement Interpretation Communication

5 Measuring time What is taking so much time? Which section of my program should be made better? How can you figure out why it is so slow? What are ways to speed it up?

6 CPU clock rate CPU clock speed or rate is the speed at which a microprocessor executes instructions number of cycles per second - Hz Today s processors 2-5GHz (~ thousand million cycles per second) Record (2013): IBM zec12 with 5.5 GH Amount of work different CPUs can do in one cycle varies

7 View of time by system and application System perspective 2 processes: A and B Perspective of the process A The system switches from process to process. Interval time time between interrupts. And it is operating in user or kernel mode. Useful computation are performed just in the user mode

8 Interval Counting OS measures runtimes using Interval Counting Maintains 2 counts per process User time - time executing instructions in current process System time - time executing instructions in kernel on behalf of current process Every X cycles, trigger a timer interrupt and increment a counter for executing process User time if running in user mode System time if running in kernel mode This is called a clock tick

9 Interval Counting OS measures runtimes using Interval Counting Maintains 2 counts per process User time - time executing instructions in current process System time - time executing instructions in kernel on behalf of current process Every X cycles, trigger a timer interrupt and increment a counter for executing process User time if running in user mode System time if running in kernel mode This is called a clock tick

10 Interval Counting Example 2 processes: A and B 10 ms Time segment is assigned to a process as part of either its user (u) or system (s) time A 110u + 40s B 70u + 30s

11 Interval Counting Example 2 processes: A and B 10 ms Time segment is assigned to a process as part of either its user (u) or system (s) time Actual times: A 110u + 40s B 70u + 30s A 120.0u s B 73.3u s Therefore this timing mechanism is only approximate

12 Accuracy of process timer 1 active process Process time unusable below 100 ms 11 active processes

13 Accuracy of process timer Conclusion: Coarseness of the timing method results in big error Interval counting is only useful for measuring relatively long computations Accuracy does not depend strongly on the system load

14 Cycle counter Most modern systems have built-in registers that are incremented every clock cycle Very fine grained No uniform, platform-independent interface Special assembly code instruction to access On (recent model) Intel machines: 64 bit unsigned number accessed with the RDTSC (read time stamp counter) instruction

15 Cycle counter Measuring time between two points in program Count the total number of cycles between this points Does not keep track of which process uses cycles Errors due to context switching, cache operation, and branch prediction But: errors will always cause overestimates of the true execution time => nothing will artificially speed up a program

16 Time on a Computer System Wall Clock = Elapsed Time includes everything from beginning to end (disk, OS activity, I/O ) CPU time time spent on the program (while it is actually running) user CPU time system CPU time Wall time Other processes System time for one core: CPU time User time

17 Time on a Computer System = user time = system time = some other user s time + + = real (wall clock) time Usually the word time refers to user time. cumulative user time

18 Multicore Effects Counters are incremented once per interval per process: Multithreaded programs running on p cores for t seconds can show pt seconds CPU time Use wall-time instead when timing parallel code

19 Timers Timer Usage Wallclock/ CPU Resolution (s) Lang Portability time shell/script both 10-2 / yes gettimeofday function wallclock 10-6 (*) C/C++ yes clock_gettim e function CPU time 10-9 C/C++ No (Unix) * hardware dependent minimum period of time that can be measured

20 Timers in code - clock() uses interval timer gives CPU time Example: #include <time.h> clock_t start = clock(); printf( doing stuff here\n ); clock_t end = clock(); printf( It took me %d ticks (%f seconds)\n,end-start, (double)(endstart)/clocks_per_sec);

21 Timers in code - gettimeofday() Unix gettimeofday() function Return elapsed time since reference time (Jan 1, 1970) Some implementations use interval counting, while others use cycle timers. System Resolution (ms) Latency (ms) Pentium II, Windows-NT Compaq Alpha Pentium III Linux Sun UltraSparc

22 Timers in code - clock_gettime() Support different clocks: CLOCK_REALTIME - system-wide clock that measures real (i.e., wall-clock) time affected by discontinuous jumps in the system time affected by the incremental adjustments performed by adjtime and NTP CLOCK_MONOTONIC - monotonic time since some unspecified starting point not affected by discontinuous jumps in the system time affected by the incremental adjustments performed by adjtime and NTP Strictly linearly increasing, and thus useful for calculating the difference in time between two samplings CLOCK_MONOTONIC_RAW - monotonic time since some unspecified starting point not affected by discontinuous jumps in the system time not affected by the incremental adjustments performed by adjtime and NTP

23 Guidelines for measuring execution time Use a high resolution timer, such as clock_gettime() Check the amount of time your program spends in the OS using the time command Do not trust timings unless they are at least 100 times longer than the CPU time resolution Measure several runs to gauge variability (3 or more) Pick the shortest time as a representative (or an average)

24 Profiling Profiling is a process of analizing the behavior of the program (CPU performance, memory, I/O) while it is running Gives a detailed description of execution times for each part of the program Allows to identify hot spots - big amount of executed instructions or place in the program where it uses a lot of CPU time bottlenecks - parts of the program which uses processor resourses inefficiently Profiling is done on working code!

25 Statistical Sampling and Code Instrumentation Sampling means polling the status of the program at regular time intervals Resulting profiles are statistical not exact Usually has minimal interference with program's runtime e.g. Unix time Instrumenting (tracing) means interposing profiling calls within your program's regular calls More exact, but also more resource-intensive Reproducible results e.g. valgrind

26 Measuring FLOPS Floating-point Operations Per Second FLOPS = # of floating point operations in the program Execution time Peak performance of the computer Frequency of the computer (Hz) Number of floating point operations per cycle MFLOPS = 10 6 FLOPS GFLOPS = 10 9 FLOPS TFLOPS = FLOPS PFLOPS = FLOPS Upper bound on performance FLOPS = CPU speed (Hz) * op per cycle * number of cores

27 Measuring FLOPS (1) Count the number of operations in your source code (or, preferably, examine *.s) (2) Time the operation (3) Calculate the rate How far from peak performance? 10% of the peak performance - room for optimization >50% of the peak performance - probably not going to get much more efficient (unless the basic algorithm is rewritten)

28 Scalar Product, example Sum = 0.0; for( i=0; i<n; i++ ) { sum += a[i]*b[i]; } ~N additions ~N multiplications ~2*N FLOP printf( Scalar product of a and b is %f\n,sum);

29 Debugging Debugging is the stage of development of the program during which one finds, isolates, and fix bugs (errors). In order to find the error you need to know: temporary values of variables how program was executed, which execution path was taken Two ways to debug program: using log files, i.e. using output function write state of the program in critical points using debuggers (includes user interface, can execute program step by step or till some condition is satisfied, write temporary values of variables and so on)

30 Debugging locating errors Ex.: GNU Debugger, GDB step through line by line stop on specified conditions show the values of variables contents of the call stack set breakpoints

31 How the debugger figures out where to find the functions and variables in the machine code? debugging information! (DI) Compile it with: gcc -g main.c - DI is stored in the object file - DI describes the data type of variables and functions and the correspondence between source line numbers and addresses in the executable code

32 objdump -h debug_aranges **0 CONTENTS, READONLY, DEBUGGING 27.debug_info c **0 CONTENTS, READONLY, DEBUGGING 28.debug_abbrev ae ee 2**0 CONTENTS, READONLY, DEBUGGING 29.debug_line c c 2**0 CONTENTS, READONLY, DEBUGGING 30.debug_str cf d8 2**0 CONTENTS, READONLY, DEBUGGING 31.debug_loc a7 2**0 CONTENTS, READONLY, DEBUGGING

33 objdump --dwarf=info <1><2d4>: Abbrev Number: 10 (DW_TAG_subprogram) <2d5> DW_AT_external : 1 <2d6> DW_AT_name 0x8a): main <2da> DW_AT_decl_file : 1 <2db> DW_AT_decl_line : 67 <2dc> DW_AT_type <2e0> DW_AT_low_pc <2e8> DW_AT_high_pc : (indirect string, offset: : <0x57> : 0x40093b : 0x400ac4

34 Valgrind Valgrind is a wrapper around a collection of tools for memory debugging, memory leak detection, and profiling: Memcheck - detects memory-management problems Cachegrind & Callgrind - cache profilers Massif - heap profiler...

35 Valgrind Memcheck Tool to check for memory errors and memory leaks Run the executable inside its own environment Memcheck can detect: Use of uninitialised memory Accesses memory it shouldn't areas not yet allocated areas that have been freed areas past the end of heap blocks Inaccessible areas of the stack Memory leaks - where pointers to malloc blocks are lost forever Mismatched use of malloc and free Overlapping of pointers in memcpy() and related functions...

36 Valgrind Memcheck memory leaks --leak-check=full give more information ==7623== LEAK SUMMARY: ==7623== definitely lost: 2,400 bytes in 3 blocks ==7623== indirectly lost: 120,000 bytes in 300 blocks ==7623== possibly lost: 0 bytes in 0 blocks ==7623== still reachable: 0 bytes in 0 blocks ==7623== suppressed: 0 bytes in 0 blocks ==7623== 40,800 (800 direct, 40,000 indirect) bytes in 1 blocks are definitely lost in loss record 6 of 6 ==7623== at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheckamd64-linux.so) ==7623== by 0x400666: allocate_mem (matmul.c:18) ==7623== by 0x400A57: main (matmul.c:92)

37 Valgrind Memcheck Memcheck runs programs about 10-30x slower than normal Valgrind may not find all the errors especially those that don't happen every time. Valgrind doesn't perform bounds checking on static arrays (allocated on the stack)

38 Valgrind Cachegrind & Valgrind Callgrind Cache memory is a computer memory with very short access time used for storage of frequently or recently used instructions or data Cachegrind finds source location of cache misses Callgrind is Cachegrind + function call graph information Simulates CPU with 2-level cache: 1L and LL. Size of the caches can be specified, default is current machine s cache Large overhead: x times slower

39 Inclusive/Exclusive CPU time Inclusive CPU time is the time spent in a function and all its children Exclusive (self) CPU time is the time spent only executing code in the function itself

40 gprof gprof is a standard UNIX profiling utility for profiling executables and shared libraries (1) compile with and link with -pg (2) run your program > generates gmon.out (3) $ gprof <executable> gprof shows CPU time, not real time, disk delays due to input and output are not included instrumenting the code adds some book-keeping information collects samples from looking at the program counter times per second of run time (system dependend)

41 gprof gprof displays the following information: The parent of each procedure. An index number for each procedure. The percentage of CPU time taken by that procedure and all procedures it calls (the calling tree). A breakdown of time used by the procedure and its descendents. The number of times the procedure was called. The direct descendents of each procedure.

42 gprof output Flat profile: how many times each function got called, total times involved, sorted by time consumed Call-graph: each function, who called it, whom it called, and how many times

43 Profiler output What do do with profiler results? observe which methods are being called the most observe which methods are taking most time relative to others Profiler pitfalls: profiling slows down your code design your profiling tests to be very short it doesn't measure everything

44 Know when to stop Pareto Principle 20% of code takes 80% of execution time Procedure CPU time main() 13% procedure1() 17% procedure2() 20% procedure3() 50% A 20% increase in the performance of procedure3() results in a 10% performance increase overall. A 20% increase in the performance of main() results in only a 2.6% performance increase overall.

45 Optimization Flags to the compiler: -O0 reduce the cost of compilation and to make debugging produce the expected results [default] -O1 minimize code size with small speed optimizations -O2 maximize program speed, Intel default -O3 more aggressive optimizations than -O2, but not always better -Os optimize for size. Enables all -O2 optimizations that do not increase code size. Performs further optimizations designed to reduce code size. Note: on different compiler flags can include different kind of optimizations

46 Optimization optimization can affect debugging and profiling: Debugging & Profiling shows what is really in the code Optimization rearranging the code example: if you define a variable, but never use it

47 Measuring arithmetic intensity Arithmetic intensity: math operations per data movement flops/word High arithmetic intensity: program is computebound Reduce math operations or parallelize Low arithmetic intensity: program is bandwidthbound

48 Matrix-vector Product, example Matrix(n*k)-vector product: n*((k-1)+n) for( i=0; i<n; i++ ) for( j=0; j<k; j++) c[i] += a[i][j]*b[k];

49 Reporting computational experiments The goal is to communicate performance results To coworkers & yourself To an external audience The audience must be able to understand: What is being measured What you are showing them What it means

50 Reporting computational experiments Presentation of algorithms Complete description of the algorithm Specification of the domain of applicability of the algorithm If possible and relevant: Computational complexity in time and space If relevant: Convergence results If relevant: Rate of convergence, order of accuracy

51 Reporting computational experiments Traditional indicators are e.g. CPU time Number of iterations Operation count (flops) Memory footprint Numerical accuracy Scope, applicability Portability

52 Reporting computational experiments Presentation of implementation Programming language used Compiler name and options used Computer environment, system and OS (not a Linux machine with an AMD processor ) Input data Settings Provide a ready-to-run package!

53 Reporting computational experiments Presentation of experiments Objective of the experiment Description of problem generator/input data set (Why) Are the results general?

54 Reporting computational experiments Presentation of results, e.g. runtime: Description of how the timings were produced What parts of the code are included in the measurements? What is the variability of the timings? How is the variability handled?

55 Reporting computational experiments Before: make sure that you understand the task correctly plots must have a good quality (ex. use Matlab) think carefully what you exactly want to show all tables and figures must be numbered all tables and figures must have self-explanatory titles figures must have lables explain what tables and pictures means in the text give just relevant information give correctly references and links to the literature

56 Take-home message Use debuggers to find bugs in code Use profilers and timing utilities to: Understand what the computer is doing Expose performance bottlenecks Target your optimization effort Target your parallelization effort Do not optimize code before it is correct!

Computer Organization: A Programmer's Perspective

Computer Organization: A Programmer's Perspective Profiling Oren Kapah orenkapah.ac@gmail.com Profiling: Performance Analysis Performance Analysis ( Profiling ) Understanding the run-time behavior of programs What parts are executed, when, for how long

More information

Use Dynamic Analysis Tools on Linux

Use Dynamic Analysis Tools on Linux Use Dynamic Analysis Tools on Linux FTF-SDS-F0407 Gene Fortanely Freescale Software Engineer Catalin Udma A P R. 2 0 1 4 Software Engineer, Digital Networking TM External Use Session Introduction This

More information

Debugging and Profiling

Debugging and Profiling Debugging and Profiling Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/

More information

CS2141 Software Development using C/C++ Debugging

CS2141 Software Development using C/C++ Debugging CS2141 Software Development using C/C++ Debugging Debugging Tips Examine the most recent change Error likely in, or exposed by, code most recently added Developing code incrementally and testing along

More information

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected

More information

Timing programs with time

Timing programs with time Profiling Profiling measures the performance of a program and can be used to find CPU or memory bottlenecks. time A stopwatch gprof The GNU (CPU) Profiler callgrind Valgrind s CPU profiling tool massif

More information

ECE 454 Computer Systems Programming Measuring and profiling

ECE 454 Computer Systems Programming Measuring and profiling ECE 454 Computer Systems Programming Measuring and profiling Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan It is a capital mistake to theorize before one has data. Insensibly

More information

CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory

CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory Announcements CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory There will be no lecture on Tuesday, Feb. 16. Prof. Thompson s office hours are canceled for Monday, Feb. 15. Prof.

More information

Valgrind. Philip Blakely. Laboratory for Scientific Computing, University of Cambridge. Philip Blakely (LSC) Valgrind 1 / 21

Valgrind. Philip Blakely. Laboratory for Scientific Computing, University of Cambridge. Philip Blakely (LSC) Valgrind 1 / 21 Valgrind Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) Valgrind 1 / 21 Part I Valgrind Philip Blakely (LSC) Valgrind 2 / 21 Valgrind http://valgrind.org/

More information

Performance Improvement. The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 7

Performance Improvement. The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 7 Performance Improvement The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 7 1 For Your Amusement Optimization hinders evolution. -- Alan Perlis

More information

Time Measurement. CS 201 Gerson Robboy Portland State University. Topics. Time scales Interval counting Cycle counters K-best measurement scheme

Time Measurement. CS 201 Gerson Robboy Portland State University. Topics. Time scales Interval counting Cycle counters K-best measurement scheme Time Measurement CS 201 Gerson Robboy Portland State University Topics Time scales Interval counting Cycle counters K-best measurement scheme Computer Time Scales Microscopic Time Scale (1 Ghz Machine)

More information

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties Overview Course Overview Course theme Five realities Computer Systems 1 2 Course Theme: Abstraction Is Good But Don t Forget Reality Most CS courses emphasize abstraction Abstract data types Asymptotic

More information

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Computer Systems A Programmer s Perspective 1 (Beta Draft) Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface

More information

Profiling & Optimization

Profiling & Optimization Lecture 18 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes

More information

CptS 360 (System Programming) Unit 4: Debugging

CptS 360 (System Programming) Unit 4: Debugging CptS 360 (System Programming) Unit 4: Debugging Bob Lewis School of Engineering and Applied Sciences Washington State University Spring, 2018 Motivation You re probably going to spend most of your code

More information

Debugging with gdb and valgrind

Debugging with gdb and valgrind Debugging with gdb and valgrind Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance Computing

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Performance Optimization: Simulation and Real Measurement

Performance Optimization: Simulation and Real Measurement Performance Optimization: Simulation and Real Measurement KDE Developer Conference, Introduction Agenda Performance Analysis Profiling Tools: Examples & Demo KCachegrind: Visualizing Results What s to

More information

Performance Profiling

Performance Profiling Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance

More information

19: I/O Devices: Clocks, Power Management

19: I/O Devices: Clocks, Power Management 19: I/O Devices: Clocks, Power Management Mark Handley Clock Hardware: A Programmable Clock Pulses Counter, decremented on each pulse Crystal Oscillator On zero, generate interrupt and reload from holding

More information

Performance Analysis. HPC Fall 2007 Prof. Robert van Engelen

Performance Analysis. HPC Fall 2007 Prof. Robert van Engelen Performance Analysis HPC Fall 2007 Prof. Robert van Engelen Overview What to measure? Timers Benchmarking Profiling Finding hotspots Profile-guided compilation Messaging and network performance analysis

More information

The course that gives CMU its Zip! Time Measurement Oct. 24, 2002

The course that gives CMU its Zip! Time Measurement Oct. 24, 2002 15-213 The course that gives CMU its Zip! Time Measurement Oct. 24, 2002 Topics Time scales Interval counting Cycle counters K-best measurement scheme class18.ppt Computer Time Scales Microscopic Time

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Profiling & Optimization

Profiling & Optimization Lecture 11 Sources of Game Performance Issues? 2 Avoid Premature Optimization Novice developers rely on ad hoc optimization Make private data public Force function inlining Decrease code modularity removes

More information

Data and File Structures Laboratory

Data and File Structures Laboratory Tools: GDB, Valgrind Assistant Professor Machine Intelligence Unit Indian Statistical Institute, Kolkata August, 2018 1 GDB 2 Valgrind A programmer s experience Case I int x = 10, y = 25; x = x++ + y++;

More information

CS3350B Computer Architecture CPU Performance and Profiling

CS3350B Computer Architecture CPU Performance and Profiling CS3350B Computer Architecture CPU Performance and Profiling Marc Moreno Maza http://www.csd.uwo.ca/~moreno/cs3350_moreno/index.html Department of Computer Science University of Western Ontario, Canada

More information

Performance Measurement

Performance Measurement ECPE 170 Jeff Shafer University of the Pacific Performance Measurement 2 Lab Schedule Activities Today / Thursday Background discussion Lab 5 Performance Measurement Next Week Lab 6 Performance Optimization

More information

Profiling and Workflow

Profiling and Workflow Profiling and Workflow Preben N. Olsen University of Oslo and Simula Research Laboratory preben@simula.no September 13, 2013 1 / 34 Agenda 1 Introduction What? Why? How? 2 Profiling Tracing Performance

More information

Scientific Programming in C IX. Debugging

Scientific Programming in C IX. Debugging Scientific Programming in C IX. Debugging Susi Lehtola 13 November 2012 Debugging Quite often you spend an hour to write a code, and then two hours debugging why it doesn t work properly. Scientific Programming

More information

Praktische Aspekte der Informatik

Praktische Aspekte der Informatik Praktische Aspekte der Informatik Moritz Mühlhausen Prof. Marcus Magnor Optimization valgrind, gprof, and callgrind Further Reading Warning! The following slides are meant to give you a very superficial

More information

Profiling and debugging. Carlos Rosales September 18 th 2009 Texas Advanced Computing Center The University of Texas at Austin

Profiling and debugging. Carlos Rosales September 18 th 2009 Texas Advanced Computing Center The University of Texas at Austin Profiling and debugging Carlos Rosales carlos@tacc.utexas.edu September 18 th 2009 Texas Advanced Computing Center The University of Texas at Austin Outline Debugging Profiling GDB DDT Basic use Attaching

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Module I: Measuring Program Performance

Module I: Measuring Program Performance Performance Programming: Theory, Practice and Case Studies Module I: Measuring Program Performance 9 Outline 10 Measuring methodology and guidelines Measurement tools Timing Tools Profiling Tools Process

More information

CSE 160 Discussion Section. Winter 2017 Week 3

CSE 160 Discussion Section. Winter 2017 Week 3 CSE 160 Discussion Section Winter 2017 Week 3 Homework 1 - Recap & a few points ComputeMandelbrotPoint func() in smdb.cpp does the job serially. You will have to do the same task in parallel manner in

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Lecture 14 Notes. Brent Edmunds

Lecture 14 Notes. Brent Edmunds Lecture 14 Notes Brent Edmunds October 5, 2012 Table of Contents 1 Sins of Coding 3 1.1 Accessing Undeclared Variables and Pointers...................... 3 1.2 Playing With What Isn t Yours..............................

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

MPI Performance Tools

MPI Performance Tools Physics 244 31 May 2012 Outline 1 Introduction 2 Timing functions: MPI Wtime,etime,gettimeofday 3 Profiling tools time: gprof,tau hardware counters: PAPI,PerfSuite,TAU MPI communication: IPM,TAU 4 MPI

More information

Computer Time Scales Time Measurement Oct. 24, Measurement Challenge. Time on a Computer System. The course that gives CMU its Zip!

Computer Time Scales Time Measurement Oct. 24, Measurement Challenge. Time on a Computer System. The course that gives CMU its Zip! 5-23 The course that gives CMU its Zip! Computer Time Scales Microscopic Time Scale ( Ghz Machine) Macroscopic class8.ppt Topics Time Measurement Oct. 24, 2002! Time scales! Interval counting! Cycle counters!

More information

Systems software design. Software build configurations; Debugging, profiling & Quality Assurance tools

Systems software design. Software build configurations; Debugging, profiling & Quality Assurance tools Systems software design Software build configurations; Debugging, profiling & Quality Assurance tools Who are we? Krzysztof Kąkol Software Developer Jarosław Świniarski Software Developer Presentation

More information

Performance Analysis

Performance Analysis Performance Analysis EE380, Fall 2015 Hank Dietz http://aggregate.org/hankd/ Why Measure Performance? Performance is important Identify HW/SW performance problems Compare & choose wisely Which system configuration

More information

Introduction to Parallel Performance Engineering

Introduction to Parallel Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie Jülich Supercomputing Centre (with content used with permission from tutorials by Bernd Mohr/JSC and Luiz DeRose/Cray) Performance:

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Profiling Why? Parallel or serial codes are usually quite complex and it is difficult to understand what is the most time

More information

6.S096: Introduction to C/C++

6.S096: Introduction to C/C++ 6.S096: Introduction to C/C++ Frank Li, Tom Lieber, Kyle Murray Lecture 4: Data Structures and Debugging! January 17, 2012 Today Memory Leaks and Valgrind Tool Structs and Unions Opaque Types Enum and

More information

Performance Evaluation. December 2, 1999

Performance Evaluation. December 2, 1999 15-213 Performance Evaluation December 2, 1999 Topics Getting accurate measurements Amdahl s Law class29.ppt Time on a Computer System real (wall clock) time = user time (time executing instructing instructions

More information

valgrind overview: runtime memory checker and a bit more What can we do with it?

valgrind overview: runtime memory checker and a bit more What can we do with it? Valgrind overview: Runtime memory checker and a bit more... What can we do with it? MLUG Mar 30, 2013 The problem When do we start thinking of weird bug in a program? The problem When do we start thinking

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Oracle Developer Studio Performance Analyzer

Oracle Developer Studio Performance Analyzer Oracle Developer Studio Performance Analyzer The Oracle Developer Studio Performance Analyzer provides unparalleled insight into the behavior of your application, allowing you to identify bottlenecks and

More information

The assignment requires solving a matrix access problem using only pointers to access the array elements, and introduces the use of struct data types.

The assignment requires solving a matrix access problem using only pointers to access the array elements, and introduces the use of struct data types. C Programming Simple Array Processing The assignment requires solving a matrix access problem using only pointers to access the array elements, and introduces the use of struct data types. Both parts center

More information

The control of I/O devices is a major concern for OS designers

The control of I/O devices is a major concern for OS designers Lecture Overview I/O devices I/O hardware Interrupts Direct memory access Device dimensions Device drivers Kernel I/O subsystem Operating Systems - June 26, 2001 I/O Device Issues The control of I/O devices

More information

Example: intmath lib

Example: intmath lib Makefile Goals Help you learn about: The build process for multi-file programs Partial builds of multi-file programs make, a popular tool for automating (partial) builds Why? A complete build of a large

More information

Massif-Visualizer. Memory Profiling UI. Milian Wolff Desktop Summit

Massif-Visualizer. Memory Profiling UI. Milian Wolff Desktop Summit Massif-Visualizer Memory Profiling UI Milian Wolff mail@milianw.de Desktop Summit 2011 08.08.2011 Agenda 1 Valgrind 2 Massif 3 Massif-Visualizer Milian Massif-Visualizer Desktop Summit Berlin 2/32 1 Valgrind

More information

CS 322 Operating Systems Practice Midterm Questions

CS 322 Operating Systems Practice Midterm Questions ! CS 322 Operating Systems 1. Processes go through the following states in their lifetime. time slice ends Consider the following events and answer the questions that follow. Assume there are 5 processes,

More information

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control.

What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope. C Flow Control. C Flow Control David Chisnall February 1, 2011 Outline What the CPU Sees Basic Flow Control Conditional Flow Control Structured Flow Control Functions and Scope Disclaimer! These slides contain a lot of

More information

Performance Measuring on Blue Horizon and Sun HPC Systems:

Performance Measuring on Blue Horizon and Sun HPC Systems: Performance Measuring on Blue Horizon and Sun HPC Systems: Timing, Profiling, and Reading Assembly Language NPACI Parallel Computing Institute 2000 Sean Peisert peisert@sdsc.edu Performance Programming

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Cache Profiling with Callgrind

Cache Profiling with Callgrind Center for Information Services and High Performance Computing (ZIH) Cache Profiling with Callgrind Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945

More information

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems Processes CS 475, Spring 2018 Concurrent & Distributed Systems Review: Abstractions 2 Review: Concurrency & Parallelism 4 different things: T1 T2 T3 T4 Concurrency: (1 processor) Time T1 T2 T3 T4 T1 T1

More information

Module 11: I/O Systems

Module 11: I/O Systems Module 11: I/O Systems Reading: Chapter 13 Objectives Explore the structure of the operating system s I/O subsystem. Discuss the principles of I/O hardware and its complexity. Provide details on the performance

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Performance Analysis and Debugging Tools

Performance Analysis and Debugging Tools Performance Analysis and Debugging Tools Performance analysis and debugging intimately connected since they both involve monitoring of the software execution. Just different goals: Debugging -- achieve

More information

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition CPU Structure and Function Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition CPU must: CPU Function Fetch instructions Interpret/decode instructions Fetch data Process data

More information

Performance Tuning VTune Performance Analyzer

Performance Tuning VTune Performance Analyzer Performance Tuning VTune Performance Analyzer Paul Petersen, Intel Sept 9, 2005 Copyright 2005 Intel Corporation Performance Tuning Overview Methodology Benchmarking Timing VTune Counter Monitor Call Graph

More information

16 Sharing Main Memory Segmentation and Paging

16 Sharing Main Memory Segmentation and Paging Operating Systems 64 16 Sharing Main Memory Segmentation and Paging Readings for this topic: Anderson/Dahlin Chapter 8 9; Siberschatz/Galvin Chapter 8 9 Simple uniprogramming with a single segment per

More information

Both parts center on the concept of a "mesa", and make use of the following data type:

Both parts center on the concept of a mesa, and make use of the following data type: C Programming Simple Array Processing This assignment consists of two parts. The first part focuses on array read accesses and computational logic. The second part requires solving the same problem using

More information

UNIT- 5. Chapter 12 Processor Structure and Function

UNIT- 5. Chapter 12 Processor Structure and Function UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers

More information

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Intel profiling tools and roofline model. Dr. Luigi Iapichino Intel profiling tools and roofline model Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimization (and to the next hour) We will focus on tools developed

More information

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018

Introduction. 1 Measuring time. How large is the TLB? 1.1 process or wall time. 1.2 the test rig. Johan Montelius. September 20, 2018 How large is the TLB? Johan Montelius September 20, 2018 Introduction The Translation Lookaside Buer, TLB, is a cache of page table entries that are used in virtual to physical address translation. Since

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

CSCI-1200 Data Structures Fall 2015 Lecture 6 Pointers & Dynamic Memory

CSCI-1200 Data Structures Fall 2015 Lecture 6 Pointers & Dynamic Memory CSCI-1200 Data Structures Fall 2015 Lecture 6 Pointers & Dynamic Memory Announcements: Test 1 Information Test 1 will be held Monday, September 21st, 2015 from 6-:50pm. Students have been randomly assigned

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

CS533 Modeling and Performance Evaluation of Network and Computer Systems

CS533 Modeling and Performance Evaluation of Network and Computer Systems CS533 Modeling and Performance Evaluation of Network and Computer Systems Monitors (Chapter 7) 1 Monitors That which is monitored improves. Source unknown A monitor is a tool used to observe system Observe

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Various optimization and performance tips for processors

Various optimization and performance tips for processors Various optimization and performance tips for processors Kazushige Goto Texas Advanced Computing Center 2006/12/7 Kazushige Goto (TACC) 1 Contents Introducing myself Merit/demerit

More information

Evaluating Performance Via Profiling

Evaluating Performance Via Profiling Performance Engineering of Software Systems September 21, 2010 Massachusetts Institute of Technology 6.172 Professors Saman Amarasinghe and Charles E. Leiserson Handout 6 Profiling Project 2-1 Evaluating

More information

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo PAGE REPLACEMENT Operating Systems 2015 Spring by Euiseong Seo Today s Topics What if the physical memory becomes full? Page replacement algorithms How to manage memory among competing processes? Advanced

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Processes and threads ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part I: Operating system overview: Processes and threads 1 Overview Process concept Process scheduling Thread

More information

Memory Analysis tools

Memory Analysis tools Memory Analysis tools PURIFY The Necessity TOOL Application behaviour: Crashes intermittently Uses too much memory Runs too slowly Isn t well tested Is about to ship You need something See what your code

More information

Introduction to Computer Systems: Semester 1 Computer Architecture

Introduction to Computer Systems: Semester 1 Computer Architecture Introduction to Computer Systems: Semester 1 Computer Architecture Fall 2003 William J. Taffe using modified lecture slides of Randal E. Bryant Topics: Theme Five great realities of computer systems How

More information

CPU Structure and Function

CPU Structure and Function Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com http://www.yildiz.edu.tr/~naydin CPU Structure and Function 1 2 CPU Structure Registers

More information

Performance Measurement

Performance Measurement ECPE 170 Jeff Shafer University of the Pacific Performance Measurement 2 Lab Schedule Ac?vi?es Today Background discussion Lab 5 Performance Measurement Wednesday Lab 5 Performance Measurement Friday Lab

More information

Operating System. Operating System Overview. Structure of a Computer System. Structure of a Computer System. Structure of a Computer System

Operating System. Operating System Overview. Structure of a Computer System. Structure of a Computer System. Structure of a Computer System Overview Chapter 1.5 1.9 A program that controls execution of applications The resource manager An interface between applications and hardware The extended machine 1 2 Structure of a Computer System Structure

More information

Introduction to Computer Systems

Introduction to Computer Systems CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due Most of slides for

More information

Introduction to Computer Systems

Introduction to Computer Systems CSCE 230J Computer Organization Introduction to Computer Systems Dr. Steve Goddard goddard@cse.unl.edu Giving credit where credit is due Most of slides for this lecture are based on slides created by Drs.

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Instruction Set Architectures

Instruction Set Architectures Instruction Set Architectures ISAs Brief history of processors and architectures C, assembly, machine code Assembly basics: registers, operands, move instructions 1 What should the HW/SW interface contain?

More information

EE355 Lab 5 - The Files Are *In* the Computer

EE355 Lab 5 - The Files Are *In* the Computer 1 Introduction In this lab you will modify a working word scramble game that selects a word from a predefined word bank to support a user-defined word bank that is read in from a file. This is a peer evaluated

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Prof. Thomas Sterling

Prof. Thomas Sterling High Performance Computing: Concepts, Methods & Means Performance Measurement 1 Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 13 th, 2007 News Alert! Intel

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 20 th Lecture Mar. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Assignment 3 - Feedback Peak Performance

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind VI-HPS Tuning Workshop 8 September 2011, Aachen Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Time Measurement Nov 4, 2009"

Time Measurement Nov 4, 2009 Time Measurement Nov 4, 2009" Reminder" 2! Computer Time Scales" Microscopic Time Scale (1 Ghz Machine) Macroscopic Integer Add FP Multiply FP Divide Keystroke Interrupt Handler Disk Access Screen Refresh

More information

Cache Performance Analysis with Callgrind and KCachegrind

Cache Performance Analysis with Callgrind and KCachegrind Cache Performance Analysis with Callgrind and KCachegrind 21 th VI-HPS Tuning Workshop April 2016, Garching Josef Weidendorfer Computer Architecture I-10, Department of Informatics Technische Universität

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools

Simplifying the Development and Debug of 8572-Based SMP Embedded Systems. Wind River Workbench Development Tools Simplifying the Development and Debug of 8572-Based SMP Embedded Systems Wind River Workbench Development Tools Agenda Introducing multicore systems Debugging challenges of multicore systems Development

More information

Reporting Performance Results

Reporting Performance Results Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility - another experimenter would need to duplicate the results. However: A system s software

More information

ARCHER Single Node Optimisation

ARCHER Single Node Optimisation ARCHER Single Node Optimisation Profiling Slides contributed by Cray and EPCC What is profiling? Analysing your code to find out the proportion of execution time spent in different routines. Essential

More information