Performance Profiling

Similar documents
Performance analysis basics

Profiling: Understand Your Application

Performance Tuning VTune Performance Analyzer

Profiling and Workflow

Introduction to Parallel Performance Engineering

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

ECE 454 Computer Systems Programming Measuring and profiling

Introduction to Performance Tuning & Optimization Tools

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Jackson Marusarz Intel Corporation

Evaluating Performance Via Profiling

Lecture 9 Dynamic Compilation

Zing Vision. Answering your toughest production Java performance questions

Performance Evaluation of Virtualization Technologies

Chapter 1 Computer System Overview

CSE 141 Summer 2016 Homework 2

Review: Creating a Parallel Program. Programming for Performance

Stanislav Bratanov; Roman Belenov; Ludmila Pakhomova 4/27/2015

Computer Organization: A Programmer's Perspective

Performance Analysis. HPC Fall 2007 Prof. Robert van Engelen

Method-Level Phase Behavior in Java Workloads

Performance Optimization: Simulation and Real Measurement

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Intel VTune Amplifier XE

CS533 Modeling and Performance Evaluation of Network and Computer Systems

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture #10 Context Switching & Performance Optimization

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

G Robert Grimm New York University

Virtual Memory Outline

Visual Profiler. User Guide

HPC Lab. Session 4: Profiler. Sebastian Rettenberger, Chaulio Ferreira, Michael Bader. November 9, 2015

Main Memory (Part I)

Parallel Performance and Optimization

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Operating System. Operating System Overview. Structure of a Computer System. Structure of a Computer System. Structure of a Computer System

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

Profiling & Optimization

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

OPERATING SYSTEM. Functions of Operating System:

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure.

Cycle Time for Non-pipelined & Pipelined processors

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ARCHER Single Node Optimisation

Chapter 8: Virtual Memory. Operating System Concepts

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Trace-based JIT Compilation

Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis

CSE 4/521 Introduction to Operating Systems. Lecture 12 Main Memory I (Background, Swapping) Summer 2018

Module I: Measuring Program Performance

15 Sharing Main Memory Segmentation and Paging

Virtual Memory COMPSCI 386

I/O Systems (3): Clocks and Timers. CSE 2431: Introduction to Operating Systems

CSC630/CSC730 Parallel & Distributed Computing

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Profiling & Optimization

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Performance and Optimization Issues in Multicore Computing

Introduction to Computer Systems and Operating Systems

Memory Management. Minsoo Ryu. Real-Time Computing and Communications Lab. Hanyang University.

Memory Management Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University

Background. Contiguous Memory Allocation

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

Scheduling Mar. 19, 2018

First-In-First-Out (FIFO) Algorithm

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Systems software design. Software build configurations; Debugging, profiling & Quality Assurance tools

Optimisation. CS7GV3 Real-time Rendering

ECE 571 Advanced Microprocessor-Based Design Lecture 2

Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. Kyle C. Hale and Peter Dinda

Kampala August, Agner Fog

Computer System Overview

Intel Parallel Studio: Vtune

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Principles of Operating Systems

Virtual Memory. Reading: Silberschatz chapter 10 Reading: Stallings. chapter 8 EEL 358

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CS399 New Beginnings. Jonathan Walpole

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

16 Sharing Main Memory Segmentation and Paging

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

ECE 477 Digital Systems Senior Design Project. Module 10 Embedded Software Development

Batch Jobs Performance Testing

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

ECE 571 Advanced Microprocessor-Based Design Lecture 7

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

DB2 is a complex system, with a major impact upon your processing environment. There are substantial performance and instrumentation changes in

Receive Livelock. Robert Grimm New York University

Parallel Performance and Optimization

2

Chapter 8: Main Memory

Capacity Estimation for Linux Workloads. David Boyes Sine Nomine Associates

Chapter 8 Virtual Memory

Computer System Overview

Corrections made in this version not in first posting:

Transcription:

Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr

Outline History Understanding Profiling Understanding Performance Understanding Performance Profiling Performance Profiling Types Performance Counters Event Based Profiling Statistical Profiling Instrumentation Profiling Hypervisor/Simulator Performance Profiling Tools 2 2

History

History 1970 Usually based on timer interrupts which recorded the program status word (PSW) at set timer-intervals to detect "hot spots" in executing code 1980 Instruction-set simulators permitted full trace and other performancemonitoring features in 1974 Profiler-driven program analysis on Unix dates back to at least 1979, when unix systems included a basic tool 1990 Prof which listed each function and how much of program execution time it used Gprof extended the concept to a complete call graph analysis in 1982 The ATOM(analysis tools with OM) platform converts a program into its own profiler It inserts code into the program to be analyzed: instrumentation profiling That inserted code outputs analysis data, at compile time 4 4

Understanding Profiling

Understanding Profiling What is measured Time Control flow Loop counts, Function calls Aliasing facts Cache stats Allocation information Track allocation sites for objects Hardware stats Granularity of what is measured Instructions, basic blocks, line of code, function, modules 6 6

Understanding Profiling How are measurements taken Instrumentation The code is modified to take the measurements When? Interruption At compile time: static time At runtime: dynamic time An outside event triggers inspection and measurement Who? Hardware Timer Another thread 7 7

Understanding Profiling When are measurements taken All the time Expensive Sampling Cheaper When? N_th function call N_th basic block Timer Some property of the hardware 8 8

Understanding Performance

Understanding Performance We need visibility into what the machine is doing To understand low-level performance We mostly only control data and code We need to know how it interacts with HW But to make an existing algorithm/data-structure work better Sometimes it is enough to know time Algorithm-level fix, data-structure-level fix Part of performance measurements CPU SQL DB Memory User Input Network Disk 10 10

Understanding Performance Aspects of performance Availability Availability of a system is typically measured as a factor of its reliability As reliability increases, so does availability Response time Response time is the total amount of time it takes to respond to a request for service The response time is the sum of three numbers Throughput Service time + wait time + transmission time Throughput is the rate of production or the rate at which something can be processed 11 11

Understanding Performance Aspects of performance Latency Latency is a time delay between the cause and the effect of some physical change in the system being observed. Latency is a result of the limited velocity with which any physical interaction can take place. Scalability This velocity is always lower or equal to speed of light Scalability is the ability of a system or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth 12 12

Understanding Performance Profiling

Understanding Performance Profiling What is a hotspot Where in an application or system there is a significant amount of activity Significant: activity that occurs infrequently probably does not have much impact on system performance Activity: time spent or other internal processor event Examples of other events: cache misses, branch mispredictions, floating-point instructions retired, partial register stalls, and so on 14 14

Understanding Performance Profiling In theory, the " algorithm " and " data structure " can be predicted through knowledge. but, performance profiling requires precise measurements for the possible prediction results Even though " the same problem ", substantially significant differences occur The hotspot can be found through precise measurements Performance profiling programming Knowledge of all the layers involved Experience in knowing when and how performance can be a problem Skill in detecting and zooming in on the problems A good dose of common sense 15 15

Performance Profiling Types

Performance Profiling Types Performance Counters Event Based Profiling Statistical Profiling Instrumentation Profiling Hypervisor/Simulator 17 17

Performance Counters Performance counters We need to count the events Program a counter to count a specific event HW provides special programmable counters Software can read and write the counters What's wrong with performance counters Should be solved the course-granularity problem Extend performance counters with a programmable limit value Cause an interrupt when the counter overflows 18 18

Event Based Profiling Event based profiling is triggered by any one of the software-based events or any performance monitor event that occurs on the processor Event-based profiling: advantages The routine addresses are visible when interrupts are disabled The ability to vary the profiling event The ability to vary the sampling frequency Event-based profiling: disadvantages Limited data collection Per run Types of data In-exact Sample point and event cause don t match 19 19

Statistical Profiling A method of profiling an application by taking samples of the program address at regular intervals while the application is executing Cycle accurate profiling methods These samples are then associated with either a specific function or a specific memory range A simple statistical analysis is performed to determine the areas where an applications spends the largest portions of its cycles Statistical Profiling is a very quick and handy way to get a first look at which functions are consuming the largest proportions of cycles 20 20

Statistical Profiling The statistical profiling: advantages No installation required Wide coverage Low overhead The statistical profiling: disadvantages Approximate precision Limited report 21 21

Instrumentation Profiling A method of profiling how to get the information by inserting an additional command to the code The performance impact depends on the amount of detail in the data type To measure the execution frequency of each line VS execution times of the function Method Programmer to insert code by hand Automatic code insertion tool How to insert the code with the help of the compiler The final non-binary conversion code Modify the binaries to run immediately before Operate in protected mode while running 22 22

Instrumentation Profiling The instrumentation profiling: advantages Perfect accuracy Where you visit immediately before and after your Can calculate how much time you spent at each site How many times you visited each site The instrumentation profiling: disadvantages Low granularity Too coarse High overhead High touch I have to build all those area, which expands the space in each site you visit 23 23

Hypervisor/Simulator Hypervisor/Simulator Hypervisor: data are collected by running the unmodified program under a hypervisor Simulator: data collected interactively and selectively by running the unmodified program under an instruction set simulator 24 24

Performance Profiling Tools

Performance Profiling Tools In order to do Performance counters Event based profiling Statistical profiling Instrumentation profiling Hypervisor/Simulator You can use Perf, Oprofile Ruby, JIT Oprofile, Vtune Gprof, Valgrind SIMMON, OLIVER Other tools exist for Network, Disk IO 26 26