Parallel architectures are enforcing the need of managing parallel software efficiently Sw design, programming, compiling, optimizing, running

Similar documents
An Extended GPROF Profiling and Tracing Tool for Enabling Feedback-Directed Optimizations of Multithreaded Applications

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Measuring the impacts of the Preempt-RT patch

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Processes. Johan Montelius KTH

A process. the stack

Module 1. Introduction:

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

Review: Easy Piece 1

Operating systems. Lecture 3 Interprocess communication. Narcis ILISEI, 2018

Linux Driver and Embedded Developer

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Lecture Topics. Announcements. Today: Threads (Stallings, chapter , 4.6) Next: Concurrency (Stallings, chapter , 5.

Processes The Process Model. Chapter 2. Processes and Threads. Process Termination. Process Creation

Processes The Process Model. Chapter 2 Processes and Threads. Process Termination. Process States (1) Process Hierarchies

CSC 4320 Test 1 Spring 2017

CISC2200 Threads Spring 2015

Processes (Intro) Yannis Smaragdakis, U. Athens

USCOPE: A SCALABLE UNIFIED TRACER FROM KERNEL TO USER SPACE

OS Structure. User mode/ kernel mode (Dual-Mode) Memory protection, privileged instructions. Definition, examples, how it works?

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

CSI3131 Final Exam Review

3.1 Introduction. Computers perform operations concurrently

Processes Prof. James L. Frankel Harvard University. Version of 6:16 PM 10-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved.

Multiprocessors 2007/2008

Systems Programming/ C and UNIX

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

Implementation of Processes

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Processes and Threads

CS Lecture 3! Threads! George Mason University! Spring 2010!

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

Outline Background Jaluna-1 Presentation Jaluna-2 Presentation Overview Use Cases Architecture Features Copyright Jaluna SA. All rights reserved

CSCE 313: Intro to Computer Systems

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Notes. CS 537 Lecture 5 Threads and Cooperation. Questions answered in this lecture: What s in a process?

Timers 1 / 46. Jiffies. Potent and Evil Magic

OS Structure. User mode/ kernel mode. System call. Other concepts to know. Memory protection, privileged instructions

Processes. Process Concept

* What are the different states for a task in an OS?

High Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science

Operating Systems. Lecture 4. Nachos Appetizer

Last class: Today: Thread Background. Thread Systems

Processes and Threads

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Problem 1: Concepts (Please be concise)

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

Processes & Threads. Today. Next Time. ! Process concept! Process model! Implementing processes! Multiprocessing once again. ! More of the same J

Process and Its Image An operating system executes a variety of programs: A program that browses the Web A program that serves Web requests

Threads Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Intel Threading Tools

PROCESS CONCEPTS. Process Concept Relationship to a Program What is a Process? Process Lifecycle Process Management Inter-Process Communication 2.

PROCESSES. Jo, Heeseung

Processes. Jo, Heeseung

Operating Systems: William Stallings. Starvation. Patricia Roy Manatee Community College, Venice, FL 2008, Prentice Hall

OS lpr. www. nfsd gcc emacs ls 1/27/09. Process Management. CS 537 Lecture 3: Processes. Example OS in operation. Why Processes? Simplicity + Speed

ECE 598 Advanced Operating Systems Lecture 23

Operating Systems. VI. Threads. Eurecom. Processes and Threads Multithreading Models

Performance analysis basics

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

CSE 410: Systems Programming

Semaphores. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

OPERATING SYSTEM. Chapter 4: Threads

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Threads Implementation. Jo, Heeseung

Implementing Lightweight Threads

THREADS. Jo, Heeseung

POSIX PTHREADS PROGRAMMING

Concurrent Server Design Multiple- vs. Single-Thread

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Introduction. CS3026 Operating Systems Lecture 01

Threads. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Problem Set 2. CS347: Operating Systems

Lecture 2 Process Management

An Overview of the BLITZ System

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

CSCE Operating Systems Interrupts, Exceptions, and Signals. Qiang Zeng, Ph.D. Fall 2018

Applications, services. Middleware. OS2 Processes, threads, Processes, threads, communication,... communication,... Platform

ECE 598 Advanced Operating Systems Lecture 10

www nfsd emacs lpr Process Management CS 537 Lecture 4: Processes Example OS in operation Why Processes? Simplicity + Speed

Operating- System Structures

Chapter 5: Processes & Process Concept. Objectives. Process Concept Process Scheduling Operations on Processes. Communication in Client-Server Systems

Chap.6 Limited Direct Execution. Dongkun Shin, SKKU

Intel Thread Building Blocks

MARUTHI SCHOOL OF BANKING (MSB)

LINUX INTERNALS & NETWORKING Weekend Workshop

Real-Time Performance of Linux. OS Latency

Use Dynamic Analysis Tools on Linux

CS 153 Design of Operating Systems Winter 2016

Operating System Architecture. CS3026 Operating Systems Lecture 03

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Introduction to PThreads and Basic Synchronization

Linux multi-core scalability

Chapter 4 Multithreaded Programming

Debugging and Profiling

Parallelization Primer. by Christian Bienia March 05, 2007

Transcription:

S.Bartolini Department of Information Engineering University of Siena, Italy C.A. Prete Department of Information Engineering University of Pisa, Italy GREPS Workshop (PACT 07) Brasov, Romania. 16/09/2007 Parallel architectures are enforcing the need of managing parallel software efficiently Sw design, programming, compiling, optimizing, running Need to provide application-behavior feedback to compiler/linker to enable advanced optimizations e.g., L1 instruction cache opts through procedure remapping Monothreaded application: locality features (call graph) This simple information can be easily available, e.g. GNU GCC More sophisticated information is still not widely available The behavior of a parallel (multithreaded) application is harder to profile Multiple concurrent, interacting execution activities with possible dynamic activation Interaction typically mediated by the Operating System Complex applications runtime profiling overhead, probe eff. 1

Challenge: profile multithreaded (MT) applications enable the extension of existing feedback-directed compiler/linker opts, and, possibly, the design of new ones Desiderata: Easy to perform: Gprof-like user interface collect simple ( Gprof -like) stats on a per-thread basis Additional intra-thread stats Usage of synchronization mechanisms/api Collect inter-thread stats Execution ordering (cooperation) Blocking relationship, blocking time (competition) Idle time (competition for Hw/Os resources) Low overhead Enable profiles from real-execution GPROF intro Extensions for multithreading OS activity Synchro-tracing Results 2

Gprof capabilities Flat-profile: time profile of functions Not very precise and not suitable for parallel applications Function call graph Locality Basic block execution count Quite expensive in terms of overhead Modular approach: Selectable profiling level: overhead vs. detail Usage: Compile programs (or modules) with a special GCC flag (-pg) libraries Execute the application profiling output file Gprof tool to analyze the output index % time self children called name [1] 100.0 0.00 0.05 start [1] 0.00 0.05 1/1 main [2] 0.00 0.00 1/2 on_exit [28] 0.00 0.00 1/1 exit [59] -------------------------------------- 0.00 0.05 1/1 start [1] [2] 100.0 0.00 0.05 1 main [2] 0.00 0.05 1/1 report [3] -------------------------------------- 0.00 0.05 1/1 main [2] [3] 100.0 0.00 0.05 1 report [3] 0.00 0.03 8/8 timelocal [6] 0.00 0.01 1/1 print [9] 0.00 0.01 9/9 fgets [12] 0.00 0.00 12/34 strncmp <cycle 1> [40] 0.00 0.00 8/8 lookup [20] 0.00 0.00 1/1 fopen [21] 0.00 0.00 8/8 chewtime [24] 0.00 0.00 8/16 skipspace [44] 3

Call graph gathering: Instrumentation and Event count Compiler adds a function call (to mcount() ) at the very start of each function mcount is responsible of profiling its execution e.g. call-graph update: parameters(caller addr., callee addr.) Only the main thread is profiled main might do only initializations Possibly short and not representative behavior One profiling structure Life-cycle (allocation, deallocation and write to file) Performed at application start/end One profiling structure for each thread Allocation: Part is pre-allocated in chunks, on demand, in advance The association to the thread is done at thread creation time (fast, flexible) Flush By mcount itself, whenever the thread-specific structure is full Disable bit per thread, to avoid mcount recursion Deallocation Upon thread termination Actions: thread create and finalize patched similarly to process create and process exit in native Gprof Altered mcount to support MT 4

Call graph Relatively fixed size during execution Event history Sequence of pre-defined and/or programmer specified events, in program execution order e.g.: execution of specific instructions (like C statements) e.g., Track locality and/or usage of code sections Event record: {event_type, event_id, timestamp} Each record has a timestamp e.g.: TSC (Intel) or other mechanism Size increases with execution time Flushing needed Context switch between thread-1 and thread-2? The event log is not enough e.g.: OS time quantum expires, blocking operation, preemption, A thread cannot profile its behavior when it is not executing Only indirect, sometimes misleading information Information from the OS needed Linux Tracing Toolkit: is able to detect and log a number of events with limited overhead (less than 2.5%) and is easily customizable Syscalls, interrupts, fork, scheduling, filesystem ops, virtual mem ops, network We specialized LTT tracing Ability to limit to the application-related events Added a timestamp (same format as for intra-thread events) 5

Event Time PID Length Description #################################################################### Syscall exit 1042493319696474 N/A 7 Syscall entry 1042493319696492 N/A 12 SYSCALL : open; EIP : 0x0804A509 File system 1042493319696637 N/A 20 START I/O WAIT Sched change 1042493319696696 1191 19 IN : 1191; OUT : 1215; STATE : 2 File system 1042493319696707 1191 20 SELECT : 3; TIMEOUT : 0 File system 1042493319696713 1191 20 SELECT : 4; TIMEOUT : 0 File system 1042493319696716 1191 20 SELECT : 6; TIMEOUT : 0 File system 1042493319696718 1191 20 SELECT : 7; TIMEOUT : 0 File system 1042493319696722 1191 20 SELECT : 9; TIMEOUT : 0 File system 1042493319696724 1191 20 SELECT : 11; TIMEOUT : 0 Memory 1042493319696737 1191 12 PAGE FREE ORDER : 0 Syscall exit 1042493319696745 1191 7 Syscall entry 1042493319696994 1191 12 SYSCALL : read; EIP : 0x0804D028 File system 1042493319696997 1191 20 READ : 11; COUNT : 4096 Syscall exit 1042493319697012 1191 7 Trap entry 1042493319697722 1191 13 TRAP : page fault; EIP : 0x4126C309 Trap exit 1042493319697729 1191 7 Syscall entry 1042493319698007 1191 12 SYSCALL:gettimeofday;EIP: 0x0804D028 Syscall exit 1042493319698010 1191 7... File system 1042493319722332 1037 20 SELECT : 17; TIMEOUT : 12000 Timer 1042493319722334 1037 17 SET TIMEOUT : 12000 Sched change 1042493319722337 1191 19 IN : 1191; OUT : 1037; STATE : 1 File system 1042493319722340 1191 20 SELECT : 3; TIMEOUT : 2... Syscall entry 1042493319722926 1191 12 SYSCALL : write; EIP : 0x0804D028 File system 1042493319722927 1191 20 WRITE : 3; COUNT : 8 Socket 1042493319722929 1191 16 SO SEND; TYPE : 1; SIZE : 8 Process 1042493319722935 1191 16 WAKEUP PID : 1037; STATE : 1 Syscall exit 1042493319722939 1191 7 Sched change 1042493319722942 1037 19 IN : 1037; OUT : 1191; STATE : 0 File system 1042493319722946 1037 20 SELECT : 1; TIMEOUT : 12000 LTT is able to show the low-level, basic synchro-ops on which the OS relies Low-level mechanisms (far from programmer ones) Potential variability across OS versions The programmer is interested in analyzing, profiling, inspecting the synchronization mechanisms he has in his scope Semaphores, mutexes, condition variables, spinlocks, signals and going into the specificity of their semantics Whatever their implementation inside the OS Aims: Performance profiling (bottlenecks) Behavior assessment (testing) and verification Debugging Execution patterns Action: special events recorded in the thread event log Synchro API invocation trigger event log Modularity and flexibility original API wrapped inside the logging ones Event record: {Syncr.Type, Structure-Id (address), Op.Type, timestamp} 6

Information gathering Intra-thread: statistics and events Inter-thread: thread context switches with motivation (OS), other application-related OS events (e.g.: scheduling, blocking, page faults) Synchronization events per thread in program order Independent logging of different information Limited probe effect on parallel behavior Low overhead (time and memory) Post-processing of the independent stats/log files Possibility to dig into the data and extract complex information e.g.: semaphore usage pattern Low overhead is not magic :) Coarse grain data collection Independent profiling activities Carefully designed profiling data structure Kernel benchmarks Dining philosophers (10 100 load test) Simulation of vehicles in a crossroad (30 1000 cars) Client-server application (30 80 concurrent clients) Processor farm for matrix multiplication (6x6-6 threads 20x20-25 threads) Real multithreaded application MySQL Test procedures coming along the installation bundle: test-check, test-lock, testmyisam A couple of profiling structure size profile-a = 1 M-entry (event record) per thread, 500 pre-allocated structs profile-b = 100 entry per thread, 10 pre-allocated structs Various levels of profiling applic-only: only thread stats/events are collected full-sched: MT application + OS context-switch events full: MT application + full OS events 7

57 187 17 22 8 6 31 21 17:getenv 18:strlen 20:_dl_important_hwcaps 21:malloc 22:libc_internal_tsd_get 31:pthread_mutex_lock 56:pthread_cleanup_push _defflockfile 57:flockfile 7 15 4 15 8 1 44 7:sem_wait 8:pthread_lock 9:pthread_unloc 12:pthread_mutex_lock 15:pthread_mutex_unlock 33:sem_post 44:pthread_destroy_speci fi_exit 4 1 2 56 20 18 Thread 1024 ( main ) 1 1 15 3 9 9 12 33 Thread 6151 ( compute ) Function name Execution count Thread_id 4101 _IO_new_file_xsputn 13 errno_location 12 pthread_lock pthread_unlock 11 Thread_id 1024 Sem_getvalue 151725 _IO_new_file_xsputs 640 _pthread_cleanup_push_defflockfile 189 Intra-thread stats: (top) Call graphs: interaction between functions Processor farm: small execution (left) function execution count: Processor farm: load-condition Pettis&Hansen 1990 I-cache opt relies on this Already in gcc for mono-threaded appl. S Thread Funzione 16 1026 pthread_start_thread 17 1026 nanosleep C 5126 C 2051 C 7176 5 RS 8201 RS 4101 4 RS 9226 2 8 27 9 12 0 19 23 22 7 1 21 14 22 20 6 10 3 26 17 4 9 25 21 28 8 29 4 18 1026 vfprintf 19 2049 pthread_manager 20 2049 clone 21 2049 kill 22 2049 errno_location 23 2049 libc_read 24 2049 poll 25 2051 pthread_start_thread 26 2051 pthread_lock 27 2051 printf C 3076 RS 6151 25 16 21 24 28 2051 errno_location 29 2051 nanosleep 3 Inter-thread graph (thread-granul.) (client-server ) Inter-thread graph (function-granul.) (crossroad simulator) Function/Thread 4101 1026 2051 3076 MutexLock 3/0 3/0 3/1 3/0 MutexUnLock 3/- 3/- 3/- 3/- Wait 2/2 2/1 2/0 2/1 Post 2/- 2/- 2/- 2/- Lock Destroy 1/- 1/- 1/- 1/- Synchro struct usage (philosophers) Synchro struct usage count (philosophers) 8

Normalized execution time Normalized execution time 24/09/2007 Thread_id Not-running time (%) Not-running time (blocked) (%) 1026 72 30 2051 63 20 3076 71 31 4101 75 37 Thread non-running time (philosophers) Synchro struct usage (client-server) Ability to easily derive a variety of stats 1,10 1,08 1,06 1,04 1,02 1,040 1,050 1,061 1,051 1,065 1,082 Profile A applic-only Profile A full-sched Profile A full 1,07 1,06 1,05 1,04 Profile B applic-only 1,03 1,025 Profile B full-sched 1,02 Profile B full 1,010 1,01 1,051 1,021 1,042 1,062 Profile A applic-only Profile A full-sched Profile A full Profile B applic-only Profile B full-sched Profile B full 1,00 1,00 client-server (80 clients) philosophers (100 philosophers) Overall overhead is reasonable for runtime profiling Not much degradation for full vs. applic-only profile Relative overhead of small profile structure (B) vs. large (A) can be high But the absolute overhead value is still reasonable 9

Normalized execution time 24/09/2007 4,5 4,0 3,5 3,0 2,5 2,0 1,5 1,0 0,5 3,92 3,99 3,76 3,49 3,30 3,08 MySQL 1,27 1,35 1,14 1,21 1,01 1,04 Profile A applic-only Profile A full-sched Profile A full Profile B applic-only Profile B full-sched Profile B full 1,66 1,42 2,11 1,53 2,15 2,53 0,0 test-check test-lock test-myisam The overhead of a complex, real application can be higher Still reasonable for real-execution runtime profile Highlighted the limitations of Gprof in case of multithreaded applications per-thread information Thread relationship Thread-related OS activity Profile of programmer-level synchronization mechanism Proposal of a MT runtime profiling approach Gprof inspired code instrumentation Independent gathering of different information Application threads, OS Extension of LTT tool for Linux OS profiling Post-processing of raw, low-level profile to derive high-level information Intra-thread call graph, inter-thread call graph, synchro-structure usage pattern, thread idle/blocking time, Low-overhead: wide applicability Suitable to enable feed-back directed compiler optimizations 10

11