COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law

Size: px

Start display at page:

Download "COSC 6374 Parallel Computation. Performance Oriented Software Design. Edgar Gabriel. Spring Amdahl s Law"

Maurice O’Brien’
6 years ago
Views:

1 COSC 6374 Parallel Computation Performance Oriented Software Design Spring 2008 Amdahl s Law Describes the performance gains by enhancing one part of the overall system (code, computer) Speedup = Performance of entire task using the enhancement Performance of entire task not using the enhancement Or Speedup = Execution time of the task not using the enhancement Execution time of the task using the enhancement 1

2 Amdahl s Law (II) Amdahl s Law depends on two factors: Fraction of the execution time affected by enhancement The improvement gained by the enhancement for this fraction Thus Fractionenh Execution _ timeenh = Execution_ timeorg((1 Fractionenh) + ) Speedup enh (1:27:1) Speedup overall Execution_ time = Execution_ time org enh 1 = Fraction (1 Fractionenh) + Speedup enh enh (1:27:2) 6 Speedup Amdahl s Law (III) overall 1 = Fraction (1 Fractionenh) + Speedup enh enh 5 Speedup total Fraction enhanced: 20% Fraction enhanced: 40% Fraction enhanced: 60% Fraction enhanced: 80% Speedup enhanced 2

3 Amdahl s Law (IV) Speedup according to Amdahl's Law Speedup total Speedup enhanced: 2 Speedup enhanced: 4 Speedup enhanced: Fraction enhanced Three big questions Where do I spend the most time? How efficient are those routines? Where do we loose efficiency? 3

various tools to analyze an application at runtime tool=memcheck: memory debugger tool=cachegrind: estimate on the

4 Where do we spend most time? Need to profile the application Standard tools in UNIX like environments: gprof, valgrind Valgrind: Collection of various tools to analyze an application at runtime tool=memcheck: memory debugger tool=cachegrind: estimate on the cache usage of an application tool=callgrind: provides a trace of the function calls Most tools produce an output file (cachegrind.<procid>.out> kcachegrind: visualization tool of valgrind output files 4

5 How to determine the sources of overhead? Get detailed data for different sections of the routine get an estimate on the number of operations executed within these section Scaling issues: For each process we might end up with a large no. of time stamps (e.g. k per process) a large no. of measurements per time stamp (e.g. m per time stamp) (Execution time of MPI functions, various PAPI counters, user defined values) This leads to (n * k * m) data values for the performance analysis Data reduction for performance Analysis Data reduction for the number of processes analyzed: Find processors exposing the same behavior and focus on the performance analysis of a single processor of each group Data reduction per process: Eliminate the measurements exposing the same information Data reduction in time: Find a small, typical cycle in the application and ignore the rest. Automatic, statistical methods inevitable 5

Where do we loose efficieny? gabriel@sharj:> valgrind --tool=cachegrind.

6 Where do we loose efficieny? valgrind --tool=cachegrind./atf ================================================= ==27050== ==27050== I refs: 7,477,574,763 ==27050== I1 misses: 1,856 ==27050== L2i misses: 1,774 ==27050== I1 miss rate: 0.00% ==27050== L2i miss rate: 0.00% ==27050== ==27050== D refs: 3,663,973,777 (3,517,790,756 rd + 146,183,021 wr) ==27050== D1 misses: 89,705,595 ( 85,089,836 rd + 4,615,759 wr) ==27050== L2d misses: 85,614,772 ( 81,648,115 rd + 3,966,657 wr) ==27050== D1 miss rate: 2.4% ( 2.4% + 3.1% ) ==27050== L2d miss rate: 2.3% ( 2.3% + 2.7% ) ==27050== ==27050== L2 refs: 89,707,451 ( 85,091,692 rd + 4,615,759 wr) ==27050== L2 misses: 85,616,546 ( 81,649,889 rd + 3,966,657 wr) ==27050== L2 miss rate: 0.7% ( 0.7% + 2.7% ) 6

of simultaneous counters and the supported combination of hardware counters depending on the processor Available on most modern

7 PAPI hardware performance counters Modern processors expose a some counters which give some information about the performance Limited number of counters No. of simultaneous counters and the supported combination of hardware counters depending on the processor Available on most modern operating systems: Linux: requires recompiling the kernel Windows: works right away, however not very accurate due to some restrictions of the OS on context switches Requires modification of your source code to insert the PAPI calls 7

8 General Counters PAPI_FP_OPS Floating point operations PAPI_TOT_CYC Total cycles PAPI_HW_INT Hardware interrupts Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Instruction Counters PAPI_TOT_IIS Instructions issued PAPI_TOT_INS Instructions completed PAPI_INT_INS Integer instructions PAPI_LD_INS Load instructions PAPI_SR_INS Store instructions PAPI_BR_INS Branch instructions PAPI_VEC_INS Vector/SIMD instructions PAPI_LST_INS Load/store instr. completed PAPI_SYC_INS Synch. instr. completed Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 8

9 FP Instruction Counters PAPI_FP_INS Floating point instructions PAPI_FML_INS Floating point multiply PAPI_FAD_INS Floating point add PAPI_FDV_INS Floating point divide PAPI_FSQ_INS Floating point square root PAPI_FNV_INS Floating point inverse Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Cache Counters PAPI_L[1 2 3]_[D I T]C[M H A R W] Cache level 1/2/3 [D I T]: data/instruction/total cache [M H A R W]: misses/hits/accesses/ reads/writes PAPI_L[1 2 3]_[LD ST]M Cache level 1/2/3 [LD ST]: load/store misses PAPI_PRF_DM Data prefetch cache misses Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 9

10 PAPI manual example PAPI_library_init(PAPI_VER_CURRENT); /* query and set up the right events to monitor */ if (PAPI_query_event(PAPI_FP_INS) == PAPI_OK) { Events[0] = PAPI_FP_INS; } else { Events[0] = PAPI_TOT_INS; } Events[1] = PAPI_TOT_CYC; PAPI_start_counters((int *) Events, NUM_EVENTS); /* Execute the real code*/ do_flops(num_flops); PAPI_read_counters(values, NUM_EVENTS); Vampir: Process Timeline Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden 10

Example: low FP rate due to FP exceptions Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical

11 Example: low FP rate due to FP exceptions Slides based on a talk and courtesy of Andreas Knuepfer and Matthias Mueller, Center for Information Services and High Performance Computing Technical University Dresden Consequences for software design Multi-dimensional allocations in C/C++ Typical code sequence double **matrix; matrix = (double **) malloc ( dim1 * sizeof(double *); for ( i=0; i< dim1; i++ ) { matrix[i] = (double *) malloc ( dim2 *sizeof(double)); } memory allocated might not be contiguous lowers performance 11

12 Consequences for software design Alternative allocation technique: double **matrix; double *data; data = (double *) malloc(dims1*dims2*sizeof(double)); matrix = (double **) malloc (dims1*sizeof(double *)); for (i=0; i<(dims[0]); i++) { matrix[i] = &(data[i*dims1]); } Consequences for software design Inner loop should go over the outmost index of multidimensional arrays in C/C++ correct version: for (i=0; i<dims1; i++) { for ( j=0; j<dims2; j++ ){ matrix[i][j]= ; } } wrong version: for ( j=0; j<dims2; j++ ) { for (i=0; i<dims1; i++) { matrix[i][j]= ; } } 12

13 What shall you do if one variable requires access along the row and one variable along the columns? for ( i=0; i<dim; i++ ) for ( j=0; j<dim; j++ ) for ( k=0; k<dim; k++) c[i][j] += a[i][k] * b[k][j]; Blocked code versions optimize cache usage for ( i=0; i<dim; i+=block ) for ( j=0; j<dim; j+=block ) for ( k=0; k<dim; k+=block) for (ii=i; ii<(i+block); ii++) for (jj=j; jj<(j+block); jj++) for (kk=k; kk<(k+block);kk++) c[ii][jj] += a[ii][kk] * b[kk][jj]; Comparison operators Comparing integer values is orders of magnitudes faster than comparing strings map options to integers and use if or switch statements avoid strcmp or similar functions wherever possible Avoid unnecessary memory copy operations minimizing memory footprint improves cache behavior passing pointers to a subroutine instead of making a copy of the data array might have however a negative impact on loops within the subroutine, since the compiler does not know boundaries of the array/loop. 13

14 Object structures Rule of thumb: it is better to have an object containing a vector of data, than having a vector of objects with one data point each fewer indirections better cache usage 14

Organizational issues (I)

COSC 6385 Computer Architecture Introduction and Organizational Issues Fall 2007 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, PGH 232 Wednesday, 1.00pm 2.30pm, PGH 232 Evaluation 25% homework