COSC 6385 Computer Architecture - Project

Size: px

Start display at page:

Download "COSC 6385 Computer Architecture - Project"

Jocelyn Casey
6 years ago
Views:

1 COSC 6385 Computer Architecture - Project Edgar Gabriel Spring 2018 Hardware performance counters set of special-purpose registers built into modern microprocessors to store the counts of hardwarerelated activities within computer systems low overhead compared to software based methods types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations. Some of the subsequent material is based on a tutorial by P. Mucci, S. Moore, N. Smeds, Performance tuning using Hardware Counter Data, Supercomputing

2 Performance Application Programming Interface (PAPI) Portable API to access the hardware performance monitor counters found on most modern microprocessors. PAPI provides multiple interfaces to the underlying counter hardware: 1. The low level interface manages hardware events in user defined groups called EventSets. 2. The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. PAPI High-level Interface 2

3 High-level Interface Meant for application programmers wanting coarsegrained measurements Not thread safe Calls the lower level API Easier to use and less setup (additional code) than lowlevel Allows only PAPI preset events standard set of events deemed most relevant for application performance tuning Run papi_avail to see list of PAPI preset events available on a platform High-level API C interface PAPI_start_counters() PAPI_read_counters() PAPI_stop_counters() PAPI_accum_counters() PAPI_num_counters() PAPI_flops() 3

4 Setting up the High-level Interface int PAPI_num_counters(void) Initializes PAPI (if needed) Returns number of hardware counters int PAPI_start_counters(int *events, int len) Initializes PAPI (if needed) Sets up an event set with the given counters Starts counting in the event set int PAPI_library_init(int version) Low-level routine implicitly called by above Controlling the Counters PAPI_stop_counters(long_long *vals, int alen) Stop counters and put counter values in array PAPI_accum_counters(long_long *vals, int alen) Accumulate counters into array and reset PAPI_read_counters(long_long *vals, int alen) Copy counter values into array and reset counters PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops) Wallclock time, process time, FP ins since start, Mflop/s since last call 4

5 PAPI High-level Example #include papi.h long long values[num_events]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results */ retval = PAPI_stop_counters(values,NUM_EVENTS); Return codes Name PAPI_OK PAPI_EINVAL PAPI_ENOMEM PAPI_ESYS PAPI_ESBSTR PAPI_ECLOST PAPI_EBUG PAPI_ENOEVNT PAPI_ECNFLCT PAPI_ENOTRUN PAPI_EISRUN PAPI_ENOEVST PAPI_ENOTPRESET PAPI_ENOCNTR PAPI_EMISC Description No error Invalid argument Insufficient memory A system/c library call failed. Check errno variable Substrate returned an error. E.g. unimplemented feature Access to the counters was lost or interrupted Internal error Hardware event does not exist Hardware event exists, but resources are exhausted Event or envent set is currently counting Events or event set is currently running No event set available Argument is not a preset Hardware does not support counters Any other error occured 5

6 PAPI Low-level Interface Low-level Interface Increased efficiency and functionality over the high level PAPI interface About 40 functions Obtain information about the executable and the hardware Thread-safe Fully programmable Callbacks on counter overflow 6

7 Low-level Functionality Library initialization PAPI_library_init, PAPI_thread_init, PAPI_shutdown Timing functions PAPI_get_real_usec, PAPI_get_virt_usec PAPI_get_real_cyc, PAPI_get_virt_cyc Inquiry functions Management functions Simple lock PAPI_lock/PAPI_unlock Event sets The event set contains key information What low-level hardware counters to use Most recently read counter values The state of the event set (running/not running) Option settings (e.g., domain, granularity, overflow, profiling) Event sets can overlap if they map to the same hardware counter set-up. Allows inclusive/exclusive measurements 7

8 Event set Operations Event set management PAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset Event set control PAPI_start, PAPI_stop, PAPI_read, PAPI_accum Event set inquiry PAPI_query_event, PAPI_list_events,... Simple Example #include "papi.h #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet; long_long values[num_events]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values); 8

9 Overflow handling generate an overflow signal after every threshold events are counted each counter has to be registered separately the value of each registered hardware counter is maintained separately (LONG_)LONG_MAX: 32 bit: 2,147,483, bit: 9,223,372,036,854,775,807 overflow_handler(): user-defined function to process overflow events. function will be called by the PAPI library every time the threshold is reached overflow_vector: a bit-array that can be processed to determined which event(s) caused the overflow e.g. using PAPI_get_overflow_event_index() Software vs. hardware overflow: if processor does not support hardware overflow, software emulates it be periodically checking the counter values software overflow handling inaccurate and more expensive than hardware handling often implemented using a zero-crossing algorithm value of counter is set to threshold and increased accordingly 9

10 1 st Assignment Rules Each student should deliver Source code (.c and.h files) Please: no.o files and no executables! Documentation (pdf or docx formats accepts) Deliver electronically on blackboard Expected by Tuesday, March 6, 11.59pm In case of questions: Ask early, not the day before the submission is due About the Project Given the source code for matrix-multiply operation( File hwmatmul.c). The code contains a trivial implementation of the matrix multiply operation and a blocked implementation The blocked implementation is executed with block sizes of 16, 32, 64 and 128 You can compile the C file, e.g. with cc O3 hw-matmul.c o hw-matmul Once you added the PAPI functions cc o3 hw-matmul.c o hw-matmul -I/opt/papi/5.6.0/include L/opt/papi/5.6.0/lib64 lpapi Run: srun./hw-matmul <matrix-dimension> 10

11 Determine the execution time for the matrix multiply operation separately for each of the 5 versions of the matrix multiply operation (trivial, blocksize 16, blocksize 32, blocksize 64, blocksize 128) for two matrix sizes (512 and 1024) Determine no. of L1 cache misses and L1 cache miss rate separately for all 5 versions of the matrix multiply operation for two matrix sizes Determine no. of L2 cache misses and L2 cache miss rate separately for all 5 versions of the matrix multiply operation for two matrix sizes Determine no. of L3 cache misses and L3 cache miss rate separately for all 5 versions of the matrix multiply operation for two matrix sizes Add the required function calls to the PAPI library into the code to determine properties of the trivial and of the blocked implementation for different block sizes Provide measurements for matrixes of size 512 and 1024 on the whale cluster Note, that for development purposes you can run the code of course with much smaller matrices, e.g. 64 Compare the numbers obtained both, between the different implementations (e.g. blocksize x has higher cache miss rate than blocksize y, but execution time is highest with blocksize z), and matrix sizes (increasing the matrix size from 512 to 1024 increased the cache miss rate by a factor of k for blocksize x) Determine and document the cache hierarchy, sizes and characteristics of the processors used on the whale cluster (Note: you can do that using PAPI) 11

12 It is ok to submit different code for determining time, L1 cache behavior, L2 cache behavior and L3 cache behavior Make sure you run your tests multiple times, and document how often you run it, whether you show average, minimum, maximum etc. Comment on your findings on how the parameter values change with the block sizes for each matrix size. Both graphs and tables are ok to discuss your results Notes The PAPI version installed on whale is On the front-end node you can find tons ton s of examples in C and Fortran on how to use PAPI in /opt/papi/5.6.0/ctests E.g. low-level.c -> how to use the low-level API of PAPI high-level.c -> example for high-level API memory.c -> how to extract information of the memory subsystem (e.g. cache sizes) overflow_index.c -> how to handle overflow correctly For compiling one of these examples: gcc -o high-level high-level.c -I/opt/papi/5.6.0/include -L/opt/papi/5.6.0/lib64/ -lpapi -ltestlib 12

13 1 st Assignment The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs + findings) 1 st Assignment The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The slurm output files 13

14 How to use a cluster A cluster usually consists of a front-end node and compute nodes You can login to the front end node using ssh (from windows or linux machines) using the login name and the password assigned to you. The front end node is supposed to be there for editing, and compiling - not for running jobs! To run your job interactively development: smith@whale:~>srun./hw-matmul 64 How to use a cluster (II) Once your code is correct and you would like to do the measurements: You have to submit a batch job The command you need is sbatch, e.g. sbatch N 1./measurement.sh Your job goes into a queue, and will be executed as soon as a node is available. You can check the status of your job with smith@whale:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 489 whale smith R 0:02 1 whale

15 How to use a cluster (III) The output of squeue gives you a job-id for your job Once your job finishes, you will have a file called slurm- <jobid>.out in your home directory, which contains all the output of your printf statements etc. Note the batch script used for the job submission (e.g. measurements.sh) has to be executable. This means, that after you downloaded it from the webpage and copied it to whale, you have to type chmod +x measurements.sh Please do not edit the measurements.sh file on MS Windows. Windows does not add the UNIX EOF markers, and this confuses slurm when reading the file. Notes PAPI Documentation: If you need hints on how to use a UNIX/Linux machine through ssh: How to use a cluster such as whale/crill Please use crill documentation for this reference, since we are operating for this assignment the whale cluster as an HPC cluster, not Hadoop! 15

16 16

COSC 6385 Computer Architecture. - Homework

COSC 6385 Computer Architecture - Homework Fall 2008 1 st Assignment Rules Each team should deliver Source code (.c,.h and Makefiles files) Please: no.o files and no executables! Documentation (.pdf,.doc,.tex