BenchIT Performance Measurement and Comparison for Scientific Applications

Size: px

Start display at page:

Download "BenchIT Performance Measurement and Comparison for Scientific Applications"

Dustin Nicholson
5 years ago
Views:

1 1 BenchIT Performance Measurement and Comparison for Scientific Applications Guido Juckeland a, Stefan Börner a, Michael Kluge a, Sebastian Kölling a, Wolfgang E. Nagel a, Stefan Pflüger a, Heike Röding a, Stephan Seidl a, Thomas William a, Robert Wloch a a Center for High Performance Computing, Dresden University of Technology, Dresden, Germany Introduction Contrary to common belief, performance evaluation is an art. [1] With an increasing variety of operation fields Webserver for displaying & comparing results from office applications to data-massive, high-performance computing with very different user demands, the programmer s know-how of program optimization, the choice reads of the compiler version, and the usage of the compiler options have an important influence on the runtime. Cur- Resultfile rent and future microprocessors offer a variety of different levels of parallel processing in combination with an writes increasing number of intelligently organized functional units and a deeply staged memory hierarchy. Main program runs the measurement Traditional benchmarks (e.g. [2,3]) highlight only a few aspects of the performance behavior. Often computer provides architects, system designers, software developers and decisionmakers want to have more detailed information about the interface.h performance of the whole system than only one or a few values of a performance metric. This paper introduces BenchIT a tool created by the Center for High Performance Computing Dresden to accompany the perfor- fulfills mance evaluator. Kernel provides the Algorithm This art of performance evaluation actually contains two steps: Performance measurement as well as data validation and comparison. BenchIT s modular design, Figure 1: Components of the BenchITtherefore, consists of three layers (as shown in figure 1): Project The measuring kernels, a main program for the measurements, and a web based graphing engine to plot and compare the gathered data. The unique step in this project is the concept of splitting the evaluation into exactly the two steps mentioned above and thus being so flexible to be used for any kind of performance measurement. The Center for High Performance Computing Dresden presents the established infrastructure for this project, which is designed to allow the HPC community easy access to a variety of performance measurements, easily extendable by own measurements and even, but especially, own measuring kernels. 1. Measuring Environment The BenchIT measuring environment is especially designed for the hazardous conditions on all kinds of measuring platforms. In reducing all varying factors on different machines, only two utilities are certain: a shell and a compiler. The BenchIT measuring environment deliberatly reduces itself to use only those two to allow the highest compatibily. The environment on a certain operating system is set up by a number of cascading shell scripts compiling the measuring kernel, linking it to a main

2 2 program and executing the measuring run. Some common definitions are placed in one small file named COMMONDEFS. This script provides the base name of the directory, the nodename, and the hostname of the machine as environmental variables used by the main program. The next file used by each kernel is the file ARCHDEFS providing a basic set of system variables depending on the operating system on the machine. They look like the following: if [ "${uname_minus_s}" = "Linux" ]; then HAVE_CC=1 HAVE_F77=1 HAVE_F90=0 HAVE_MPI=1 CC="cc" CC_C_FLAGS="$CC_C_FLAGS -Wall -Werror -Waggregate-return -Wcast-align" CC_C_FLAGS_STD="-O2" CC_C_FLAGS_HIGH="-O3" LIB_PTHREAD="-lpthread" These default values enable BenchIT to run on a normal installation of the OS s included. Nevertheless, each user might want to set machine specific variables. This is possible by defining a set of LOCALDEFS. The LOCALDEFS-file is named after the nodename of the machine running on and holds exactly the same variables as already defined in the ARCHDEFS-file, therefore, allowing an easy customization. Additionally The LOCALDEFS-directory accomodates the two input-files for each node. They are named <nodename> input architecture and <nodename> input display and allow to fill in large sections of the output-file (see 2.1) since they are just copied into the output-files. The last part of the environment is made up of the variables used in the shell-script of the kernel itself and usually sets some kernel specific values or overwrites already existing variables (from the ARCHDEFS or LOCALDEFS). 2. Module Interfaces In between the three BenchIT program layers stand two interface files. They ensure that the modules have a common basis to work together. The result-file - also called output-file - is, after it has been created on the local machine, transferred to the BenchIT webserver. The file interface.h is used as a common basis in the compilation and linking of one measurement run. The following will provide a more detailed view at the two necessary and important interfaces The Output-File A possible way to explain the results of a measuring kernel is to collect all the relevant data in a structured output file. This idea was realized in the BenchIT output-files saved in the subdirectory output. They are coded in ASCII format for easy viewing and editing. The different parts of the structure are bounded by the keywords beginofxxxxx and endofxxxxx and introduced in the following. Measurement Information This part of the output-file includes a kernel-string as a short description of the measuring kernel, for example Fortran dot product, a timestamp, a comment, the programming language, the used compiler and its compiler flags, and minima and maxima for the x- and y-values. Additionally the string code-sequence, for example do i=1,n# sum=sum+x(i)*y(i)#enddo shows the characteristic feature of this measuring program.

3 3 Architecture Important architectural statements are the node-name and the host-name. Output-files will not be accepted on the project homepage([6]) without this information. A collection of architectural information was designed as a guideline of this part of the outputfile, first to explain the measurement results and further to identify the machine the measurement ran on. The following characteristics are included (selection): mainboard manufacturer, mainboard type, mainboard chipset, processor name and clock rate, processor serial number, processor version, instruction set architecture and its level, several instruction set architecture extensions, processor clock rate, instruction length, processor word length, and the number of integer, floating point, and loadstore units. The cache hierarchy is described by the sizes, organization and location. To characterize the memory system information about the used memory chip type, memory bus type and clock rate are necessary. Display This section holds all information needed to set up the plotting engine to display the results contained in the output file. This includes axis texts and labels for all measured functions, axis setup (linear or logarithmic), and the boundaries for the plotting range. Additionally information from the sections Measurement Information and Architecture can be placed in the graph. Identifier-Strings This section is used to relate easily readable strings prepared for the web menu to all identifierstrings in the output-file, for example ISA Extension to the identifier-string processorisaextension2. Data The measured physical values are stored in the data section in a 2-dimensional ordering: The first value per row is the x value followed by y values depending on the number of measuring functions inside the kernel. Each new x value generates a new row. All values (integers or floating point numbers) are represented as ASCII coded decimal strings. The design of the output-files is no static. It is possible that additional parts will be inserted during the further development of the BenchIT project The File interface.h The two data acquisition layers of the BenchIT project are linked through the C header file interface.h. It defines an info structure, where a kernel provides information about itself. Furthermore it specifies the functions called by the main program and service functions to be used by the kernels. Info structure: Some elements are used to fill out the output file, such as: kernelstring, kernellibraries (e.g. PThread, MPI, BLAS), codesequence, axis texts and properties, and legend texts. The main program itself needs a few more details about the kernel, e.g. maxproblemsize, numfunctions, outlier direction upwards for error correction by the main program, and kernel execs XXX which allow an adaption to the kind of parallelism the kernel wants to execute. Interface functions The main program uses the functions bi getinfo, bi init, bi entry, and bi cleanup - first to inform itself about the kernel to run, initialize the kernel, than to run the measurements for various problem sizes, and finally to cleanup files and memory used by the kernel. Furthermore, the main program provides two tool functions - bi gettime and bi strdup. 3. Module Components Having introduced the BenchIT module layer interfaces, the paper will now turn the focus to the BenchIT modules itself. BenchIT consists of three module layers: the kernels, the main-program, and the website. Each layer offers different services which will be presented together with the modules

4 4 itself in the following The Kernels Within this project a kernel is referred to as an algorithm or measuring program. Typical examples are a matrix multiplication or the Jacobi algorithm. Programming a kernel demands a certain discipline from the kernel author. Since BenchIT is to run on a variety of computation platforms, the kernel code has to be compatible to all of them. This can be best accomplished by: using only basic program structures, avoiding system calls and system specific operations 1, and utilizing the functions provided by the main program. The professed goal of the BenchIT-Team is to have every kernel distributed with BenchIT being executable on every platform. Nevertheless it is possible and not valued less to write a problem specific kernel. A typical use for this strategy might be the optimization of a certain algorithm on a specific target architecture. As up today the following kernels are included in the BenchIT package: MPI-performance measurement (Roundtrip-Message and Binary-Tree-Broadcast programmed in C), performance measurement for the Jacobi algorithm (sequential in C and Java; parallel in Java using Java-Threads and in C using PThreads), matrix multiplication (sequential in C, Fortran 77, and Java; parallel in Fortran 77 using MPI), performance measurement for calculating the dot product for large vectors (sequential in Fortran 77; parallel in C using PThreads), performance measurement for the mathematical operations sine, cosine, and square root (sequential in C, Java and Fortran 77), memory bandwidth (sequential in C), and IO-performance such as write rate and read rate for small and large file (parallel in C using PThreads). Every BenchIT-User is also able and asked to act as an author of a kernel. A custom kernel can then be sent to the BenchIT- Team and will be taken into the kernel set, if considered useful and complying with the kernel rules. Initialize Program & Kernel 3.2. The Main Program The first service module within the BenchIT layers is the main program for the measurement. It controls the generation of measurement data by the kernels, offers them service routines (see 2.2), and writes the resultfile (see 2.1). The main program has to operate (just as the kernels) under a wide variety of system environments. However, the environment of the operating system is just one part of this variety. Another issue is the runtime environment. Since BenchIT supports among others MPI as a parallel environment, the main program has to adapt itself to that as well. 2 One might argue that it would also be feasible to have different main programs for each runtime environment, yet the BenchIT designers considered it an unnecessary code redundancy, especially since so far using just one main file has been practicable. One measurement run follows the scheme shown in figure 2. During the measurement the main program calls the kernel with a certain problem size. This is just an internal value and must not have something to do with the actual measurement. 3 The translation is done by the kernel. The main program also contains an error correction for the kernels since performance differences during a measurement run for one problem size due to other system processes running on the CPU are inevitable. BenchIT thus uses the following Measure one Problemsize still time left? no Analyze Data Write Result- & Quickview- File yes Figure 2: Schematic view of one measurement run. 1 If system calls become necessary they will have to be according to the POSIX([4]) standard. 2 This is in case of MPI done by compiling the main program with the -DUSE MPI -option. 3 The internal problem size might be the same as the external in case of a matrix multiply, but it could also be scaled by a certain factor.

5 5 approach: Measure one problem size n times 4. Each kernel informs the main program in the init routine if the outliers of each function have to be expected upwards or downwards. BenchIT then uses the best value of the n runs. After measuring the main program will analyze the gathered data. In this step minima and maxima are gathered and useful display boundaries are calculated. Furthermore some environment variables (see 1) are gathered and the two computer specific input files are opened. With all this done, the main program will then write the output file (see 2.1) as well as a gnuplot-file used by the local QUICKVIEW The Webserver The BenchIT web interface([6]) complements the BenchIT project, by giving the possibility to plot the results of the measuring kernels and compare them directly. It is the unique step in the project and allows acces to all measurement data with just an internet browser Specification The Webserver manages the output-files (see 2.1) uploaded by the registered users. They are held as ASCII-files as well as entries in a PostgreSQL-Database. The PHP-Webpages use the database to assemble a plot, then writes instructions for gnuplot([5]) which produces an eps-file that can be downloaded directly (as done in figure 3). Additionally a JPEG-image is created and displayed on the website. It is specified that all kind of measuring data can be displayed in one graph. The only limitation is that the data has to have one or the other unit (e.g. FLOPS, seconds, or a number of hits or misses) since gnuplot can at the maximum display two different y-axes. Another important question to be answered is how the plots will be assembled and how the user can customize the plots. The BenchIT Team has so far implemented two strategies: Selection by architectural characteristics The first possibility is to compare different values of one architectural feature. It is possible to show the sensitiveness of the results of the measuring kernels on the physical size of one architectural feature. This way it is possible to look for specific performance data for a searched architectural feature and compare it to other architectures. Selection by the measuring kernel The second possibility compares different characteristics of architecture, which are all calculated by just one measuring kernel. It can be considered the expressway in the adaption of the plot result since it is possible to customize a plot result with just three steps The construction of the BenchIT web interface The BenchIT web interface consists of two parts: An open and a restricted section. The measurement data is only accessible after registering on the website. This is also a security question since it is, therefore, trackable who uploaded which output-file. At the moment only registered users can download the measurement program, because BenchIT is still in a status of development. The new accounts will first be locked automatically and unlocked by the web interface administrators. All output-files uploaded to the webserver are backed up on a daily basis, hence, ensuring the availability of the data. Additional secu- 4.5e+08 4e e+08 3e e+08 2e e+08 4 The n is set by the compiler option -DERROR CORRECTION=n 1. Flops 1e+08 5e+07 Matrix Multiply Matrix Size Figure 3: The graph for a matrix multiplication ijk ikj jik jki kij kji

6 6 rity measures are implemented, so the data classified as non-disclosure can be uploaded and only be viewed by one user or a group of users. 4. First Results of the Project The project has been running for one year now and most of the immediate goals have been achieved. The measurement (as shown in figure 4) is so flexible that an adaption to a new platform is a matter of filling out one configuration file. The kernels run on all platform with the compilers and libraries necessary. The webserver is well capable of administering and plotting the files. It has been especially designed to work without Java-Script to allow the greatest browser compatibility. After first attempts without a database to support the server in managing the resultfiles for plotting, it hast been decided that a database for the arrangement of the plots is necessary to receive acceptable response times on the website. Guido@bluerabbit ~/benchit/src/kernel/matmul_c $./SUBDIREXEC.SH No definitions for your operating system found in ARCHDEFS. You will have to set them manually in your LOCALDEFS. BenchIT will not run without at least one set of definitions No definitions for your operating system found in ARCHDEFS. You will have to set them manually in your LOCALDEFS. BenchIT will not run without at least one set of definitions Warning: the variable 'ENVIRONMENT' is not set using NOTHING as default BenchIT: Getting info about kernel [ OK ] BenchIT: Getting starting time [ OK ] BenchIT: Selected kernel: Matrix Multiply BenchIT: Initializing kernel [ OK ] BenchIT: Allocating memory for results [ OK ] BenchIT: Measuring.. BenchIT: Total time limit reached. Stopping measurement. BenchIT: Analyzing results [ OK ] BenchIT: Writing resultfile [ OK ] BenchIT: Wrote output to "matmul_c0_amk7_1g33_2003_08_15 15_55.bit" BenchIT: Writing quickview file [ OK ] BenchIT: Finishing [ OK ] rm: cannot unlink `matmul_c': No such file or directory Guido@bluerabbit ~/benchit/src/kernel/matmul_c $ Figure 4: Output of one measurement run 5. Summary and Outlook The BenchIT kernels generate a large amount of measurement results in dependence of the number of functional arguments. Using the web interface the user is given the chance to show the selected results of different measuring programs in only one coordinate system. Often there are different reasons they can cause characteristic minima, maxima or a special shape in a graph. It is necessary to collect additional information about the tested system to explain such effects on a base of well-known system properties and physical values of the realization. The BenchIT-Project wants to provide such an evaluation platform by offering a variety of measurement kernels as well as a easily accessible plotting engine, thus enabling an easy way to measure performance on a specific system and compare the result, which is a full graph instead of just a number, to other results contributed by other users. The further development of the BenchIT-project will take place on all module layers. A GUI for the configuration of the measurements is under development it will provide an easier way to handle the measurements by partially substituting the shell scripts running the measurements up to this point. The power of the PCL will we utilized to access more measurement data. Furthermore an additional way to plot the data on the website by using Java-Applets and Java graphing tools is planned. The BenchIT-project will not merely be just another tool in the art of performance analysis yet it will have prove to be a very powerful one. REFERENCES [1] Raj Jain: The Art of Computer Systems Performance Analysis. John Wiley, Chichester [2] Standard Performance Evaluation Corporation (SPEC): [3] LINPACK: [4] IEEE POSIX: [5] Gnuplot: [6] The BenchIT Webserver:

Performance comparison and optimization: Case studies using BenchIT

John von Neumann Institute for Computing Performance comparison and optimization: Case studies using BenchIT R. Schöne, G. Juckeland, W.E. Nagel, S. Pflüger, R. Wloch published in Parallel Computing: Current