Improving graphics processing performance using Intel Cilk Plus

Size: px

Start display at page:

Download "Improving graphics processing performance using Intel Cilk Plus"

Marjory Harris
6 years ago
Views:

Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism.

1 Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords to implement task parallelism and an array notation syntax to express data parallelism. This article demonstrates how to improve the performance of a graphics processing program using Intel Cilk Plus. To demonstrate the performance increase, you will use a program that converts a bitmap file from color image to a Sepia tone image. A Sepia tone image is monochromatic image with a distinctive Brown Gray color that provides a distinctive tone to a photograph when Black & White film was available. The program works by converting each pixel in the bitmap file to a Sepia tone. Overview A Sepia filter converts a color image to a duotone image with a dark Brown-Gray color. The filter converts each color pixel using the following formula: Where R, G, and B are the Red, Green, and Blue values of each pixel in the input image and Rs, Gs, and Bs are the corresponding output pixels in the output image. This is a highly data parallel algorithm where the value of each pixel at (i,j) in an output image is purely dependent on only the pixel at (i,j) in the input image. This is an ideal candidate for the Single Instruction Multiple Data (SIMD) exploitation where multiple data items in a loop construct are loaded into the vector registers and operated upon simultaneously by a single instruction. Below is the bitmap file before and after the Sepia transformation:

We will look at the performance of the serial implementation of the Sepia filter algorithm above, and then create an Intel Cilk Plus implementation of the filter to improve filter s performance

2 We will look at the performance of the serial implementation of the Sepia filter algorithm above, and then create an Intel Cilk Plus implementation of the filter to improve filter s performance through vectorization and parallelization features supported by Intel C++ Compilers. Optimization Steps We will start the optimization process by performing the following steps: Establish a performance baseline by building and running the serial version of the Sepia filter with default Visual Studio* compiler with the default options (Release Build). Rebuild the project with Intel C++ Compiler with default options to get a performance boost (Release Build). Implement the filter using Intel Cilk Plus Array Notation. Introduce thread level parallelization using Intel Cilk Plus Cilk_for construct. Replace the Array of Structure (AOS) implementation with Structure of Array (SOA) implementation to improve performance further. System Requirements To compile and run the example and exercises in this document you will need Intel C++ Composer XE 2013 Update 1 or higher, and an Intel Pentium 4 Processor or higher with support for Intel SSE2 or higher instruction extensions. The exercises in this document were tested on a third generation Intel Core

3 i5 system supporting 256-bit vector registers. The instructions in this document show you how to build and run the examples with the Microsoft Visual Studio* A Visual Studio* 2008 project is provided to allow using the examples with older versions of Visual Studio*. The examples provided can also be built from the command line on Windows*, Linux*, and Mac OS* X using the following command line options: Windows*: icl /Qvec-report2 /Qrestrict /fp:fast SepiaFilterCilkPlus.cpp Linux* and Mac OS* X: icc vec-report2 restrict fp-model -fast SepiaFilterCilkPlus.cpp For system requirements for Linux and Mac OS X please refer to the Intel C++ Composer XE 2013 Release Notes. NOTE: The sample code used in this article will only read images in RGB (24 bit format) and with.bmp extensions. Three sample images of different sizes are attached with this solution. Locating the Samples To build the sample code, open the SepiaFilter-CilkPlus.zip archive attached. Use these files for this tutorial: There are sample input images (RGB_Lines.bmp, test.bmp and blackbuck.bmp) in the SepiaFilterCilkPlus directory inside the zip file. RGB_Lines.bmp Test.bmp Blackbuck.bmp The above sample input images are in the SpeiaFilterCilkPlus directory inside the zip file. SepiaFilterCilkPlus.sln SepiaFilterCilkPlus.cpp SepiaFilterCilkPlus.h Open the Microsoft Visual Studio* solution file, SepiaFilterCilkPlus.sln,and follow the steps below to prepare the project for the exercises in this document: 1. Select Release Win32 configuration 2. Clean the solution by selecting Build > Clean Solution.

4 You just deleted all of the compiled and temporary files association with this solution. Cleaning a solution ensures that the next build is a full build rather than changing existing files. Contents of the Source Code The program has a main function that gets the input file and output file as command line arguments and invokes the read_process_write() function. This function does the reading of the.bmp input file. The function first reads the header information from the input image file which contains information about the type of image, compression if any, and width and height of the input image. Once this information is known, a dynamic data structure is created and the payload image data is copied on to this data structure for further processing at pixel level. In this program, both array of structure (AOS) and structure of array (SOA) versions of the data structures are implemented and their performance is compared. The main Sepia filter kernel is named process_image() and depending on the macro used during the compilation phase, the corresponding implementation of the Sepia kernel is enabled (for instance array notation version or cilk_for version implemented using SOA and AOS Data structures). To run the executable from the command line, please use the following command line options: <executable> <input file> <output file> The input image and output image can be fed as command line arguments in Visual Studio as follows: Right Click on the project > Properties > Configuration Properties > Debugging > Command Arguments

Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with the Microsoft* C++ compiler in Visual Studio* (on

5 Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with the Microsoft* C++ compiler in Visual Studio* (on Windows*). Run the executable (Debug > Start without Debugging). Running the program results in starting a window that displays the program s execution time in number of clock ticks. Record the execution time reported in the output. Building the project with Intel C++ Compiler Convert the project to use Intel C++ Compiler. To do this, right click on the solution and select Intel Composer XE 20XX > Use Intel C++

6 The XX above refers to the version of the Intel Composer XE (e.g. 2011, 2013, etc) installed on your system. Once the project is converted to an Intel project, follow the steps below to set the project properties: 1. Select Project > Properties > C/C++ > General > Suppress Startup Banner > No.

8 Click Language [Intel C++] > Recognize The Restrict Keyword > Yes (/Qrestrict) The Intel C++ Compiler supports the restrict keyword for C++ even though it is a C99 extension. This qualifier can be applied to a data pointer to indicate that data accessed through that pointer will not alias data accessed through other pointers. So, the restrict keyword enables the compiler to perform certain optimizations based on the premise that a given object cannot be changed through another pointer. You must ensure that restrict-qualified pointers are used as they are intended to be used. Otherwise, undefined behavior may result. 2. Select Project > Properties > C/C++ > Optimization > Optimization > Maximize Speed (/O2)

9 3. Select Project > Properties > C/C++ > Diagnostics[Intel C++] > Vectorizer Diagnostic Level > Loops Successfully and Unsuccessfully Vectorized (2) (/Qvec-report2)

10 4. Select Project > Properties > C/C++ > Code Generation > Floating Point Model > Fast (/fp:fast)

11 5. Select Project > Properties > C/C++ > Code Generation [Intel C++] > Add Processor-Optimized Code Path > Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)

12 6. Vectorization could improve performance significantly for most applications, and is enabled by default in the Intel C++ Compiler. To see the performance impact of vectorization on our Sepia filter let s disable vectorization temporarily and and observe the runtime performance. To do this: Select Project > Properties > C/C++ > Command Line > /Qnovec

Rebuild the project and run the executable. Record the execution time reported in the output. 7. Now let s re-enable vectorization by removing the /Qno-vec option.

13 Rebuild the project and run the executable. Record the execution time reported in the output. 7. Now let s re-enable vectorization by removing the /Qno-vec option. Rebuild the project, run the executable, and record the execution time reported in the output. You should see improved performance due to vectorization. This is the baseline against which subsequent improvements will be measured. In establishing the baseline performance it is a good practice to compare the vec-report2 results between -O2 and -O3 optimization levels because more vectorization candidates tend to appear at O3. For this example, however, the O2 and O3 results are the same. SepiaFilterCilkPlus.cpp(202): (col. 2) remark: LOOP WAS VECTORIZED. The vectorization report indicates that the loop at the above line number in SepiaFilterCilkPlus.cpp was vectorized. This is the for loop which is the call site for process_image function which in this case happens to be inlined. The compiler vectorized the function body using the SIMD registers. The original serial implementation uses an Array of Structure (AOS) implementation which is not vectorization friendly due to non-sequential memory accesses inherent in the algorithm. Often, the overhead of non-sequential memory access makes vectorization unprofitable or inefficient, but in this example the compiler still deemed it profitable to vectorize the code despite non-unit stride memory access.

14 Implementation of Sepia filter kernel using Array Notation Here we re-write the original loop using the Array Notation with the default vector length. On a CPU with 128-bit vector registers size the default vector length is 4 (e.g. loading four 32-bit float data elements into vector registers). 1. Select Project > Properties > C/C++ > Preprocessor > Preprocessor Definitions, and add a new macro AOS_AN. 2. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. Array notations version will make use of the SIMD registers and SIMD instruction set to handle operations on vector operands. The vectorization report shows that the array notation version of the loop got vectorized: SepiaFilterCilkPlus.cpp(173): (col. 5) remark: LOOP WAS VECTORIZED. For our Sepia filter example the performance result of the Array Notation implementation will be almost the same as the autovectorized version in the previous case. The benefit is that while vectorizing arbiterary code is at the discretion of the compiler and cannot always be guaranteed, using array notation guarantees vectorization.

In order to enable multi-threading in this example, all you need is to include the cilk header file and replace for in line loop with cilk_for as shown below.

15 Improving Performance by Using Cilk_for Here we introduce thread-level parallelization by using the cilk_for construct. A cilk_for loop is a replacement for the normal C/C++ for loop that permits loop iterations to run in parallel on multiple cores. In order to enable multi-threading in this example, all you need is to include the cilk header file and replace for in line loop with cilk_for as shown below. To enable the cilk_for version, add the AOS_CILK_FOR macro as shown below: Rebuilding the project with above changes will make sure that the Sepia filter kernel will not only make use of SIMD registers (autovectorization) but also will make use of multiple cores and divide the workload in the for loop across multiple threads for additional speed up. The bigger the workload the closer the speed up towards the theoretical maximum. The input images provided can be used for testing. The image inputs in increasing order of workloads are blackbuck.bmp, RGB_Lines.bmp and test.bmp. The performance of the multi-threaded version will increase as these images are used in the order specified confirming the fact that the bigger the workload, the higher the speedup across the cores. Improving Performance Further Using Structure of Arrays (SOA) Up until now our default implementation has been using an Array of Structure algorithm that is not very vectorization friendly due to non-sequential access patterns. The non-sequential access pattern results in generating gather/scatter instructions that reduce the vectorization efficiency due to the long instruction latencies. Despite the above implications, Intel Cilk Plus was able to give admirable performance. By rewriting the baseline implementation in Structure of

16 Array (SOA) we could further improve performance due to unit-stride memory access pattern which is vectorization friendly. This allows the compiler to generate faster linear vector memory load/store instructions (e.g. movaps or movups supported on Intel SIMD hardware) rather than generating longer latency gather/scatter instructions that it would have to do otherwise. The data structure used in Array of Structures (AOS) implementation is: For Structure of Arrays (SOA) implementation, the data structure used is as follows: To demonstrate the performance boost using SOA, there are two different implementations; one exploiting SIMD features using array notation and the other exploiting both SIMD and multi-threading features that follow. In order to enable this section of the code in the example simply enable the macro SOA_AN as shown below:

Rebuild the project with the above setting to vectorize the code: SepiaFilterCilkPlus.cpp(141): (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(147 (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(182): (col.

17 Rebuild the project with the above setting to vectorize the code: SepiaFilterCilkPlus.cpp(141): (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(147 (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(182): (col. 3) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(178): (col. 2) remark: loop was not vectorized: not inner loop. SepiaFilterCilkPlus.cpp(210): (col. 2) remark: loop was not vectorized: vectorization possible but seems inefficient. SepiaFilterCilkPlus.cpp(210): (col. 2) remark: loop was not vectorized: vectorization possible but seems inefficient. The function process_image() containing the array notation code is invoked and vectorized. All the arguments in the section of Implementation of Sepia filter kernel using Array Notation are applicable in this section except for the fact that they operate on different data structures (in this case data structure which supports unit stride memory access). The performance number should show a significant improvement over its AOS counterpart earlier. Improving Performance by Using Cilk_for (SOA) In order to enable this section of the code in this example just enable macro SOA_CILK_FOR and also replace the for with cilk_for as shown below:

Rebuilding the project will result in the same vectorization report as the one in array notation version but this time it will divide the workload into multiple threads and execute across different

18 Rebuilding the project will result in the same vectorization report as the one in array notation version but this time it will divide the workload into multiple threads and execute across different cores thus gaining more performance than its AOS counterpart earlier. Using Cilk_for and Array Notation Together To use cilk_for and array notation together, the array needs to be broken down into multiple segments and distributed across multiple cilk worker threads. But doing so overrides the cilk runtime heuristics which will lead to lower performance in general, particularly for this example. You will get better performance if you let the cilk runtime do the load balancing. To experiment with this, enable the array notation code section explained earlier by using the SOA_AN macro. By default the SOA_AN

19 code section shown below has no cilk_for and uses num_of_seg = 1, which means that the full array is handled by one thread. To use cilk_for with array notation simply change the for_loop to cilk_for and set num_of_seg to the number of array segments you want to create. You will notice that the performance decreases as you increase num_of_seg because you will be incurring more overhead while not enough work is available for all threads. The best recommendation for using cilk_for and array notation together is to use short vectors, that is, section lengths that are vector register size or its multiple. This should enable vectorization that does not need peeling if the data is aligned, and no cleanup loop. Implementation of Sepia filter kernel using Elemental Functions An Intel Cilk Plus Elemental Function is a regular function, which can be invoked either on scalar arguments, or internally by the compiler on array elements in parallel to vectorize function calls within a loop that could otherwise prevent vectorization of the loop. In our example, the compiler inlines the call to the process_image() function in the loop which enables vectorization. Therefore, use of an elemental function is not necessary in our example. As such, using an elemental function in this example would not make any difference in the code performance. However, if you needed to use the elemental version of the function in question, all you would need to do would be to declare the function as shown below: // Declaring process_image() function as an elemental function _declspec(vector) void process_image(rgb &indataset, rgb &outdataset) For more information on elemental functions please see Elemental Functions in the reference section of this document. References For more information on SIMD vectorization, Intel Compiler automatic vectorization, Elemental Functions and examples of using other Intel Cilk Plus constructs refer to:

20 A Guide to Autovectorization Using the Intel C++ Compilers Requirements for Vectorizing Loops Requirements for Vectorizing Loops with #pragma SIMD Getting Started with Intel Cilk Plus Array Notations SIMD Parallelism using Array Notation Intel Cilk Plus Language Extension Specification Elemental functions: Writing data parallel code in C/C++ using Intel Cilk Plus Using Intel Cilk Plus to Achieve Data and Thread Parallelism

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel