Improving graphics processing performance using Intel Cilk Plus
|
|
- Marjory Harris
- 6 years ago
- Views:
Transcription
1 Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords to implement task parallelism and an array notation syntax to express data parallelism. This article demonstrates how to improve the performance of a graphics processing program using Intel Cilk Plus. To demonstrate the performance increase, you will use a program that converts a bitmap file from color image to a Sepia tone image. A Sepia tone image is monochromatic image with a distinctive Brown Gray color that provides a distinctive tone to a photograph when Black & White film was available. The program works by converting each pixel in the bitmap file to a Sepia tone. Overview A Sepia filter converts a color image to a duotone image with a dark Brown-Gray color. The filter converts each color pixel using the following formula: Where R, G, and B are the Red, Green, and Blue values of each pixel in the input image and Rs, Gs, and Bs are the corresponding output pixels in the output image. This is a highly data parallel algorithm where the value of each pixel at (i,j) in an output image is purely dependent on only the pixel at (i,j) in the input image. This is an ideal candidate for the Single Instruction Multiple Data (SIMD) exploitation where multiple data items in a loop construct are loaded into the vector registers and operated upon simultaneously by a single instruction. Below is the bitmap file before and after the Sepia transformation:
2 We will look at the performance of the serial implementation of the Sepia filter algorithm above, and then create an Intel Cilk Plus implementation of the filter to improve filter s performance through vectorization and parallelization features supported by Intel C++ Compilers. Optimization Steps We will start the optimization process by performing the following steps: Establish a performance baseline by building and running the serial version of the Sepia filter with default Visual Studio* compiler with the default options (Release Build). Rebuild the project with Intel C++ Compiler with default options to get a performance boost (Release Build). Implement the filter using Intel Cilk Plus Array Notation. Introduce thread level parallelization using Intel Cilk Plus Cilk_for construct. Replace the Array of Structure (AOS) implementation with Structure of Array (SOA) implementation to improve performance further. System Requirements To compile and run the example and exercises in this document you will need Intel C++ Composer XE 2013 Update 1 or higher, and an Intel Pentium 4 Processor or higher with support for Intel SSE2 or higher instruction extensions. The exercises in this document were tested on a third generation Intel Core
3 i5 system supporting 256-bit vector registers. The instructions in this document show you how to build and run the examples with the Microsoft Visual Studio* A Visual Studio* 2008 project is provided to allow using the examples with older versions of Visual Studio*. The examples provided can also be built from the command line on Windows*, Linux*, and Mac OS* X using the following command line options: Windows*: icl /Qvec-report2 /Qrestrict /fp:fast SepiaFilterCilkPlus.cpp Linux* and Mac OS* X: icc vec-report2 restrict fp-model -fast SepiaFilterCilkPlus.cpp For system requirements for Linux and Mac OS X please refer to the Intel C++ Composer XE 2013 Release Notes. NOTE: The sample code used in this article will only read images in RGB (24 bit format) and with.bmp extensions. Three sample images of different sizes are attached with this solution. Locating the Samples To build the sample code, open the SepiaFilter-CilkPlus.zip archive attached. Use these files for this tutorial: There are sample input images (RGB_Lines.bmp, test.bmp and blackbuck.bmp) in the SepiaFilterCilkPlus directory inside the zip file. RGB_Lines.bmp Test.bmp Blackbuck.bmp The above sample input images are in the SpeiaFilterCilkPlus directory inside the zip file. SepiaFilterCilkPlus.sln SepiaFilterCilkPlus.cpp SepiaFilterCilkPlus.h Open the Microsoft Visual Studio* solution file, SepiaFilterCilkPlus.sln,and follow the steps below to prepare the project for the exercises in this document: 1. Select Release Win32 configuration 2. Clean the solution by selecting Build > Clean Solution.
4 You just deleted all of the compiled and temporary files association with this solution. Cleaning a solution ensures that the next build is a full build rather than changing existing files. Contents of the Source Code The program has a main function that gets the input file and output file as command line arguments and invokes the read_process_write() function. This function does the reading of the.bmp input file. The function first reads the header information from the input image file which contains information about the type of image, compression if any, and width and height of the input image. Once this information is known, a dynamic data structure is created and the payload image data is copied on to this data structure for further processing at pixel level. In this program, both array of structure (AOS) and structure of array (SOA) versions of the data structures are implemented and their performance is compared. The main Sepia filter kernel is named process_image() and depending on the macro used during the compilation phase, the corresponding implementation of the Sepia kernel is enabled (for instance array notation version or cilk_for version implemented using SOA and AOS Data structures). To run the executable from the command line, please use the following command line options: <executable> <input file> <output file> The input image and output image can be fed as command line arguments in Visual Studio as follows: Right Click on the project > Properties > Configuration Properties > Debugging > Command Arguments
5 Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with the Microsoft* C++ compiler in Visual Studio* (on Windows*). Run the executable (Debug > Start without Debugging). Running the program results in starting a window that displays the program s execution time in number of clock ticks. Record the execution time reported in the output. Building the project with Intel C++ Compiler Convert the project to use Intel C++ Compiler. To do this, right click on the solution and select Intel Composer XE 20XX > Use Intel C++
6 The XX above refers to the version of the Intel Composer XE (e.g. 2011, 2013, etc) installed on your system. Once the project is converted to an Intel project, follow the steps below to set the project properties: 1. Select Project > Properties > C/C++ > General > Suppress Startup Banner > No.
7
8 Click Language [Intel C++] > Recognize The Restrict Keyword > Yes (/Qrestrict) The Intel C++ Compiler supports the restrict keyword for C++ even though it is a C99 extension. This qualifier can be applied to a data pointer to indicate that data accessed through that pointer will not alias data accessed through other pointers. So, the restrict keyword enables the compiler to perform certain optimizations based on the premise that a given object cannot be changed through another pointer. You must ensure that restrict-qualified pointers are used as they are intended to be used. Otherwise, undefined behavior may result. 2. Select Project > Properties > C/C++ > Optimization > Optimization > Maximize Speed (/O2)
9 3. Select Project > Properties > C/C++ > Diagnostics[Intel C++] > Vectorizer Diagnostic Level > Loops Successfully and Unsuccessfully Vectorized (2) (/Qvec-report2)
10 4. Select Project > Properties > C/C++ > Code Generation > Floating Point Model > Fast (/fp:fast)
11 5. Select Project > Properties > C/C++ > Code Generation [Intel C++] > Add Processor-Optimized Code Path > Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
12 6. Vectorization could improve performance significantly for most applications, and is enabled by default in the Intel C++ Compiler. To see the performance impact of vectorization on our Sepia filter let s disable vectorization temporarily and and observe the runtime performance. To do this: Select Project > Properties > C/C++ > Command Line > /Qnovec
13 Rebuild the project and run the executable. Record the execution time reported in the output. 7. Now let s re-enable vectorization by removing the /Qno-vec option. Rebuild the project, run the executable, and record the execution time reported in the output. You should see improved performance due to vectorization. This is the baseline against which subsequent improvements will be measured. In establishing the baseline performance it is a good practice to compare the vec-report2 results between -O2 and -O3 optimization levels because more vectorization candidates tend to appear at O3. For this example, however, the O2 and O3 results are the same. SepiaFilterCilkPlus.cpp(202): (col. 2) remark: LOOP WAS VECTORIZED. The vectorization report indicates that the loop at the above line number in SepiaFilterCilkPlus.cpp was vectorized. This is the for loop which is the call site for process_image function which in this case happens to be inlined. The compiler vectorized the function body using the SIMD registers. The original serial implementation uses an Array of Structure (AOS) implementation which is not vectorization friendly due to non-sequential memory accesses inherent in the algorithm. Often, the overhead of non-sequential memory access makes vectorization unprofitable or inefficient, but in this example the compiler still deemed it profitable to vectorize the code despite non-unit stride memory access.
14 Implementation of Sepia filter kernel using Array Notation Here we re-write the original loop using the Array Notation with the default vector length. On a CPU with 128-bit vector registers size the default vector length is 4 (e.g. loading four 32-bit float data elements into vector registers). 1. Select Project > Properties > C/C++ > Preprocessor > Preprocessor Definitions, and add a new macro AOS_AN. 2. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. Array notations version will make use of the SIMD registers and SIMD instruction set to handle operations on vector operands. The vectorization report shows that the array notation version of the loop got vectorized: SepiaFilterCilkPlus.cpp(173): (col. 5) remark: LOOP WAS VECTORIZED. For our Sepia filter example the performance result of the Array Notation implementation will be almost the same as the autovectorized version in the previous case. The benefit is that while vectorizing arbiterary code is at the discretion of the compiler and cannot always be guaranteed, using array notation guarantees vectorization.
15 Improving Performance by Using Cilk_for Here we introduce thread-level parallelization by using the cilk_for construct. A cilk_for loop is a replacement for the normal C/C++ for loop that permits loop iterations to run in parallel on multiple cores. In order to enable multi-threading in this example, all you need is to include the cilk header file and replace for in line loop with cilk_for as shown below. To enable the cilk_for version, add the AOS_CILK_FOR macro as shown below: Rebuilding the project with above changes will make sure that the Sepia filter kernel will not only make use of SIMD registers (autovectorization) but also will make use of multiple cores and divide the workload in the for loop across multiple threads for additional speed up. The bigger the workload the closer the speed up towards the theoretical maximum. The input images provided can be used for testing. The image inputs in increasing order of workloads are blackbuck.bmp, RGB_Lines.bmp and test.bmp. The performance of the multi-threaded version will increase as these images are used in the order specified confirming the fact that the bigger the workload, the higher the speedup across the cores. Improving Performance Further Using Structure of Arrays (SOA) Up until now our default implementation has been using an Array of Structure algorithm that is not very vectorization friendly due to non-sequential access patterns. The non-sequential access pattern results in generating gather/scatter instructions that reduce the vectorization efficiency due to the long instruction latencies. Despite the above implications, Intel Cilk Plus was able to give admirable performance. By rewriting the baseline implementation in Structure of
16 Array (SOA) we could further improve performance due to unit-stride memory access pattern which is vectorization friendly. This allows the compiler to generate faster linear vector memory load/store instructions (e.g. movaps or movups supported on Intel SIMD hardware) rather than generating longer latency gather/scatter instructions that it would have to do otherwise. The data structure used in Array of Structures (AOS) implementation is: For Structure of Arrays (SOA) implementation, the data structure used is as follows: To demonstrate the performance boost using SOA, there are two different implementations; one exploiting SIMD features using array notation and the other exploiting both SIMD and multi-threading features that follow. In order to enable this section of the code in the example simply enable the macro SOA_AN as shown below:
17 Rebuild the project with the above setting to vectorize the code: SepiaFilterCilkPlus.cpp(141): (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(147 (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(182): (col. 3) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(178): (col. 2) remark: loop was not vectorized: not inner loop. SepiaFilterCilkPlus.cpp(210): (col. 2) remark: loop was not vectorized: vectorization possible but seems inefficient. SepiaFilterCilkPlus.cpp(210): (col. 2) remark: loop was not vectorized: vectorization possible but seems inefficient. The function process_image() containing the array notation code is invoked and vectorized. All the arguments in the section of Implementation of Sepia filter kernel using Array Notation are applicable in this section except for the fact that they operate on different data structures (in this case data structure which supports unit stride memory access). The performance number should show a significant improvement over its AOS counterpart earlier. Improving Performance by Using Cilk_for (SOA) In order to enable this section of the code in this example just enable macro SOA_CILK_FOR and also replace the for with cilk_for as shown below:
18 Rebuilding the project will result in the same vectorization report as the one in array notation version but this time it will divide the workload into multiple threads and execute across different cores thus gaining more performance than its AOS counterpart earlier. Using Cilk_for and Array Notation Together To use cilk_for and array notation together, the array needs to be broken down into multiple segments and distributed across multiple cilk worker threads. But doing so overrides the cilk runtime heuristics which will lead to lower performance in general, particularly for this example. You will get better performance if you let the cilk runtime do the load balancing. To experiment with this, enable the array notation code section explained earlier by using the SOA_AN macro. By default the SOA_AN
19 code section shown below has no cilk_for and uses num_of_seg = 1, which means that the full array is handled by one thread. To use cilk_for with array notation simply change the for_loop to cilk_for and set num_of_seg to the number of array segments you want to create. You will notice that the performance decreases as you increase num_of_seg because you will be incurring more overhead while not enough work is available for all threads. The best recommendation for using cilk_for and array notation together is to use short vectors, that is, section lengths that are vector register size or its multiple. This should enable vectorization that does not need peeling if the data is aligned, and no cleanup loop. Implementation of Sepia filter kernel using Elemental Functions An Intel Cilk Plus Elemental Function is a regular function, which can be invoked either on scalar arguments, or internally by the compiler on array elements in parallel to vectorize function calls within a loop that could otherwise prevent vectorization of the loop. In our example, the compiler inlines the call to the process_image() function in the loop which enables vectorization. Therefore, use of an elemental function is not necessary in our example. As such, using an elemental function in this example would not make any difference in the code performance. However, if you needed to use the elemental version of the function in question, all you would need to do would be to declare the function as shown below: // Declaring process_image() function as an elemental function _declspec(vector) void process_image(rgb &indataset, rgb &outdataset) For more information on elemental functions please see Elemental Functions in the reference section of this document. References For more information on SIMD vectorization, Intel Compiler automatic vectorization, Elemental Functions and examples of using other Intel Cilk Plus constructs refer to:
20 A Guide to Autovectorization Using the Intel C++ Compilers Requirements for Vectorizing Loops Requirements for Vectorizing Loops with #pragma SIMD Getting Started with Intel Cilk Plus Array Notations SIMD Parallelism using Array Notation Intel Cilk Plus Language Extension Specification Elemental functions: Writing data parallel code in C/C++ using Intel Cilk Plus Using Intel Cilk Plus to Achieve Data and Thread Parallelism
21
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More informationCilk Plus GETTING STARTED
Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions
More informationUsing Intel AVX without Writing AVX
1 White Paper Using Intel AVX without Writing AVX Introduction and Tools Intel Advanced Vector Extensions (Intel AVX) is a new 256-bit instruction set extension to Intel Streaming SIMD Extensions (Intel
More informationIntel Fortran Composer XE 2011 Getting Started Tutorials
Intel Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323651-001US World Wide Web: http://developer.intel.com Legal Information Contents Legal Information...5 Introducing the Intel
More informationOptimize Data Structures and Memory Access Patterns to Improve Data Locality
Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationSemantics of Vector Loops
Doc No: N3561 Date: 2013-03-15 Authors: Robert Geva email: robert.geva@intel.com Clark Nelson email: clark.nelson@intel.com Intel Corp. Semantics of Vector Loops Introduction In the SG1 conference call
More informationSWAR: MMX, SSE, SSE 2 Multiplatform Programming
SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationCS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.
Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationCS4961 Parallel Programming. Lecture 7: Fancy Solution to HW 2. Mary Hall September 15, 2009
CS4961 Parallel Programming Lecture 7: Fancy Solution to HW 2 Mary Hall September 15, 2009 09/15/2010 CS4961 1 Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationUsing Intel VTune Amplifier XE and Inspector XE in.net environment
Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector
More informationVectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition
Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Aaron Birkland Consultant Cornell Center for Advanced Computing December 11, 2012 1 Simple Vectorization This lab serves as an
More informationComputer Organization & Assembly Language Programming
Computer Organization & Assembly Language Programming CSE 2312 Lecture 11 Introduction of Assembly Language 1 Assembly Language Translation The Assembly Language layer is implemented by translation rather
More informationUsing the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results
Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results by Todd Rosenquist, Technical Consulting Engineer, Intel Math Kernal Library and
More informationIntel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes
Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes Document number: 323803-001US 4 May 2011 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.2 Product Contents...
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationLecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014
Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Beggin Madcon (So Dark the Con of Man) 15-418 students tend to
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationLanguage Extensions for Vector loop level parallelism
Doc No.: N4237 Date: 2014-10-10 Reply to: Robert Geva Clark Nelson Language Extensions for Vector loop level parallelism Introduction This document proposes a language extension for vector level parallel
More informationFFTSS Library Version 3.0 User s Guide
Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large
More informationIntel MPI Library Conditional Reproducibility
1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance
More informationDSP Mapping, Coding, Optimization
DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713
More informationParallel Computer Architecture and Programming Final Project
Muhammad Hilman Beyri (mbeyri), Zixu Ding (zixud) Parallel Computer Architecture and Programming Final Project Summary We have developed a distributed interactive ray tracing application in OpenMP and
More informationInstallation Guide and Release Notes
Intel C++ Studio XE 2013 for Windows* Installation Guide and Release Notes Document number: 323805-003US 26 June 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.1.1 Changes since Intel
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationEliminate Threading Errors to Improve Program Stability
Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development
More informationSoftware Optimization Case Study. Yu-Ping Zhao
Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization
More informationThe New C Standard (Excerpted material)
The New C Standard (Excerpted material) An Economic and Cultural Commentary Derek M. Jones derek@knosof.co.uk Copyright 2002-2008 Derek M. Jones. All rights reserved. 39 3.2 3.2 additive operators pointer
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationLLVM Auto-Vectorization
LLVM Auto-Vectorization Past Present Future Renato Golin LLVM Auto-Vectorization Plan: What is auto-vectorization? Short-history of the LLVM vectorizer What do we support today, and an overview of how
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationC6000 Compiler Roadmap
C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control
More informationIntel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes
Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is
More informationScheduling Image Processing Pipelines
Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationParallel Webpage Layout
Parallel Webpage Layout Leo Meyerovich, Chan Siu Man, Chan Siu On, Heidi Pan Krste Asanovic, Rastislav Bodik and many others from the UPCRC Berkeley project UC Berkeley Par Lab Research Overview Diagnosing
More informationIntel Advisor XE. Vectorization Optimization. Optimization Notice
Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics
More informationEliminate Threading Errors to Improve Program Stability
Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed
More informationPresenter: Georg Zitzlsberger Date:
C++ SIMD parallelism with Intel Cilk Plus and OpenMP* 4.0 Presenter: Georg Zitzlsberger Date: 05-12-2014 Agenda SIMD & Vectorization How to Vectorize? Vectorize with OpenMP* 4.0 Vectorize with Intel Cilk
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationBitonic Sorting Intel OpenCL SDK Sample Documentation
Intel OpenCL SDK Sample Documentation Document Number: 325262-002US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL
More informationImplementing Secure Software Systems on ARMv8-M Microcontrollers
Implementing Secure Software Systems on ARMv8-M Microcontrollers Chris Shore, ARM TrustZone: A comprehensive security foundation Non-trusted Trusted Security separation with TrustZone Isolate trusted resources
More informationEliminate Memory Errors to Improve Program Stability
Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.
More informationCode optimization techniques
& Alberto Bertoldo Advanced Computing Group Dept. of Information Engineering, University of Padova, Italy cyberto@dei.unipd.it May 19, 2009 The Four Commandments 1. The Pareto principle 80% of the effects
More informationVectorization on KNL
Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017
More informationIntel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes
Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes 24 July 2014 Table of Contents 1 Introduction... 2 1.1 Product Contents... 2 1.2 System Requirements...
More informationMAQAO Hands-on exercises LRZ Cluster
MAQAO Hands-on exercises LRZ Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/hpc/a2c06/lu23bud/lrz-vihpstw21/tools/maqao/maqao_handson_lrz.tar.xz
More informationVECTORISATION. Adrian
VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be
More informationCilk User s Guide. Document Number: US
Document Number: 322581-001US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
More informationGetting Reproducible Results with Intel MKL
Getting Reproducible Results with Intel MKL Why do results vary? Root cause for variations in results Floating-point numbers order of computation matters! Single precision example where (a+b)+c a+(b+c)
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationIntel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes
Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes 22 January 2015 Table of Contents 1 Introduction... 2 1.1 Change History... 2 1.1.1 Changes
More informationAllows program to be incrementally parallelized
Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP
More informationScheduling Image Processing Pipelines
Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationTopics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)
Topics Data Structures and Information Systems Part 1: Data Structures Michele Zito Lecture 3: Arrays (1) Data structure definition: arrays. Java arrays creation access Primitive types and reference types
More informationCilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation
Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status
More informationProgram Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop
More informationProgramming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment
Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Joe H. Wolf III, Microprocessor Products Group, Intel Corporation Index
More informationDownload, Install and Setup the Linux Development Workload Create a New Linux Project Configure a Linux Project Configure a Linux CMake Project
Table of Contents Download, Install and Setup the Linux Development Workload Create a New Linux Project Configure a Linux Project Configure a Linux CMake Project Connect to Your Remote Linux Computer Deploy,
More informationEliminate Memory Errors to Improve Program Stability
Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides
More informationExploiting the Power of the Intel Compiler Suite. Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group
Exploiting the Power of the Intel Compiler Suite Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group Agenda Compiler Overview Intel C++ Compiler High level optimization IPO, PGO
More informationParallel Image Processing
Parallel Image Processing Course Level: CS1 PDC Concepts Covered: PDC Concept Concurrency Data parallel Bloom Level C A Programming Skill Covered: Loading images into arrays Manipulating images Programming
More informationInstallation Guide and Release Notes
Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationGeneric access_type descriptor for the Embedded C Technical Report by Jan Kristoffersen Walter Banks
WG14 document N929 1 Purpose: Generic access_type descriptor for the Embedded C Technical Report by Jan Kristoffersen Walter Banks This document proposes a consistent and complete specification syntax
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationVc: Portable and Easy SIMD Programming with C++
Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron
More informationUsing Intel Inspector XE 2011 with Fortran Applications
Using Intel Inspector XE 2011 with Fortran Applications Jackson Marusarz Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
More informationMPLAB XC8 C Compiler Version 2.00 Release Notes for AVR MCU
MPLAB XC8 C Compiler Version 2.00 Release Notes for AVR MCU THIS DOCUMENT CONTAINS IMPORTANT INFORMATION RELATING TO THE MPLAB XC8 C COM- PILER WHEN TARGETING MICROCHIP AVR DEVICES. PLEASE READ IT BEFORE
More informationPresenter: Georg Zitzlsberger. Date:
Presenter: Georg Zitzlsberger Date: 07-09-2016 1 Agenda Introduction to SIMD for Intel Architecture Compiler & Vectorization Validating Vectorization Success Intel Cilk Plus OpenMP* 4.x Summary 2 Vectorization
More informationAdvanced Parallel Programming II
Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler
More informationGet an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*
Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling
More informationIntel Array Building Blocks
Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call
More informationSome possible directions for the R engine
Some possible directions for the R engine Luke Tierney Department of Statistics & Actuarial Science University of Iowa July 22, 2010 Luke Tierney (U. of Iowa) Directions for the R engine July 22, 2010
More informationHPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA
HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization
More informationPreface... (vii) CHAPTER 1 INTRODUCTION TO COMPUTERS
Contents Preface... (vii) CHAPTER 1 INTRODUCTION TO COMPUTERS 1.1. INTRODUCTION TO COMPUTERS... 1 1.2. HISTORY OF C & C++... 3 1.3. DESIGN, DEVELOPMENT AND EXECUTION OF A PROGRAM... 3 1.4 TESTING OF PROGRAMS...
More informationHPC Fall 2007 Project 1 Fast Matrix Multiply
HPC Fall 2007 Project 1 Fast Matrix Multiply Robert van Engelen Due date: October 11, 2007 1 Introduction 1.1 Account and Login For this assignment you need an SCS account. The account gives you access
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationGe#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017
Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)
More informationThis guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.
Introduction A resource leak refers to a type of resource consumption in which the program cannot release resources it has acquired. Typically the result of a bug, common resource issues, such as memory
More informationParallel Programming. OpenMP Parallel programming for multiprocessors for loops
Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationIntel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes
Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes Document number: 323804-001US 8 October 2010 Table of Contents 1 Introduction... 1 1.1 Product Contents... 1 1.2 What s New...
More informationCOE608: Computer Organization and Architecture
Add on Instruction Set Architecture COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview More
More informationAccelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture
Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome
More informationispc: A SPMD Compiler for High-Performance CPU Programming
ispc: A SPMD Compiler for High-Performance CPU Programming Matt Pharr Intel Corporation matt.pharr@intel.com William R. Mark Intel Corporation william.r.mark@intel.com ABSTRACT SIMD parallelism has become
More informationProgress on OpenMP Specifications
Progress on OpenMP Specifications Wednesday, November 13, 2012 Bronis R. de Supinski Chair, OpenMP Language Committee This work has been authored by Lawrence Livermore National Security, LLC under contract
More information