Improving graphics processing performance using Intel Cilk Plus

Size: px
Start display at page:

Download "Improving graphics processing performance using Intel Cilk Plus"

Transcription

1 Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords to implement task parallelism and an array notation syntax to express data parallelism. This article demonstrates how to improve the performance of a graphics processing program using Intel Cilk Plus. To demonstrate the performance increase, you will use a program that converts a bitmap file from color image to a Sepia tone image. A Sepia tone image is monochromatic image with a distinctive Brown Gray color that provides a distinctive tone to a photograph when Black & White film was available. The program works by converting each pixel in the bitmap file to a Sepia tone. Overview A Sepia filter converts a color image to a duotone image with a dark Brown-Gray color. The filter converts each color pixel using the following formula: Where R, G, and B are the Red, Green, and Blue values of each pixel in the input image and Rs, Gs, and Bs are the corresponding output pixels in the output image. This is a highly data parallel algorithm where the value of each pixel at (i,j) in an output image is purely dependent on only the pixel at (i,j) in the input image. This is an ideal candidate for the Single Instruction Multiple Data (SIMD) exploitation where multiple data items in a loop construct are loaded into the vector registers and operated upon simultaneously by a single instruction. Below is the bitmap file before and after the Sepia transformation:

2 We will look at the performance of the serial implementation of the Sepia filter algorithm above, and then create an Intel Cilk Plus implementation of the filter to improve filter s performance through vectorization and parallelization features supported by Intel C++ Compilers. Optimization Steps We will start the optimization process by performing the following steps: Establish a performance baseline by building and running the serial version of the Sepia filter with default Visual Studio* compiler with the default options (Release Build). Rebuild the project with Intel C++ Compiler with default options to get a performance boost (Release Build). Implement the filter using Intel Cilk Plus Array Notation. Introduce thread level parallelization using Intel Cilk Plus Cilk_for construct. Replace the Array of Structure (AOS) implementation with Structure of Array (SOA) implementation to improve performance further. System Requirements To compile and run the example and exercises in this document you will need Intel C++ Composer XE 2013 Update 1 or higher, and an Intel Pentium 4 Processor or higher with support for Intel SSE2 or higher instruction extensions. The exercises in this document were tested on a third generation Intel Core

3 i5 system supporting 256-bit vector registers. The instructions in this document show you how to build and run the examples with the Microsoft Visual Studio* A Visual Studio* 2008 project is provided to allow using the examples with older versions of Visual Studio*. The examples provided can also be built from the command line on Windows*, Linux*, and Mac OS* X using the following command line options: Windows*: icl /Qvec-report2 /Qrestrict /fp:fast SepiaFilterCilkPlus.cpp Linux* and Mac OS* X: icc vec-report2 restrict fp-model -fast SepiaFilterCilkPlus.cpp For system requirements for Linux and Mac OS X please refer to the Intel C++ Composer XE 2013 Release Notes. NOTE: The sample code used in this article will only read images in RGB (24 bit format) and with.bmp extensions. Three sample images of different sizes are attached with this solution. Locating the Samples To build the sample code, open the SepiaFilter-CilkPlus.zip archive attached. Use these files for this tutorial: There are sample input images (RGB_Lines.bmp, test.bmp and blackbuck.bmp) in the SepiaFilterCilkPlus directory inside the zip file. RGB_Lines.bmp Test.bmp Blackbuck.bmp The above sample input images are in the SpeiaFilterCilkPlus directory inside the zip file. SepiaFilterCilkPlus.sln SepiaFilterCilkPlus.cpp SepiaFilterCilkPlus.h Open the Microsoft Visual Studio* solution file, SepiaFilterCilkPlus.sln,and follow the steps below to prepare the project for the exercises in this document: 1. Select Release Win32 configuration 2. Clean the solution by selecting Build > Clean Solution.

4 You just deleted all of the compiled and temporary files association with this solution. Cleaning a solution ensures that the next build is a full build rather than changing existing files. Contents of the Source Code The program has a main function that gets the input file and output file as command line arguments and invokes the read_process_write() function. This function does the reading of the.bmp input file. The function first reads the header information from the input image file which contains information about the type of image, compression if any, and width and height of the input image. Once this information is known, a dynamic data structure is created and the payload image data is copied on to this data structure for further processing at pixel level. In this program, both array of structure (AOS) and structure of array (SOA) versions of the data structures are implemented and their performance is compared. The main Sepia filter kernel is named process_image() and depending on the macro used during the compilation phase, the corresponding implementation of the Sepia kernel is enabled (for instance array notation version or cilk_for version implemented using SOA and AOS Data structures). To run the executable from the command line, please use the following command line options: <executable> <input file> <output file> The input image and output image can be fed as command line arguments in Visual Studio as follows: Right Click on the project > Properties > Configuration Properties > Debugging > Command Arguments

5 Establishing a Performance Baseline To set a performance baseline for the improvements that follow in this tutorial, build your project with the Microsoft* C++ compiler in Visual Studio* (on Windows*). Run the executable (Debug > Start without Debugging). Running the program results in starting a window that displays the program s execution time in number of clock ticks. Record the execution time reported in the output. Building the project with Intel C++ Compiler Convert the project to use Intel C++ Compiler. To do this, right click on the solution and select Intel Composer XE 20XX > Use Intel C++

6 The XX above refers to the version of the Intel Composer XE (e.g. 2011, 2013, etc) installed on your system. Once the project is converted to an Intel project, follow the steps below to set the project properties: 1. Select Project > Properties > C/C++ > General > Suppress Startup Banner > No.

7

8 Click Language [Intel C++] > Recognize The Restrict Keyword > Yes (/Qrestrict) The Intel C++ Compiler supports the restrict keyword for C++ even though it is a C99 extension. This qualifier can be applied to a data pointer to indicate that data accessed through that pointer will not alias data accessed through other pointers. So, the restrict keyword enables the compiler to perform certain optimizations based on the premise that a given object cannot be changed through another pointer. You must ensure that restrict-qualified pointers are used as they are intended to be used. Otherwise, undefined behavior may result. 2. Select Project > Properties > C/C++ > Optimization > Optimization > Maximize Speed (/O2)

9 3. Select Project > Properties > C/C++ > Diagnostics[Intel C++] > Vectorizer Diagnostic Level > Loops Successfully and Unsuccessfully Vectorized (2) (/Qvec-report2)

10 4. Select Project > Properties > C/C++ > Code Generation > Floating Point Model > Fast (/fp:fast)

11 5. Select Project > Properties > C/C++ > Code Generation [Intel C++] > Add Processor-Optimized Code Path > Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)

12 6. Vectorization could improve performance significantly for most applications, and is enabled by default in the Intel C++ Compiler. To see the performance impact of vectorization on our Sepia filter let s disable vectorization temporarily and and observe the runtime performance. To do this: Select Project > Properties > C/C++ > Command Line > /Qnovec

13 Rebuild the project and run the executable. Record the execution time reported in the output. 7. Now let s re-enable vectorization by removing the /Qno-vec option. Rebuild the project, run the executable, and record the execution time reported in the output. You should see improved performance due to vectorization. This is the baseline against which subsequent improvements will be measured. In establishing the baseline performance it is a good practice to compare the vec-report2 results between -O2 and -O3 optimization levels because more vectorization candidates tend to appear at O3. For this example, however, the O2 and O3 results are the same. SepiaFilterCilkPlus.cpp(202): (col. 2) remark: LOOP WAS VECTORIZED. The vectorization report indicates that the loop at the above line number in SepiaFilterCilkPlus.cpp was vectorized. This is the for loop which is the call site for process_image function which in this case happens to be inlined. The compiler vectorized the function body using the SIMD registers. The original serial implementation uses an Array of Structure (AOS) implementation which is not vectorization friendly due to non-sequential memory accesses inherent in the algorithm. Often, the overhead of non-sequential memory access makes vectorization unprofitable or inefficient, but in this example the compiler still deemed it profitable to vectorize the code despite non-unit stride memory access.

14 Implementation of Sepia filter kernel using Array Notation Here we re-write the original loop using the Array Notation with the default vector length. On a CPU with 128-bit vector registers size the default vector length is 4 (e.g. loading four 32-bit float data elements into vector registers). 1. Select Project > Properties > C/C++ > Preprocessor > Preprocessor Definitions, and add a new macro AOS_AN. 2. Rebuild the project, then run the executable (Debug > Start Without Debugging) and record the execution time reported in the output. Array notations version will make use of the SIMD registers and SIMD instruction set to handle operations on vector operands. The vectorization report shows that the array notation version of the loop got vectorized: SepiaFilterCilkPlus.cpp(173): (col. 5) remark: LOOP WAS VECTORIZED. For our Sepia filter example the performance result of the Array Notation implementation will be almost the same as the autovectorized version in the previous case. The benefit is that while vectorizing arbiterary code is at the discretion of the compiler and cannot always be guaranteed, using array notation guarantees vectorization.

15 Improving Performance by Using Cilk_for Here we introduce thread-level parallelization by using the cilk_for construct. A cilk_for loop is a replacement for the normal C/C++ for loop that permits loop iterations to run in parallel on multiple cores. In order to enable multi-threading in this example, all you need is to include the cilk header file and replace for in line loop with cilk_for as shown below. To enable the cilk_for version, add the AOS_CILK_FOR macro as shown below: Rebuilding the project with above changes will make sure that the Sepia filter kernel will not only make use of SIMD registers (autovectorization) but also will make use of multiple cores and divide the workload in the for loop across multiple threads for additional speed up. The bigger the workload the closer the speed up towards the theoretical maximum. The input images provided can be used for testing. The image inputs in increasing order of workloads are blackbuck.bmp, RGB_Lines.bmp and test.bmp. The performance of the multi-threaded version will increase as these images are used in the order specified confirming the fact that the bigger the workload, the higher the speedup across the cores. Improving Performance Further Using Structure of Arrays (SOA) Up until now our default implementation has been using an Array of Structure algorithm that is not very vectorization friendly due to non-sequential access patterns. The non-sequential access pattern results in generating gather/scatter instructions that reduce the vectorization efficiency due to the long instruction latencies. Despite the above implications, Intel Cilk Plus was able to give admirable performance. By rewriting the baseline implementation in Structure of

16 Array (SOA) we could further improve performance due to unit-stride memory access pattern which is vectorization friendly. This allows the compiler to generate faster linear vector memory load/store instructions (e.g. movaps or movups supported on Intel SIMD hardware) rather than generating longer latency gather/scatter instructions that it would have to do otherwise. The data structure used in Array of Structures (AOS) implementation is: For Structure of Arrays (SOA) implementation, the data structure used is as follows: To demonstrate the performance boost using SOA, there are two different implementations; one exploiting SIMD features using array notation and the other exploiting both SIMD and multi-threading features that follow. In order to enable this section of the code in the example simply enable the macro SOA_AN as shown below:

17 Rebuild the project with the above setting to vectorize the code: SepiaFilterCilkPlus.cpp(141): (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(147 (col. 2) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(182): (col. 3) remark: LOOP WAS VECTORIZED. SepiaFilterCilkPlus.cpp(178): (col. 2) remark: loop was not vectorized: not inner loop. SepiaFilterCilkPlus.cpp(210): (col. 2) remark: loop was not vectorized: vectorization possible but seems inefficient. SepiaFilterCilkPlus.cpp(210): (col. 2) remark: loop was not vectorized: vectorization possible but seems inefficient. The function process_image() containing the array notation code is invoked and vectorized. All the arguments in the section of Implementation of Sepia filter kernel using Array Notation are applicable in this section except for the fact that they operate on different data structures (in this case data structure which supports unit stride memory access). The performance number should show a significant improvement over its AOS counterpart earlier. Improving Performance by Using Cilk_for (SOA) In order to enable this section of the code in this example just enable macro SOA_CILK_FOR and also replace the for with cilk_for as shown below:

18 Rebuilding the project will result in the same vectorization report as the one in array notation version but this time it will divide the workload into multiple threads and execute across different cores thus gaining more performance than its AOS counterpart earlier. Using Cilk_for and Array Notation Together To use cilk_for and array notation together, the array needs to be broken down into multiple segments and distributed across multiple cilk worker threads. But doing so overrides the cilk runtime heuristics which will lead to lower performance in general, particularly for this example. You will get better performance if you let the cilk runtime do the load balancing. To experiment with this, enable the array notation code section explained earlier by using the SOA_AN macro. By default the SOA_AN

19 code section shown below has no cilk_for and uses num_of_seg = 1, which means that the full array is handled by one thread. To use cilk_for with array notation simply change the for_loop to cilk_for and set num_of_seg to the number of array segments you want to create. You will notice that the performance decreases as you increase num_of_seg because you will be incurring more overhead while not enough work is available for all threads. The best recommendation for using cilk_for and array notation together is to use short vectors, that is, section lengths that are vector register size or its multiple. This should enable vectorization that does not need peeling if the data is aligned, and no cleanup loop. Implementation of Sepia filter kernel using Elemental Functions An Intel Cilk Plus Elemental Function is a regular function, which can be invoked either on scalar arguments, or internally by the compiler on array elements in parallel to vectorize function calls within a loop that could otherwise prevent vectorization of the loop. In our example, the compiler inlines the call to the process_image() function in the loop which enables vectorization. Therefore, use of an elemental function is not necessary in our example. As such, using an elemental function in this example would not make any difference in the code performance. However, if you needed to use the elemental version of the function in question, all you would need to do would be to declare the function as shown below: // Declaring process_image() function as an elemental function _declspec(vector) void process_image(rgb &indataset, rgb &outdataset) For more information on elemental functions please see Elemental Functions in the reference section of this document. References For more information on SIMD vectorization, Intel Compiler automatic vectorization, Elemental Functions and examples of using other Intel Cilk Plus constructs refer to:

20 A Guide to Autovectorization Using the Intel C++ Compilers Requirements for Vectorizing Loops Requirements for Vectorizing Loops with #pragma SIMD Getting Started with Intel Cilk Plus Array Notations SIMD Parallelism using Array Notation Intel Cilk Plus Language Extension Specification Elemental functions: Writing data parallel code in C/C++ using Intel Cilk Plus Using Intel Cilk Plus to Achieve Data and Thread Parallelism

21

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

Using Intel AVX without Writing AVX

Using Intel AVX without Writing AVX 1 White Paper Using Intel AVX without Writing AVX Introduction and Tools Intel Advanced Vector Extensions (Intel AVX) is a new 256-bit instruction set extension to Intel Streaming SIMD Extensions (Intel

More information

Intel Fortran Composer XE 2011 Getting Started Tutorials

Intel Fortran Composer XE 2011 Getting Started Tutorials Intel Fortran Composer XE 2011 Getting Started Tutorials Document Number: 323651-001US World Wide Web: http://developer.intel.com Legal Information Contents Legal Information...5 Introducing the Intel

More information

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Optimize Data Structures and Memory Access Patterns to Improve Data Locality Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Semantics of Vector Loops

Semantics of Vector Loops Doc No: N3561 Date: 2013-03-15 Authors: Robert Geva email: robert.geva@intel.com Clark Nelson email: clark.nelson@intel.com Intel Corp. Semantics of Vector Loops Introduction In the SG1 conference call

More information

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

SWAR: MMX, SSE, SSE 2 Multiplatform Programming SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009. Parallel Programming Lecture 9: Task Parallelism in OpenMP Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22 before class - Use the handin program on the CADE machines

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

CS4961 Parallel Programming. Lecture 7: Fancy Solution to HW 2. Mary Hall September 15, 2009

CS4961 Parallel Programming. Lecture 7: Fancy Solution to HW 2. Mary Hall September 15, 2009 CS4961 Parallel Programming Lecture 7: Fancy Solution to HW 2 Mary Hall September 15, 2009 09/15/2010 CS4961 1 Administrative Programming assignment 1 is posted (after class) Due, Tuesday, September 22

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition

Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Vectorization Lab Parallel Computing at TACC: Ranger to Stampede Transition Aaron Birkland Consultant Cornell Center for Advanced Computing December 11, 2012 1 Simple Vectorization This lab serves as an

More information

Computer Organization & Assembly Language Programming

Computer Organization & Assembly Language Programming Computer Organization & Assembly Language Programming CSE 2312 Lecture 11 Introduction of Assembly Language 1 Assembly Language Translation The Assembly Language layer is implemented by translation rather

More information

Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results

Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results by Todd Rosenquist, Technical Consulting Engineer, Intel Math Kernal Library and

More information

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes Document number: 323803-001US 4 May 2011 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.2 Product Contents...

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014 Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Beggin Madcon (So Dark the Con of Man) 15-418 students tend to

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

Language Extensions for Vector loop level parallelism

Language Extensions for Vector loop level parallelism Doc No.: N4237 Date: 2014-10-10 Reply to: Robert Geva Clark Nelson Language Extensions for Vector loop level parallelism Introduction This document proposes a language extension for vector level parallel

More information

FFTSS Library Version 3.0 User s Guide

FFTSS Library Version 3.0 User s Guide Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large

More information

Intel MPI Library Conditional Reproducibility

Intel MPI Library Conditional Reproducibility 1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance

More information

DSP Mapping, Coding, Optimization

DSP Mapping, Coding, Optimization DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713

More information

Parallel Computer Architecture and Programming Final Project

Parallel Computer Architecture and Programming Final Project Muhammad Hilman Beyri (mbeyri), Zixu Ding (zixud) Parallel Computer Architecture and Programming Final Project Summary We have developed a distributed interactive ray tracing application in OpenMP and

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel C++ Studio XE 2013 for Windows* Installation Guide and Release Notes Document number: 323805-003US 26 June 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.1.1 Changes since Intel

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development

More information

Software Optimization Case Study. Yu-Ping Zhao

Software Optimization Case Study. Yu-Ping Zhao Software Optimization Case Study Yu-Ping Zhao Yuping.zhao@intel.com Agenda RELION Background RELION ITAC and VTUE Analyze RELION Auto-Refine Workload Optimization RELION 2D Classification Workload Optimization

More information

The New C Standard (Excerpted material)

The New C Standard (Excerpted material) The New C Standard (Excerpted material) An Economic and Cultural Commentary Derek M. Jones derek@knosof.co.uk Copyright 2002-2008 Derek M. Jones. All rights reserved. 39 3.2 3.2 additive operators pointer

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

LLVM Auto-Vectorization

LLVM Auto-Vectorization LLVM Auto-Vectorization Past Present Future Renato Golin LLVM Auto-Vectorization Plan: What is auto-vectorization? Short-history of the LLVM vectorizer What do we support today, and an overview of how

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

C6000 Compiler Roadmap

C6000 Compiler Roadmap C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control

More information

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Parallel Webpage Layout

Parallel Webpage Layout Parallel Webpage Layout Leo Meyerovich, Chan Siu Man, Chan Siu On, Heidi Pan Krste Asanovic, Rastislav Bodik and many others from the UPCRC Berkeley project UC Berkeley Par Lab Research Overview Diagnosing

More information

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Advisor XE. Vectorization Optimization. Optimization Notice Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed

More information

Presenter: Georg Zitzlsberger Date:

Presenter: Georg Zitzlsberger Date: C++ SIMD parallelism with Intel Cilk Plus and OpenMP* 4.0 Presenter: Georg Zitzlsberger Date: 05-12-2014 Agenda SIMD & Vectorization How to Vectorize? Vectorize with OpenMP* 4.0 Vectorize with Intel Cilk

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Bitonic Sorting Intel OpenCL SDK Sample Documentation Intel OpenCL SDK Sample Documentation Document Number: 325262-002US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL

More information

Implementing Secure Software Systems on ARMv8-M Microcontrollers

Implementing Secure Software Systems on ARMv8-M Microcontrollers Implementing Secure Software Systems on ARMv8-M Microcontrollers Chris Shore, ARM TrustZone: A comprehensive security foundation Non-trusted Trusted Security separation with TrustZone Isolate trusted resources

More information

Eliminate Memory Errors to Improve Program Stability

Eliminate Memory Errors to Improve Program Stability Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.

More information

Code optimization techniques

Code optimization techniques & Alberto Bertoldo Advanced Computing Group Dept. of Information Engineering, University of Padova, Italy cyberto@dei.unipd.it May 19, 2009 The Four Commandments 1. The Pareto principle 80% of the effects

More information

Vectorization on KNL

Vectorization on KNL Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017

More information

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes 24 July 2014 Table of Contents 1 Introduction... 2 1.1 Product Contents... 2 1.2 System Requirements...

More information

MAQAO Hands-on exercises LRZ Cluster

MAQAO Hands-on exercises LRZ Cluster MAQAO Hands-on exercises LRZ Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/hpc/a2c06/lu23bud/lrz-vihpstw21/tools/maqao/maqao_handson_lrz.tar.xz

More information

VECTORISATION. Adrian

VECTORISATION. Adrian VECTORISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Vectorisation Same operation on multiple data items Wide registers SIMD needed to approach FLOP peak performance, but your code must be

More information

Cilk User s Guide. Document Number: US

Cilk User s Guide. Document Number: US Document Number: 322581-001US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL

More information

Getting Reproducible Results with Intel MKL

Getting Reproducible Results with Intel MKL Getting Reproducible Results with Intel MKL Why do results vary? Root cause for variations in results Floating-point numbers order of computation matters! Single precision example where (a+b)+c a+(b+c)

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes

Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes Intel Integrated Native Developer Experience 2015 Build Edition for OS X* Installation Guide and Release Notes 22 January 2015 Table of Contents 1 Introduction... 2 1.1 Change History... 2 1.1.1 Changes

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Topics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)

Topics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1) Topics Data Structures and Information Systems Part 1: Data Structures Michele Zito Lecture 3: Arrays (1) Data structure definition: arrays. Java arrays creation access Primitive types and reference types

More information

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop

More information

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment

Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Joe H. Wolf III, Microprocessor Products Group, Intel Corporation Index

More information

Download, Install and Setup the Linux Development Workload Create a New Linux Project Configure a Linux Project Configure a Linux CMake Project

Download, Install and Setup the Linux Development Workload Create a New Linux Project Configure a Linux Project Configure a Linux CMake Project Table of Contents Download, Install and Setup the Linux Development Workload Create a New Linux Project Configure a Linux Project Configure a Linux CMake Project Connect to Your Remote Linux Computer Deploy,

More information

Eliminate Memory Errors to Improve Program Stability

Eliminate Memory Errors to Improve Program Stability Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides

More information

Exploiting the Power of the Intel Compiler Suite. Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group

Exploiting the Power of the Intel Compiler Suite. Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group Exploiting the Power of the Intel Compiler Suite Dr. Mario Deilmann Intel Compiler and Languages Lab Software Solutions Group Agenda Compiler Overview Intel C++ Compiler High level optimization IPO, PGO

More information

Parallel Image Processing

Parallel Image Processing Parallel Image Processing Course Level: CS1 PDC Concepts Covered: PDC Concept Concurrency Data parallel Bloom Level C A Programming Skill Covered: Loading images into arrays Manipulating images Programming

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Generic access_type descriptor for the Embedded C Technical Report by Jan Kristoffersen Walter Banks

Generic access_type descriptor for the Embedded C Technical Report by Jan Kristoffersen Walter Banks WG14 document N929 1 Purpose: Generic access_type descriptor for the Embedded C Technical Report by Jan Kristoffersen Walter Banks This document proposes a consistent and complete specification syntax

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Vc: Portable and Easy SIMD Programming with C++

Vc: Portable and Easy SIMD Programming with C++ Vc: Portable and Easy SIMD Programming with C++ Matthias Kretz Frankfurt Institute Institute for Computer Science Goethe University Frankfurt May 19th, 2014 HGS-HIRe Helmholtz Graduate School for Hadron

More information

Using Intel Inspector XE 2011 with Fortran Applications

Using Intel Inspector XE 2011 with Fortran Applications Using Intel Inspector XE 2011 with Fortran Applications Jackson Marusarz Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS

More information

MPLAB XC8 C Compiler Version 2.00 Release Notes for AVR MCU

MPLAB XC8 C Compiler Version 2.00 Release Notes for AVR MCU MPLAB XC8 C Compiler Version 2.00 Release Notes for AVR MCU THIS DOCUMENT CONTAINS IMPORTANT INFORMATION RELATING TO THE MPLAB XC8 C COM- PILER WHEN TARGETING MICROCHIP AVR DEVICES. PLEASE READ IT BEFORE

More information

Presenter: Georg Zitzlsberger. Date:

Presenter: Georg Zitzlsberger. Date: Presenter: Georg Zitzlsberger Date: 07-09-2016 1 Agenda Introduction to SIMD for Intel Architecture Compiler & Vectorization Validating Vectorization Success Intel Cilk Plus OpenMP* 4.x Summary 2 Vectorization

More information

Advanced Parallel Programming II

Advanced Parallel Programming II Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler

More information

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows* Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

More information

Intel Array Building Blocks

Intel Array Building Blocks Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call

More information

Some possible directions for the R engine

Some possible directions for the R engine Some possible directions for the R engine Luke Tierney Department of Statistics & Actuarial Science University of Iowa July 22, 2010 Luke Tierney (U. of Iowa) Directions for the R engine July 22, 2010

More information

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA

HPC TNT - 2. Tips and tricks for Vectorization approaches to efficient code. HPC core facility CalcUA HPC TNT - 2 Tips and tricks for Vectorization approaches to efficient code HPC core facility CalcUA ANNIE CUYT STEFAN BECUWE FRANKY BACKELJAUW [ENGEL]BERT TIJSKENS Overview Introduction What is vectorization

More information

Preface... (vii) CHAPTER 1 INTRODUCTION TO COMPUTERS

Preface... (vii) CHAPTER 1 INTRODUCTION TO COMPUTERS Contents Preface... (vii) CHAPTER 1 INTRODUCTION TO COMPUTERS 1.1. INTRODUCTION TO COMPUTERS... 1 1.2. HISTORY OF C & C++... 3 1.3. DESIGN, DEVELOPMENT AND EXECUTION OF A PROGRAM... 3 1.4 TESTING OF PROGRAMS...

More information

HPC Fall 2007 Project 1 Fast Matrix Multiply

HPC Fall 2007 Project 1 Fast Matrix Multiply HPC Fall 2007 Project 1 Fast Matrix Multiply Robert van Engelen Due date: October 11, 2007 1 Introduction 1.1 Account and Login For this assignment you need an SCS account. The account gives you access

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)

More information

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems. Introduction A resource leak refers to a type of resource consumption in which the program cannot release resources it has acquired. Typically the result of a bug, common resource issues, such as memory

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes Document number: 323804-001US 8 October 2010 Table of Contents 1 Introduction... 1 1.1 Product Contents... 1 1.2 What s New...

More information

COE608: Computer Organization and Architecture

COE608: Computer Organization and Architecture Add on Instruction Set Architecture COE608: Computer Organization and Architecture Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview More

More information

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome

More information

ispc: A SPMD Compiler for High-Performance CPU Programming

ispc: A SPMD Compiler for High-Performance CPU Programming ispc: A SPMD Compiler for High-Performance CPU Programming Matt Pharr Intel Corporation matt.pharr@intel.com William R. Mark Intel Corporation william.r.mark@intel.com ABSTRACT SIMD parallelism has become

More information

Progress on OpenMP Specifications

Progress on OpenMP Specifications Progress on OpenMP Specifications Wednesday, November 13, 2012 Bronis R. de Supinski Chair, OpenMP Language Committee This work has been authored by Lawrence Livermore National Security, LLC under contract

More information