C Language Constructs for Parallel Programming

Similar documents
C Language Constructs for Parallel Programming

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Using Intel Inspector XE 2011 with Fortran Applications

Intel MKL Data Fitting component. Overview

Getting Compiler Advice from the Optimization Reports

Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors

Software Tools for Software Developers and Programming Models

Intel Parallel Amplifier Sample Code Guide

Techniques for Lowering Power Consumption in Design Utilizing the Intel EP80579 Integrated Processor Product Line

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Intel C++ Compiler Documentation

Intel IT Director 1.7 Release Notes

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel(R) Threading Building Blocks

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen*

Overview of Intel Parallel Studio XE

What's new in VTune Amplifier XE

Intel MPI Library for Windows* OS

Overview of Intel MKL Sparse BLAS. Software and Services Group Intel Corporation

Product Change Notification

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Product Change Notification

Parallel Programming Models

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Boot-Up Options

Product Change Notification

Intel(R) Threading Building Blocks

Product Change Notification

Product Change Notification

Installation Guide and Release Notes

Product Change Notification

Intel Platform Controller Hub EG20T

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

ECC Handling Issues on Intel XScale I/O Processors

Product Change Notification

MayLoon User Manual. Copyright 2013 Intel Corporation. Document Number: xxxxxx-xxxus. World Wide Web:

Intel Platform Controller Hub EG20T

Product Change Notification

Product Change Notification

Introduction to Intel Fortran Compiler Documentation. Document Number: US

Product Change Notification

Product Change Notification

Product Change Notification

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Continuous Speech Processing API for Host Media Processing

Third Party Hardware TDM Bus Administration

Product Change Notification

Product Change Notification

Product Change Notification

Expand Your HPC Market Reach and Grow Your Sales with Intel Cluster Ready

Product Change Notification

Product Change Notification

Product Change Notification

Intel IXP400 Software: Integrating STMicroelectronics* ADSL MTK20170* Chipset Firmware

Product Change Notification

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor PCI 16-Bit Read Implementation

Product Change Notification

Product Change Notification

Using the Intel VTune Amplifier 2013 on Embedded Platforms

VTune(TM) Performance Analyzer for Linux

Product Change Notification

A Simple Path to Parallelism with Intel Cilk Plus

Product Change Notification

Intel Platform Controller Hub EG20T

Installation Guide and Release Notes

Product Change Notification

What s New August 2015

Product Change Notification

Intel Fortran Composer XE 2011 Getting Started Tutorials

Intel Software Development Products Licensing & Programs Channel EMEA

Product Change Notification

Cilk Plus GETTING STARTED

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Intel EP80579 Software Drivers for Embedded Applications

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Product Change Notification

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Installation Guide and Release Notes

Product Change Notification

Enabling Hardware Accelerated Playback for Intel Atom /Intel US15W Platform and IEGD

Intel Thread Profiler

Intel Integrated Performance Primitives for Intel Architecture

Product Change Notification

Obtaining the Last Values of Conditionally Assigned Privates

Installation Guide and Release Notes

Product Change Notification

Recommended JTAG Circuitry for Debug with Intel Xscale Microarchitecture

Continuous Speech Processing API for Linux and Windows Operating Systems

Product Change Notification

Product Change Notification

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Product Change Notification

Product Change Notification

Floating-point control in the Intel C/C++ compiler and libraries or Why doesn t my application always give the same answer?

Intel Xeon Phi Programmability (the good, the bad and the ugly)

Vectorization Advisor: getting started

Transcription:

C Language Constructs for Parallel Programming Robert Geva 5/17/13 1

Cilk Plus Parallel tasks Easy to learn: 3 keywords Tasks, not threads Load balancing Hyper Objects Array notations Elemental Functions SIMD Loops Mitigate data races on non-local variables Data-parallel array operations Targets SIMD Data-parallel function mapping Vectorization annotation for loops Single threaded vector parallelism 2 5/17/13 2

cilk_spawn and cilk_sync Keywords int tree_walk(node *nodep) { int a = 0, b = 0; if (nodep- >left) a = _Cilk_spawn tree_walk(nodep- >left); if (nodep- >right) b = _Cilk_spawn tree_walk(nodep- >right); int c = f(nodep- >value); _Cilk_sync; return a + b + c; Implicit sync at the end of every function keeps code well structured Asynchronous recursive call to tree_wak Call to f() can run in parallel with recursive tree walks 5/17/13 3

Serialization of Tree-walk Example int tree_walk(node *n) { int a = 0, b = 0; if (n- >left) a = _Cilk_spawn tree_walk(n- >left); if (n- >right) b = _Cilk_spawn tree_walk(n- >right); int c = f(n- >value); _Cilk_sync; return a + b + c; 5/17/13 4

Example of keyword vs. pragma X = f1(a,b) + _Cilk_spawn f2(c,d); X = _Cilk_spawn f1(a,b) + f2(c,d); The above is currently disallow in Cilk Plus But this is not a necessary restriction Can be allowed The pragmas are separate from the C expression Hard to point out an exact point within a sub expression

cilk_for Loop Loop invariant. cilk_for (int i = start; i < finish; i += stride) f(); { /* Body of loop uses i */ All iterations complete before f() execute The loops has to be a countable loop Multiple linear increment allowed Iterations can execute in parallel. 5/17/13 6

Reducer Hyperobjects Traditional reduction on a parallel for loop: long a[sz]; reducer_opadd<int> sum = 0; cilk_for (int i = 0; i < sz; ++i) sum += a[i]; Parallel accesses each get their own view Generalized reduction for any code executing in parallel: reducer_opadd<int> sum = 0; void sum_tree(node* nodep) { if (nodep- >left) cilk_spawn sum_tree(nodep- >left); if (nodep- >right) cilk_spawn sum_tree(nodep- >right); sum += nodep- >value; 5/17/13 7

Array Notation Example Serial Example float dot_product(unsigned int sz, float A[], float B[]) { float dp=0.0f; for (int i=0; i<size; i++) dp += A[i] * B[i]; return dp; Array Notation Version float dot_product(unsigned int sz, float A[], float B[]) { return sec_reduce_add(a[0:sz] * B[0:sz]); Intrinsic reduction Array Section Element-wise multiplication 5/17/13 8

Rank and Shape An array section doesn't have a new kind of type the type of an array section is exactly that of the analogous subscript expression. Additionally, an array section has rank and shape. A section implicitly iterates over some elements of an array. Rank is the number of levels of loop nesting (i.e. dimensions) in the iteration space. Shape is a (mathematical) vector of lengths. (The rank is the same as the length of the shape vector.)

Rank and Shape (continued) The rank of an expression is determined statically. In general the shape of a section is determined dynamically. Expression Rank Shape a[0] 0 a[0:n] 1 n a[0][i:10] 1 10 a[i:n][j:m] 2 n m

Shapes have to match If array size is not known, both lower-bound and length must be specified Section ranks and lengths ( shapes ) must match. Scalars are OK. a[0:5] = b[0:6]; // No. Size mismatch. a[0:5][0:4] = b[0:5]; // No. Rank mismatch. a[0:5] = b[0:5][0:5]; // No. No 2D->1D a[0:4] = 5; // OK. 4 elements of A filled w/ 5. a[0:4] = b[i]; // OK. Fill with scalar b[i]. a[10][0:4] = b[1:4]; // OK. Both are 1D sections. b[i] = a[0:4]; // No. 1D 0 D

Array Notations Vector Operations Selection of array elements vector refers to a 1D array. Current implementation is does not allow [:] to be overloaded, e.g., for std::vector. A[:] B[2:6] // All of vector A // Elements 2 to 7 of vector B C[:][5] // Column 5 of matrix C D[0:3:2] // Elements 0,2,4 of vector D Masked vector operations if (a[:] > b[:]) { // Create a (logical) bit-mask, M c[:] = d[:] * e[:]; // For elements where M contains 1 else { c[:] = d[:] * 2; // For elements where M contains 0 Array x scalar operation 5/17/13 12

Vector Loop: Order of Evaluation simd_for (int n = 0; n < N; ++n) { a[n] += b[n]; c[n] += d[n]; for (int n = 0; n < N; n+=2) { t1 = a[n]; t2 = a[n+1]; // a[n+1] can be written // before c[n] and d[n] are read t5 = b[n]; t6 = b[n+1]; t1 += t5; t2 += t6; a[n] = t1; a[n+1] = t2; t3 = c[n]; t4 = c[n+1]; // c[n+1] can only be accessed // after a[n] t5 = d[n]; t6 = d[n+1]; t3 += t5; t4 += t6 c[n] = t3; d[n] = t4;

Uniform vs. Private: Illustration double b = get_position(); simd_for (int i = 0; i < N; ++i) { double t; t = y[i] * cos(z[i]); a[i] = t / b; b is uniform, t is private The proposal is mapping the concepts of a uniform and a private variables onto existing syntax Assignments to b inside the loop shall result in uniform values, otherwise the behavior is undefined.

Elemental Functions - Example Defining an elemental function: double option_price_call_black_scholes( { double S, double K, double r, double sigma, double time) _Simd double time_sqrt = sqrt(time); double d1 = (log(s/k)+r*time)/(sigma*time_sqrt) + 0.5*sigma*time_sqrt; double d2 = d1- (sigma*time_sqrt); return S*N(d1) - K*exp(- r*time)*n(d2); 5/17/13 15

Illustration void vec_add ( float *r, float *op1, float *op2, int i) simd (chunk (N)) simd (uniform (r,op1, op2), linear (i), chunk(n)) { r[i] = op1[i] + op2[i]; Two vector versions and one scalar ssimd_for (int i = 0; i<n; ++i) { vec_add(a,b,c,i); Call matches the version with the uniforms simd_for (int i = 0; i<n; ++i) { vec_add(a[x1[[i]],b[x2[[i]],c[x3[[i]],i); Call matches the version w/o the uniforms

5/17/13 17

Optimization Notice Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the Intel Compiler User and Reference Guides under Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101 5/17/13 18

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vpro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vpro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2011. Intel Corporation. http://intel.com/software/products 5/17/13 19

Joint proposal between Cilk Plus and OpenMP A minimal language The language does not mandate a scheduling technique The language allows / does not disallow dynamic load balancing Serial semantics and serial equivalence Well integrated into the C language