Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Similar documents
OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Intel Cache Acceleration Software for Windows* Workstation

How to Create a.cibd File from Mentor Xpedition for HLDRC

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC

Drive Recovery Panel

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Intel Atom Processor E3800 Product Family Development Kit Based on Intel Intelligent System Extended (ISX) Form Factor Reference Design

Theory and Practice of the Low-Power SATA Spec DevSleep

Intel Core TM i7-4702ec Processor for Communications Infrastructure

Desktop 4th Generation Intel Core, Intel Pentium, and Intel Celeron Processor Families and Intel Xeon Processor E3-1268L v3

Intel Atom Processor E6xx Series Embedded Application Power Guideline Addendum January 2012

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

Intel Atom Processor D2000 Series and N2000 Series Embedded Application Power Guideline Addendum January 2012

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Intel Integrated Native Developer Experience 2015 (OS X* host)

Intel RealSense Depth Module D400 Series Software Calibration Tool

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Bitonic Sorting Intel OpenCL SDK Sample Documentation

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

Intel USB 3.0 extensible Host Controller Driver

Data Center Efficiency Workshop Commentary-Intel

Intel & Lustre: LUG Micah Bhakti

Krzysztof Laskowski, Intel Pavan K Lanka, Intel

Intel RealSense D400 Series Calibration Tools and API Release Notes

6th Generation Intel Core Processor Series

Intel Cache Acceleration Software - Workstation

Obtaining the Last Values of Conditionally Assigned Privates

2013 Intel Corporation

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel Open Source HD Graphics Programmers' Reference Manual (PRM)

Intel Ethernet Controller I350 Frequently Asked Questions (FAQs)

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Intel Open Source HD Graphics, Intel Iris Graphics, and Intel Iris Pro Graphics

Building on The NVM Programming Model A Windows Implementation

Intel Galileo Firmware Updater Tool

Product Change Notification

Intel Stereo 3D SDK Developer s Guide. Alpha Release

Installation Guide and Release Notes

Clear CMOS after Hardware Configuration Changes

OMNI-PATH FABRIC TOPOLOGIES AND ROUTING

Intel Manycore Platform Software Stack (Intel MPSS)

Software Evaluation Guide for WinZip* esources-performance-documents.html

Product Change Notification

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Intel Desktop Board DZ68DB

Intel Dynamic Platform and Thermal Framework (Intel DPTF), Client Version 8.X

Intel Graphics Virtualization Technology. Kevin Tian Graphics Virtualization Architect

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

The Intel Processor Diagnostic Tool Release Notes

Intel Unite Plugin Guide for VDO360 Clearwater

Product Change Notification

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Product Change Notification

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Intel Embedded Media and Graphics Driver v1.12 for Intel Atom Processor N2000 and D2000 Series

The Intel SSD Pro 2500 Series Guide for Microsoft edrive* Activation

LED Manager for Intel NUC

Product Change Notification

High Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version

True Scale Fabric Switches Series

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Installation Guide and Release Notes

Product Change Notification

Product Change Notification

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

Intel Cache Acceleration Software (Intel CAS) for Linux* v2.9 (GA)

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Installation Guide and Release Notes

Product Change Notification

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Reference Boot Loader from Intel

Product Change Notification

Carsten Benthin, Sven Woop, Ingo Wald, Attila Áfra HPG 2017

Customizing an Android* OS with Intel Build Tool Suite for Android* v1.1 Process Guide

Product Change Notification

Product Change Notification

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Intel Manageability Commander User Guide

Data Plane Development Kit

Product Change Notification

Intel Desktop Board D945GCLF2

Intel vpro Technology Virtual Seminar 2010

Intel vpro Technology Virtual Seminar 2010

Product Change Notification

Bring Intelligence to the Edge with Intel Movidius Neural Compute Stick

Product Change Notification

Product Change Notification

What s P. Thierry

Optimizing the operations with sparse matrices on Intel architecture

Product Change Notification

Product Change Notification

Product Change Notification

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Product Change Notification

Product Change Notification

Intel Open Network Platform Release 2.0 Hardware and Software Specifications Application Note. SDN/NFV Solutions with Intel Open Network Platform

Transcription:

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation OpenMPCon 2017 September 18, 2017

Legal Notice and Disclaimers By using this document, in addition to any agreements you have with Intel, you accept the terms set forth below. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit Performance Test Disclosure Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: Learn About Intel Processor Numbers Intel Advanced Vector Extensions (Intel AVX)* are designed to achieve higher throughput to certain integer and floating point operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you should consult your system manufacturer for more information. *Intel Advanced Vector Extensions refers to Intel AVX, Intel AVX2 or Intel AVX-512. For more information on Intel Turbo Boost Technology 2.0, visit http://www.intel.com/go/turbo Optimization Notice Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 2

Motivation Why can t we do this seemingly simple thing? Financial code from customers has this pattern, and we can t vectorize it with simd X x = #pragma omp simd linear(x:1) Error! x is not of integral type for(int i=0; i<n; i++){ work(x); x ++; // ++ is defined for class X } 3

Inductions Beyond the linear Clause linear : limited to the + operator on integral types We want to extend OpenMP to support More operators Other built-in operators: -,, and User-defined operators More data types Other built-in data types: float, double, etc. User-defined types, including non-pod types 4

Agenda Induction overview The induction clause Syntax Example: polynomial evaluation User-defined induction (UDI): the declare induction construct Syntax Parallelization example Vectorization example Conclusions 5

Induction Overview Recursive form: x i = x i-1 s the inductive operator (or inductor ) i index (zero-normalized) s step expression x 0 initial value Expansion: x i = x 0 s s s (i times) Collection: often the i s-terms can be collected into s i = s i the collective operator (or collector ) s i collective step expression Closed form: when collection is possible, we can express the induction with the closed form x i = x 0 ( s i ) 6

Induction of Basic Arithmetic Operations Recursive form s i Closed form + x i = x i-1 + s s * i x i = x 0 + s * i - x i = x i-1 s s * i x i = x 0 s * i * x i = x i-1 * s s i x i = x 0 * s i / x i = x i-1 / s s i x i = x 0 / s i 7

Proposal: induction Clause induction( induction-id : var-list : step ) To be used with the OpenMP loop, distribute, and simd constructs induction-id : can be a built-in op (+, -, *, /) or user-defined var-list : one or more variables. Intergral or FP for built-in operators. Can be of non-pod types for user-defined operators. step : the step expression Examples induction( + : x, y : 1 ) is equivalent to linear( x, y ) if x, y are integral induction( * : x : s ) describes the nonlinear induction x i = x i-1 * s induction( foo : x : s ) uses a user-defined induction operator foo ; x and s can be of different non-pod types 8

Example: Evaluating a Polynomial Compute the value of i=0 N c i x i float c[n]; float s = x; // the coefficients // step is the value of x for which // to evaluate the polynomial float xi = 1.0; // x^i; initial value x^0 == 1 float value = 0.0; // accumulator for the result #pragma omp simd reduction(+:value) induction(* : xi : s) for(i=0; i<=n; i++){ value += c[i] * xi; xi *= s; } 9

Going Beyond Built-in Types and Operators The user needs to specify the following Type of induction variable Type of the step expression Inductive operator Collective operator (optional) Allows computation, in constant time, of the initial value of the induction variable for each thread or SIMD lane by using the closed-form Omitting the collector may negatively affect performance The proposed syntax has a form similar to UDR s declare reduction construct 10

User-Defined Reduction (UDR) Recap #pragma omp declare reduction ( reduction-id : type-list : combiner ) [ initializer( init-expr ) ] reduction-id : name of this reduction operation type-list : list of type specifiers for the reduction variables combiner : expression to combine partial results of the given type(s) Two special variables: omp_in and omp_out Example1: omp_out = omp_out + omp_in Example2: foo(&omp_out, omp_in) init-expr: Two special variables: omp_priv and omp_orig Example: omp_priv = { MAX_INT, MAX_INT } 11

Proposal: User-Defined Induction (UDI) #pragma omp declare induction ( induction-id : induction-type : step-type : inductor ) [collector( collector )] induction-id : identifier for the operation, to be used in an induction clause induction-type : type specifier for the induction variables step-type : type specifier for the step expression inductor : specifies the inductive operation: x = x s omp_out represents x omp_step represents s C++ Example: omp_out = omp_out + omp_step, where + is overloaded C Example: add(&omp_out, omp_step) 12

Proposal: User-Defined Induction (UDI) (cont) collector : omp_step represents s omp_index represents the logical, zero-normalized index i C++ Example: omp_step = omp_step * omp_index, where * may be overloaded C Example: cs(&omp_step, omp_index) If collector is available, then the closed form is x i = x 0 ( s i ) 13

Example1 (Parallelization): Source class A; class S; // class of induction variables // class of the step expression #pragma omp declare induction ( op : A : S : \ omp_out = omp_out + omp_step ) \ collector ( omp_step = omp_step * omp_index ) // A A+S and S S*int must be defined... A a; S s; // initialized by constructors... #pragma omp parallel for induction( op : a : s ) for(int i=0; i<n; i++) { work(a); a=a+s; } 14

Example1: Compiler Front-End Output void inductor(a *omp_out_ptr, S *omp_step_ptr) { } A_add_S(omp_out_ptr, omp_step_ptr); void collector(s *omp_step_ptr, int omp_index) { } S_mult_int(omp_step_ptr, omp_index); 15

Example1: Threaded Pseudocode //...Call runtime to obtain lb,ub for each thread... firstprivatize(a,s); // privatized and copy-constructed in each thr collector(&s, lb); // Use the closed form to find a s initial inductor(&a, &s) // value for each thread in const time for(int i=lb; i<ub; i++) { work(a); a = a + s; } lastprivatize(a); // lastprivate copy-out for a 16

Example2 (Vectorization): Source class A; class S; // class of induction variables // class of the step expression #pragma omp declare induction ( op : A : S : \ omp_out = omp_out + omp_step ) \ collector ( omp_step = omp_step * omp_index ) // A A+S and S S*int must be defined... A a; S s; // initialized by constructors... #pragma omp simd induction( op : a : s ) for(int i=0; i<n; i++) { work(a); a=a+s; } 17

Example2: Vectorized Pseudocode A a; S s; A aa[vl] = bcast(a); S ss[vl] = bcast(s); // original scalar non-pod variables // initialize vector for a // initialize vector for s int idx[vl] = {VL-1,VL-2,,1,0}; simd_collector(ss, idx); simd_inductor(aa, ss); // Get vector step ss[:]=idx[:]*ss[:] // Get initial vector aa[:]=aa[:]+ss[:] S t = s; collector(&t, VL); // Use collector to create a vector step ss[:]=bcast(t); // ss[:] = bcast( s * VL ) for(int i=0; i<n; i+=vl) { simd_work(aa[:]); aa[:] = aa[:] + ss[:]; } // copy out last aa to a 18

Compiler Implementation Status Intel C/C++ compiler Front-end and IR: done Parallelizer and Vectorizer back-end: In progress Intel Fortran compiler Front-end: TBD Back-end (shared with C/C++) Runtime No changes expected 19

Conclusions Real applications demand OpenMP support for induction beyond that provided by the linear clause The proposed declare induction construct and induction clause will support More induction operations, including user-defined operations More data types, including user-defined class types Examples were presented 20