Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Similar documents
Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Kevin O Leary, Intel Technical Consulting Engineer

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Growth in Cores - A well rehearsed story

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

What s New August 2015

Gil Rapaport and Ayal Zaks. Intel Corporation, Israel Development Center. March 27-28, 2017 European LLVM Developers Meeting

Obtaining the Last Values of Conditionally Assigned Privates

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Vectorization Advisor: getting started

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Ayal Zaks and Gil Rapaport, Vectorization Team, Intel Corporation. October 18 th, 2017 US LLVM Developers Meeting, San Jose, CA

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Expressing and Analyzing Dependencies in your C++ Application

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Getting Started with Intel SDK for OpenCL Applications

Graphics Performance Analyzer for Android

IXPUG 16. Dmitry Durnov, Intel MPI team

Bei Wang, Dmitry Prohorov and Carlos Rosales

What s P. Thierry

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Real World Development examples of systems / iot

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products Licensing & Programs Channel EMEA

Kirill Rogozhin. Intel

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Intel Cluster Checker 3.0 webinar

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Knights Corner: Your Path to Knights Landing

Intel Xeon Phi Coprocessor

More performance options

Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor

Memory & Thread Debugger

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Jackson Marusarz Software Technical Consulting Engineer

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

A Simple Path to Parallelism with Intel Cilk Plus

Using Intel VTune Amplifier XE for High Performance Computing

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

C Language Constructs for Parallel Programming

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Intel Architecture for Software Developers

INTEL MKL Vectorized Compact routines

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Intel Many Integrated Core (MIC) Architecture

Simplified and Effective Serial and Parallel Performance Optimization

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

High Performance Computing The Essential Tool for a Knowledge Economy

Installation Guide and Release Notes

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

OpenCL heterogeneous portability theory and practice

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Installation Guide and Release Notes

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

HPC code modernization with Intel development tools

Software Optimization Case Study. Yu-Ping Zhao

Jackson Marusarz Intel

Using Intel AVX without Writing AVX

Intel s Architecture for NFV

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

3D ray tracing simple scalability case study

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Intel Xeon Phi Coprocessor Performance Analysis

Teaching Think Parallel

Debugging and Analyzing Programs using the Intercept Layer for OpenCL Applications

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Efficiently Introduce Threading using Intel TBB

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

OpenMPCon Developers Conference 2017, Stony Brook Univ., New York, USA

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Transcription:

Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list

Lo F IRST, Thanks for the Technion For Inspiration and Recognition of Science and Technology the leading pre-university program of the Technion 05/02/2015 Technion Honors Dean Kamen, inventor and entrepreneur, with Honorary Doctorate

Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD is necessary, fun, but insufficient 3. Help SIMD become mainstream in SW with Programming Language and Tool support

Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD is necessary, and it s fun, but insufficient 3. Help SIMD become mainstream in SW with Programming Language and Tool support

Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD is necessary, and it s fun, but it s insufficient 3. Help SIMD become mainstream in SW with Compiler, Programming Language and Tools support

More Cores, Threads, Wider SIMD Intel Xeon processor 64-bit SSE3 Intel Xeon processor 5100 series SSSE3 Intel Xeon processor 5500 series (Nehalem) SSE4.2 Intel Xeon processor 5600 series SSE4.2 Intel Xeon processor code-named Sandy Bridge EP AVX Intel Xeon processor code-named Ivy Bridge EP AVX2 Intel Xeon processor code-named Haswell EP AVX2 Intel Xeon Phi coprocessor Knights Corner Intel Xeon Phi processor & coprocessor Knights Landing 1 AVX-512 Core(s) 1 2 4 6 8 12 18 Threads 2 2 8 12 16 24 36 SIMD Width 128 128 128 128 256 256 256 61 >61 244 >244 512 512 *Product specification for launched and shipped products are available on ark.intel.com. 1. Not launched or in planning. 1. SIMD is mainstream and ubiquitous in HW

Single Instruction Multiple Data in a. float a[n], b[n]; for (int i = 0; i < 8; ++i) { c[i] = a[i] + b[i]; } Compiler Vectorization VADDPS YMM0, YMM1, YMM2 float a[n], b[n]; for (int i = 0; i < 16; ++i) { c[i] = a[i] + b[i]; } VADDPS ZMM0, ZMM1, ZMM2 16 x 256-bit registers In each register, e.g., 8 float or 4 double or AVX 8 int or 4 long AVX2 32 x 512-bit registers In each register, e.g., 16 float or 8 double or 16 int or 8 long AVX- 512

Wider Vectors + Richer Instructions Cumulative # of Vector Instruction s 3833 Skylake Xeon Haswell Nehalem Sandy Bridge 1109 1557 62 294 318 378 471 485 SSE SSE2 SSE3 SSSE3 SSE4 SSE42 AVX AVX2 AVX512 2001 2. Compiler support 1999 2004 2006 2007 for SIMD 2008 is necessary 2011 2013 2015

The vectorizer is truly such a wonderfully sophisticated and delicate algorithm. Debugging it to find and pin-point the unlikely problem is something I found, simply put, fun. Dr. Shahar Timnat 2. Compiler support for SIMD is fun!

Insufficient? Solved in the 70 s, no? Communications of the ACM The CRAY-1 computer system By Richard M. Russell Cray Research, Inc., Minneapolis, MN Communications of the ACM, January 1978 (Vol. 21 No. 1), Pages 63-72 Solved innermost DO loop auto-vectorization

Innermost DO loop Vectorization DO 1 k = 1,n 1 A(k) = B(k) + C(k) K=1 Ld C(1) Ld B(1) Add St A(1) Scalar code K=2 Ld C(2) Ld B(2) Add St A(2) GCC by Dorit Nuzman/IBM LLVM by Nadav Rotem/Apple K=1..2 Ld C(1) Ld B(1) Add St A(1) Ld C(2) Ld B(2) Add St A(2) Vector code Vector code generation is straightforward Emphasis on analysis and disambiguation

Vectorization Today branches nested loops #pragma omp simd reduction(+:.) for(p=0; p<n; p++) { // Brown work if( ) { // Green work } else { // Red work } while( ) { // Gold work // Purple work } y = foo (x); // Pink work 2 p=0 Are all lanes done? x1, x2 Vector Function call p=0..1 y2 y1, y2 } y1 Insufficient: Vector code generation has become a more difficult problem function calls x1 Function call 3 x2 Function call p=1

It s hard to vectorize what diverges Two main Divergence challenges for SIMD: 1. Control, 2. Data

Vectorization Today: Challenges* Divergence Analysis Predication and Masking Optimizations Outer-loop Vectorization Less-than-full-vector Vectorization Gather/Scatter Optimizations Sophisticated Idioms Vectorization Partial Vectorization Function Vectorization Increasing need for user guidance for correctness and profitability * nonexclusive list

Parallel Program Development pthread_create( ) pthread_join( ) Threading using operating system calls Applicability evelopment cost Performance #pragma omp for for (i=0; i< ){ Or cilk_for(i=0; i< ){ Or. Industry standard directives for Explicit Parallelization Maintainability Scalability for(i=0; i< ) { } Automatically using the compiler

Vector Program Development Applicability evelopment cost Performance MS128 t1, t2; t1 = _mm_mulps(&a[i], &b[i]); t2 = _mm_addps(t1, t1); #pragma omp simd for (i=0; i<n; i++) c[i] = a[i] + b[i]; Intrinsics Explicit Vectorization Maintainability Scalability for (i=0; i<n; i++) c[i] = a[i] + b[i]; Automatically using the compiler

Explicit Vectorization Example #pragma omp parallel for for (int y = 0; y < ImageHeight; ++y) { #pragma omp simd for (int x = 0; x < ImageWidth; ++x) { count[y][x] = mandelbrot(in_vals[y][x], max_iter); } } #pragma omp declare simd uniform(limit) int mandelbrot(fcomplex c, int limit) { fcomplex z = c; int iters = 0; while ((cabsf(z) < 2.0f) && (iters < limit)) { z = z * z + c; iters++; } return iters; } Intel Xeon Phi system, Linux64, 61 cores running 244 threads at 1GHz, 32 KB L1, 512 KB L2 per core. Intel C/C++ Compiler internal build.

(Explicit) Whole Function Vectorization WFV: PhD thesis of Ralf Karrenberg, 2015 Expands vectorization across function boundary Originally for implicit data-parallel languages OpenMP 4.0: Employs WFV with new, explicit SIMD annotations Recent examples published include SIMD examples http://openmp.org/mp-documents/openmp-examples-4.0.2.pdf

Some Recent References re:wfv* Books: Automatic SIMD Vectorization of SSA-based Control Flow Graphs, Ralf Karrenberg, July 2015 High Performance Parallelism Pearls volume 2, James Reinders and Jim Jeffers, August 2015 Conference Papers: Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures, Hee-Seok Kim et al., CGO 2015. Effectively maximizes WFV factors where possible and profitable on CPUs Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures, Yunsup Lee et al., Micro-47. Compares SW and HW techniques for supporting control divergence on GPUs The Impact of the SIMD Width on Control-Flow and Memory Divergence, Thomas Schaub et al., HiPEAC 2015 Optimizing Overlapped Memory Accesses in User-directed Vectorization, Diego Caballero et al., to appear in ICS 2015. Technique and OpenMP proposal to support memory divergence on CPU and MIC Workshop papers Predicate Vectors If You Must, Shahar Timnat et al., WPMVP@PPoPP 2014 Streamlining Whole Function Vectorization in C using Higher Order Vector Semantics, Gil Rapaport et al., PLC@IPDPS 2015 * nonexclusive list

Friendly Compiler Optimization Report Feedback from static optimizing compilation improving

Intel Advisor XE Vectorization Advisor In Beta 2 1 Integrates Compiler diagnostics + Performance Data + SIMD efficiency stats Guidance: detect penalties and recommend improvements using OpenMP4.0 Deep dive: dependence checkers, memory access pattern analysis

Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support is necessary, fun, but insufficient 3. Help SIMD be mainstream in SW by Explicit Vectorization Need user guidance for today s applications Similar to what OpenMP/Cilk/TBB did for parallelization Maps threaded execution to SIMD hardware Has advantages in development cost, applicability, performance, maintenance, scalability 4. Tools are improving

The 4 th Compiler, Architecture and Tools Conference Mark Your Calendars and Stay Tuned! November 23 rd, 2015 @ Intel IDC Haifa Organized by: Gadi Haber, Ayal Zaks, Michal Nir and Leeor Peled from Intel SSGi and PEG Dorit Nuzman from IBM, Erez Petrank from the Technion, Yosi Ben-Asher from Haifa U.

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

THE END

OpenMP 4.0 Based on a proposal from Intel based on customer success with the Intel Cilk Plus features in Intel compilers. #pragma omp simd reduction(+:val) reduction(+:val2) for (int pos = 0; pos < RAND_N; pos++) { float callvalue = expectedcall(sval,xval,mubyt,vbysqrtt,l_random[pos]); val += callvalue; val2 += callvalue * callvalue; }

OpenMP 4.0 YES VECTORIZE THIS! Note: per the OpenMP standard, the for-loop must have canonical loop form.

OpenMP 4.0 Make VECTOR versions of this function.

OpenMP 4.0 Parallelize and Vectorize.

HW provides many parallel SIMD lanes

SW+Compiler need to run them, or