Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Similar documents
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Kevin O Leary, Intel Technical Consulting Engineer

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Using Intel VTune Amplifier XE for High Performance Computing

What s New August 2015

Vectorization Advisor: getting started

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Intel Software Development Products Licensing & Programs Channel EMEA

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Growth in Cores - A well rehearsed story

Memory & Thread Debugger

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Bei Wang, Dmitry Prohorov and Carlos Rosales

Graphics Performance Analyzer for Android

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Efficiently Introduce Threading using Intel TBB

Installation Guide and Release Notes

A Simple Path to Parallelism with Intel Cilk Plus

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

Getting Started with Intel SDK for OpenCL Applications

Expressing and Analyzing Dependencies in your C++ Application

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Installation Guide and Release Notes

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Intel Xeon Phi Coprocessor

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Obtaining the Last Values of Conditionally Assigned Privates

Intel Many Integrated Core (MIC) Architecture

IXPUG 16. Dmitry Durnov, Intel MPI team

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Eliminate Threading Errors to Improve Program Stability

Jackson Marusarz Software Technical Consulting Engineer

What s P. Thierry

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Using Intel Inspector XE 2011 with Fortran Applications

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Real World Development examples of systems / iot

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Eliminate Threading Errors to Improve Program Stability

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Intel Architecture for Software Developers

Intel Cluster Checker 3.0 webinar

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Intel Software Development Products for High Performance Computing and Parallel Programming

Revealing the performance aspects in your code

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Tuning Python Applications Can Dramatically Increase Performance

Intel Parallel Studio XE 2015

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Software Optimization Case Study. Yu-Ping Zhao

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Eliminate Memory Errors to Improve Program Stability

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Intel VTune Amplifier XE

Jackson Marusarz Intel

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

Eliminate Memory Errors to Improve Program Stability

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

INTEL MKL Vectorized Compact routines

Intel Xeon Phi Coprocessor Performance Analysis

More performance options

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Intel s Architecture for NFV

Embree Ray Tracing Kernels: Overview and New Features

High Performance Computing: Tools and Applications

Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur

High Performance Computing The Essential Tool for a Knowledge Economy

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Reusing this material

Intel Math Kernel Library (Intel MKL) Latest Features

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Transcription:

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor 64-bit Intel Xeon processor code-named Woodcrest EP Intel Xeon processor code-named Nehalem EP Intel Xeon processor code-named Westmere EP Intel Xeon processor code-named Sandy Bridge EP Intel Xeon processor code-named Ivy Bridge EP Intel Xeon processor code-named Haswell EP Intel Xeon processor code-named Skylake 1 EP Core(s) 1 2 4 6 8 12 18 28 Threads 2 2 8 12 16 24 36 56 Intel Xeon Phi x100 coprocessor code-named Knights Corner Intel Xeon Phi x200 processor & coprocessor code-named Knights Landing 61 72 244 288 SIMD Width 128 128 128 128 256 256 256 512 512 512 More cores More Threads Wider vectors OpenMP* is one of most important vehicles for the parallel + SIMD path forward *Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning. 2

Intel tools help modernize your code Leverage optimization tools and libraries: Get ready to tune! Build: Use Intel compilers (on hotspots) to generate optimal code Build: Use optimized performance libraries such as Intel MKL, Intel TBB and Intel IPP when appropriate Profile: Intel VTune Analyzer identifies hotspots, Intel Advisor identifies vectorization & threading opportunities Scalar, serial optimization: Use compiler and libraries proper precision, flags, appropriate functions Vectorization: Utilize compiler and libraries auto-vectorization and SIMD features in conjunction with data layout optimizations Parallelization part 1 - Thread (shared memory) Parallelization: Profile thread scaling and affinitize threads to cores using Intel Inspector, Intel VTune Analyzer and Intel compilers Parallelization part 2 - From multicore to many core (distributed memory parallelism): Profile and scale your application to distributed memory using Intel Trace Analyzer and Collector and the Intel MPI Library 3

Agenda Optimization of a sample kernel Processor type Memory access Threading synchronization Thread Affinity Explicit Vectorization Other 3

SAMPLE program Match a feature (sub-image) to a larger image Brute force, test all locations Look for smallest sum of absolute differences In intensity and color index For our tests, use random pattern superposed on random background C and Fortran source versions available Issues and optimizations closely similar for both Closely similar for Windows* and Linux* (Linux results quoted) Used Intel Xeon E7-4850 v3, 56 cores, 2.20 GHz base frequency, support for Intel Advanced Vector Extensions 2 (Intel AVX2) 5

SAD kernel (C) struct image_data { P_TYPE *intens; // intensity (x, y) P_TYPE *color; // color (x, y) A_TYPE xmax; // image size in x (# pixels) A_TYPE ymax; // image size in y (# pixels) ; int match(struct image_data & image, struct image_data & feature, S_TYPE & sabdmini, S_TYPE & sabdminc) { A_TYPE x, y, u, v, result; S_TYPE sabdmin, sabdi, sabdc; sabdminc = sabdmini = sabdmin = (S_TYPE) 2000000000; // loop over possible locations in image for(x = 0; x < (image.xmax-feature.xmax+1); x++) { for (y = 0; y < (image.ymaxfeature.ymax+1); y++) { sabdi = 0; sabdc = 0; // loop over feature pixels for(u = 0; u < feature.xmax; u++) { // calculate sum of absolute differences for intensity and color index for(v = 0; v < feature.ymax; v++) { A_TYPE uv = u + v*feature.xmax; A_TYPE xy = x+u + (y+v)*image.xmax; sabdi = sabdi + abs(int(feature.intens[uv] - image.intens[xy])); // exclude empty feature pixels; penalize saturated image pixels if(feature.intens[uv] > IMIN) { if(image.intens[xy] < IMAX) sabdc = sabdc + abs(int(feature.color[uv] - image.color[xy])); else sabdc = sabdc + CMAX; // end of v loop // end of u loop // find and record minimum if (sabdi + sabdc < sabdmin) { result = x + y*image.xmax; sabdmin = sabdi + sabdc; sabdmini = sabdi; sabdminc = sabdc; // end of y loop // end of x loop return result; 6

baseline icc -std=c++14 -O2 set_seed.cpp init.cpp driver.cpp match.cpp -qopt-report=3;./a.out 1 LOOP BEGIN at match.cpp(34,5) <Distributed chunk1> remark #15335: loop was not vectorized: vectorization possible but seems inefficient analyze 1 images SUCCESS image 0 match at (x,y) 387709 ref 387709 sabd=32350.000000 7064.000000 time per image 12.391983 Lowest common denominator instruction set is targeted by default (Intel SSE2) But our Intel Xeon E5 v3 processor (code-named Haswell ) supports Intel AVX2 icc -std=c++14 -O2 -xcore-avx2 set_seed.cpp init.cpp driver.cpp match.cpp -qopt-report=3;./a.out 1 LOOP BEGIN at match.cpp(34,5) <Distributed chunk1> remark #15301: PARTIAL LOOP WAS VECTORIZED remark #15452: unmasked strided loads: 2 analyze 1 images SUCCESS image 0 match at (x,y) 91526 ref 91526 sabd=31112.000000 6578.000000 time per image 11.214404 Partial vectorization because loop was split into vectorizable and non-auto-vectorizable parts (more later) 7

Memory access strided loads means inefficient access to memory for(u = 0; u < feature.xmax; u++) { for(v = 0; v < feature.ymax; v++) { A_TYPE uv = u + v*feature.xmax; A_TYPE xy = x+u + (y+v)*image.xmax; sabdi = sabdi + abs(int(feature.intens[uv] - image.intens[xy])); // line 37 Interchange the u and v loops to get contiguous memory access with u icc -std=c++14 -O2 -xcore-avx2 set_seed.cpp init.cpp driver.cpp match1.cpp -qopt-report=3;./a.out 1 LOOP BEGIN at match1.cpp(34,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between sabdc (44:8) and sabdc (42:8) analyze 1 images SUCCESS image 0 match at (x,y) 305101 ref 305101 sabd=32316.000000 6711.000000 time per image 9.361800 8

Distribute loop Compiler wasn t able to figure out that the new inner loop was safe to vectorize Due to reductions with multiple if statements And didn t realize it was worthwhile to split the loop We can ask it to split the reductions into two separate loops with a distribute point pragma sabdi = sabdi + abs(feature.intens[uv] - image.intens[xy]); // end chunk 1 #pragma distribute point if(feature.intens[uv] > IMIN) { // start chunk 2 if(image.intens[xy] < IMAX) sabdc = sabdc + abs(int(feature.color[uv] - image.color[xy])); // line 42 The first chunk, sabdi = sabdi +, is a simple reduction and is vectorized with unit stride icc -std=c++14 -O2 -xcore-avx2 set_seed.cpp init.cpp driver.cpp match1a.cpp -qopt-report=3;./a.out 1 LOOP BEGIN at match1a.cpp(34,5) <Distributed chunk1> remark #15301: PARTIAL LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 2 analyze 1 images SUCCESS image 0 match at (x,y) 526873 ref 526873 sabd=31868.000000 7081.000000 time per image 8.613725 9

threading It s very easy to add threading using an OpenMP* pragma Most efficient to thread at the highest level (outermost loop) // loop over possible locations in image #pragma omp parallel for private(sabdi,sabdc, x, y, u, v) // line 26 for(x = 0; x < (image.xmax-feature.xmax+1); x++) { for (y = 0; y < (image.ymax-feature.ymax+1); y++) { But look what happens icc -std=c++14 -O2 -xcore-avx2 -qopenmp set_seed.cpp init.cp driver.cpp match2.cpp -qopt-report=3 ;./a.out 1 match2.cpp(26:1-26:1):omp:_z5matchr10image_datas0_rfs1_: OpenMP DEFINED LOOP WAS PARALLELIZED analyze 1 images FAILED image 0 match at (x,y) 435900 ref 340101 sabd=285190.000000 36914.000000 time per image 0.182654 10

Find threading ERRORS Run Intel Inspector XE : inspxe-cl -collect=ti3 --./a.out Warning: One or more threads in the application accessed the stack of another thread. 4 Data race problem(s) detected And view in GUI: 11

FIXING the threading Protect using an OpenMP* critical section Can be executed by only one thread at a time This part is slow, but much other work is still done in parallel // find and record minimum #pragma omp critical { if (sabdi + sabdc < sabdmin) { result = x + y*image.xmax; sabdmin = sabdi + sabdc; sabdmini = sabdi; sabdminc = sabdc; icc -std=c++14 -O2 -xcore-avx2 -qopenmp set_seed.cpp init.cpp driver.cpp match2a.cpp -qopt-report=3;./a.out 10 match2a.cpp(51:1-51:1):omp:_z5matchr10image_datas0_rfs1_: OpenMP multithreaded code generation for CRITICAL was successful match2a.cpp(26:1-26:1):omp:_z5matchr10image_datas0_rfs1_: OpenMP DEFINED LOOP WAS PARALLELIZED analyze 10 images SUCCESS image 0 match at (x,y) 87916 ref 87916 sabd=31800.000000 6563.000000 time per image 0.675926 Critical section makes it slower But still much faster than unthreaded version 12

Intel VTune Amplifier XE

Better STILL: DISTRIBUTE (SPLIT) the loop The critical section doesn t contain much work Better to execute serially Avoids synchronization between threads icc -std=c++14 -O2 -xcore-avx2 -qopenmp set_seed.cpp init.cpp driver.cpp match3.cpp -qopt-report=3;./a.out 10 match3.cpp(29:1-29:1):omp:_z5matchr10image_datas0_rfs1_: OpenMP DEFINED LOOP WAS PARALLELIZED analyze 10 images SUCCESS image 0 match at (x,y) 648425 ref 648425... time per image 0.231830 Faster than critical section // loop over possible locations in image #pragma omp parallel for private(sabdi,sabdc, x, y, u, v) for(x = 0; x < (image.xmax-feature.xmax+1); x++) { for (y = 0; y < (image.ymax-feature.ymax+1); y++) { for(v = 0; v < feature.ymax; v++) { for(u = 0; u < feature.xmax; u++) { sabdi = sabdi + sabdc = sabdc + A_TYPE xy = x + y*(image.xmax-feature.xmax+1); sabdi_xy[xy] = sabdi; sabdc_xy[xy] = sabdc; // find and record minimum for(x = 0; x < (image.xmax-feature.xmax+1); x++) { for (y = 0; y < (image.ymax-feature.ymax+1); y++) { int xy = x + y*(image.xmax-feature.xmax+1); if (sabdi_xy[xy] + sabdc_xy[xy] < sabdmin) { sabdmin = sabdi_xy[xy] + sabdc_xy[xy]; result = x + y*image.xmax; 14

Thread AFFINITY By default, the OS can move threads from one core to another Good for variable workloads and numbers of threads, but not for symmetric workloads such as this Leads to fluctuations in execution time, e.g.:.26,.20,.22,.24,.25 Better to bind threads to cores, e.g.: Rerun same executable: time per image.19 time per image.19 time per image.19 time per image.19 time per image.23 Faster and more consistent export OMP_PROC_BIND=close 15

BACK TO Vectorization REPORT icc -std=c++14 -O2 -xcore-avx2 -qopenmp match3.f90 -c -qopt-report=2 -qopt-report-phase=loop,vec LOOP BEGIN at match3.cpp(30,2) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at match3.cpp(31,3) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at match3.cpp(36,4) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at match3.cpp(38,5) <Distributed chunk1> remark #15301: PARTIAL LOOP WAS VECTORIZED LOOP END LOOP BEGIN at match3.cpp(38,5) <Distributed chunk2> remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between sabdc (48:8) and sabdc (46:8) LOOP END LOOP END LOOP END LOOP END 16

Vectorization revisted Compiler couldn t vectorize red code because it combines several idioms chunk2: for(u = 0; u < feature.xmax; u++) { A_TYPE uv = u + v*feature.xmax; A_TYPE xy = x+u + (y+v)*image.xmax; sabdi = sabdi + abs(int(feature.intens[uv] - image.intens[xy])); // exclude empty feature pixels; penalize saturated image pixels if(feature.intens[uv] > IMIN) { if(image.intens[xy] < IMAX) sabdc = sabdc + abs(int(feature.color[uv] - image.color[xy])); else sabdc = sabdc + CMAX; Two reductions, one with double conditional and else clause Can recognize individually, but not all together (treats sabdc as a dependency) Unless we help 17

Explicit vectorization Use OpenMP* SIMD directive to tell the compiler how to vectorize: #pragma omp simd reduction(+:sabdi,sabdc) Instructs the compiler to vectorize and treat sabdi, sabdc as reduction variables icc -std=c++14 -O2 -xcore-avx2 -qopenmp set_seed.cpp init.cpp driver.cpp match4.cpp -qopt-report=2;./a.out 100 LOOP BEGIN at match4.cpp(39,5) remark #15301: OpenMP SIMD LOOP WAS VECTORIZED LOOP END SUCCESS image 0 match at (x,y) 78922 ref 78922 sabd=31657.000000 6084.000000... time per image 0.099986 Loop is no longer split, entire loop is vectorized SIMD is very powerful, but user is responsible for correctness, not the compiler No dependencies except as handled by REDUCTION, PRIVATE, etc. clauses 18

More small optimizations Data locality can be further improved by interchanging the outermost loops over x and y: icc -std=c++14 -O2 -xcore-avx2 -qopenmp set_seed.cpp init.cpp driver.cpp match5.cpp -qopt-report=3 ;./a.out 100 SUCCESS image 0 match at (x,y) 376316 ref 376316 sabd=31593.000000 6598.000000 time per image 0.057735 The vectorization report contains the following line: remark #15487: type converts: 2 This is a hint that we have been using floats as reduction variables all this time, when everything else is integer Declare sabdi, sabdc as integer (change S_TYPE, but be sure they won t overflow)! Replacing unsigned int by int also helps slightly (P_TYPE, A_TYPE - see image_data5.h) Unsigned ints that overflow must wrap; signed ints need not icc -std=c++14 -O2 -xcore-avx2 -qopenmp set_seed.cpp init.cpp driver.cpp match5.cpp -qopt-report=3 ;./a.out 100 SUCCESS image 0 match at (x,y) 867713 ref 867713 sabd=31004.000000 6241.000000... time per image 0.044227 The type convert warnings have gone away, and it made a difference. 19

Summary Systematic optimization can yield a large speed-up compared to default optimization (>200 X for this system and kernel) >500 X compared to -O0 (no optimization) Worth the effort! Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products Optimization Source Time per image (secs) -O2 baseline Match 12.4 -xcore-avx2 Match 11.2 unit stride Match1 9.4 Partial vectorization Match1a 8.6 OpenMP + critical match2a 0.68 distribute match3 0.23 Thread affinity match3 0.20 SIMD vectorization match4 0.10 Outer loop swap match5 0.055 No type converts match5 0.045 20

Code that performs and outperforms Download a free, 30-day trial of Intel Parallel Studio XE makebettercode.com/parallelstudioxe-eval 21

Resources OpenMP* Information Putting Vector Programming to work with OpenMP SIMD software.intel.com/sites/default/files/man aged/77/e7/parallel_mag_issue22.pdf The Present and Future of OpenMP https://software.intel.com/sites/default/fil es/managed/75/b0/parallel-universeissue-27.pdf Modernize Your Code for Intel* Xeon Phi Processors https://software.intel.com/sites/default/fil es/managed/9d/4d/parallel-universemagazine-issue-26.pdf Code Modernization Links Modern Code Developer Community software.intel.com/modern-code Intel Code Modernization Enablement Program software.intel.com/codemodernization-enablement Intel Parallel Computing Centers software.intel.com/ipcc Technical Webinar Series Registration https://software.intel.com/events/de velopment-tools-webinars Intel Parallel Universe Magazine software.intel.com/intel-paralleluniverse-magazine 22

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products 2017 Intel Corporation. All rights reserved. Intel, Xeon, Xeon Phi and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 23