Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Similar documents
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Kevin O Leary, Intel Technical Consulting Engineer

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Kirill Rogozhin. Intel

Jackson Marusarz Intel

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Software and Services, 2017

Vectorization Advisor: getting started

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

H.J. Lu, Sunil K Pandey. Intel. November, 2018

What s P. Thierry

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

IXPUG 16. Dmitry Durnov, Intel MPI team

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Bei Wang, Dmitry Prohorov and Carlos Rosales

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

What s New August 2015

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Graphics Performance Analyzer for Android

Intel Many Integrated Core (MIC) Architecture

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

KNL tools. Dr. Fabio Baruffa

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Installation Guide and Release Notes

Achieving Peak Performance on Intel Hardware. Jim Cownie: Intel Software Developer Conference Frankfurt, December 2017

Intel Architecture for Software Developers

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Memory & Thread Debugger

Jackson Marusarz Software Technical Consulting Engineer

Growth in Cores - A well rehearsed story

More performance options

Intel profiling tools and roofline model. Dr. Luigi Iapichino

INTEL MKL Vectorized Compact routines

Installation Guide and Release Notes

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Intel Software Development Products Licensing & Programs Channel EMEA

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

A Simple Path to Parallelism with Intel Cilk Plus

Intel Cluster Checker 3.0 webinar

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Intel Xeon Phi Coprocessor Performance Analysis

Obtaining the Last Values of Conditionally Assigned Privates

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Intel s Architecture for NFV

Real World Development examples of systems / iot

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Getting Started with Intel SDK for OpenCL Applications

Tuning Python Applications Can Dramatically Increase Performance

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Master Informatics Eng.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Expressing and Analyzing Dependencies in your C++ Application

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Intel Parallel Studio XE 2011 for Linux* Installation Guide and Release Notes

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Revealing the performance aspects in your code

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Software Optimization Case Study. Yu-Ping Zhao

Efficiently Introduce Threading using Intel TBB

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Simplified and Effective Serial and Parallel Performance Optimization

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

NVMe Over Fabrics: Scaling Up With The Storage Performance Development Kit

Gil Rapaport and Ayal Zaks. Intel Corporation, Israel Development Center. March 27-28, 2017 European LLVM Developers Meeting

High Performance Computing The Essential Tool for a Knowledge Economy

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

Intel Parallel Studio XE 2015

Intel Software and Services, Kirill Rogozhin

Using Intel VTune Amplifier XE for High Performance Computing

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Intel Xeon Phi Coprocessor

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Transcription:

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

Changing Hardware Impacts Software More Cores More Threads Wider Vectors Intel Xeon Processor Intel Xeon Phi 64-bit 5100 series 5500 series 5600 series E5-2600 E5-2600 V2 E5-2600 V3 E5-2600 V4 Platinum 8180 processor Knights Landing Core(s) 1 2 4 6 8 12 18 22 28 Threads 2 2 8 12 16 24 36 44 56 SIMD Width Optimization Notice 128 128 128 128 256 256 256 256 512 High performance software must be both Parallel (multi-thread, multi-process) Vectorized *Product specification for launched and shipped products available on ark.intel.com. Copyright 2017, Intel Corporation. All rights reserved. *Other names and 72 288 512 3

Vectorize and Thread for Dramatic Performance Gains Together they are more effective than either one alone Automatic Vectorization Not Enough Explicit pragmas and optimization often required 187x Vectorized & Threaded The Difference Is Growing With Each New Generation of Hardware Intel Xeon 2007 Processor: X5472 codenamed: Harpertown 2009 X5570 Nehalem 2010 X5680 Westmere 2012 E5-2600 Sandy Bridge 2013 E5-2600 v2 Ivy Bridge 2014 E5-2600 v3 Haswell 2016 E5-2600 v4 Broadwell Threaded Vectorized Serial Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Configurations at the end of this presentation. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 4

Intel Advisor Vectorization Advisor Get breakthrough vectorization performance Faster Vectorization Optimization: Vectorize where it will pay off most Quickly ID what is blocking vectorization Tips for effective vectorization Safely force compiler vectorization Optimize memory stride The data and guidance you need: Compiler diagnostics + Performance Data + SIMD efficiency Detect problems & recommend fixes Loop-Carried Dependency Analysis Memory Access Patterns Analysis Optimize for AVX-512 with or without access to AVX-512 hardware Part of Intel Parallel Studio XE Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and http://intel.ly/advisor-xe 5

The Right Data At Your Fingertips Get all the data you need for high impact vectorization Filter by which loops are vectorized! Trip Counts What prevents vectorization? Focus on hot loops What vectorization issues do I have? Which Vector instructions are being used? How efficient is the code? Get Faster Code Faster! Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 6

Find Effective Optimization Strategies Intel Advisor: Cache-aware roofline analysis Roofs Show Platform Limits Memory, cache & compute limits Dots Are Loops Bigger, red dots take more time so optimization has a bigger impact Dots farther from a roof have more room for improvement Higher Dot = Higher GFLOPs/sec Optimization moves dots up Algorithmic changes move dots horizontally Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and Which loops should we optimize? A and G are the best candidates B has room to improve, but will have less impact E, C, D, and H are poor candidates Roofs Roofline tutorial video New! 7

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

What is the roofline model? Do you know how fast you should run? Comes from Berkeley Performance is limited by equations/implementation & code generation/hardware 2 hardware limitations PEAK Flops PEAK Bandwidth The application performance is bounded by hardware specifications Gflop/s= min Platform PEAK Platform BW AI Arithmetic Intensity (Flops/Bytes) 9

Platform PEAK FlopS How many floating point operations per second Gflop/s= min Platform PEAK Platform BW AI Theoretical value can be computed by specification Example with 2 sockets Intel Xeon Processor E5-2697 v2 PEAK FLOP = 2 x 2.7 x 12 x 8 x 2 = 1036.8 Gflop/s Number of sockets Core Frequency Number of cores Number of single precision element in a SIMD register More realistic value can be obtained by running Linpack =~ 930 Gflop/s on a 2 sockets Intel Xeon Processor E5-2697 v2 1 port for addition, 1 for multiplication 10

Platform PEAK bandwidth How many bytes can be transferred per second Gflop/s= min Platform PEAK Platform BW AI Theoretical value can be computed by specification Example with 2 sockets Intel Xeon Processor E5-2697 v2 PEAK BW = 2 x 1.866 x 8 x 4 = 119 GB/s Number of sockets Byte per channel Memory Frequency Number of mem channels More realistic value can be obtained by running Stream =~ 100 GB/s on a 2 sockets Intel Xeon Processor E5-2697 v2 11

Drawing the Roofline Defining the speed of light Gflop/s= min Platform PEAK Platform BW AI 2 sockets Intel Xeon Processor E5-2697 v2 Peak Flop = 1036 Gflop/s Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] 12

Drawing the Roofline Defining the speed of light Gflop/s= min Platform PEAK Platform BW AI 2 sockets Intel Xeon Processor E5-2697 v2 Peak Flop = 1036 Gflop/s Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] 13

Drawing the Roofline Defining the speed of light Gflop/s= min Platform PEAK Platform BW AI 2 sockets Intel Xeon Processor E5-2697 v2 Peak Flop = 1036 Gflop/s Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] 8.7 14

What is the performance boundary? Manual way to do it Manual counting on matrix/matrix multiplication for(i=0; i<n; i++) for(j=0; j<n; j++) for(k=0; k<n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j] # add = N * N * N #Read = 3 * N * N * 4 bytes # mul = N * N * N #Write = N * N * 4 bytes AI = 2N3 16N 2 = 1 8 N 15

Compute the maximum performance BW * Arithmetic Intensity Gflop/s= min Platform PEAK Platform BW AI 2 sockets Intel Xeon Processor E5-2697 v2 Peak Flop = 1036 Gflop/s Peak BW = 119 GB/s Gflops/s 1036 119 If N = 8, sgemm should not be able to perform better than 119 GFlop/s on a 2 sockets Ivy Bridge For sgemm AI = 1/8 N If N = 8, AI = 1 1 AI [Flop/B] 8.7 16

And NOW? How to get better performance? 1036 119 Gflops/s Vectorization + threading Optimize memory access 1 8.7 17

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

Roofline in Intel Advisor The cache aware roofline model Intel Advisor implements a Cache Aware Roofline Model (CARM) - Algorithmic, Cumulative (L1+L2+LLC+DRAM) traffic-based - Invariant for the given code / platform combination How does it work? - Counts every memory movement - Bytes and Flops -> Instrumentation - Time -> Sampling CARM: Cache aware Roofline Model DRAM: DRAM aware Roofline Model TRAM: Theoretical Roofline Model Typically AI_CARM < AI_DRAM < AI_TRAM 19

Understanding the roofline in Intel Advisor Intel Advisor for vectorization optimization Purely compute bound Purely Cache/DRAMbound 20

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

Roofline model and compiler optimizations

Roofline model and optimizations Matrix/matrix addition void addition(float* a, float* b, float* c, int size){ int i, j; for(j=0; j<size; j++){ for(i=0; i<size; i++){ c[i*size + j] = a[i*size + j]+b[i*size + j]; } } } Let s have a look at the roofline model

Roofline model and optimizations Compilation with O1 Very poor performance, far from the DRAM roofline!

Roofline model and optimizations Lets look at the Memory Access Pattern Analysis Constant stride found!!! Looks like loops should be reversed

Roofline model and optimizations Compilation with O3

Vectorization of Loop carried dependency

Vectorization of loop carried dependency Loop carried dependency void addition(float* a, float* b, float* c, int size){ int i, j; for(i=0; i<size; i++){ for(j=pad; j<size; j++){ c[i*size + j] = a[i*size + j]+c[i*size + j-pad]; } } }

Vectorization of loop carried dependency

Vectorization of loop carried dependency Loop carried dependency

Vectorization of loop carried dependency Loop carried dependency void addition(float* a, float* b, float* c, int size){ int i, j; for(i=0; i<size; i++){ #pragma omp simd safelen(4) In this case, we assume that pad >=4 for(j=pad; j<size; j++){ c[i*size + j] = a[i*size + j]+c[i*size + j-pad]; } } }

Vectorization of loop carried dependency

Vectorization of loop carried dependency Safelen was 4

Vectorization of function call

Vectorization of a function call with OMP Function call inside of a loop can prevent the vectorization for(int i=0; i<size; i++){ for(int j=0; j<size; j++){ single_line_addition(a, c, i*size + j); } } //function is defined in another compilation unit void single_line_addition(float* a, float* c, int ind){ c[ind] = a[ind]+c[ind]; }

Vectorization of a function call with OMP

Vectorization of a function call with OMP Advisor tells you that this pattern can be a problem and proposes a solution

Vectorization of a function call with OMP Omp declare simd for(int i=0; i<size; i++){ #pragma omp simd for(int j=0; j<size; j++){ single_line_addition(a, c, i*size + j); } } #pragma omp declare simd uniform(a, c) linear(ind) void single_line_addition(float* a, float* c, int ind);

Vectorization of a function call with OMP

Vectorization of a function call with OMP Before Intel Advisor for vectorization optimization After

Agenda Intel Advisor for vectorization optimization What is the theoretical roofline model? How is it implemented in Advisor? Some examples Resources

References Roofline model proposed by Williams, Waterman, Patterson: http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf Cache-aware Roofline model: Upgrading the loft (Ilic, Pratas, Sousa, INESC- ID/IST, Thec Uni of Lisbon) http://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and

Resources Intel Advisor Threading Design & Prototyping: Product page overview, features, FAQs, support Training materials movies, tech briefs, documentation Evaluation guides step by step walk through Reviews Additional Analysis Tools: Intel VTune Amplifier performance profiler Intel Inspector - memory and thread checker / debugger Additional Development Products: Intel Software Development Products Intel Distribution for Python* accelerated Python distribution Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 43

Code that performs and outperforms Download a free, 30-day trial of Intel Parallel Studio XE 2018 today software.intel.com/en-us/intel-parallel-studio-xe And Don t Forget To fill out the evaluation survey via a URL that will be provided at the end of the day OR Watch your email for a link to the survey P.S. Everyone who fills out the survey will receive a personalized certificate indicating completion of the training!

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 45

Advisor works with GCC and Microsoft Compilers Adds bonus capabilities with the Intel Compiler Advisor using GCC, Microsoft or Intel Compiler: Finds un-vectorized loops Analyze SIMD, AVX, AVX2, AVX-512 Dependency Analysis safely force vectorization with a pragma Memory Access Pattern Analysis - optimize stride and caching Trip Counts FLOPS metrics with masking Roofline Analysis balance memory vs. compute optimization Intel Compiler Adds: Usually better optimized vectorization Better compiler optimization messages Intel Advisor with Intel Compiler Adds: Finds inefficiently vectorized loops and estimates performance gain Compiler optimization report messages displayed on the source More tips for improving vectorization Optimize for AVX-512 even without AVX-512 hardware Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 47

Configurations for 2007-2016 Benchmarks Platform Hardware and Software Configuration Platform Intel Xeon 5472 Processor Intel Xeon X5570 Processor Intel Xeon X5680 Processor Intel Xeon E5 2690 Processor Intel Xeon E5 2697v2 Processor Intel Xeon E5 2600v3 Processor Intel Xeon E5 2600v4 Processor Intel Xeon E5 2600v4 Processor Unscaled Core Cores/S Frequency ocket Num L1 Data Sockets Cache L2 Cache L3 Cache Memory Memory Memory Frequency Access H/W Prefetchers Enabled HT Enabled Turbo Enabled C States O/S Name 3.0 GHZ 4 2 32K 6 MB None 32 GB 800 MHz UMA Y N N Disabled Fedora 20 2.9 GHZ 4 2 32K 256K 8 MB 48 GB 1333 MHz NUMA Y Y Y Disabled Fedora 20 3.33 GHZ 6 2 32K 256K 12 MB 48 MB 1333 MHz NUMA Y Y Y Disabled Fedora 20 2.9 GHZ 8 2 32K 256K 20 MB 64 GB 1600 MHz NUMA Y Y Y Disabled Fedora 20 2.7 GHZ 12 2 32K 256K 30 MB 64 GB 1867 MHz NUMA Y Y Y Disabled RHEL 7.1 2.2 GHz 18 2 32K 256K 46 MB 128 GB 2133 MHz NUMA Y Y Y Disabled Fedora 20 2.3 GHz 18 2 32K 256K 46 MB 256 GB 2400 MHz NUMA Y Y Y Disabled RHEL 7.0 2.2 GHz 22 2 32K 256K 56 MB 128 GB 2133 MHz NUMA Y Y Y Disabled CentOS 7.2 Operating System 3.11.10-301.fc20 3.11.10-301.fc20 3.11.10-301.fc20 3.11.10-301.fc20 3.10.0-229.el7.x86_64 3.13.5-202.fc20 3.10.0-123. el7.x86_64 3.10.0-327. el7.x86_64 Compiler Version icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 icc version 14.0.1 Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and 48