Growth in Cores - A well rehearsed story

Similar documents
Intel Xeon Phi Programmability (the good, the bad and the ugly)

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Kevin O Leary, Intel Technical Consulting Engineer

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Architecture for Software Developers

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

H.J. Lu, Sunil K Pandey. Intel. November, 2018

What s New August 2015

Obtaining the Last Values of Conditionally Assigned Privates

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

What s P. Thierry

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

Vectorization Advisor: getting started

Intel Many Integrated Core (MIC) Architecture

IXPUG 16. Dmitry Durnov, Intel MPI team

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

C Language Constructs for Parallel Programming

Jim Cownie, Johnny Peyton with help from Nitya Hariharan and Doug Jacobsen

Graphics Performance Analyzer for Android

Intel s Architecture for NFV

Bei Wang, Dmitry Prohorov and Carlos Rosales

Intel Software Development Products Licensing & Programs Channel EMEA

Intel Xeon Phi Coprocessor

Real World Development examples of systems / iot

Three Questions every one keeps asking. Stephen Blair-Chappell Intel Compiler Labs

Using Intel VTune Amplifier XE for High Performance Computing

Installation Guide and Release Notes

Intel Xeon Phi programming. September 22nd-23rd 2015 University of Copenhagen, Denmark

Installation Guide and Release Notes

A Simple Path to Parallelism with Intel Cilk Plus

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Intel Software Development Products for High Performance Computing and Parallel Programming

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Software Optimization Case Study. Yu-Ping Zhao

Diego Caballero and Vectorizer Team, Intel Corporation. April 16 th, 2018 Euro LLVM Developers Meeting. Bristol, UK.

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Kirill Rogozhin. Intel

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Gil Rapaport and Ayal Zaks. Intel Corporation, Israel Development Center. March 27-28, 2017 European LLVM Developers Meeting

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Simplified and Effective Serial and Parallel Performance Optimization

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

INTEL MKL Vectorized Compact routines

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

High Performance Computing: Tools and Applications

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

More performance options

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Presenter: Georg Zitzlsberger Date:

Getting Started with Intel SDK for OpenCL Applications

High Performance Computing The Essential Tool for a Knowledge Economy

Intel Parallel Studio XE 2015

Crosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Teaching Think Parallel

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel Xeon Phi Processor

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Risk Factors. Rev. 4/19/11

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel Xeon Phi Coprocessor Performance Analysis

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Intel Cluster Checker 3.0 webinar

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Embree Ray Tracing Kernels: Overview and New Features

OpenMPCon Developers Conference 2017, Stony Brook Univ., New York, USA

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Sayantan Sur, Intel. ExaComm Workshop held in conjunction with ISC 2018

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Expressing and Analyzing Dependencies in your C++ Application

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Transcription:

Intel CPUs

Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Implications of Moore s Law Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Tick-Tock Development Cycles Integrate. Innovate. 45nm Tick Tock 32nm 3D Tri-Gate 22nm Intel Core Microarchitecture Nehalem Microarchitecture Sandy Bridge Microarchitecture Haswell Microarchitecture Projection SSE4.2/AESNI AVX AVX2** Future ISA **Intel Architecture Instruction Set Extensions Programming Reference, #319433-012A, FEBRUARY 2012 Potential future options, subject to change without notice.

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 5

Intel & Parallelism More cores. Wider vectors. Co-Processors. Images do not reflect actual die sizes Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon processor code-named Sandy Bridge Intel Xeon processor code-named Ivy Bridge Intel Xeon processor code-named Haswell Intel MIC coprocessor code-named Knights Ferry Intel MIC coprocessor code-named Knights Corner Core(s) 1 2 4 6 8 32 >50 Thread s 2 2 8 12 16 128 >200 SIMD Width 128 128 128 128 256 256 256 512 512 SSE2 SSSE3 SSE4.2 SSE4.2 AVX AVX AVX2 FMA3 Intel MIC architecture extends established CPU architecture and programming concepts to highly parallel applications Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 7

Parallelism on Intel x86-based architectures: the hardware hierarchy Distributed memory level: multiple nodes Shared memory level: multiple sockets per node CPU level: multiple physical cores, SMT (hyperthreading) X86 SIMD registers: multiple data in one xmm/ymm* register Needs MPI Needs vectorization (vectorizer can help) Multiple execution units for each core, e.g. int ALU on ports 0,1,5 Automatic Needs multithreading (threadizer can help) Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 8

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 9

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 10

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 11

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 12

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 13

Software & Services Group Developer Products Division Copyright 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel Confidential 14

Knights Corner Core Architecture Improved Intel Pentium Core In-Order, Short Pipeline Minimal speculation Instruction Decode Instruction Fetch Instruction Decode Instruction Decode 4 Threads Tolerates latencies and keeps the execution units busy Scalar Unit Vector Unit 2-wide 1 Vector (load-op) + 1 Scalar op (or prefetch) per cycle 2 Scalar per cycle X87 Area and power efficient > 50 Cores Scalar Vector Registers Registers L1 Icache L1 Dcache & Dcache 512K 256K L2 Cache Local Subset On-Die Ring Network Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

BASED ON A PRESENATION FROM: Programming Continuity Between Intel Xeon and Intel Xeon Phi Coprocessors for High Performance Robert Geva, Principal Engineer

Programming Continuity 255 X8 X7 X6 X5 128 127 X4 X3 X2 X1 0 Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 X8opY8 X7opY7 X6opY6 X5opY5 X4opY4 X3opY3 X2opY2 X1opY1 Improving parallelism for better utilization of cores and vectors pays off on both Intel Xeon and Intel Xeon Phi Products 17

Parallel Programming for Intel Architecture (IA) Cores Use threads, directly or via OpenMP*, or Use tasking, Intel TBB / Cilk Plus Vectors Blocking algorithms Data layout and alignment Intrinsics, auto vectorization Language extensions for vector programming Use caches to hide memory latency Organize memory access for data reuse Structure of arrays facilitates vector loads / stores, unit stride Align data for vector accesses Parallel programming to utilize the hardware resources 18 Intel Threading Building Blocks (Intel TBB)

Running Example: Monte Carlo for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n 1.f))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 19 NOT A STAC BENCHMARK SFTL003 hands on lab Baseline, serial code. No cores, vector utilization.

The Same Source Change Improves Performance on Both Targets 20000 18000 16000 14000 12000 10000 8000 Serial parallel vector 6000 4000 Options per second 2000 0 Intel Xeon Processor E5 Intel Xeon Phi Parallelization and vectorization together improve option per second by > 800X and by >50X HOW DO WE GET THERE? 20 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Running Example: Monte Carlo #pragma omp parallel for for (int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n 1.f))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 21 SFTL003 hands on lab

The Same Source Change Improves Performance on Both Targets 5000 4500 4000 3500 3000 2500 2000 1500 1000 Serial parallel scalar Options per second 500 0 Intel Xeon Processor E5 Intel Xeon Phi 22 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Vector Parallelism in Intel Cilk Plus Array Notations Syntax to operate on arrays No ordering constraints use SIMD Elemental Functions Function describes operations on an element Deployed across a collection of elements SIMD Loops Vector parallelism on a single thread Guaranteed vector implementation by the compiler Language support for explicit vector programming 23

A social challenge? Vector execution is well understood Vector programming is not Programmers expect an auto vectorizer to vectorize their scalar loops Why is this easier then auto parallelism? Summary: Vector programming is distinct from both serial and parallel programming It currently yields good return on investment Both AVX and Xeon Phi The syntax is machine independent, start with AVX Intel is driving standardization: GCC, OpenMP, C++ 24

Performance with Vector Parallelism Non STAC benchmarks Measurements by Xinmin Tian for paper in IPDPS, PLC 12 25 robert.geva@intel.com

Auto-Vectorization Limited by Serial Semantics for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i]; } Compiler checks for Is *p loop invariant? Are a, b, and c loop invariant? Does a[] overlap with b[], c[], and/or sum? Is + operator associative? (Does the order of add s matter?) Vector computation on the target expected to be faster than scalar code? Auto vectorization is limited by the language rules: you can t say what you mean! 26

SIMD Pragma Language Based Vectorization #pragma simd reduction(+:sum) for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i]; } This loop implies: *p is loop invariant a[] is not aliased with b[], c[], and sum sum is not aliased with b[] and c[] Generate a private copy of sum for each iteration + operation on sum is associative (Compiler can reorder the add s on sum) Vector code to be generated even if it could be slower than scalar code 27

SIMD Pragma: Definition Top-level C/C++: #pragma simd Fortran:!DIR$ SIMD Attached clauses to describe semantics vectorlength (VL) private / firstprivate / lastprivate (var1[,var2, ]) reduction (oper1:var1[, ][, oper2:var2[, ]]) linear (var1[:step1][, var2[:step2], ]) directiv e OpenMP*-like pragma for vector programming A keyword base syntax also being added Not everyone wants to program with pragmas hint vector SIMD IVDEP thread OpenMP PARALLE L 28 11/12/201

Vector Length for (i=0;i<=max;i++) c[i]=a[i]+b[i]; a[i+3] a[i+2] a[i+1] a[i] b[i+3] b[i+2] b[i+1] b[i] c[i+3] c[i+2] c[i+1] c[i] + 4 doubles in an AVX reg 8 in an Intel Xeon Phi coprocessor register 8 singles in an AVX reg 16 in an Intel Xeon Phi coprocessor register a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] + 29 vector length depends on the hardware register and the characteristic type in the loop

Data in Vector Loops float sum = 0.0f; float *p = a; int step = 4; #pragma simd for (int i = 0; i < N; ++i) { sum += *p; p += step; } The two statements with the += operations have different meaning from each other The programmer should be able to express those differently The compiler has to generate different code The variables i, p and step have different meaning from each other 30

Data in Vector Loops float sum = 0.0f; float *p = a; int step = 4; #pragma simd reduction(+:sum)\ linear(p:step) for (int i = 0; i < N; ++i) { sum += *p; p += step; } The two statements with the += operations have different meaning from each other The programmer should be able to express those differently The compiler has to generate different code The variables i, p and step have different meaning from each other 31

32 Parallel Loops vs. Vector Loops Vector loops allow forward dependence Vector loops execute on a single thread Parallel loops allow critical sections, whereas vector loops would deadlock with critical sections vector parallel for (int i = 1; i < N; ++i) { a[i] = expr; b[i] += a[i-1]; } for (int i = 1; i < N; ++i) { float x = sqrt(b[i] + a[i]); b[i] = x; omp_set_lock(&lck); float y = s += x; a[i] = y; omp_unset_lock(&lck); }

Running Example: Monte Carlo for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; #pragma simd reduction(+:val) reduction(+:val2) for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n - 1))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 33 SFTL003 hands on lab

The Same Source Change Improves Performance on Both Targets 1000 900 800 Options per second 700 600 500 400 Serial serial vector 300 200 100 0 Intel Xeon Processor E5 Intel Xeon Phi 34 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Running Example: Monte Carlo #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; #pragma simd reduction(+:val) reduction(+:val2) for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n 1.f))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 35 SFTL003 hands on lab

The Same Source Change Improves Performance on Both Targets 20000 18000 16000 14000 12000 10000 8000 6000 Serial parallel vector Options per second 4000 2000 0 Intel Xeon Processor E5 Intel Xeon Phi Parallelization and vectorization together improve option per second by > 800X and by >50X 36 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Summary Both Intel Xeon and Intel Xeon Phi processors benefit from parallel programming Expressing parallelism can be done consistently between them Key considerations: Blocking algorithms Data layout and alignment Parallel programming to utilize cores Vector programming to utilize vector If you expect to port to Xeon Phi soon, the most economical first step is to use vector programming on a Xeon first. 37

System configuration for measurements from IPDPS, PLC12 The performance measurement were carried out on an Intel Core i7 CPU X980 system (6 cores with Hyper-Threading On), running at 3.33GHz, with 4.0GB RAM, 12M smart cache, 64-bit Windows Server 2008 R2 Enterprise SP1. Intel(R) C++ compiler 13.0 beta. Performance will vary depending on the specific hardware and software used 38

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice 39 398/2/2012 Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.