Growth in Cores - A well rehearsed story

Intel CPUs

Tick-Tock Development Cycles Integrate. Innovate. 45nm Tick Tock 32nm 3D Tri-Gate 22nm Intel Core Microarchitecture Nehalem Microarchitecture Sandy Bridge Microarchitecture Haswell Microarchitecture Projection SSE4.2/AESNI AVX AVX2** Future ISA **Intel Architecture Instruction Set Extensions Programming Reference, #319433-012A, FEBRUARY 2012 Potential future options, subject to change without notice.

Intel & Parallelism More cores. Wider vectors. Co-Processors. Images do not reflect actual die sizes Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon processor code-named Sandy Bridge Intel Xeon processor code-named Ivy Bridge Intel Xeon processor code-named Haswell Intel MIC coprocessor code-named Knights Ferry Intel MIC coprocessor code-named Knights Corner Core(s) 1 2 4 6 8 32 >50 Thread s 2 2 8 12 16 128 >200 SIMD Width 128 128 128 128 256 256 256 512 512 SSE2 SSSE3 SSE4.2 SSE4.2 AVX AVX AVX2 FMA3 Intel MIC architecture extends established CPU architecture and programming concepts to highly parallel applications Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Parallelism on Intel x86-based architectures: the hardware hierarchy Distributed memory level: multiple nodes Shared memory level: multiple sockets per node CPU level: multiple physical cores, SMT (hyperthreading) X86 SIMD registers: multiple data in one xmm/ymm* register Needs MPI Needs vectorization (vectorizer can help) Multiple execution units for each core, e.g. int ALU on ports 0,1,5 Automatic Needs multithreading (threadizer can help) Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 8

Knights Corner Core Architecture Improved Intel Pentium Core In-Order, Short Pipeline Minimal speculation Instruction Decode Instruction Fetch Instruction Decode Instruction Decode 4 Threads Tolerates latencies and keeps the execution units busy Scalar Unit Vector Unit 2-wide 1 Vector (load-op) + 1 Scalar op (or prefetch) per cycle 2 Scalar per cycle X87 Area and power efficient > 50 Cores Scalar Vector Registers Registers L1 Icache L1 Dcache & Dcache 512K 256K L2 Cache Local Subset On-Die Ring Network Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

BASED ON A PRESENATION FROM: Programming Continuity Between Intel Xeon and Intel Xeon Phi Coprocessors for High Performance Robert Geva, Principal Engineer

Programming Continuity 255 X8 X7 X6 X5 128 127 X4 X3 X2 X1 0 Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 X8opY8 X7opY7 X6opY6 X5opY5 X4opY4 X3opY3 X2opY2 X1opY1 Improving parallelism for better utilization of cores and vectors pays off on both Intel Xeon and Intel Xeon Phi Products 17

Parallel Programming for Intel Architecture (IA) Cores Use threads, directly or via OpenMP*, or Use tasking, Intel TBB / Cilk Plus Vectors Blocking algorithms Data layout and alignment Intrinsics, auto vectorization Language extensions for vector programming Use caches to hide memory latency Organize memory access for data reuse Structure of arrays facilitates vector loads / stores, unit stride Align data for vector accesses Parallel programming to utilize the hardware resources 18 Intel Threading Building Blocks (Intel TBB)

Running Example: Monte Carlo for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n 1.f))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 19 NOT A STAC BENCHMARK SFTL003 hands on lab Baseline, serial code. No cores, vector utilization.

The Same Source Change Improves Performance on Both Targets 20000 18000 16000 14000 12000 10000 8000 Serial parallel vector 6000 4000 Options per second 2000 0 Intel Xeon Processor E5 Intel Xeon Phi Parallelization and vectorization together improve option per second by > 800X and by >50X HOW DO WE GET THERE? 20 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Running Example: Monte Carlo #pragma omp parallel for for (int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n 1.f))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 21 SFTL003 hands on lab

The Same Source Change Improves Performance on Both Targets 5000 4500 4000 3500 3000 2500 2000 1500 1000 Serial parallel scalar Options per second 500 0 Intel Xeon Processor E5 Intel Xeon Phi 22 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Vector Parallelism in Intel Cilk Plus Array Notations Syntax to operate on arrays No ordering constraints use SIMD Elemental Functions Function describes operations on an element Deployed across a collection of elements SIMD Loops Vector parallelism on a single thread Guaranteed vector implementation by the compiler Language support for explicit vector programming 23

A social challenge? Vector execution is well understood Vector programming is not Programmers expect an auto vectorizer to vectorize their scalar loops Why is this easier then auto parallelism? Summary: Vector programming is distinct from both serial and parallel programming It currently yields good return on investment Both AVX and Xeon Phi The syntax is machine independent, start with AVX Intel is driving standardization: GCC, OpenMP, C++ 24

Performance with Vector Parallelism Non STAC benchmarks Measurements by Xinmin Tian for paper in IPDPS, PLC 12 25 robert.geva@intel.com

Auto-Vectorization Limited by Serial Semantics for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i]; } Compiler checks for Is *p loop invariant? Are a, b, and c loop invariant? Does a[] overlap with b[], c[], and/or sum? Is + operator associative? (Does the order of add s matter?) Vector computation on the target expected to be faster than scalar code? Auto vectorization is limited by the language rules: you can t say what you mean! 26

SIMD Pragma Language Based Vectorization #pragma simd reduction(+:sum) for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i]; } This loop implies: *p is loop invariant a[] is not aliased with b[], c[], and sum sum is not aliased with b[] and c[] Generate a private copy of sum for each iteration + operation on sum is associative (Compiler can reorder the add s on sum) Vector code to be generated even if it could be slower than scalar code 27

SIMD Pragma: Definition Top-level C/C++: #pragma simd Fortran:!DIR$ SIMD Attached clauses to describe semantics vectorlength (VL) private / firstprivate / lastprivate (var1[,var2, ]) reduction (oper1:var1[, ][, oper2:var2[, ]]) linear (var1[:step1][, var2[:step2], ]) directiv e OpenMP*-like pragma for vector programming A keyword base syntax also being added Not everyone wants to program with pragmas hint vector SIMD IVDEP thread OpenMP PARALLE L 28 11/12/201

Vector Length for (i=0;i<=max;i++) c[i]=a[i]+b[i]; a[i+3] a[i+2] a[i+1] a[i] b[i+3] b[i+2] b[i+1] b[i] c[i+3] c[i+2] c[i+1] c[i] + 4 doubles in an AVX reg 8 in an Intel Xeon Phi coprocessor register 8 singles in an AVX reg 16 in an Intel Xeon Phi coprocessor register a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] + 29 vector length depends on the hardware register and the characteristic type in the loop

Data in Vector Loops float sum = 0.0f; float *p = a; int step = 4; #pragma simd for (int i = 0; i < N; ++i) { sum += *p; p += step; } The two statements with the += operations have different meaning from each other The programmer should be able to express those differently The compiler has to generate different code The variables i, p and step have different meaning from each other 30

Data in Vector Loops float sum = 0.0f; float *p = a; int step = 4; #pragma simd reduction(+:sum)\ linear(p:step) for (int i = 0; i < N; ++i) { sum += *p; p += step; } The two statements with the += operations have different meaning from each other The programmer should be able to express those differently The compiler has to generate different code The variables i, p and step have different meaning from each other 31

32 Parallel Loops vs. Vector Loops Vector loops allow forward dependence Vector loops execute on a single thread Parallel loops allow critical sections, whereas vector loops would deadlock with critical sections vector parallel for (int i = 1; i < N; ++i) { a[i] = expr; b[i] += a[i-1]; } for (int i = 1; i < N; ++i) { float x = sqrt(b[i] + a[i]); b[i] = x; omp_set_lock(&lck); float y = s += x; a[i] = y; omp_unset_lock(&lck); }

Running Example: Monte Carlo for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; #pragma simd reduction(+:val) reduction(+:val2) for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n - 1))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 33 SFTL003 hands on lab

The Same Source Change Improves Performance on Both Targets 1000 900 800 Options per second 700 600 500 400 Serial serial vector 300 200 100 0 Intel Xeon Processor E5 Intel Xeon Phi 34 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Running Example: Monte Carlo #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrtf(t[opt]); float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0f, val2 = 0.0f; #pragma simd reduction(+:val) reduction(+:val2) for(int pos = 0; pos < RAND_N; pos++){ float callvalue = expectedcall(sval, Xval, MuByT, VBySqrtT, l_random[pos]); val += callvalue; val2 += callvalue * callvalue; } } float exprt = expf(-riskfree *T[opt]); h_callresult[opt] = exprt * val / (float)rand_n; float stddev = sqrtf(((float)rand_n*val2 - val*val) / ((float)rand_n*(float)(rand_n 1.f))); h_callconfidence[opt] =(float)(exprt * 1.96f * stddev/sqrtf((float)rand_n)); 35 SFTL003 hands on lab

The Same Source Change Improves Performance on Both Targets 20000 18000 16000 14000 12000 10000 8000 6000 Serial parallel vector Options per second 4000 2000 0 Intel Xeon Processor E5 Intel Xeon Phi Parallelization and vectorization together improve option per second by > 800X and by >50X 36 Performance data generated by Shuo Li as part of SFTL003 Hands On Lab

Summary Both Intel Xeon and Intel Xeon Phi processors benefit from parallel programming Expressing parallelism can be done consistently between them Key considerations: Blocking algorithms Data layout and alignment Parallel programming to utilize cores Vector programming to utilize vector If you expect to port to Xeon Phi soon, the most economical first step is to use vector programming on a Xeon first. 37

System configuration for measurements from IPDPS, PLC12 The performance measurement were carried out on an Intel Core i7 CPU X980 system (6 cores with Hyper-Threading On), running at 3.33GHz, with 4.0GB RAM, 12M smart cache, 64-bit Windows Server 2008 R2 Enterprise SP1. Intel(R) C++ compiler 13.0 beta. Performance will vary depending on the specific hardware and software used 38

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice 39 398/2/2012 Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.