PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA

Size: px

Start display at page:

Download "PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA"

Easter Stanley
5 years ago
Views:

1 2 nd Workshop MIC IFERC PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA Ph.D. candidate Fan YE Advisor CEA Christophe Calvin Supervisor Serge Petiton 18 MARCH 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 1

2 ADVENT OF ACCELERATORS: MIC AND GPU Power Wall => Frequency é ê Mono-core => Multi-Core => Many-Core Processor design Latency-oriented => Throughput oriented ê Accelerators! 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 2

SOME HISTORY Reduce power consumption of Intel Xeon processors low frequency + manycore + appropriate software support -> better performance/watt efficiency Will x86 do it? -> Yes.

3 SOME HISTORY Reduce power consumption of Intel Xeon processors low frequency + manycore + appropriate software support -> better performance/watt efficiency Will x86 do it? -> Yes. The ISA needed for x86 compatibility dictates less than 10% of power consumption. They focus on In-order core x86 ISA Smaller pipeline Wider SIMD SMT Started with Pentium5 cores connected through ring interface and added texture sampler to help with graphics Larrabee project PAGE 3

4 PRECEDING PROJECTS Intel MIC is an incorporation of the Larrabee (codename for a GPGPU chip) many core architecture Introduced wide SIMD (512-bit) to a x86 architecture Cache coherent multiprocessor system connected via a ring bus to memory Each core was capable of 4-way multi-threading Specialised hardware for texture sampling the Teraflops Research Chip multicore chip research project An experimental 80 core chip with 2 floating point units per core implementing not x86 but a 96-bit VLIW architecture Other features: energy efficiency, core communication, self correction, fixed function cores, memory stacking the Intel Single-Chip Cloud Computer multicore microprocessor Each SCC chip contained 48 P54C Pentium cores connected with a 4x6 2D-mesh The cores are divided into 24 tiles, each tile had 2 cores and a message passing buffer (MPB) shared by the two cores 4 DDR3 memory controllers were on each chip, connected to the 2D-mesh as well The design lacked cache coherent cores and focused on principles that would allow the design to scale to many more cores PAGE 4

5 TRILOGY Features Generations 1 st (Intel s MIC prototype board) 1 st (1 st many-core commercial product) 2nd codename Knights Ferry Knights Corner Knights Landing Core 32 in-order x86 cores Up to 1.2 GHz 4 SMT threads per core One 512-bit SIMD unit Modified Pentium Cache 32 KB L1 8 MB coherent L2 (256 KB per core) 61 in-order x86 cores Up to 1.2 GHz 4 SMT threads per core One 512-bit SIMD unit Modified Pentium 32 KB L1 32 MB coherent L2 (512 KB per core) 72 in-order x86 cores divided into 36 tiles 4 SMT threads per core Two 512-bit SIMD unit Modified Airmont (Atom) L1 + shared L2 (by 2 cores) within a tile + configurable MC DRAM Memory 2GB GDDR5 Up to 16GB GDDR5 DDR4 + MC DRAM (stacked 3D on-die) Interconnect 1024-bit bidirectional ring bus 1024-bit bidirectional ring bus 2D-mesh Process size 45 nm 22 nm 14 nm Theoretical performance SP: 750 Gflops DP: N/A SP: 1 TFlops DP: 2 TFlops SP: 7 TFlops DP: 3 TFlops Power consumption ~300 W W W PAGE 5

6 PRESENCE IN TOP10 Xeon Phi as the brand name used for all products based on Intel MIC architecture KNC KNC PAGE 6

7 PORTRAIT Available as a PCIe device Tianhe-2(MilkyWay-2) Stampede Both are in Intel Xeon + Intel Xeon Phi configuration PAGE 7

CACHE SUBSYSTEM Main objective: reduce the memory bandwidth/latency bottleneck discovered in the Von Neumann architecture A cache is added to the processor core and connect through a memory

8 CACHE SUBSYSTEM Main objective: reduce the memory bandwidth/latency bottleneck discovered in the Von Neumann architecture A cache is added to the processor core and connect through a memory controller (MC) to communicate with the main memory the processors at high-level are now designed with two distinct but important components known as core and uncore the core components consist of engines that do the computations the uncore components includes caches, memory and peripheral components the uncore components of modern day computers play more fundamental role in scientific application performance and often consumes more power and silicon chip area than core PAGE 8

9 CACHE SUBSYSTEM Cache subsystem: L1 data + L1 instruction + L1 data TLB + L1 instruction TLB L2 unified & coherent + L2 unified TLB (L2 cache is inclusive of L1 cache) PAGE 9

CACHE SUBSYSTEM Intel Xeon Phi L1 I/D cache configuration Size Associativity Line Size Bank Size Data Return 32KB 8-way 64 bytes 8 bytes Out of order The data cache allows simultaneous R/W allowing

10 CACHE SUBSYSTEM Intel Xeon Phi L1 I/D cache configuration Size Associativity Line Size Bank Size Data Return 32KB 8-way 64 bytes 8 bytes Out of order The data cache allows simultaneous R/W allowing cache line replacement to happen in a single cycle. L1 cache access: 3 cycles L2 cache/core is 512 KB in size The cache is divided into 1024 sets and 8 ways per set with 64 bytes/1 cache line per way The cache is divided into 2 logical banks L2 cache latency could be as small as cycles L2 cache can deliver 64 bytes read data to corresponding cores every two cycles and 64 bytes of write data every cycle PAGE 10

save the page address discovered here Page Size Entries Associativity 4 KB 64 4-way L1 Data TLB 64

11 CACHE SUBSYSTEM Linear to Physical address translation in Intel Xeon Phi Coprocessor The work of TLB (translation look aside buffer) is to reduce the page walk necessary to locate the page and save the page address discovered here Page Size Entries Associativity 4 KB 64 4-way L1 Data TLB 64 KB 32 4-way 2 MB 8 4-way L1 Instruction TLB 4 KB 64 4-way L2 TLB 4 KB, 64 KB, 2 MB 64 4-way PAGE 11

12 CACHE SUBSYSTEM Energy consumed per byte of data transferred from the memory, L1 and L2 caches The L1 and L2 caches provide an aggregate bandwidth that is approximately 15 and 7 times, respectively, faster compared to the aggregate memory bandwidth PAGE 12

13 INTERCONNECT The interconnect topology selected for a manycore processor is determined by the latency, bandwidth and cost of implementing such technology The interconnect technology chosen for KNC is a bidirectional ring topology All cores talk to each other including memory through memory controller through a ring bus P 0 -P n indicates the cores C indicates the cache MC indicates the memory controller. In reality, there re 8 memory controllers distributed over the ring to improve the memory BW The system interface controller supports I/O protocol like PCI express to communicate with the host Manycore processor architecture with cores connected through a ring bus PAGE 13

INTERCONNECT Core memory interface: 32 bit, 2 channels -> 8.4 GB/s 8 memory controllers each with 2 GDDR5 channels Memory bandwidth Consumable max 8.

5 Gtransfer x 16 channels x 4B/Transfer = 352 GB/s STREAM benchmark READ: 180 GB/s WRITE: 160 GB/s Each direction of bidirectional ring is comprised of 3

14 INTERCONNECT Core memory interface: 32 bit, 2 channels -> 8.4 GB/s 8 memory controllers each with 2 GDDR5 channels Memory bandwidth Consumable max 8.4 x 61 = GB/s Producible max 5.5 Gtransfer x 16 channels x 4B/Transfer = 352 GB/s STREAM benchmark READ: 180 GB/s WRITE: 160 GB/s Each direction of bidirectional ring is comprised of 3 independent rings The data block ring (the 1 st, largest, and most expensive) 64 bytes wide The address ring (much smaller, used to send R/W commands and memory addr) The acknowledgement ring (smallest and least expensive) sends flow control and coherence messages PAGE 14

15 THEORETICAL PEAK PERFORMANCE Peak performance is what manufacturer guarantees that programs will not exceed Jack Dongarra For an instantiation of Intel Xeon Phi coprocessor with 60 usable cores, running at 1.1 GHz, the theoretical performance is computed as follows: Gflop/sec = 16 (SP SIMD Lane) x 2 (FMA) x 1.1(GHz) x 60 (#cores) = 2112 for single precision arithmetic Gflop/sec = 8 (SP SIMD Lane) x 2 (FMA) x 1.1(GHz) x 60 (#cores) = 1056 for double precision arithmetic Since Intel Xeon Phi coprocessor runs an OS inside, which make take up a core to service hardware/software requests like interrupts. As such, often a 61 core processor may end up with 60 cores available for pure computation tasks PAGE 15

16 INTERCONNECT What if a L2 miss? An address request is sent on the AD ring to the tag directories, if the requested data block is found in another core s L2 cache, a forwarding request is sent to that core s L2 over the Ad ring and the request block is subsequently forwarded on the data block ring. Else a memory address is sent from the tag directories to the memory controller 512 KB L2 size per core -> 32 MB collective L2 size in total (cache coherence) The memory addresses are uniformly distributed among the tag directories on the ring to provide a smooth traffic characteristic on the ring The addresses are also evenly distributed across the memory controllers The memory controllers are symmetrically interleaved around the ring All-to-all mapping from the tag directories to the memory controllers PAGE 16

17 CORE Manycore architecture A logical evolution from multithreading -> clone the whole core multiple times to allow multiple threads of execution to happen in parallel (homogeneous manycore architecture -> similar cores) Moves the burden of achieving application performance improvement more from hardware engineers towards software engineers!!! C indicates cache, MC indicates memory controller and Px indicates processor cores One big question: the parallel constructs to exploit such machines? PAGE 17

Foo() Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() A process viewpoint of the Intel MIC

18 COMPUTE MODES Native Xeon Offload Xeon hosted MIC co-processed Autonomous mode Offload MIC hosted Xeon co-processed Native MIC Xeon Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() Foo() PCIe MIC Foo() Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() A process viewpoint of the Intel MIC Architecture enabled compute continuum Available programming methods BIG CHALLENGE 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 18

19 AVAILABLE PARALLEL CONSTRUCTS ON MIC Consider MIC as a shared memory system Multithreading techniques (from easy to hard) MKL OpenMP Cilk+ TBB (Threading Building Blocks) Pthread Explicit vectorization (from easy to hard) OpenMP 4.0 simd pragma #pragma omp simd Intel simd pragma #pragma simd Intel Cilk+ C/C++ Array Notation Something like a[1:2:3] -> a[1], a[4] (a[lower bound : length : stride]), along with some array functions IMCI (Intel Intial Manycore Instructions) intrinsic functions Consider MIC as a distributed memory system MPI Other techniques: OpenCL(cross-platform), StarPU, Kaapi(runtime), SCIF, COI(low-level API), etc. PAGE 19

MULTIPROCESSING Worker: thread or process In MIC, each core is 4-way multithreaded, which means each core can concurrently execute instructions from 4 threads/processes.

20 MULTIPROCESSING Worker: thread or process In MIC, each core is 4-way multithreaded, which means each core can concurrently execute instructions from 4 threads/processes. This helps reduce the effect of vector pipeline latency and memory access latencies, thus keeping the execution units busy. Thread state diagram (similar to process) 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 20

21 MULTITHREADING MODELS In a nutshell: Abstraction that maps efficiently the work to the OS threads Task centric programming Tasks Threads Cores Schedulers: Work-sharing vs Work-stealing Main performance bottleneck: For work-sharing Contention for the public task queue (Big data contention due to large # cores) For work-stealing High cost of stealing tasks from remote threads (stealing cost proportional to the distance between thief thd and victim thd) 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 21

MOTIVATION Neutronics Neutron Transport & Interactions

Deterministic Methods linearized Boltzmann transport

numerical method implemented in the main deterministic

22 MOTIVATION Neutronics Neutron Transport & Interactions Monte Carlo Methods solve statistically the exact model Deterministic Methods linearized Boltzmann transport equation (First Principles Treatment) Eigenproblem Basic numerical method implemented in the main deterministic neutronics code -> Power Method 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 22

23 PARALLELIZATION OF DENSE MATRIX- VECTOR PRODUCT KERNEL Multithreading techniques: OpenMP, Cilk+, TBB Explicit Vectorization Methods: Intel Cilk+ Array Notation, Intel simd pragma Matrix-Vector Product Kernel 1. for i = 1 to n 2. do bi =0 3. for j = 1 to n do 4. bi bi +Aij xj 5. end for 6. endfor Solutions: 1. Step 1 -> multithreading 2. Step 1,3 -> multithreading 3. Step 1 -> multithreading + Step 3 -> vectorization Ref: C. Calvin, F. Ye, S. Petiton, The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor, 3PGCIC 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 23

24 IMPLEMENTATION Worker: 1. Pure OpenMP threads 2. Hybrid MPI/OpenMP (Processes/Threads) 3. Pure MPI processes Idea: Mixing different dimensions of parallelism SIMDized kernel using CSR format (row_ptrs, col_inds, vals) 1: reg_y <- 0 2: start <- row_ptrs[row] 3: end <- row_ptrs[row+1] 4: for i = start to end do 5: writemask <- (end-i)>8?0xff:(0xff>>(8-end+i)) 6: reg_ind <- load(writemask, &col_inds[i]) 7: reg_val <- load(writemask, &vals[i]) 8: reg_x <- gather(writemask, reg_ind, x) 9: reg_y <- fmadd(reg_x, reg_val, reg_y, writemask) 10: i = i+8 11:end for 12:y[row] = reduce_add(reg_y) OMP threads MPI processes OMP threads MPI processes Two phases: 1. Computing phase: all elements of y are calculated 2. Communication phase: y is copied to x (explicit message passing in the presence of MPI) VPU 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 24

25 RESULTS OMP vs MKL Comments: No big differents, except that the MKL tends to have better performance with more threads/core Hybrid Gain Cross-platform performances Comments: Hybrid MPI/OMP helps to reduce the scaling overheads and promote data locality 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 25

26 PERFORMANCE ANALYSIS Three main factors: 1. Vectorization rate 2. Nonzeros dispersion rate 3. Load balancing Quantification α =187.5,ε 1 = 55,ε2 = Average number of nonzeros per row 2. Average number of occurrences when the distance between any pair of contiguous nonzero elements within a row is greater than 2 3. Analysis within the slowest process Proposed model: P thd (nnz, d) = α[1 exp( nnz ε 1 )]exp( d ε 2 ) Estimated parameters: α =187.5,ε 1 = 55,ε2 = 40 Ref: F. Ye, C. Calvin, S. Petiton, A Study of SpMV Implementation using MPI and OpenMP on Intel Many- Core Architecture, VECPAR 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 26

27 WHY MONTE- CARLO In order to fully unleash the potential of massively parallel architecture, we need to exploit more parallelism at algorithmic level. In this direction, the Monte-Carlo method appears to us as a good candidate. As our research encircles the Krylov subspace method, we propose to use Monte-Carlo technique as a preconditioner for GMRES considering the less optimal convergence property of this stochastic linear solver. 2 steps: 1. Validating the standard GMRES with Monte-Carlo preconditioner 2. Flexible GMRES with smart preconditioning 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 27

MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (1) Monte-Carlo in linear algebra Dates back to the work of Von Neumann and Ulam Recent revival of interest The use of MC is promising where approximation

28 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (1) Monte-Carlo in linear algebra Dates back to the work of Von Neumann and Ulam Recent revival of interest The use of MC is promising where approximation solutions are sufficient -> preconditioning, graph partitioning, information retrieval, and feature extraction Parallel MC is very latency tolerant -> intrinsic parallelism MC can also yield specific components of the solution Convergence rate is independent of the size of the matrix Monte-Carlo linear algebra techniques Based on the ability to perform stochastic matrix-vector multiplication Based on stationary iterative methods with poor convergence properties 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 28

29 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (2) Current MC techniques Stochastic Matrix-Vector Multiplication Consider C R n n and vector h R n Transition-probability P and weight matrix W C ij = P ij W ij, 1 i, j n where n P ij =1, 1 j n i=1 Initial-probability p, initialweightw satisfying h i = p i w i, 1 i n, with n P i =1 i=1 MC techniques estimate C j h, j 0byconstructingaMarkovchainoflengthj The random walk visits a set of states in {1,...,n} The state visited in the i th step: k i,i [0,j] Probability of initial state: Prob(k 0 = α) =p α transition probability: Prob(k i = α k i 1 = β) =P αβ Consider random variables X i defined as follows X 0 = w k0,x i = X i 1 W ki k i 1 Let δ denote the Kronecker delta function (δ ij =1)ifi = j, 0otherwise then it can be shown that E(X j δ ikj )=(C j h) i, 1 i n for each random walk, X j δ ikj can be used to estimate the i th component of C j h 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 29

30 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (3) Linear solvers Ax = b, A R n n,andx, b R n The starting point of MC techniques is to split A as A = N M and write the fixed-point iteration x m+1 = N 1 Mx (m) + N 1 b = Cx (m) + h where C = N 1 M and h = N 1 b.thenweget x m = C (m) x (0) + m 1 C i h i=0 The initial vector x (0) is often taken to be h for convenience, yielding x (m) = m C i h i=0 x (m) converges to the solution as m if C < 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 30

31 TALKS 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 31

32 TALKS 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 32

33 TALKS " C. Calvin, S. Petiton, F. Ye, Krylov Basis Orthogonalization Algorithms on Many Core Architectures, SIAM Annual Meeting 2013, SAN Diego, U.S. " F. Ye, S. Petiton, C. Calvin, Fine-Grained Multilevel Parallel Programming on Intel Xeon Phi for Eigenproblem, Journée Informatique Intensive et Massive de Proximité, Polytechnique, France " C.Calvin, F. Boillod-Cerneux, N. Emad, S. Petiton, F. Ye, FP3C Meeting ongoing research Task 7, ANR-JST FP3C Meeting, ENSEEIHT, Toulouse, France " C. Calvin, F. Boillod-Cerneux, F. Ye, H. Galicher, S. Petiton, Programming Paradigms for Emerging Architectures Applied to Asynchronous Krylov Eigensolver, SIAM-PP 14, Portland, U.S 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 33

34 PAGE 34 CEA 18 MARCH 2015 Commissariat à l énergie atomique et aux énergies alternatives Centre de Saclay Gif-sur-Yvette Cedex T. +33 (0) F. +33 (0)1 XX XX XX XX DEN DM2S 2015 年 3 月 18 日 Etablissement public à caractère industriel et commercial R.C.S Paris B

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider