PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA
|
|
- Easter Stanley
- 5 years ago
- Views:
Transcription
1 2 nd Workshop MIC IFERC PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA Ph.D. candidate Fan YE Advisor CEA Christophe Calvin Supervisor Serge Petiton 18 MARCH 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 1
2 ADVENT OF ACCELERATORS: MIC AND GPU Power Wall => Frequency é ê Mono-core => Multi-Core => Many-Core Processor design Latency-oriented => Throughput oriented ê Accelerators! 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 2
3 SOME HISTORY Reduce power consumption of Intel Xeon processors low frequency + manycore + appropriate software support -> better performance/watt efficiency Will x86 do it? -> Yes. The ISA needed for x86 compatibility dictates less than 10% of power consumption. They focus on In-order core x86 ISA Smaller pipeline Wider SIMD SMT Started with Pentium5 cores connected through ring interface and added texture sampler to help with graphics Larrabee project PAGE 3
4 PRECEDING PROJECTS Intel MIC is an incorporation of the Larrabee (codename for a GPGPU chip) many core architecture Introduced wide SIMD (512-bit) to a x86 architecture Cache coherent multiprocessor system connected via a ring bus to memory Each core was capable of 4-way multi-threading Specialised hardware for texture sampling the Teraflops Research Chip multicore chip research project An experimental 80 core chip with 2 floating point units per core implementing not x86 but a 96-bit VLIW architecture Other features: energy efficiency, core communication, self correction, fixed function cores, memory stacking the Intel Single-Chip Cloud Computer multicore microprocessor Each SCC chip contained 48 P54C Pentium cores connected with a 4x6 2D-mesh The cores are divided into 24 tiles, each tile had 2 cores and a message passing buffer (MPB) shared by the two cores 4 DDR3 memory controllers were on each chip, connected to the 2D-mesh as well The design lacked cache coherent cores and focused on principles that would allow the design to scale to many more cores PAGE 4
5 TRILOGY Features Generations 1 st (Intel s MIC prototype board) 1 st (1 st many-core commercial product) 2nd codename Knights Ferry Knights Corner Knights Landing Core 32 in-order x86 cores Up to 1.2 GHz 4 SMT threads per core One 512-bit SIMD unit Modified Pentium Cache 32 KB L1 8 MB coherent L2 (256 KB per core) 61 in-order x86 cores Up to 1.2 GHz 4 SMT threads per core One 512-bit SIMD unit Modified Pentium 32 KB L1 32 MB coherent L2 (512 KB per core) 72 in-order x86 cores divided into 36 tiles 4 SMT threads per core Two 512-bit SIMD unit Modified Airmont (Atom) L1 + shared L2 (by 2 cores) within a tile + configurable MC DRAM Memory 2GB GDDR5 Up to 16GB GDDR5 DDR4 + MC DRAM (stacked 3D on-die) Interconnect 1024-bit bidirectional ring bus 1024-bit bidirectional ring bus 2D-mesh Process size 45 nm 22 nm 14 nm Theoretical performance SP: 750 Gflops DP: N/A SP: 1 TFlops DP: 2 TFlops SP: 7 TFlops DP: 3 TFlops Power consumption ~300 W W W PAGE 5
6 PRESENCE IN TOP10 Xeon Phi as the brand name used for all products based on Intel MIC architecture KNC KNC PAGE 6
7 PORTRAIT Available as a PCIe device Tianhe-2(MilkyWay-2) Stampede Both are in Intel Xeon + Intel Xeon Phi configuration PAGE 7
8 CACHE SUBSYSTEM Main objective: reduce the memory bandwidth/latency bottleneck discovered in the Von Neumann architecture A cache is added to the processor core and connect through a memory controller (MC) to communicate with the main memory the processors at high-level are now designed with two distinct but important components known as core and uncore the core components consist of engines that do the computations the uncore components includes caches, memory and peripheral components the uncore components of modern day computers play more fundamental role in scientific application performance and often consumes more power and silicon chip area than core PAGE 8
9 CACHE SUBSYSTEM Cache subsystem: L1 data + L1 instruction + L1 data TLB + L1 instruction TLB L2 unified & coherent + L2 unified TLB (L2 cache is inclusive of L1 cache) PAGE 9
10 CACHE SUBSYSTEM Intel Xeon Phi L1 I/D cache configuration Size Associativity Line Size Bank Size Data Return 32KB 8-way 64 bytes 8 bytes Out of order The data cache allows simultaneous R/W allowing cache line replacement to happen in a single cycle. L1 cache access: 3 cycles L2 cache/core is 512 KB in size The cache is divided into 1024 sets and 8 ways per set with 64 bytes/1 cache line per way The cache is divided into 2 logical banks L2 cache latency could be as small as cycles L2 cache can deliver 64 bytes read data to corresponding cores every two cycles and 64 bytes of write data every cycle PAGE 10
11 CACHE SUBSYSTEM Linear to Physical address translation in Intel Xeon Phi Coprocessor The work of TLB (translation look aside buffer) is to reduce the page walk necessary to locate the page and save the page address discovered here Page Size Entries Associativity 4 KB 64 4-way L1 Data TLB 64 KB 32 4-way 2 MB 8 4-way L1 Instruction TLB 4 KB 64 4-way L2 TLB 4 KB, 64 KB, 2 MB 64 4-way PAGE 11
12 CACHE SUBSYSTEM Energy consumed per byte of data transferred from the memory, L1 and L2 caches The L1 and L2 caches provide an aggregate bandwidth that is approximately 15 and 7 times, respectively, faster compared to the aggregate memory bandwidth PAGE 12
13 INTERCONNECT The interconnect topology selected for a manycore processor is determined by the latency, bandwidth and cost of implementing such technology The interconnect technology chosen for KNC is a bidirectional ring topology All cores talk to each other including memory through memory controller through a ring bus P 0 -P n indicates the cores C indicates the cache MC indicates the memory controller. In reality, there re 8 memory controllers distributed over the ring to improve the memory BW The system interface controller supports I/O protocol like PCI express to communicate with the host Manycore processor architecture with cores connected through a ring bus PAGE 13
14 INTERCONNECT Core memory interface: 32 bit, 2 channels -> 8.4 GB/s 8 memory controllers each with 2 GDDR5 channels Memory bandwidth Consumable max 8.4 x 61 = GB/s Producible max 5.5 Gtransfer x 16 channels x 4B/Transfer = 352 GB/s STREAM benchmark READ: 180 GB/s WRITE: 160 GB/s Each direction of bidirectional ring is comprised of 3 independent rings The data block ring (the 1 st, largest, and most expensive) 64 bytes wide The address ring (much smaller, used to send R/W commands and memory addr) The acknowledgement ring (smallest and least expensive) sends flow control and coherence messages PAGE 14
15 THEORETICAL PEAK PERFORMANCE Peak performance is what manufacturer guarantees that programs will not exceed Jack Dongarra For an instantiation of Intel Xeon Phi coprocessor with 60 usable cores, running at 1.1 GHz, the theoretical performance is computed as follows: Gflop/sec = 16 (SP SIMD Lane) x 2 (FMA) x 1.1(GHz) x 60 (#cores) = 2112 for single precision arithmetic Gflop/sec = 8 (SP SIMD Lane) x 2 (FMA) x 1.1(GHz) x 60 (#cores) = 1056 for double precision arithmetic Since Intel Xeon Phi coprocessor runs an OS inside, which make take up a core to service hardware/software requests like interrupts. As such, often a 61 core processor may end up with 60 cores available for pure computation tasks PAGE 15
16 INTERCONNECT What if a L2 miss? An address request is sent on the AD ring to the tag directories, if the requested data block is found in another core s L2 cache, a forwarding request is sent to that core s L2 over the Ad ring and the request block is subsequently forwarded on the data block ring. Else a memory address is sent from the tag directories to the memory controller 512 KB L2 size per core -> 32 MB collective L2 size in total (cache coherence) The memory addresses are uniformly distributed among the tag directories on the ring to provide a smooth traffic characteristic on the ring The addresses are also evenly distributed across the memory controllers The memory controllers are symmetrically interleaved around the ring All-to-all mapping from the tag directories to the memory controllers PAGE 16
17 CORE Manycore architecture A logical evolution from multithreading -> clone the whole core multiple times to allow multiple threads of execution to happen in parallel (homogeneous manycore architecture -> similar cores) Moves the burden of achieving application performance improvement more from hardware engineers towards software engineers!!! C indicates cache, MC indicates memory controller and Px indicates processor cores One big question: the parallel constructs to exploit such machines? PAGE 17
18 COMPUTE MODES Native Xeon Offload Xeon hosted MIC co-processed Autonomous mode Offload MIC hosted Xeon co-processed Native MIC Xeon Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() Foo() PCIe MIC Foo() Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() A process viewpoint of the Intel MIC Architecture enabled compute continuum Available programming methods BIG CHALLENGE 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 18
19 AVAILABLE PARALLEL CONSTRUCTS ON MIC Consider MIC as a shared memory system Multithreading techniques (from easy to hard) MKL OpenMP Cilk+ TBB (Threading Building Blocks) Pthread Explicit vectorization (from easy to hard) OpenMP 4.0 simd pragma #pragma omp simd Intel simd pragma #pragma simd Intel Cilk+ C/C++ Array Notation Something like a[1:2:3] -> a[1], a[4] (a[lower bound : length : stride]), along with some array functions IMCI (Intel Intial Manycore Instructions) intrinsic functions Consider MIC as a distributed memory system MPI Other techniques: OpenCL(cross-platform), StarPU, Kaapi(runtime), SCIF, COI(low-level API), etc. PAGE 19
20 MULTIPROCESSING Worker: thread or process In MIC, each core is 4-way multithreaded, which means each core can concurrently execute instructions from 4 threads/processes. This helps reduce the effect of vector pipeline latency and memory access latencies, thus keeping the execution units busy. Thread state diagram (similar to process) 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 20
21 MULTITHREADING MODELS In a nutshell: Abstraction that maps efficiently the work to the OS threads Task centric programming Tasks Threads Cores Schedulers: Work-sharing vs Work-stealing Main performance bottleneck: For work-sharing Contention for the public task queue (Big data contention due to large # cores) For work-stealing High cost of stealing tasks from remote threads (stealing cost proportional to the distance between thief thd and victim thd) 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 21
22 MOTIVATION Neutronics Neutron Transport & Interactions Monte Carlo Methods solve statistically the exact model Deterministic Methods linearized Boltzmann transport equation (First Principles Treatment) Eigenproblem Basic numerical method implemented in the main deterministic neutronics code -> Power Method 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 22
23 PARALLELIZATION OF DENSE MATRIX- VECTOR PRODUCT KERNEL Multithreading techniques: OpenMP, Cilk+, TBB Explicit Vectorization Methods: Intel Cilk+ Array Notation, Intel simd pragma Matrix-Vector Product Kernel 1. for i = 1 to n 2. do bi =0 3. for j = 1 to n do 4. bi bi +Aij xj 5. end for 6. endfor Solutions: 1. Step 1 -> multithreading 2. Step 1,3 -> multithreading 3. Step 1 -> multithreading + Step 3 -> vectorization Ref: C. Calvin, F. Ye, S. Petiton, The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor, 3PGCIC 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 23
24 IMPLEMENTATION Worker: 1. Pure OpenMP threads 2. Hybrid MPI/OpenMP (Processes/Threads) 3. Pure MPI processes Idea: Mixing different dimensions of parallelism SIMDized kernel using CSR format (row_ptrs, col_inds, vals) 1: reg_y <- 0 2: start <- row_ptrs[row] 3: end <- row_ptrs[row+1] 4: for i = start to end do 5: writemask <- (end-i)>8?0xff:(0xff>>(8-end+i)) 6: reg_ind <- load(writemask, &col_inds[i]) 7: reg_val <- load(writemask, &vals[i]) 8: reg_x <- gather(writemask, reg_ind, x) 9: reg_y <- fmadd(reg_x, reg_val, reg_y, writemask) 10: i = i+8 11:end for 12:y[row] = reduce_add(reg_y) OMP threads MPI processes OMP threads MPI processes Two phases: 1. Computing phase: all elements of y are calculated 2. Communication phase: y is copied to x (explicit message passing in the presence of MPI) VPU 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 24
25 RESULTS OMP vs MKL Comments: No big differents, except that the MKL tends to have better performance with more threads/core Hybrid Gain Cross-platform performances Comments: Hybrid MPI/OMP helps to reduce the scaling overheads and promote data locality 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 25
26 PERFORMANCE ANALYSIS Three main factors: 1. Vectorization rate 2. Nonzeros dispersion rate 3. Load balancing Quantification α =187.5,ε 1 = 55,ε2 = Average number of nonzeros per row 2. Average number of occurrences when the distance between any pair of contiguous nonzero elements within a row is greater than 2 3. Analysis within the slowest process Proposed model: P thd (nnz, d) = α[1 exp( nnz ε 1 )]exp( d ε 2 ) Estimated parameters: α =187.5,ε 1 = 55,ε2 = 40 Ref: F. Ye, C. Calvin, S. Petiton, A Study of SpMV Implementation using MPI and OpenMP on Intel Many- Core Architecture, VECPAR 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 26
27 WHY MONTE- CARLO In order to fully unleash the potential of massively parallel architecture, we need to exploit more parallelism at algorithmic level. In this direction, the Monte-Carlo method appears to us as a good candidate. As our research encircles the Krylov subspace method, we propose to use Monte-Carlo technique as a preconditioner for GMRES considering the less optimal convergence property of this stochastic linear solver. 2 steps: 1. Validating the standard GMRES with Monte-Carlo preconditioner 2. Flexible GMRES with smart preconditioning 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 27
28 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (1) Monte-Carlo in linear algebra Dates back to the work of Von Neumann and Ulam Recent revival of interest The use of MC is promising where approximation solutions are sufficient -> preconditioning, graph partitioning, information retrieval, and feature extraction Parallel MC is very latency tolerant -> intrinsic parallelism MC can also yield specific components of the solution Convergence rate is independent of the size of the matrix Monte-Carlo linear algebra techniques Based on the ability to perform stochastic matrix-vector multiplication Based on stationary iterative methods with poor convergence properties 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 28
29 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (2) Current MC techniques Stochastic Matrix-Vector Multiplication Consider C R n n and vector h R n Transition-probability P and weight matrix W C ij = P ij W ij, 1 i, j n where n P ij =1, 1 j n i=1 Initial-probability p, initialweightw satisfying h i = p i w i, 1 i n, with n P i =1 i=1 MC techniques estimate C j h, j 0byconstructingaMarkovchainoflengthj The random walk visits a set of states in {1,...,n} The state visited in the i th step: k i,i [0,j] Probability of initial state: Prob(k 0 = α) =p α transition probability: Prob(k i = α k i 1 = β) =P αβ Consider random variables X i defined as follows X 0 = w k0,x i = X i 1 W ki k i 1 Let δ denote the Kronecker delta function (δ ij =1)ifi = j, 0otherwise then it can be shown that E(X j δ ikj )=(C j h) i, 1 i n for each random walk, X j δ ikj can be used to estimate the i th component of C j h 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 29
30 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (3) Linear solvers Ax = b, A R n n,andx, b R n The starting point of MC techniques is to split A as A = N M and write the fixed-point iteration x m+1 = N 1 Mx (m) + N 1 b = Cx (m) + h where C = N 1 M and h = N 1 b.thenweget x m = C (m) x (0) + m 1 C i h i=0 The initial vector x (0) is often taken to be h for convenience, yielding x (m) = m C i h i=0 x (m) converges to the solution as m if C < 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 30
31 TALKS 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 31
32 TALKS 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 32
33 TALKS " C. Calvin, S. Petiton, F. Ye, Krylov Basis Orthogonalization Algorithms on Many Core Architectures, SIAM Annual Meeting 2013, SAN Diego, U.S. " F. Ye, S. Petiton, C. Calvin, Fine-Grained Multilevel Parallel Programming on Intel Xeon Phi for Eigenproblem, Journée Informatique Intensive et Massive de Proximité, Polytechnique, France " C.Calvin, F. Boillod-Cerneux, N. Emad, S. Petiton, F. Ye, FP3C Meeting ongoing research Task 7, ANR-JST FP3C Meeting, ENSEEIHT, Toulouse, France " C. Calvin, F. Boillod-Cerneux, F. Ye, H. Galicher, S. Petiton, Programming Paradigms for Emerging Architectures Applied to Asynchronous Krylov Eigensolver, SIAM-PP 14, Portland, U.S 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 33
34 PAGE 34 CEA 18 MARCH 2015 Commissariat à l énergie atomique et aux énergies alternatives Centre de Saclay Gif-sur-Yvette Cedex T. +33 (0) F. +33 (0)1 XX XX XX XX DEN DM2S 2015 年 3 月 18 日 Etablissement public à caractère industriel et commercial R.C.S Paris B
Introduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationA Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture
A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture Fan Ye 1,2, Christophe Calvin 1, Serge Petiton 2,3 1 CEA/DEN/DANS/DM2S, CEA Saclay, France 2 LIFL, Université de Lille
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationChapter 1 Introduction to Xeon Phi Architecture
Chapter 1 Introduction to Xeon Phi Architecture Technical computing can be defined as the application of mathematical and computational principles to solve engineering and scientific problems. It has become
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationEXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière
EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationFor your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Author...
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationIntra-MIC MPI Communication using MVAPICH2: Early Experience
Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationAn Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015
Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationComputer Architecture and Structured Parallel Programming James Reinders, Intel
Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationIntroduction to the Intel Xeon Phi on Stampede
June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors
More informationHybrid MPI - A Case Study on the Xeon Phi Platform
Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationE, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),
Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationTACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing
TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and
More informationthe Intel Xeon Phi coprocessor
the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationThread and Data parallelism in CPUs - will GPUs become obsolete?
Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationA Unified Approach to Heterogeneous Architectures Using the Uintah Framework
DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps A Unified Approach to Heterogeneous Architectures Using the Uintah Framework Qingyu Meng, Alan Humphrey
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationOpportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory
Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationVincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012
Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationHybrid Architectures Why Should I Bother?
Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationEfficient Parallel Programming on Xeon Phi for Exascale
Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationManycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.
phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this
More informationDoes the Intel Xeon Phi processor fit HEP workloads?
Does the Intel Xeon Phi processor fit HEP workloads? October 17th, CHEP 2013, Amsterdam Andrzej Nowak, CERN openlab CTO office On behalf of Georgios Bitzes, Havard Bjerke, Andrea Dotti, Alfio Lazzaro,
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationKnights Corner: Your Path to Knights Landing
Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More information