PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA

Size: px
Start display at page:

Download "PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA"

Transcription

1 2 nd Workshop MIC IFERC PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA Ph.D. candidate Fan YE Advisor CEA Christophe Calvin Supervisor Serge Petiton 18 MARCH 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 1

2 ADVENT OF ACCELERATORS: MIC AND GPU Power Wall => Frequency é ê Mono-core => Multi-Core => Many-Core Processor design Latency-oriented => Throughput oriented ê Accelerators! 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 2

3 SOME HISTORY Reduce power consumption of Intel Xeon processors low frequency + manycore + appropriate software support -> better performance/watt efficiency Will x86 do it? -> Yes. The ISA needed for x86 compatibility dictates less than 10% of power consumption. They focus on In-order core x86 ISA Smaller pipeline Wider SIMD SMT Started with Pentium5 cores connected through ring interface and added texture sampler to help with graphics Larrabee project PAGE 3

4 PRECEDING PROJECTS Intel MIC is an incorporation of the Larrabee (codename for a GPGPU chip) many core architecture Introduced wide SIMD (512-bit) to a x86 architecture Cache coherent multiprocessor system connected via a ring bus to memory Each core was capable of 4-way multi-threading Specialised hardware for texture sampling the Teraflops Research Chip multicore chip research project An experimental 80 core chip with 2 floating point units per core implementing not x86 but a 96-bit VLIW architecture Other features: energy efficiency, core communication, self correction, fixed function cores, memory stacking the Intel Single-Chip Cloud Computer multicore microprocessor Each SCC chip contained 48 P54C Pentium cores connected with a 4x6 2D-mesh The cores are divided into 24 tiles, each tile had 2 cores and a message passing buffer (MPB) shared by the two cores 4 DDR3 memory controllers were on each chip, connected to the 2D-mesh as well The design lacked cache coherent cores and focused on principles that would allow the design to scale to many more cores PAGE 4

5 TRILOGY Features Generations 1 st (Intel s MIC prototype board) 1 st (1 st many-core commercial product) 2nd codename Knights Ferry Knights Corner Knights Landing Core 32 in-order x86 cores Up to 1.2 GHz 4 SMT threads per core One 512-bit SIMD unit Modified Pentium Cache 32 KB L1 8 MB coherent L2 (256 KB per core) 61 in-order x86 cores Up to 1.2 GHz 4 SMT threads per core One 512-bit SIMD unit Modified Pentium 32 KB L1 32 MB coherent L2 (512 KB per core) 72 in-order x86 cores divided into 36 tiles 4 SMT threads per core Two 512-bit SIMD unit Modified Airmont (Atom) L1 + shared L2 (by 2 cores) within a tile + configurable MC DRAM Memory 2GB GDDR5 Up to 16GB GDDR5 DDR4 + MC DRAM (stacked 3D on-die) Interconnect 1024-bit bidirectional ring bus 1024-bit bidirectional ring bus 2D-mesh Process size 45 nm 22 nm 14 nm Theoretical performance SP: 750 Gflops DP: N/A SP: 1 TFlops DP: 2 TFlops SP: 7 TFlops DP: 3 TFlops Power consumption ~300 W W W PAGE 5

6 PRESENCE IN TOP10 Xeon Phi as the brand name used for all products based on Intel MIC architecture KNC KNC PAGE 6

7 PORTRAIT Available as a PCIe device Tianhe-2(MilkyWay-2) Stampede Both are in Intel Xeon + Intel Xeon Phi configuration PAGE 7

8 CACHE SUBSYSTEM Main objective: reduce the memory bandwidth/latency bottleneck discovered in the Von Neumann architecture A cache is added to the processor core and connect through a memory controller (MC) to communicate with the main memory the processors at high-level are now designed with two distinct but important components known as core and uncore the core components consist of engines that do the computations the uncore components includes caches, memory and peripheral components the uncore components of modern day computers play more fundamental role in scientific application performance and often consumes more power and silicon chip area than core PAGE 8

9 CACHE SUBSYSTEM Cache subsystem: L1 data + L1 instruction + L1 data TLB + L1 instruction TLB L2 unified & coherent + L2 unified TLB (L2 cache is inclusive of L1 cache) PAGE 9

10 CACHE SUBSYSTEM Intel Xeon Phi L1 I/D cache configuration Size Associativity Line Size Bank Size Data Return 32KB 8-way 64 bytes 8 bytes Out of order The data cache allows simultaneous R/W allowing cache line replacement to happen in a single cycle. L1 cache access: 3 cycles L2 cache/core is 512 KB in size The cache is divided into 1024 sets and 8 ways per set with 64 bytes/1 cache line per way The cache is divided into 2 logical banks L2 cache latency could be as small as cycles L2 cache can deliver 64 bytes read data to corresponding cores every two cycles and 64 bytes of write data every cycle PAGE 10

11 CACHE SUBSYSTEM Linear to Physical address translation in Intel Xeon Phi Coprocessor The work of TLB (translation look aside buffer) is to reduce the page walk necessary to locate the page and save the page address discovered here Page Size Entries Associativity 4 KB 64 4-way L1 Data TLB 64 KB 32 4-way 2 MB 8 4-way L1 Instruction TLB 4 KB 64 4-way L2 TLB 4 KB, 64 KB, 2 MB 64 4-way PAGE 11

12 CACHE SUBSYSTEM Energy consumed per byte of data transferred from the memory, L1 and L2 caches The L1 and L2 caches provide an aggregate bandwidth that is approximately 15 and 7 times, respectively, faster compared to the aggregate memory bandwidth PAGE 12

13 INTERCONNECT The interconnect topology selected for a manycore processor is determined by the latency, bandwidth and cost of implementing such technology The interconnect technology chosen for KNC is a bidirectional ring topology All cores talk to each other including memory through memory controller through a ring bus P 0 -P n indicates the cores C indicates the cache MC indicates the memory controller. In reality, there re 8 memory controllers distributed over the ring to improve the memory BW The system interface controller supports I/O protocol like PCI express to communicate with the host Manycore processor architecture with cores connected through a ring bus PAGE 13

14 INTERCONNECT Core memory interface: 32 bit, 2 channels -> 8.4 GB/s 8 memory controllers each with 2 GDDR5 channels Memory bandwidth Consumable max 8.4 x 61 = GB/s Producible max 5.5 Gtransfer x 16 channels x 4B/Transfer = 352 GB/s STREAM benchmark READ: 180 GB/s WRITE: 160 GB/s Each direction of bidirectional ring is comprised of 3 independent rings The data block ring (the 1 st, largest, and most expensive) 64 bytes wide The address ring (much smaller, used to send R/W commands and memory addr) The acknowledgement ring (smallest and least expensive) sends flow control and coherence messages PAGE 14

15 THEORETICAL PEAK PERFORMANCE Peak performance is what manufacturer guarantees that programs will not exceed Jack Dongarra For an instantiation of Intel Xeon Phi coprocessor with 60 usable cores, running at 1.1 GHz, the theoretical performance is computed as follows: Gflop/sec = 16 (SP SIMD Lane) x 2 (FMA) x 1.1(GHz) x 60 (#cores) = 2112 for single precision arithmetic Gflop/sec = 8 (SP SIMD Lane) x 2 (FMA) x 1.1(GHz) x 60 (#cores) = 1056 for double precision arithmetic Since Intel Xeon Phi coprocessor runs an OS inside, which make take up a core to service hardware/software requests like interrupts. As such, often a 61 core processor may end up with 60 cores available for pure computation tasks PAGE 15

16 INTERCONNECT What if a L2 miss? An address request is sent on the AD ring to the tag directories, if the requested data block is found in another core s L2 cache, a forwarding request is sent to that core s L2 over the Ad ring and the request block is subsequently forwarded on the data block ring. Else a memory address is sent from the tag directories to the memory controller 512 KB L2 size per core -> 32 MB collective L2 size in total (cache coherence) The memory addresses are uniformly distributed among the tag directories on the ring to provide a smooth traffic characteristic on the ring The addresses are also evenly distributed across the memory controllers The memory controllers are symmetrically interleaved around the ring All-to-all mapping from the tag directories to the memory controllers PAGE 16

17 CORE Manycore architecture A logical evolution from multithreading -> clone the whole core multiple times to allow multiple threads of execution to happen in parallel (homogeneous manycore architecture -> similar cores) Moves the burden of achieving application performance improvement more from hardware engineers towards software engineers!!! C indicates cache, MC indicates memory controller and Px indicates processor cores One big question: the parallel constructs to exploit such machines? PAGE 17

18 COMPUTE MODES Native Xeon Offload Xeon hosted MIC co-processed Autonomous mode Offload MIC hosted Xeon co-processed Native MIC Xeon Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() Foo() PCIe MIC Foo() Main() Foo() MPI_*() Main() Foo() MPI_*() Main() Foo() MPI_*() A process viewpoint of the Intel MIC Architecture enabled compute continuum Available programming methods BIG CHALLENGE 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 18

19 AVAILABLE PARALLEL CONSTRUCTS ON MIC Consider MIC as a shared memory system Multithreading techniques (from easy to hard) MKL OpenMP Cilk+ TBB (Threading Building Blocks) Pthread Explicit vectorization (from easy to hard) OpenMP 4.0 simd pragma #pragma omp simd Intel simd pragma #pragma simd Intel Cilk+ C/C++ Array Notation Something like a[1:2:3] -> a[1], a[4] (a[lower bound : length : stride]), along with some array functions IMCI (Intel Intial Manycore Instructions) intrinsic functions Consider MIC as a distributed memory system MPI Other techniques: OpenCL(cross-platform), StarPU, Kaapi(runtime), SCIF, COI(low-level API), etc. PAGE 19

20 MULTIPROCESSING Worker: thread or process In MIC, each core is 4-way multithreaded, which means each core can concurrently execute instructions from 4 threads/processes. This helps reduce the effect of vector pipeline latency and memory access latencies, thus keeping the execution units busy. Thread state diagram (similar to process) 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 20

21 MULTITHREADING MODELS In a nutshell: Abstraction that maps efficiently the work to the OS threads Task centric programming Tasks Threads Cores Schedulers: Work-sharing vs Work-stealing Main performance bottleneck: For work-sharing Contention for the public task queue (Big data contention due to large # cores) For work-stealing High cost of stealing tasks from remote threads (stealing cost proportional to the distance between thief thd and victim thd) 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 21

22 MOTIVATION Neutronics Neutron Transport & Interactions Monte Carlo Methods solve statistically the exact model Deterministic Methods linearized Boltzmann transport equation (First Principles Treatment) Eigenproblem Basic numerical method implemented in the main deterministic neutronics code -> Power Method 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 22

23 PARALLELIZATION OF DENSE MATRIX- VECTOR PRODUCT KERNEL Multithreading techniques: OpenMP, Cilk+, TBB Explicit Vectorization Methods: Intel Cilk+ Array Notation, Intel simd pragma Matrix-Vector Product Kernel 1. for i = 1 to n 2. do bi =0 3. for j = 1 to n do 4. bi bi +Aij xj 5. end for 6. endfor Solutions: 1. Step 1 -> multithreading 2. Step 1,3 -> multithreading 3. Step 1 -> multithreading + Step 3 -> vectorization Ref: C. Calvin, F. Ye, S. Petiton, The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor, 3PGCIC 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 23

24 IMPLEMENTATION Worker: 1. Pure OpenMP threads 2. Hybrid MPI/OpenMP (Processes/Threads) 3. Pure MPI processes Idea: Mixing different dimensions of parallelism SIMDized kernel using CSR format (row_ptrs, col_inds, vals) 1: reg_y <- 0 2: start <- row_ptrs[row] 3: end <- row_ptrs[row+1] 4: for i = start to end do 5: writemask <- (end-i)>8?0xff:(0xff>>(8-end+i)) 6: reg_ind <- load(writemask, &col_inds[i]) 7: reg_val <- load(writemask, &vals[i]) 8: reg_x <- gather(writemask, reg_ind, x) 9: reg_y <- fmadd(reg_x, reg_val, reg_y, writemask) 10: i = i+8 11:end for 12:y[row] = reduce_add(reg_y) OMP threads MPI processes OMP threads MPI processes Two phases: 1. Computing phase: all elements of y are calculated 2. Communication phase: y is copied to x (explicit message passing in the presence of MPI) VPU 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 24

25 RESULTS OMP vs MKL Comments: No big differents, except that the MKL tends to have better performance with more threads/core Hybrid Gain Cross-platform performances Comments: Hybrid MPI/OMP helps to reduce the scaling overheads and promote data locality 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 25

26 PERFORMANCE ANALYSIS Three main factors: 1. Vectorization rate 2. Nonzeros dispersion rate 3. Load balancing Quantification α =187.5,ε 1 = 55,ε2 = Average number of nonzeros per row 2. Average number of occurrences when the distance between any pair of contiguous nonzero elements within a row is greater than 2 3. Analysis within the slowest process Proposed model: P thd (nnz, d) = α[1 exp( nnz ε 1 )]exp( d ε 2 ) Estimated parameters: α =187.5,ε 1 = 55,ε2 = 40 Ref: F. Ye, C. Calvin, S. Petiton, A Study of SpMV Implementation using MPI and OpenMP on Intel Many- Core Architecture, VECPAR 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 26

27 WHY MONTE- CARLO In order to fully unleash the potential of massively parallel architecture, we need to exploit more parallelism at algorithmic level. In this direction, the Monte-Carlo method appears to us as a good candidate. As our research encircles the Krylov subspace method, we propose to use Monte-Carlo technique as a preconditioner for GMRES considering the less optimal convergence property of this stochastic linear solver. 2 steps: 1. Validating the standard GMRES with Monte-Carlo preconditioner 2. Flexible GMRES with smart preconditioning 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 27

28 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (1) Monte-Carlo in linear algebra Dates back to the work of Von Neumann and Ulam Recent revival of interest The use of MC is promising where approximation solutions are sufficient -> preconditioning, graph partitioning, information retrieval, and feature extraction Parallel MC is very latency tolerant -> intrinsic parallelism MC can also yield specific components of the solution Convergence rate is independent of the size of the matrix Monte-Carlo linear algebra techniques Based on the ability to perform stochastic matrix-vector multiplication Based on stationary iterative methods with poor convergence properties 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 28

29 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (2) Current MC techniques Stochastic Matrix-Vector Multiplication Consider C R n n and vector h R n Transition-probability P and weight matrix W C ij = P ij W ij, 1 i, j n where n P ij =1, 1 j n i=1 Initial-probability p, initialweightw satisfying h i = p i w i, 1 i n, with n P i =1 i=1 MC techniques estimate C j h, j 0byconstructingaMarkovchainoflengthj The random walk visits a set of states in {1,...,n} The state visited in the i th step: k i,i [0,j] Probability of initial state: Prob(k 0 = α) =p α transition probability: Prob(k i = α k i 1 = β) =P αβ Consider random variables X i defined as follows X 0 = w k0,x i = X i 1 W ki k i 1 Let δ denote the Kronecker delta function (δ ij =1)ifi = j, 0otherwise then it can be shown that E(X j δ ikj )=(C j h) i, 1 i n for each random walk, X j δ ikj can be used to estimate the i th component of C j h 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 29

30 MONTE-CARLO LINEAR SOLVER/PRECONDITIONER (3) Linear solvers Ax = b, A R n n,andx, b R n The starting point of MC techniques is to split A as A = N M and write the fixed-point iteration x m+1 = N 1 Mx (m) + N 1 b = Cx (m) + h where C = N 1 M and h = N 1 b.thenweget x m = C (m) x (0) + m 1 C i h i=0 The initial vector x (0) is often taken to be h for convenience, yielding x (m) = m C i h i=0 x (m) converges to the solution as m if C < 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 30

31 TALKS 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 31

32 TALKS 2015 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 32

33 TALKS " C. Calvin, S. Petiton, F. Ye, Krylov Basis Orthogonalization Algorithms on Many Core Architectures, SIAM Annual Meeting 2013, SAN Diego, U.S. " F. Ye, S. Petiton, C. Calvin, Fine-Grained Multilevel Parallel Programming on Intel Xeon Phi for Eigenproblem, Journée Informatique Intensive et Massive de Proximité, Polytechnique, France " C.Calvin, F. Boillod-Cerneux, N. Emad, S. Petiton, F. Ye, FP3C Meeting ongoing research Task 7, ANR-JST FP3C Meeting, ENSEEIHT, Toulouse, France " C. Calvin, F. Boillod-Cerneux, F. Ye, H. Galicher, S. Petiton, Programming Paradigms for Emerging Architectures Applied to Asynchronous Krylov Eigensolver, SIAM-PP 14, Portland, U.S 年 3 月 18 日 CEA 18 MARCH 2015 PAGE 33

34 PAGE 34 CEA 18 MARCH 2015 Commissariat à l énergie atomique et aux énergies alternatives Centre de Saclay Gif-sur-Yvette Cedex T. +33 (0) F. +33 (0)1 XX XX XX XX DEN DM2S 2015 年 3 月 18 日 Etablissement public à caractère industriel et commercial R.C.S Paris B

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture Fan Ye 1,2, Christophe Calvin 1, Serge Petiton 2,3 1 CEA/DEN/DANS/DM2S, CEA Saclay, France 2 LIFL, Université de Lille

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Chapter 1 Introduction to Xeon Phi Architecture

Chapter 1 Introduction to Xeon Phi Architecture Chapter 1 Introduction to Xeon Phi Architecture Technical computing can be defined as the application of mathematical and computational principles to solve engineering and scientific problems. It has become

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MIC workshop Guillaume Colin de Verdière MARCH 17 TH, 2015 MIC Workshop PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France March 17th, 2015 Overview Context

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Author...

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Placement de processus (MPI) sur architecture multi-cœur NUMA

Placement de processus (MPI) sur architecture multi-cœur NUMA Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Computer Architecture and Structured Parallel Programming James Reinders, Intel Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

Hybrid MPI - A Case Study on the Xeon Phi Platform

Hybrid MPI - A Case Study on the Xeon Phi Platform Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP), Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps A Unified Approach to Heterogeneous Architectures Using the Uintah Framework Qingyu Meng, Alan Humphrey

More information

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core 1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

Hybrid Architectures Why Should I Bother?

Hybrid Architectures Why Should I Bother? Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Does the Intel Xeon Phi processor fit HEP workloads?

Does the Intel Xeon Phi processor fit HEP workloads? Does the Intel Xeon Phi processor fit HEP workloads? October 17th, CHEP 2013, Amsterdam Andrzej Nowak, CERN openlab CTO office On behalf of Georgios Bitzes, Havard Bjerke, Andrea Dotti, Alfio Lazzaro,

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Knights Corner: Your Path to Knights Landing

Knights Corner: Your Path to Knights Landing Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information