Intel Xeon Phi Coprocessor

Size: px
Start display at page:

Download "Intel Xeon Phi Coprocessor"

Transcription

1 Intel Xeon Phi Coprocessor 1

2 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2

3 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation of HPC Performance Suited for full scope of workloads Industry leading performance and performance/watt for serial & parallel workloads Focus on fast single core/thread performance with moderate number of cores Performance and performance/watt optimized for highly parallelized compute workloads Common software tools with Xeon enabling efficient application readiness and performance tuning IA extension to Manycore Many cores/threads with wide SIMD

4 Intel Architecture Multicore and Manycore More cores. Wider vectors. Co-Processors. Images do not reflect actual die sizes. Actual production die may differ from images. Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon processor E5 Product Family Intel Xeon processor code name Ivy Bridge Intel Xeon processor code name Haswell Intel Xeon Phi Coprocessor Core(s) Threads TBD Intel Xeon Phi Coprocessor extends established CPU architecture and programming concepts to highly parallel applications

5 Introducing Intel Xeon Phi Coprocessors Highly-parallel Processing for Unparalleled Discovery Groundbreaking: differences Up to 61 IA cores/1.1 GHz/ 244 Threads Up to 16GB memory with up to 352 GB/s bandwidth 512-bit SIMD vector instructions Linux operating system, IP addressable Standard programming languages and tools Leading to Groundbreaking results Over 1 TeraFlop/s double precision peak performance 1 Up to 2.2x higher memory bandwidth than on an Intel Xeon processor E5 family-based server. 2 Up to 4x more performance per watt than with an Intel Xeon processor E5 family-based server. 3 5

6 MPSS - MIC Software Stack MIC operating system is Linux! Host OS: Linux;Wind ows* to be added No need to deal with the lowlevel APIs!

7 Intel Xeon Phi Coprocessors Introduced Full Portfolio June ISC 3xxx Family Outstanding Parallel Computing Solution Performance/$ leadership 6GB GDDR5 240 GB/s >1 TFlops DP 3120P 3120A 5xxx Family Optimized for High Density Environments Performance/watt leadership 8GB GDDR5 >300 GB/s >1 TFlops DP 5110P 5120D 7xxx Family Highest Level of Features Performance leadership 16GB GDDR5 352 GB/s > 1.2 TFlops DPT 7120P 7120X

8 Announced at ISC 13 Tianhe-2 ( MilkyWay 2 ): Largest supercomputer of the world installed in Guangzhou, China First in Top500 list 55 peta flops peak performance Intel Xeon processors E V2 Based on 3 rd Generation Intel Core architecture code name Ivy Bridge Intel Xeon Phi processors! 8

9 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 9

10 Intel Xeon Phi Architecture Overview 8 memory controllers 16 Channel GDDR5 MC PCIe GEN2 Up to ~350GB/sec BW Up to 61 core s, at 1.1 GHz in-order, support 4 threads 512 bit Vector Processing Unit 32 native registers igh-speed bi-directional ring interconnect Fully Coherent L2 Cache Reliability Features Parity on L1 Cache, ECC on memory CRC on memory IO, CAP on memory 10

11 Core Architecture Overview Instruction Decode Scalar Unit based on Intel Pentium processor: Scalar Unit Scalar Registers Vector Unit Vector Registers Two pipelines Dual issue with vector instructions Pipelined one-per-clock scalar throughput 32K L1 I-cache 32K L1 D-cache 512K L2 Cache Ring SIMD Vector Processing Engine: 4 hardware threads per core 4 clock latency, hidden by round-robin scheduling of threads Cannot issue back to back inst in same thread!! Coherent 512KB L2 Cache per core 11

12 Vector Processing Unit Extends the Scalar IA Core PPF PF D0 D1 D2 E WB Thread 0 IP Thread 1 IP Thread 2 IP Thread 3 IP 4 threads in-order Pipe 0 (u-pipe) L1 TLB and L1 instruction cache 32KB Decoder Instruction Cache Miss TLB miss 16B/cycle ( 2 IPC) ucode Pipe 1 (v-pipe) TLB Miss Handler L2 TLB HWP for L2 CRI 512KB L2 Cache VPU RF X87 RF Scalar RF VPU 512b SIMD X87 ALU 0 ALU 1 L1 TLB and L1 Data Cache 32 KB TLB miss Data Cache Miss On-Die Interconnect 12

13 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 13

14 One Source Base, Tuned to many Targets Source Compilers, Libraries, Parallel Models Multicore Many-core Cluster Multicore CPU Multicore CPU Intel MIC Architecture Multicore Cluster Multicore and Many-core Cluster

15 Intel Developer Products - Intel Cluster Studio XE 2013 Phase Product Feature Benefit Intel Advisor XE Threading design assistant (Studio products only) Simplifies, demystifies, and speeds parallel application design Build Intel Composer XE C/C++ and Fortran compilers Intel Threading Building Blocks Intel Cilk Plus Intel Integrated Performance Primitives Intel Math Kernel Library Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core Intel MPI Library High Performance Message Passing (MPI) Library Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability Intel VTune Amplifier XE Performance Profiler for optimizing application performance and scalability Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks Verify & Tune Intel Inspector XE Memory & threading dynamic analysis for code quality Static Analysis for code quality Increased productivity, code quality, and lowers cost, finds memory, threading, and security defects before they happen Intel Trace Analyzer & Collector MPI Performance Profiler for understanding application correctness & behavior Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots

16 Data Parallelism of Intel Processors (2) X8 Y8 X8 Y8 X7 Y7 X7 Y7 X6 Y6 X6 Y6 X5 Y5 X5 Y5 X4 Y4 X4 Y4 X3 Y3 X3 Y3 X2 Y2 X2 Y2 X1 Y1 X1 Y1 Intel AVX Vector size: 256 bit Data types: 32 and 64 bit float VL: 4, 8, X16 Y16... X16 Y16 X8 Y8 X8 Y8 X7 Y7 X7 Y7 X6 Y6 X6 Y6 X5 Y5 X5 Y5 X4 Y4 X4 Y4 X3 Y3 X3 Y3 X2 Y2 X2 Y2 0 X1 Y1 X1 Y1 Intel MIC Vector size: 512 bit Data types: 32 and 64 bit integer 32 and 64 bit float VL: 8,16 Illustrations: Xi, Yi & results 32 bit float

17 Intel Cilk Plus Array Notation <array base> [<lower bound>:<length>[:<stride>]]+ A[:] // All of vector A B[2:6] // Elements 2 to 7 of vector B C[:][5] // Column 5 of matrix C D[0:3:2] // Elements 0,2,4 of vector D if (a[:] > b[:]) c[:] = d[:] * e[:]; else c[:] = d[:] * 2; A simple and elegant solution: a language construct for vector level parallelism

18 Input: C/C++/FORTRAN source code Data parallel part of Intel Cilk Plus extension Vectorization for MIC: A new Target Fully Automatic Analysis Vectorization Hints (ivdep/vector pragmas) Vectorizer: Map vector parallelism to vector ISA Array Notation Intel SSE Intel AVX Intel MIC Elemental Function SIMD pragma Express/expose vector parallelism Optimize and Code Gen Vectorizer makes retargeting easy!

19 Many-Core Hosted Native Model Enabled by mmic compiler switch Fully supported by compiler vectorization, Intel MKL, OpenMP*, Intel TBB, Intel Cilk Plus, Intel MPI, Might be an option for some applications: Needs to fit into memory! Should be highly parallel code Serial parts are slower on MIC than on host! Limited access to external environment like I/O Native MIC file system exists in memory only! NFS allows external I/O but

20 Many-Core Hosted with MPI MPI ranks on Intel Xeon Phi TM coprocessors(only) All messages into/out of Intel Xeon Phi TM coprocessors Xeon Data MIC MPI Programmed as homogenous network of many-core CPUs: Data Network Xeon MIC Data Xeon MIC Data Xeon MIC 20

21 Parallel Compute Parallel Compute PCIe Heterogeneous Programming Tools MKL CPU Executable MIC Native Executable Tools MKL Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCL Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCl Parallel programming is the same on MIC and CPU

22 Parallel Compute Parallel Compute PCIe Heterogeneous Programming Tools MKL CPU Executable MIC Native Executable Tools MKL Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCL Offload Directives (Non-Shared Model) Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCl Offload Keywords (Virtual Shared-Memory) Parallel programming is the same on MIC and CPU

23 Choices for Offloading Application Code Two Intel-specific models for offloading are supported ( Intel Composer 2013 ): LEO: Language Extensions for Offload for C/C++ and Fortran Explicit data transfer by compiler directives MYO: Mine-Your-Ours for C/C++ Shared virtual memory implicit offload controlled by language extensions for variable declaration etc Offloading and parallelism is orthogonal Offloading only transfers control to the MIC devices Parallelism needs to be exploited by a second model (e.g. OpenMP*) 23

24 LEO: Explicit Offload Completely realized by directives and attributes Ignored by other compilers ( might result in a warning ) Requires bit-wise copyable data objects Programmer designates variables that need to be copied between host and card in the offload directive C/C++ Example: #pragma offload target(mic) in(data:length(size)) Fortran Example:!dir$ offload target(mic in(a1:length(size)) Very much influenced accelerator ( target ) extension of coming OpenMP* 4.0 standard #pragma omp target map(to(b:count)) map(from(a:count)) 24

25 C/C++ Extensions for explicit Offload Offload pragma Keyword for variable & function definitions C/C++ Syntax #pragma offload <clauses> <statement block> attribute ((target(mic))) Semantics Allow next statement block to execute on Intel MIC Architecture or host CPU Compile function for, or allocate variable on, both CPU and Intel MIC Architecture Entire blocks of code Data transfer #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop) #pragma offload_transfer target(mic)<clauses> Mark entire files or large blocks of code for generation on both host CPU and Intel MIC Architecture Initiates asynchronous data transfer, or initiates and completes synchronous data transfer and/or other countries. *Other Intel names Many and brands Integrated may be Core claimed Architecture as the property of others. 25

26 Example float reduction(float *data, int numberof) { float ret = 0.f; #pragma offload target(mic) in(data:length(numberof)) { #pragma omp parallel for reduction(+:ret) for (int i=0; i < numberof; ++i) ret += data[i]; } return ret; } Note: copies numberof elements to the coprocessor, not numberof*sizeof(float) bytes the compiler knows data s type 26

27 Example: Call Intel MKL on Coprocessor int main{ // initialize variables } #pragma offload target(mic) \ in(transa,transb, N, alpha, beta) \ in(a:length(matrix_elements)) \ in(b:length(matrix_elements)) \ inout(c:length(matrix_elements)) \ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); sgemm performs C=beta*C+alpha*A*B, transa and transb regulate the transposition of A and B and the Ns define the sizes of the matrices (see documentation). C is input and output, all others are input only. MKL will automatically make optimal use of MIC 27

28 MYO: Implicit Offload Real language extensions Not accepted by other compilers Alternative model for non-compact data objects like a linked list Programmer marks variables that should be (virtually) shared between host and card Run-time system automatically maintains coherence at boundary of offloaded code regions Sample: _Cilk_shared double foo; _Offload func(y); 28

29 MYO Memory Model Section of memory maintained at the same (!!) virtual address on both the host and coprocessor Reserving same address range on both devices allows Seamless sharing of complex pointer-containing data structures Elimination of user marshaling and data management Use of simple language extensions to C/C++ Same address range C/C++ executable Offload code Host Memory Host Intel MIC MIC Memory

30 MYO: _Cilk_shared for Data & Routines What Syntax Semantics Function int _Cilk_shared f(int x) { return x+1; } Versions generated for both CPU and card; may be called from either side Global _Cilk_shared int x = 0; Visible on both sides File/Function static static _Cilk_shared int x; Visible on both sides, only to code within the file/function Class class _Cilk_shared x { }; Class methods, members, and and operators are available on both sides Pointer to shared data int _Cilk_shared *p; p is local (not shared), can point to shared data A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data Entire blocks of code #pragma offload_attribute( push, _Cilk_shared) #pragma offload_attribute(pop) Mark entire files or large blocks of code _Cilk_shared using this pragma and/or other countries. *Other Intel names Many and brands Integrated may be Core claimed Architecture as the property of others. 30

31 Example int main() { int count = 10000; // Shared variable declaration for pi _Cilk_shared float pi; // Initialize shared global // variables pi = 0.0f; _Cilk_shared void compute_pi(int count) { int i; } // Compute pi on target _Offload compute_pi(count); pi /= count; } #pragma omp parallel for \ reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); } 31

32 Offload Models plus MPI MPI ranks on Intel Xeon processors (only) All messages into/out of processors Data Offload Offload models used to accelerate MPI ranks MPI Xeon MIC Homogenous network of hybrid nodes: Data Network Xeon MIC Data Xeon MIC Data Xeon MIC 32

33 Intel Xeon Phi Coprocessor OpenCL* Compiler Clang* OpenCL* LLVM Compiler Optimizer OpenCL* LLVM* IR LLVM* Standard Passes LLVM* IR LLVM* Vectorizer: Scalarizer Divergence Analysis Predicator Packetizer bypasses LLVM* IR Code Generator Xeon Phi Code LLVM* OpenCL Passes: Barriers Builtins Kernel Arguments LLVM* IR LLVM* Standard Passes

34 Intel VTune Amplifier XE 2013 for Intel Xeon Phi Select which problem areas you want to analyze Beginning with update 4, the methodology in this presentation is implemented in the General Exploration profile Events to collect will be configured automatically 7/23/2 34

35 MIC Debugging with GDB* Run Host-MIC GDB* on your localhost (you can t use default host gdb!) /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Start gdbserver on the Intel Xeon Phi Coprocessor To remote debug using pipe to ssh (gdb) target extended-remote ssh T mic0 ~/gdbserver multi IP:port To remote debug using stdio (gdb) target extended-remote ssh -T mic0 ~/gdbserver multi - To attach to a running application via the processid (pid) (gdb) file /local/path/to/application (gdb) attach <remote-pid> To run an application directly from GDB* (gdb) file /local/path/to/application (gdb) set remote exec-file /target/path/to/application

36 Extension to Microsoft Windows* OS Microsoft Window being added as second host operating system Windows 2008 Server No change for operating system of coprocessor remains Linux Developer environment same as for Linux Same programming models In beta testing today To be released H2/

37 Future Product Line: Knights Landing Knights Landing is the code name for the 2 nd generation product for the Intel Many Integrated Core Architecture Knights Landing targets Intel s 14nm manufacturing process Kights Landing will be productized as a processor (running the host OS) and as a coprocessor ( a PCI end-point device ) Knights Landing will feature on-package high-bandwidth memory 37

38 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 38

39 Summary Intel Xeon Phi coprocessor is a real product now! The flexibility of the programming models offer a solution for all needs: From a single MIC card in a workstation up to thousands of MIC cards attached to HPC cluster nodes Compilers and other tools make it easy to develop or port code to run applications natively on the coprocessors, heterogeneously on host CPU + Intel MIC systems and collections of these connected in a compute cluster No new parallel programming models needed: All Intelsupported models available for MIC too including innovative models like Coarray-Fortran Using MIC is a simple extension of CPU programming

40 Get Educated on Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessor (codename Knights Corner): Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual: understanding An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors: Intel Manycore Platform Software Stack (MPSS): Intel Xeon Phi Coprocessor Developer's Quick Start Guide: Intel Xeon Phi Coprocessor System Software Developers Guide: Intel and Third Party Tools and Libraries available with support for Intel Xeon Phi Coprocessor: Optimization and Performance Tuning for Intel Xeon Phi Coprocessors - Part 1: Optimization Essentials: optimization Optimization and Performance Tuning for Intel Xeon Phi Coprocessors, Part 2: Understanding and Using Hardware Events: understanding Tools That Enable You to be Ready!

41 41

42

43 Vector Instruction Performance VPU contains 16 SP ALUs, 8 DP ALUs, Most VPU instructions have a latency of 4 cycles and TPT 1 cycle Load/Store/Scatter have 7-cycle latency Convert/Shuffle have 6-cycle latency VPU instruction are issued in u-pipe Certain instructions can go to v-pipe also Vector Mask, Vector Store, Vector Packstore, Vector Prefetch, Scalar 43

44 Spectrum of Execution Models ---- Heterogenous Models CPU-Centric Intel MIC-Centric Intel Xeon Processor Intel Many Integrated Core (MIC) Multi-core Hosted Offload Symmetric Reverse Offload Manycore Hosted General purpose serial and parallel computing Codes with balanced needs Highly-parallel codes Codes with highly- parallel phases Codes with serial phases Multi-core Many-core Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main() Foo( ) MPI_*() PCIe Supported with Intel Tools

Intel Many Integrated Core (MIC) Programming Intel Xeon Phi

Intel Many Integrated Core (MIC) Programming Intel Xeon Phi Intel Many Integrated Core (MIC) Programming Intel Xeon Phi Dmitry Petunin Intel Technical Consultant 1 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Computer Architecture and Structured Parallel Programming James Reinders, Intel Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Intel Xeon Phi Coprocessor Offloading Computation

Intel Xeon Phi Coprocessor Offloading Computation Intel Xeon Phi Coprocessor Offloading Computation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,

More information

Parallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel

Parallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel Parallel Programming The Ultimate Road to Performance April 16, 2013 Werner Krotz-Vogel 1 Getting started with parallel algorithms Concurrency is a general concept multiple activities that can occur and

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it s all about parallel programming Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU

More information

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward. High Performance Parallel Programming Multicore development tools with extensions to many-core. Investment protection. Scale Forward. Enabling & Advancing Parallelism High Performance Parallel Programming

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Path to Exascale? Intel in Research and HPC 2012

Path to Exascale? Intel in Research and HPC 2012 Path to Exascale? Intel in Research and HPC 2012 Intel s Investment in Manufacturing New Capacity for 14nm and Beyond D1X Oregon Development Fab Fab 42 Arizona High Volume Fab 22nm Fab Upgrades D1D Oregon

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

Beyond Offloading Programming Models for the Intel Xeon Phi Coprocessor. Michael Hebenstreit, Senior Cluster Architect, Intel SFTS001

Beyond Offloading Programming Models for the Intel Xeon Phi Coprocessor. Michael Hebenstreit, Senior Cluster Architect, Intel SFTS001 Beyond Offloading Programming Models for the Intel Xeon Phi Coprocessor Michael Hebenstreit, Senior Cluster Architect, Intel SFTS001 Agenda Overview Automatic offloading Offloading by pragmas and keywords

More information

Using Intel VTune Amplifier XE for High Performance Computing

Using Intel VTune Amplifier XE for High Performance Computing Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message

More information

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP), Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

extreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA

extreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA extreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA Topics Covered Today 2 Intel s offerings to HPC Update on Intel Architecture Roadmap Overview on Intel Development Tools

More information

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012 Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012 Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

Get Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library

Get Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Get Ready for Intel MKL on Intel Xeon Phi Coprocessors Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL

More information

Laurent Duhem Intel Alain Dominguez - Intel

Laurent Duhem Intel Alain Dominguez - Intel Laurent Duhem Intel Alain Dominguez - Intel Agenda 2 What are Intel Xeon Phi Coprocessors? Architecture and Platform overview Intel associated software development tools Execution and Programming model

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Knights Corner: Your Path to Knights Landing

Knights Corner: Your Path to Knights Landing Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Vincent C. Betro, Ph.D. NICS March 6, 2014

Vincent C. Betro, Ph.D. NICS March 6, 2014 Vincent C. Betro, Ph.D. NICS March 6, 2014 NSF Acknowledgement This material is based upon work supported by the National Science Foundation under Grant Number 1137097 Any opinions, findings, and conclusions

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information) Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction

More information

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified

More information

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015 Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last

More information

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

An Introduction to the Intel Xeon Phi Coprocessor

An Introduction to the Intel Xeon Phi Coprocessor An Introduction to the Intel Xeon Phi Coprocessor INFIERI-2013 - July 2013 Leo Borges (leonardo.borges@intel.com) Intel Software & Services Group Introduction High-level overview of the Intel Xeon Phi

More information

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel 15.03.2012 1 Accelerating HPC HPC Advisory Council Lugano, CH March 15 th, 2012 Herbert Cornelius Intel Legal Disclaimer 15.03.2012 2 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.

More information

Debugging Intel Xeon Phi KNC Tutorial

Debugging Intel Xeon Phi KNC Tutorial Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

Overview: Programming Environment for Intel Xeon Phi Coprocessor

Overview: Programming Environment for Intel Xeon Phi Coprocessor Overview: Programming Environment for Intel Xeon Phi Coprocessor One Source Base, Tuned to many Targets Source Compilers, Libraries, Parallel Models Multicore Many-core Cluster Multicore CPU Multicore

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

John Hengeveld Director of Marketing, HPC Evangelist

John Hengeveld Director of Marketing, HPC Evangelist MIC, Intel and Rearchitecting for Exascale John Hengeveld Director of Marketing, HPC Evangelist Intel Data Center Group Dr. Jean-Laurent Philippe, PhD Technical Sales Manager & Exascale Technical Lead

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

Intel Array Building Blocks

Intel Array Building Blocks Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call

More information

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: Intel Architecture and Tools Jureca Tuning for the platform II Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: 23.11.2017 Agenda Introduction Processor Architecture Overview Composer XE Compiler Intel Python

More information

For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them. Contents at a Glance About the Author...

More information

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps A Unified Approach to Heterogeneous Architectures Using the Uintah Framework Qingyu Meng, Alan Humphrey

More information

Getting Started with Intel SDK for OpenCL Applications

Getting Started with Intel SDK for OpenCL Applications Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com

More information

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Advisor XE. Vectorization Optimization. Optimization Notice Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product

More information

Programming Intel R Xeon Phi TM

Programming Intel R Xeon Phi TM Programming Intel R Xeon Phi TM An Overview Anup Zope Mississippi State University 20 March 2018 Anup Zope (Mississippi State University) Programming Intel R Xeon Phi TM 20 March 2018 1 / 46 Outline 1

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

Ryan Hulguin

Ryan Hulguin Ryan Hulguin ryan-hulguin@tennessee.edu Outline Beacon The Beacon project The Beacon cluster TOP500 ranking System specs Xeon Phi Coprocessor Technical specs Many core trend Programming models Applications

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Models Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Overview Execution options

More information

Chapter 1 Introduction to Xeon Phi Architecture

Chapter 1 Introduction to Xeon Phi Architecture Chapter 1 Introduction to Xeon Phi Architecture Technical computing can be defined as the application of mathematical and computational principles to solve engineering and scientific problems. It has become

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information