Reusing this material
|
|
- Marilyn Bradley
- 6 years ago
- Views:
Transcription
1 XEON PHI BASICS
2 Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: You must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original. Note that presentations may contains images owned by others. Please seek their permission before reusing these images.
3 LESSON PLAN Programming models Parallelisation Compilers and Tools Performance Considerations
4 Programming models
5 Programming models Host Coprocessor + Main Memory
6 Programming models 3 Basic Programming Models Host Native mode Coprocessor + Offload execution Symmetric execution Main Memory
7 Programming models Native Mode: Xeon Phi only Host Main Memory ssh (PCIe) Coprocessor int main() { do stuff(); } Host used for preparation work (e.g. compiling, data copy) User initiates run from host or can use host to connect to Xeon Phi via ssh
8 Programming models Native Mode: Xeon Phi only Host Main Memory ssh (PCIe) Coprocessor int main() { do stuff(); } Host used for preparation work (e.g. compiling, data copy) User initiates run from host or can use host to connect to Xeon Phi via ssh Programme runs on Xeon Phi from start to finish as usual
9 Programming models Native Mode: Xeon Phi only Pros: Requires minimal effort to port Works well with flat profile applications No memory copy required
10 Programming models Native Mode: Xeon Phi only Pros: Requires minimal effort to port Works well with flat profile applications No memory copy required Cons: Poor performance on codes with large serial regions and complex codes Limited Xeon Phi memory
11 Programming models Offload Execution: Hotspot eliminator Host Main Memory int main() { #pragma offload do_stuff() } ssh (PCIe) do_stuff(){ } Coprocessor Application is initiated on host
12 Programming models Offload Execution: Hotspot eliminator Host Main Memory int main() { #pragma offload do_stuff() } ssh (PCIe) do_stuff(){ } Coprocessor Application is initiated on host Embarrassingly parallel hotspots are offloaded to Xeon Phi
13 Programming models Offload Execution: Hotspot eliminator Host Main Memory int main() { #pragma offload do_stuff() } ssh (PCIe) do_stuff(){ } Coprocessor Application is initiated on host Embarrassingly parallel hotspots are offloaded to Xeon Phi Results of offload region are returned to host where execution continues
14 Programming models Offload Execution: Hotspot eliminator Pros: Serial code handled by advanced CPU cores Embarrassingly parallel hotspots are executed efficiently on Xeon Phi More efficient use of (limited) Xeon Phi memory
15 Programming models Offload Execution: Hotspot eliminator Pros: Serial code handled by advanced CPU cores Embarrassingly parallel hotspots are executed efficiently on Xeon Phi More efficient use of (limited) Xeon Phi memory Cons: Data must be copied to and from the Xeon Phi via (slow) PCIe Bus May lead to poor utilisation of CPU/XeonPhi (idle time)
16 Programming models Symmetric Execution: Phi-as-a-node Host Main Memory MPI_RANK=0 15 int main() { do_stuff() } ssh (PCIe) MPI_RANK= int main() { do_stuff() } Coprocessor Application is initiated on host but
17 Programming models Symmetric Execution: Phi-as-a-node Host Main Memory MPI_RANK=0 15 int main() { do_stuff() } ssh (PCIe) MPI_RANK= int main() { do_stuff() } Coprocessor Application is initiated on host but Runs across both CPU and Xeon Phi cores
18 Programming models Symmetric Execution: Phi-as-a-node Host Main Memory MPI_RANK=0 15 int main() { do_stuff() } ssh (PCIe) MPI_RANK= int main() { do_stuff() } Coprocessor Application is initiated on host but Runs across both CPU and Xeon Phi cores Effectively using Xeon Phi as just another node for MPI to use
19 Programming models Symmetric Execution: Phi-as-a-node Pros: Promise of full hardware utilisation No need for offloading pragmas and memory copies
20 Programming models Symmetric Execution: Phi-as-a-node Pros: Serial code handled by advanced CPU cores Embarrassingly parallel hotspots are executed efficiently on Xeon Phi More efficient use of (limited) Xeon Phi memory Cons: Tricky load-balancing Code is rarely optimal for both CPU and Xeon Phi
21 Parallelisation
22 Parallelisation MPI and / or OpenMP
23 Parallelisation MPI+OpenMP with Offload MPI runs only on hosts MPI processes offload to Xeon Phi OpenMP in MPI processes OpenMP in offload regions Image from Colfax training material
24 Parallelisation Symmetric Pure MPI MPI processes on host MPI processes (native) on Xeon Phi No OpenMP Image from Colfax training material
25 Parallelisation Symmetric hybrid MPI+OpenMP MPI processes on host MPI processes (native) on Xeon Phi All MPI processes use OpenMP multithreading Image from Colfax training material
26 Parallelisation What is best? What is your goal? What is your system? What is your application? Generally OpenMP faster than MPI on Xeon Phi Poor performance of MPI on Xeon Phi Less memory (especially important on Xeon Phi) Worth checking affinity settings (more later)
27 Compilers & Tools
28 Compilers & Tools Compilers In a word: Intel
29 Compilers & Tools Compilers In a word: Intel Intel C Compiler Intel C++ Compiler Intel Fortran Compiler
30 Compilers & Tools Tools In two words: Intel & (but mainly Intel) Allinea
31 Compilers & Tools Tools Intel Parallel Studio XE Intel C, C++ and Fortran compilers (MIC-capable) Intel Math Kernel Library (MKL) Intel MPI Library (only in Cluster Edition) Intel Trace Analyzer and Collector / ITAC (MPI profiler) Intel VTune Amplifier XE (multi-threaded profiler) Intel Inspector XE (memory and threading debugging) Intel Threading Building Blocks / TBB (threading library) Intel Performance Primitives / IPP (media and data) Intel Advisor XE (guided parallelism design) Allinea Map (lightweight profiler) DDT (debug) Forge (unified UI for DDT & Map)
32 Compilers & Tools Tools Runtime
33 Compilers & Tools Tools Runtime MPSS (Intel Manycore Platform Software Stack) Environment Variables Linux Commands
34 Compilers & Tools MPSS Tools Environment Variables Runtime Linux Commands micnativeloadex micinfo miccheck micsmc (GUI) micrasd (root) MKL_MIC_ENABLE MIC_ENV_PREFIX MIC_LD_LIBRARY_PATH I_MPI_MIC I_MPI_MIC_POSTFIX OFFLOAD_REPORT KMP_AFFINITY KMP_BLOCKTIME MIC_USE_2MB_BUFFERS lspci grep Phi cat /etc/hosts grep mic cat /proc/cpuinfo grep proc tail -n 3 For more details: E1EC94AE-A13D-463E-B3C3-6D7A7205F5A1.htm
35 Performance Considerations
36 Performance Considerations Four things to consider first: Execution mode Vectorisation Alignment Affinity Application Design
37 Performance Considerations Mode of execution Native Offload Symmetric Mode chosen should depend on the application and system configuration (as discussed previously)
38 Performance Considerations Vectorisation Xeon Phi performance is greatly dependant on vector units. Intel Xeon CPUs also use (smaller) vector units Code optimised for Intel Xeon will run faster on Intel Xeon Phi KNL (next generation Xeon Phi) will also use 512-AVX vector units Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)
39 Performance Considerations Vectorisation Xeon Phi performance is greatly dependant on vector units. Intel Xeon CPUs also use (smaller) vector units Code optimised for Intel Xeon will run faster on Intel Xeon Phi KNL (next generation Xeon Phi) will also use 512-AVX vector units Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)
40 Performance Considerations Vectorisation Xeon Phi performance is greatly dependant on vector units. Intel Xeon CPUs also use (smaller) vector units Code optimised for Intel Xeon will run faster on Intel Xeon Phi KNL (next generation Xeon Phi) will also use 512-AVX vector units Code optimised for Intel Xeon Phi KNC will also run faster on Intel Xeon Phi KNL *(KNC-KNL not binary compatible)
41 Performance Considerations Data Alignment Loop is vectorised!= faster Data alignment is critical for vectorisation to be beneficial Remember to not only align data, but also to tell the compiler that data is aligned at loop.
42 Performance Considerations Data Alignment Loop is vectorised!= faster Data alignment is critical for vectorisation to be beneficial Remember to not only align data, but also to tell the compiler that data is aligned at loop.
43 Performance Considerations Data Alignment Loop is vectorised!= faster Data alignment is critical for vectorisation to be beneficial Remember to not only align data, but also to tell the compiler that data is aligned at loop.
44 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.
45 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.
46 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.
47 Performance Considerations Affinity All data moves over high-speed ring interconnect Affinity critical for good performance Default settings are not always optimal In offload mode, may accidentally use poor settings. e.g. 240 threads competing for the use of 30 cores, while 30 other cores are idle.
48 Performance Considerations Application Design Design >> Optimisation Consider all levels of parallelism available and adapt your algorithm to exploit as many and as much as possible
49 Performance Considerations Levels of parallelism Machine Node Numa Region Core Thread Vector Unit Co-processor Vector Unit
50 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Region Core Thread Vector Unit Co-processor Vector Unit
51 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Numa Region Region Core Thread Vector Unit Co-processor Vector Unit
52 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Thread Machine Node Node Node Node Vector Unit Vector Unit
53 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Thread Thread Thread Machine Node Node Node Node Vector Unit Vector Unit
54 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Thread Thread Thread Machine Node Node Node Node Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit
55 Performance Considerations Levels of parallelism Numa Numa Region Region Core Core Core Core Core Co-processor Co-processor Thread Thread Thread Machine Node Node Node Node Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit
56 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Numa Region Region Core Core Core Core Core Co-processor Co-processor Thread Thread Vector Unit Thread Vector Unit Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit
57 Performance Considerations Levels of parallelism Machine Node Node Node Node Numa Numa Region Region Core Core Core Core Core Co-processor Co-processor Thread Thread Vector Vector Unit Unit Thread Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit Vector Unit Vector Vector Unit Unit
58 Summary
59 Summary
60 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism
61 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism
62 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism
63 Summary Programming models Native, Offload, Symmetric - what s best for you. Parallelisation MPI, OpenMP -> OpenMP better on Xeon Phi Many ways to mix and match Compilers and Tools Use Intel compilers (C, C++, Fortran) Intel and Allinea tools: VTune, Map, etc. Wide variety of runtime tools and environment variables: micinfo, KMP_AFFINITY Performance Considerations Programming model Vectorisation - needed to exploit Xeon Phi compute Data alignment - needed to make vectorisation useful Thread/process affinity - can be critical for performance Application design: Consider levels of parallelism
64 Thank You!
PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationUsing Intel VTune Amplifier XE for High Performance Computing
Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationE, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),
Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,
More informationDebugging Intel Xeon Phi KNC Tutorial
Debugging Intel Xeon Phi KNC Tutorial Last revised on: 10/7/16 07:37 Overview: The Intel Xeon Phi Coprocessor 2 Debug Library Requirements 2 Debugging Host-Side Applications that Use the Intel Offload
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationParallel Programming Libraries and implementations
Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationMessage Passing Interface (MPI) on Intel Xeon Phi coprocessor
Message Passing Interface (MPI) on Intel Xeon Phi coprocessor Special considerations for MPI on Intel Xeon Phi and using the Intel Trace Analyzer and Collector Gergana Slavova gergana.s.slavova@intel.com
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationParallel Programming. Libraries and Implementations
Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationPerformance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,
Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate
More informationNative Computing and Optimization on Intel Xeon Phi
Native Computing and Optimization on Intel Xeon Phi ISC 2015 Carlos Rosales carlos@tacc.utexas.edu Overview Why run native? What is a native application? Building a native application Running a native
More informationProgramming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title
Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationMaximizing performance and scalability using Intel performance libraries
Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it s all about parallel programming Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU
More informationFractals exercise. Investigating task farms and load imbalance
Fractals exercise Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationXeon Phi Coprocessors on Turing
Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High
More informationProgramming LRZ. Dr. Volker Weinberg, RRZE, 2018
Programming Environment @ LRZ Dr. Volker Weinberg, weinberg@lrz.de RRZE, 2018 Development tools Activity Tools Linux versions Source code development Editors vi, emacs, etc. Executable creation Compilers
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationParallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel
Parallel Programming The Ultimate Road to Performance April 16, 2013 Werner Krotz-Vogel 1 Getting started with parallel algorithms Concurrency is a general concept multiple activities that can occur and
More informationInstallation Guide and Release Notes
Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel
More informationIntel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes
Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationCode modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.
Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism. Parallel + SIMD is the Path Forward Intel Xeon and Intel Xeon Phi Product
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationFractals. Investigating task farms and load imbalance
Fractals Investigating task farms and load imbalance Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationTowards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)
Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for
More informationIntel MPI Library Conditional Reproducibility
1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationIntel Parallel Studio XE 2015
2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. Ostrava,
Intel MIC Programming Workshop, Hardware Overview & Native Execution Ostrava, 7-8.2.2017 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationrabbit.engr.oregonstate.edu What is rabbit?
1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system
More informationAdvanced Threading and Optimization
Mikko Byckling, CSC Michael Klemm, Intel Advanced Threading and Optimization February 24-26, 2015 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland!$omp parallel do collapse(3) do p4=1,p4d
More informationIntel Software Development Products Licensing & Programs Channel EMEA
Intel Software Development Products Licensing & Programs Channel EMEA Intel Software Development Products Advanced Performance Distributed Performance Intel Software Development Products Foundation of
More informationPerformance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi
Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi Dr. Luigi Iapichino Leibniz Supercomputing Centre Supercomputing 2017 Intel booth, Nerve Center
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division
Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor 1 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationProgramming Intel R Xeon Phi TM
Programming Intel R Xeon Phi TM An Overview Anup Zope Mississippi State University 20 March 2018 Anup Zope (Mississippi State University) Programming Intel R Xeon Phi TM 20 March 2018 1 / 46 Outline 1
More informationIntel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:
Intel Architecture and Tools Jureca Tuning for the platform II Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: 23.11.2017 Agenda Introduction Processor Architecture Overview Composer XE Compiler Intel Python
More informationHigh Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.
High Performance Parallel Programming Multicore development tools with extensions to many-core. Investment protection. Scale Forward. Enabling & Advancing Parallelism High Performance Parallel Programming
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationAllinea Unified Environment
Allinea Unified Environment Allinea s unified tools for debugging and profiling HPC Codes Beau Paisley Allinea Software bpaisley@allinea.com 720.583.0380 Today s Challenge Q: What is the impact of current
More informationDynamic SIMD Scheduling
Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies
More informationKNL Performance Comparison: CP2K. March 2017
KNL Performance Comparison: CP2K March 2017 1. Compilation, Setup and Input 2 Compilation CP2K and its supporting libraries libint, libxsmm, libgrid, and libxc were compiled following the instructions
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationDebugging, benchmarking, tuning i.e. software development tools. Martin Čuma Center for High Performance Computing University of Utah
Debugging, benchmarking, tuning i.e. software development tools Martin Čuma Center for High Performance Computing University of Utah m.cuma@utah.edu SW development tools Development environments Compilers
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationIntel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel
Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Which performance analysis tool should I use first? Intel Application
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationDmitry Durnov 15 February 2017
Cовременные тенденции разработки высокопроизводительных приложений Dmitry Durnov 15 February 2017 Agenda Modern cluster architecture Node level Cluster level Programming models Tools 2/20/2017 2 Modern
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationextreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA
extreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA Topics Covered Today 2 Intel s offerings to HPC Update on Intel Architecture Roadmap Overview on Intel Development Tools
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More information6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance
The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E5-2630 Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More informationUnderstanding Dynamic Parallelism
Understanding Dynamic Parallelism Know your code and know yourself Presenter: Mark O Connor, VP Product Management Agenda Introduction and Background Fixing a Dynamic Parallelism Bug Understanding Dynamic
More informationLaurent Duhem Intel Alain Dominguez - Intel
Laurent Duhem Intel Alain Dominguez - Intel Agenda 2 What are Intel Xeon Phi Coprocessors? Architecture and Platform overview Intel associated software development tools Execution and Programming model
More informationIntel Manycore Platform Software Stack (Intel MPSS) User's Guide
Intel Manycore Platform Software Stack (Intel MPSS) User's Guide Copyright 2013-2015 Intel Corporation All Rights Reserved US Revision: 3.5 World Wide Web: http://www.intel.com Table of Contents Disclaimer
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationPerformance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures
Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk
More informationIntel Manycore Platform Software Stack (Intel MPSS)
Intel Manycore Platform Software Stack (Intel MPSS) Copyright 2013-2015 Intel Corporation All Rights Reserved US Revision: 3.6.1 World Wide Web: http://www.intel.com Disclaimer and Legal Information You
More informationTechnical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL
Technical Report Abstract: This technical report presents CESGA experience of porting three applications to the new Intel Xeon Phi coprocessor. The objective of these experiments was to evaluate the complexity
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationIntel + Parallelism Everywhere. James Reinders Intel Corporation
Intel + Parallelism Everywhere James Reinders Intel Corporation How to win at parallel programming 2 My Talk Hardware Parallelism and some insights INNOVATION: vectorization INNOVATION: tasking 3 Helping
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationAn Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015
Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1 Last
More informationIntel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python
Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1:
More informationHands-on with Intel Xeon Phi
Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling
More informationParallel Applications on Distributed Memory Systems. Le Yan HPC User LSU
Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming
More informationIntel Manycore Platform Software Stack (Intel MPSS)
Intel Manycore Platform Software Stack (Intel MPSS) User's Guide April 2017 Copyright 2013-2017 Intel Corporation All Rights Reserved US Revision: 3.8 World Wide Web: http://www.intel.com Disclaimer and
More information