6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance
|
|
- Calvin Perry
- 6 years ago
- Views:
Transcription
1 The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon Phi support: icc, icpc, libraries, drivers 1 core for Linux + 56 cores * 4 hyperthreads/core = 224 hyperthreads for you to use 31S1P Xeon Phi system mic0 57 Cores 22 nm 8 GB of memory No disk Application support XeonPhi.pptx Xeon Phi Internals 3 Fused Multiply-Add 4 Each Xeon Phi core has: Instruction Decode Scalar Unit Vector Unit Scalar Registers Vector Registers 32K L1 Instruction Cache The Xeon Phi chip contains almost 5B transistors! Many scientific and engineering computations take the form: D = A + (B*C); A normal multiply-add would handle this as: tmp = B*C; D = A + tmp; A fused multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time the higher-order bits of B*C are being multiplied. 32K L1 Data Cache 512K L2 Cache / Core Ring Bus Cache is 8-way set associative. Cache lines are 64 bytes. Assembly instructions are executed in-order. Thus, hyperthreading is important! Vector registers are 512 bits wide = 16 floats. They can perform Fused Multiply-Add (FMA). Theoretical performance = almost 1 TFLOPS Consider a Base 10 example: ( 123*456 ) 123 x Can start adding the 9 the moment the 8 is produced! 56,877 Note: Normal A+(B*C) FMA A+(B*C) Xeon Phi Peak Performance 5 Getting to rabbit and setting up your account 6 Lowercase letter L To login to rabbit: ssh rabbit.engr.oregonstate.edu l yourengrusername Clock freq x # cores x # vector lanes x 2 FMA / 2 cycles to decode = GHz x 56 x 16 x 2 / 2 = Put this in your rabbit account s.cshrc : setenv INTEL_LICENSE_FILE 28518@linlic.engr.oregonstate.edu setenv SINK_LD_LIBRARY_PATH /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/composer_xe_ /compiler/lib/mic/ setenv ICCPATH /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/composer_xe_2015/bin/ set path=( $path $ICCPATH ) source /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/bin/iccvars.csh intel TFLOPS Then activate these values like this: source.cshrc FMA stands for Fused Multiply-Add. (These will be activated automatically the next time you login.) To verify that the Xeon Phi card is there: ping mic0 To see the Xeon Phi card characteristics: micinfo To run some operational tests on the Xeon Phi: miccheck 1
2 Running ping 7 Running micinfo 8 rabbit 150% ping mic0 PING rabbit-mic0.engr.oregonstate.edu ( ) 56(84) bytes of data. 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=1 ttl=64 time=290 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=2 ttl=64 time=0.385 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=3 ttl=64 time=0.242 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=4 ttl=64 time=0.230 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=5 ttl=64 time=0.225 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=6 ttl=64 time=0.261 ms rabbit 151% micinfo MicInfo Utility Log Created Mon Jan 12 10:21: System Info HOST OS : Linux OS Version : el6.x86_64 Driver Version : MPSS Version : Host Physical Memory : MB Device No: 0, Device Name: mic0 Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-31S1P ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 57 Voltage : uv Frequency : khz Version Flash Version : SMC Firmware Version : SMC Boot Loader Version : uos Version : mpss3.4.2 Device Serial Number : ADKC Board Vendor ID : 0x8086 Device ID : 0x225e Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : Insufficient Privileges PCIe Speed : Insufficient Privileges PCIe Max payload size : Insufficient Privileges PCIe Max read req size : Insufficient Privileges Thermal Fan Speed Control Fan RPM Fan PWM Die Temp GDDR GDDR Vendor GDDR Version GDDR Density GDDR Size GDDR Technology GDDR Speed GDDR Frequency GDDR Voltage : 40 C : Elpida : 0x1 : 2048 Mb : 7936 MB : GDDR5 : GT/s : khz : uv Running miccheck 9 Running micsmc, I 10 rabbit 153% micsmc -a rabbit 152% miccheck MicCheck r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system... pass Test 1: Check mic driver is loaded... pass Test 2: Check number of devices driver sees in the system... pass Test 3: Check mpssd daemon is running... Pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF... pass Test 5 (mic0): Check ras daemon is available in device... pass Test 6 (mic0): Check running flash version is correct... pass Test 7 (mic0): Check running SMC firmware version is correct... pass mic0 (info): Device Series:... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID:... 0x225e Number of Cores: OS Version: mpss3.4.2 Flash Version: Driver Version: (root@rabbit.engr.oregonstate.edu) Stepping:... 0x3 Substepping:... 0x0 mic0 (temp): Cpu Temp: C Memory Temp: C Fan-In Temp: C Fan-Out Temp: C Core Rail Temp: C Uncore Rail Temp: C Memory Rail Temp: C Status: OK mic0 (freq): Core Frequency: GHz Total Power: Watts Low Power Limit: Watts High Power Limit: Watts Physical Power Limit: Watts mic0 (mem): Free Memory: MB Total Memory: MB Memory Usage: MB Running micsmc, II 11 Cross-compiling and running from rabbit 12 mic0 (cores): Device Utilization: User: 0.00%, System: 0.09%, Idle: 99.91% Per Core Utilization (57 cores in use) Core #1: User: 0.00%, System: 0.27%, Idle: 99.73% Core #2: User: 0.00%, System: 0.27%, Idle: 99.73% Core #3: User: 0.00%, System: 0.00%, Idle: % Core #4: User: 0.00%, System: 0.00%, Idle: % Core #5: User: 0.00%, System: 0.00%, Idle: % Core #6: User: 0.00%, System: 0.00%, Idle: % Core #7: User: 0.00%, System: 0.00%, Idle: % Core #8: User: 0.00%, System: 0.27%, Idle: 99.73% Core #9: User: 0.00%, System: 0.00%, Idle: % Core #10: User: 0.00%, System: 0.27%, Idle: 99.73% To compile on rabbit for rabbit: icpc -o.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec To cross-compile on rabbit for the Xeon Phi: Note: the summary of vectorization success or failure is in a *.optvec file Core #50: User: 0.00%, System: 0.00%, Idle: % Core #52: User: 0.00%, System: 0.27%, Idle: 99.73% Core #53: User: 0.00%, System: 0.00%, Idle: % Core #54: User: 0.00%, System: 0.27%, Idle: 99.73% Core #55: User: 0.00%, System: 0.00%, Idle: % Core #56: User: 0.00%, System: 0.27%, Idle: 99.73% Core #57: User: 0.00%, System: 0.54%, Idle: 99.46% To execute on the Xeon Phi, type this on rabbit: 2
3 Gaining Access to the Cores, I 13 Gaining Access to the Cores, II 14 #pragma omp parallel for #pragma omp parallel sections #pragma omp section #pragma omp section float sum = 0.; #pragma omp parallel for reduction(+:sum) sum += A[ i ] * B[ i ] ; #pragma omp task Gaining Access to the Vector Units 15 Turning Off All Vectorization 16 icpc -mmic -o.cpp -lm -openmp -no-vec #pragma omp parallel for simd C[0:N] = A[0:N] * B[0:N] ; The sole (?) reason to do this is when running benchmarks to compare vector vs. scalar array processing. The Intel compiler does a great job of automatically vectorizing where it can. Warning: just because you didn t deliberately vectorize your code doesn t mean it didn t end up vectorized! Use the -no-vec flag. Vectorizing Conditionals 17 Reducing a Vector 18 if( D[ i ] == 0 ) else C[ i ] = A[ i ] + B[ i ] ; float f = sec_reduce_add( A[0:N] ); float f = sec_reduce_mul( A[0:N] ); float f = sec_reduce_max( A[0:N] ); float f = sec_reduce_min( A[0:N] ); int i = sec_reduce_max_ind( A[0:N] ); int i = sec_reduce_min_ind( A[0:N] ); In my tests, this was 3-4x as fast as this. C[ i ] = ( D[ i ] == 0 )? A[ i ] * B[ i ] : A[ i ] + B[ i ] ; boolean b = sec_reduce_all_zero( A[0:N] ); boolean b = sec_reduce_all_nonzero( A[0:N] ); boolean b = sec_reduce_any_zero( A[0:N] ); boolean b = sec_reduce_any_nonzero( A[0:N] ); You must specify the array length. An argument of A[:] will throw a compiler error. 3
4 float sum = 0.; sum += A[ i ]; Reducing a Vector 19 float *A = new float[nums] float *B = new float[nums]; float *C = new float[nums]; double *dtp = new double; double Time0 = omp_get_wtime( ); Offload Mode Data to send over Data to bring back on rabbit #pragma offload target(mic) in(a:length(nums)) in(b:length(nums)) out(c:length(nums)) out(dtp:length(1)) omp_set_num_threads( NUMT ); double time0 = omp_get_wtime( ); 20 In my tests, this was the same speed as this. #pragma omp parallel for simd for( int i = 0; i < NUMS; i++ ) C[i] = A[i] * B[i] ; double time1 = omp_get_wtime( ); *dtp = time1 - time0; on the Xeon Phi float sum = sec_reduce_add( A[0:N] ); double Time1 = omp_get_wtime( ); double overalldt = Time1 - Time0; double offloaddt = *dtp; fprintf( stderr, "%6d\t%6d\t%8.5lf\t%8.5lf\t%8.4f%%\t%8.2lf\n", NUMT, NUMS, overalldt, offloaddt, 100.*offloaddt/overalldt, ((double)nums)/offloaddt/ ); on rabbit Offload Mode 21 Offload Mode: Persistence Between Offloads 22 #define ALLOC #define REUSE alloc_if(1) alloc_if(0) You don t need to do anything special with the compile line: icpc -o.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec./ #define RETAIN free_if(0) #define FREE free_if(1) #pragma offload target(mic) in(a:length(nums), ALLOC, RETAIN) out(c:length(nums), ALLOC, FREE ) #pragma offload target(mic) in(a:length(nums), REUSE, RETAIN) out(d:length(nums), ALLOC, RETAIN) #pragma offload target(mic) in(a:length(nums), REUSE, FREE) out(d:length(nums), REUSE, FREE ) To ensure alignment, replace this To ensure alignment, replace this float Temperature[NUMN]; float *A = (float *) malloc( NUMS*sizeof(float) ); float *B = (float *) malloc( NUMS*sizeof(float) ); float *C = (float *) malloc( NUMS*sizeof(float) ); with this: #define ALIGN64 declspec(align(64)) with this float *A = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *B = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *C = (float * )_mm_malloc( NUMS*sizeof(float), 64 ); ALIGN64 float Temperature[NUMN]; You then free memory with: _mm_free( A ); _mm_free( B ); _mm_free( C ); 4
5 If you want to ensure alignment, but still want to use C++ s new and delete, replace this: float *A = new float [NUMS]; float *B = new float [NUMS]; float *C = new float [NUMS]; 25 As You Create More and More Threads, On Which Cores Do They End Up? If you want them spread out onto as many cores as possible, execute this: kmp_set_defaults( KMP_AFFINITY=scatter ); 26 with this float *pa = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *pb = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *pc = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *A = new(pa) float [NUMS]; float *B = new(pb) float [NUMS]; float *C = new(pc) float [NUMS]; You then free memory with: delete [ ] A; delete [ ] B; delete [ ] C; An advantage of using new and delete instead of malloc is that they allow you to use C++ constructors and destructors. If you want them packed onto the first core until it has 4, than onto the second core until it has 4, etc., execute this: kmp_set_defaults( KMP_AFFINITY=compact ); Use the scatter-mode if you want as much core-power applied to each thread as possible. Use the compact-mode if there is an advantage to some threads sharing a core s local memory with other threads. Running the Volume-Integration Program 27 Running the Volume-Integration Program 28 # of Threads MegaHeights per Second # of Divisions MegaHeights per Second # of Threads # of Divisions Multicore, no vectorization Reservation System
rabbit.engr.oregonstate.edu What is rabbit?
1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,
, Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationParallel Programming Project Notes
1 Parallel Programming Project Notes Mike Bailey mjb@cs.oregonstate.edu project.notes.pptx Why Are These Notes Here? 2 These notes are here to: 1. Help you setup and run your projects 2. Help you get the
More informationParallel Programming using OpenMP
1 OpenMP Multithreaded Programming 2 Parallel Programming using OpenMP OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard to perform shared-memory multithreading
More informationParallel Programming using OpenMP
1 Parallel Programming using OpenMP Mike Bailey mjb@cs.oregonstate.edu openmp.pptx OpenMP Multithreaded Programming 2 OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationGPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017
1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA How Can You Gain Access to GPU Power? 3
More informationGPU 101. Mike Bailey. Oregon State University
1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA 1 How Can You Gain Access to GPU Power?
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationHeterogeneous Computing and OpenCL
Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationProgramming Intel R Xeon Phi TM
Programming Intel R Xeon Phi TM An Overview Anup Zope Mississippi State University 20 March 2018 Anup Zope (Mississippi State University) Programming Intel R Xeon Phi TM 20 March 2018 1 / 46 Outline 1
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationNative Computing and Optimization. Hang Liu December 4 th, 2013
Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationthe Intel Xeon Phi coprocessor
the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with
More informationMany-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)
Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationAdvanced OpenMP Features
Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationNUMA-aware OpenMP Programming
NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationHands-on with Intel Xeon Phi
Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationGet Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library
Get Ready for Intel MKL on Intel Xeon Phi Coprocessors Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationIntroduction to the Intel Xeon Phi on Stampede
June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors
More informationXeon Phi Coprocessors on Turing
Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization
More informationGrowth in Cores - A well rehearsed story
Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationE, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),
Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,
More informationNative Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin
Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps
More informationTurbo Boost Up, AVX Clock Down: Complications for Scaling Tests
Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationLeibniz Supercomputer Centre. Movie on YouTube
SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationFlash Issues & Remedies
Flash Issues & Remedies General Guidance Many flash and SMC update failures can be addressed by simply updating to the latest version of Intel(R) MPSS and reflash the card(s) again. Make sure to follow
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationAdvanced Research Computing. ARC3 and GPUs. Mark Dixon
Advanced Research Computing Mark Dixon m.c.dixon@leeds.ac.uk ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationDay 6: Optimization on Parallel Intel Architectures
Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationGe#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017
Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor 1 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More informationManycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.
phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationCode optimization in a 3D diffusion model
Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion
More information. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo
.. Understanding Performance of Shared Memory Programs Kenjiro Taura University of Tokyo 1 / 45 Today s topics. 1 Introduction. 2 Artificial bottlenecks. 3 Understanding shared memory performance 2 / 45
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationIntroduction of Xeon Phi Programming
Introduction of Xeon Phi Programming Wei Feinstein Shaohao Chen HPC User Services LSU HPC/LONI Louisiana State University 10/21/2015 Introduction to Xeon Phi Programming Overview Intel Xeon Phi architecture
More informationRunning HARMONIE on Xeon Phi Coprocessors
Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More information