6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Size: px
Start display at page:

Download "6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance"

Transcription

1 The Intel Xeon Phi 1 Setup 2 Xeon system Mike Bailey mjb@cs.oregonstate.edu rabbit.engr.oregonstate.edu 2 E Xeon Processors 8 Cores 64 GB of memory 2 TB of disk NVIDIA Titan Black 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon Phi support: icc, icpc, libraries, drivers 1 core for Linux + 56 cores * 4 hyperthreads/core = 224 hyperthreads for you to use 31S1P Xeon Phi system mic0 57 Cores 22 nm 8 GB of memory No disk Application support XeonPhi.pptx Xeon Phi Internals 3 Fused Multiply-Add 4 Each Xeon Phi core has: Instruction Decode Scalar Unit Vector Unit Scalar Registers Vector Registers 32K L1 Instruction Cache The Xeon Phi chip contains almost 5B transistors! Many scientific and engineering computations take the form: D = A + (B*C); A normal multiply-add would handle this as: tmp = B*C; D = A + tmp; A fused multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time the higher-order bits of B*C are being multiplied. 32K L1 Data Cache 512K L2 Cache / Core Ring Bus Cache is 8-way set associative. Cache lines are 64 bytes. Assembly instructions are executed in-order. Thus, hyperthreading is important! Vector registers are 512 bits wide = 16 floats. They can perform Fused Multiply-Add (FMA). Theoretical performance = almost 1 TFLOPS Consider a Base 10 example: ( 123*456 ) 123 x Can start adding the 9 the moment the 8 is produced! 56,877 Note: Normal A+(B*C) FMA A+(B*C) Xeon Phi Peak Performance 5 Getting to rabbit and setting up your account 6 Lowercase letter L To login to rabbit: ssh rabbit.engr.oregonstate.edu l yourengrusername Clock freq x # cores x # vector lanes x 2 FMA / 2 cycles to decode = GHz x 56 x 16 x 2 / 2 = Put this in your rabbit account s.cshrc : setenv INTEL_LICENSE_FILE 28518@linlic.engr.oregonstate.edu setenv SINK_LD_LIBRARY_PATH /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/composer_xe_ /compiler/lib/mic/ setenv ICCPATH /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/composer_xe_2015/bin/ set path=( $path $ICCPATH ) source /nfs/guille/a2/rh80apps/intel/studio.2013-sp1/bin/iccvars.csh intel TFLOPS Then activate these values like this: source.cshrc FMA stands for Fused Multiply-Add. (These will be activated automatically the next time you login.) To verify that the Xeon Phi card is there: ping mic0 To see the Xeon Phi card characteristics: micinfo To run some operational tests on the Xeon Phi: miccheck 1

2 Running ping 7 Running micinfo 8 rabbit 150% ping mic0 PING rabbit-mic0.engr.oregonstate.edu ( ) 56(84) bytes of data. 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=1 ttl=64 time=290 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=2 ttl=64 time=0.385 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=3 ttl=64 time=0.242 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=4 ttl=64 time=0.230 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=5 ttl=64 time=0.225 ms 64 bytes from rabbit-mic0.engr.oregonstate.edu ( ): icmp_seq=6 ttl=64 time=0.261 ms rabbit 151% micinfo MicInfo Utility Log Created Mon Jan 12 10:21: System Info HOST OS : Linux OS Version : el6.x86_64 Driver Version : MPSS Version : Host Physical Memory : MB Device No: 0, Device Name: mic0 Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-31S1P ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 57 Voltage : uv Frequency : khz Version Flash Version : SMC Firmware Version : SMC Boot Loader Version : uos Version : mpss3.4.2 Device Serial Number : ADKC Board Vendor ID : 0x8086 Device ID : 0x225e Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : Insufficient Privileges PCIe Speed : Insufficient Privileges PCIe Max payload size : Insufficient Privileges PCIe Max read req size : Insufficient Privileges Thermal Fan Speed Control Fan RPM Fan PWM Die Temp GDDR GDDR Vendor GDDR Version GDDR Density GDDR Size GDDR Technology GDDR Speed GDDR Frequency GDDR Voltage : 40 C : Elpida : 0x1 : 2048 Mb : 7936 MB : GDDR5 : GT/s : khz : uv Running miccheck 9 Running micsmc, I 10 rabbit 153% micsmc -a rabbit 152% miccheck MicCheck r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system... pass Test 1: Check mic driver is loaded... pass Test 2: Check number of devices driver sees in the system... pass Test 3: Check mpssd daemon is running... Pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF... pass Test 5 (mic0): Check ras daemon is available in device... pass Test 6 (mic0): Check running flash version is correct... pass Test 7 (mic0): Check running SMC firmware version is correct... pass mic0 (info): Device Series:... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID:... 0x225e Number of Cores: OS Version: mpss3.4.2 Flash Version: Driver Version: (root@rabbit.engr.oregonstate.edu) Stepping:... 0x3 Substepping:... 0x0 mic0 (temp): Cpu Temp: C Memory Temp: C Fan-In Temp: C Fan-Out Temp: C Core Rail Temp: C Uncore Rail Temp: C Memory Rail Temp: C Status: OK mic0 (freq): Core Frequency: GHz Total Power: Watts Low Power Limit: Watts High Power Limit: Watts Physical Power Limit: Watts mic0 (mem): Free Memory: MB Total Memory: MB Memory Usage: MB Running micsmc, II 11 Cross-compiling and running from rabbit 12 mic0 (cores): Device Utilization: User: 0.00%, System: 0.09%, Idle: 99.91% Per Core Utilization (57 cores in use) Core #1: User: 0.00%, System: 0.27%, Idle: 99.73% Core #2: User: 0.00%, System: 0.27%, Idle: 99.73% Core #3: User: 0.00%, System: 0.00%, Idle: % Core #4: User: 0.00%, System: 0.00%, Idle: % Core #5: User: 0.00%, System: 0.00%, Idle: % Core #6: User: 0.00%, System: 0.00%, Idle: % Core #7: User: 0.00%, System: 0.00%, Idle: % Core #8: User: 0.00%, System: 0.27%, Idle: 99.73% Core #9: User: 0.00%, System: 0.00%, Idle: % Core #10: User: 0.00%, System: 0.27%, Idle: 99.73% To compile on rabbit for rabbit: icpc -o.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec To cross-compile on rabbit for the Xeon Phi: Note: the summary of vectorization success or failure is in a *.optvec file Core #50: User: 0.00%, System: 0.00%, Idle: % Core #52: User: 0.00%, System: 0.27%, Idle: 99.73% Core #53: User: 0.00%, System: 0.00%, Idle: % Core #54: User: 0.00%, System: 0.27%, Idle: 99.73% Core #55: User: 0.00%, System: 0.00%, Idle: % Core #56: User: 0.00%, System: 0.27%, Idle: 99.73% Core #57: User: 0.00%, System: 0.54%, Idle: 99.46% To execute on the Xeon Phi, type this on rabbit: 2

3 Gaining Access to the Cores, I 13 Gaining Access to the Cores, II 14 #pragma omp parallel for #pragma omp parallel sections #pragma omp section #pragma omp section float sum = 0.; #pragma omp parallel for reduction(+:sum) sum += A[ i ] * B[ i ] ; #pragma omp task Gaining Access to the Vector Units 15 Turning Off All Vectorization 16 icpc -mmic -o.cpp -lm -openmp -no-vec #pragma omp parallel for simd C[0:N] = A[0:N] * B[0:N] ; The sole (?) reason to do this is when running benchmarks to compare vector vs. scalar array processing. The Intel compiler does a great job of automatically vectorizing where it can. Warning: just because you didn t deliberately vectorize your code doesn t mean it didn t end up vectorized! Use the -no-vec flag. Vectorizing Conditionals 17 Reducing a Vector 18 if( D[ i ] == 0 ) else C[ i ] = A[ i ] + B[ i ] ; float f = sec_reduce_add( A[0:N] ); float f = sec_reduce_mul( A[0:N] ); float f = sec_reduce_max( A[0:N] ); float f = sec_reduce_min( A[0:N] ); int i = sec_reduce_max_ind( A[0:N] ); int i = sec_reduce_min_ind( A[0:N] ); In my tests, this was 3-4x as fast as this. C[ i ] = ( D[ i ] == 0 )? A[ i ] * B[ i ] : A[ i ] + B[ i ] ; boolean b = sec_reduce_all_zero( A[0:N] ); boolean b = sec_reduce_all_nonzero( A[0:N] ); boolean b = sec_reduce_any_zero( A[0:N] ); boolean b = sec_reduce_any_nonzero( A[0:N] ); You must specify the array length. An argument of A[:] will throw a compiler error. 3

4 float sum = 0.; sum += A[ i ]; Reducing a Vector 19 float *A = new float[nums] float *B = new float[nums]; float *C = new float[nums]; double *dtp = new double; double Time0 = omp_get_wtime( ); Offload Mode Data to send over Data to bring back on rabbit #pragma offload target(mic) in(a:length(nums)) in(b:length(nums)) out(c:length(nums)) out(dtp:length(1)) omp_set_num_threads( NUMT ); double time0 = omp_get_wtime( ); 20 In my tests, this was the same speed as this. #pragma omp parallel for simd for( int i = 0; i < NUMS; i++ ) C[i] = A[i] * B[i] ; double time1 = omp_get_wtime( ); *dtp = time1 - time0; on the Xeon Phi float sum = sec_reduce_add( A[0:N] ); double Time1 = omp_get_wtime( ); double overalldt = Time1 - Time0; double offloaddt = *dtp; fprintf( stderr, "%6d\t%6d\t%8.5lf\t%8.5lf\t%8.4f%%\t%8.2lf\n", NUMT, NUMS, overalldt, offloaddt, 100.*offloaddt/overalldt, ((double)nums)/offloaddt/ ); on rabbit Offload Mode 21 Offload Mode: Persistence Between Offloads 22 #define ALLOC #define REUSE alloc_if(1) alloc_if(0) You don t need to do anything special with the compile line: icpc -o.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec./ #define RETAIN free_if(0) #define FREE free_if(1) #pragma offload target(mic) in(a:length(nums), ALLOC, RETAIN) out(c:length(nums), ALLOC, FREE ) #pragma offload target(mic) in(a:length(nums), REUSE, RETAIN) out(d:length(nums), ALLOC, RETAIN) #pragma offload target(mic) in(a:length(nums), REUSE, FREE) out(d:length(nums), REUSE, FREE ) To ensure alignment, replace this To ensure alignment, replace this float Temperature[NUMN]; float *A = (float *) malloc( NUMS*sizeof(float) ); float *B = (float *) malloc( NUMS*sizeof(float) ); float *C = (float *) malloc( NUMS*sizeof(float) ); with this: #define ALIGN64 declspec(align(64)) with this float *A = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *B = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *C = (float * )_mm_malloc( NUMS*sizeof(float), 64 ); ALIGN64 float Temperature[NUMN]; You then free memory with: _mm_free( A ); _mm_free( B ); _mm_free( C ); 4

5 If you want to ensure alignment, but still want to use C++ s new and delete, replace this: float *A = new float [NUMS]; float *B = new float [NUMS]; float *C = new float [NUMS]; 25 As You Create More and More Threads, On Which Cores Do They End Up? If you want them spread out onto as many cores as possible, execute this: kmp_set_defaults( KMP_AFFINITY=scatter ); 26 with this float *pa = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *pb = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *pc = (float *) _mm_malloc( NUMS*sizeof(float), 64 ); float *A = new(pa) float [NUMS]; float *B = new(pb) float [NUMS]; float *C = new(pc) float [NUMS]; You then free memory with: delete [ ] A; delete [ ] B; delete [ ] C; An advantage of using new and delete instead of malloc is that they allow you to use C++ constructors and destructors. If you want them packed onto the first core until it has 4, than onto the second core until it has 4, etc., execute this: kmp_set_defaults( KMP_AFFINITY=compact ); Use the scatter-mode if you want as much core-power applied to each thread as possible. Use the compact-mode if there is an advantage to some threads sharing a core s local memory with other threads. Running the Volume-Integration Program 27 Running the Volume-Integration Program 28 # of Threads MegaHeights per Second # of Divisions MegaHeights per Second # of Threads # of Divisions Multicore, no vectorization Reservation System

rabbit.engr.oregonstate.edu What is rabbit?

rabbit.engr.oregonstate.edu What is rabbit? 1 rabbit.engr.oregonstate.edu Mike Bailey mjb@cs.oregonstate.edu rabbit.pptx What is rabbit? 2 NVIDIA Titan Black PCIe Bus 15 SMs 2880 CUDA cores 6 GB of memory OpenGL support OpenCL support Xeon system

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava, , Hardware Overview & Native Execution IT4Innovations, Ostrava, 3.2.- 4.2.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi (MIC) Programming models Native mode programming

More information

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native

More information

Parallel Programming Project Notes

Parallel Programming Project Notes 1 Parallel Programming Project Notes Mike Bailey mjb@cs.oregonstate.edu project.notes.pptx Why Are These Notes Here? 2 These notes are here to: 1. Help you setup and run your projects 2. Help you get the

More information

Parallel Programming using OpenMP

Parallel Programming using OpenMP 1 OpenMP Multithreaded Programming 2 Parallel Programming using OpenMP OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard to perform shared-memory multithreading

More information

Parallel Programming using OpenMP

Parallel Programming using OpenMP 1 Parallel Programming using OpenMP Mike Bailey mjb@cs.oregonstate.edu openmp.pptx OpenMP Multithreaded Programming 2 OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017 1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA How Can You Gain Access to GPU Power? 3

More information

GPU 101. Mike Bailey. Oregon State University

GPU 101. Mike Bailey. Oregon State University 1 GPU 101 Mike Bailey mjb@cs.oregonstate.edu gpu101.pptx Why do we care about GPU Programming? A History of GPU Performance vs. CPU Performance 2 Source: NVIDIA 1 How Can You Gain Access to GPU Power?

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

Programming Intel R Xeon Phi TM

Programming Intel R Xeon Phi TM Programming Intel R Xeon Phi TM An Overview Anup Zope Mississippi State University 20 March 2018 Anup Zope (Mississippi State University) Programming Intel R Xeon Phi TM 20 March 2018 1 / 46 Outline 1

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

Native Computing and Optimization. Hang Liu December 4 th, 2013

Native Computing and Optimization. Hang Liu December 4 th, 2013 Native Computing and Optimization Hang Liu December 4 th, 2013 Overview Why run native? What is a native application? Building a native application Running a native application Setting affinity and pinning

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

the Intel Xeon Phi coprocessor

the Intel Xeon Phi coprocessor the Intel Xeon Phi coprocessor 1 Introduction about the Intel Xeon Phi coprocessor comparing Phi with CUDA the Intel Many Integrated Core architecture 2 Programming the Intel Xeon Phi Coprocessor with

More information

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information) Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 9 SIMD vectorization using #pragma omp simd force

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Advanced OpenMP Features

Advanced OpenMP Features Christian Terboven, Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group {terboven,schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Vectorization 2 Vectorization SIMD =

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Hands-on with Intel Xeon Phi

Hands-on with Intel Xeon Phi Hands-on with Intel Xeon Phi Lab 2: Native Computing and Vector Reports Bill Barth Kent Milfeld Dan Stanzione 1 Lab 2 What you will learn about Evaluating and Analyzing vector performance. Controlling

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Get Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library

Get Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Get Ready for Intel MKL on Intel Xeon Phi Coprocessors Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Introduction to the Intel Xeon Phi on Stampede

Introduction to the Intel Xeon Phi on Stampede June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors

More information

Xeon Phi Coprocessors on Turing

Xeon Phi Coprocessors on Turing Xeon Phi Coprocessors on Turing Table of Contents Overview...2 Using the Phi Coprocessors...2 Example...2 Intel Vtune Amplifier Example...3 Appendix...8 Sources...9 Information Technology Services High

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps Controlling Execution Tuning Vectorization

More information

Growth in Cores - A well rehearsed story

Growth in Cores - A well rehearsed story Intel CPUs Growth in Cores - A well rehearsed story 2 1. Multicore is just a fad! Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor John D. McCalpin mccalpin@tacc.utexas.edu Outline Overview What is a native application? Why run native? Getting Started: Building a

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP), Index A Accelerated Strategic Computing Initiative (ASCI), 3 Address generation interlock (AGI), 55 Algorithm and data structures, 171. See also General matrix-matrix multiplication (GEMM) design rules,

More information

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin Native Computing and Optimization on the Intel Xeon Phi Coprocessor Lars Koesterke John D. McCalpin lars@tacc.utexas.edu mccalpin@tacc.utexas.edu Intro (very brief) Outline Compiling & Running Native Apps

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Leibniz Supercomputer Centre. Movie on YouTube

Leibniz Supercomputer Centre. Movie on YouTube SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC. Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:

More information

Benchmark results on Knight Landing architecture

Benchmark results on Knight Landing architecture Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

Flash Issues & Remedies

Flash Issues & Remedies Flash Issues & Remedies General Guidance Many flash and SMC update failures can be addressed by simply updating to the latest version of Intel(R) MPSS and reflash the card(s) again. Make sure to follow

More information

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Advanced Research Computing. ARC3 and GPUs. Mark Dixon

Advanced Research Computing. ARC3 and GPUs. Mark Dixon Advanced Research Computing Mark Dixon m.c.dixon@leeds.ac.uk ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

Day 6: Optimization on Parallel Intel Architectures

Day 6: Optimization on Parallel Intel Architectures Day 6: Optimization on Parallel Intel Architectures Lecture day 6 Ryo Asai Colfax International colfaxresearch.com April 2017 colfaxresearch.com/ Welcome Colfax International, 2013 2017 Disclaimer 2 While

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017

Ge#ng Started with Automa3c Compiler Vectoriza3on. David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Ge#ng Started with Automa3c Compiler Vectoriza3on David Apostal UND CSci 532 Guest Lecture Sept 14, 2017 Parallellism is Key to Performance Types of parallelism Task-based (MPI) Threads (OpenMP, pthreads)

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor 1 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation

More information

Overview of Intel Xeon Phi Coprocessor

Overview of Intel Xeon Phi Coprocessor Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

Code optimization in a 3D diffusion model

Code optimization in a 3D diffusion model Code optimization in a 3D diffusion model Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 18 th 2016, Barcelona Agenda Background Diffusion

More information

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo .. Understanding Performance of Shared Memory Programs Kenjiro Taura University of Tokyo 1 / 45 Today s topics. 1 Introduction. 2 Artificial bottlenecks. 3 Understanding shared memory performance 2 / 45

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

Introduction of Xeon Phi Programming

Introduction of Xeon Phi Programming Introduction of Xeon Phi Programming Wei Feinstein Shaohao Chen HPC User Services LSU HPC/LONI Louisiana State University 10/21/2015 Introduction to Xeon Phi Programming Overview Intel Xeon Phi architecture

More information

Running HARMONIE on Xeon Phi Coprocessors

Running HARMONIE on Xeon Phi Coprocessors Running HARMONIE on Xeon Phi Coprocessors Enda O Brien Irish Centre for High-End Computing Disclosure Intel is funding ICHEC to port & optimize some applications, including HARMONIE, to Xeon Phi coprocessors.

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information