MANY-CORE COMPUTING. 10-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Size: px
Start display at page:

Download "MANY-CORE COMPUTING. 10-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center"

Transcription

1 MANY-CORE COMPUTING 10-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center

2 Schedule 2 1. Introduction, performance metrics 2. Programming many-cores (10/10) 1. Performance Evaluation and Analysis 2. Programming multi-/many-core CPUs 3. Programming GPUs 3. CUDA Programming (14/10) 4. Case study: LOFAR telescope with many-cores (Rob van Nieuwpoort, on17/10)

3 3 Performance analysis Operational Intensity and the Roofline model

4 An Example 4 I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want better performance! How fast can you make it? n Execution time => what users care about! How much will parallelism bring? n Compute speed-up => use correct reference How do I know you re not bluffing? n Compute achievable performance => depends on architecture Which architecture do you *really* need? n Compute Utilization => how much to spend

5 Software performance metrics (3 P s) 5 Performance Execution time n Speed-up Computational throughput (GFLOP/s) n Computational efficiency (i.e., utilization) Bandwidth (GB/s) n Memory efficiency (i.e., utilization) Productivity and Portability Programmability Production costs Maintenance costs

6 Performance evaluation 6 Measure execution time : Tpar Absolute performance Calculate speed-up : S = Tseq / Tpar Relative performance Does not take application into account! Execution time and speedup can be used to compare implementations of the same algorithm!

7 Performance evaluation 7 Calculate throughput: GFLOPs = #FLOPs / Tpar Takes application into account! Calculate compute efficiency: Ec = GFLOPs/peak *100 Calculate bandwidth: BW = #(RD+WR) / Tpar Takes application into account! Calculate bandwidth efficiency: Ebw = BW/peak*100 Achieved bandwidth and throughput can be used to compare *different* algorithms. Efficiency can be used to compare *different* (application, platform) combinations.

8 Performance analysis 8 Real-life performance vs. theoretical limits. Understand bottlenecks Perform correct optimizations decide when to stop fiddling with code!!! Computing the theoretical limits is the most difficult challenge in parallel performance analysis Use theoretical peak limits => low accuracy Use application characteristics Use the platform characteristics

9 Arithmetic intensity 9 The number of arithmetic (floating point) operations per byte of memory that is accessed Compute-intensive? Data-intensive? It s an application characteristic! Ignore overheads Loop counters Array index calculations Branches

10 RGB to gray 10 for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; } }

11 RGB to gray 11 for (int y = 0; y < height; y++) { } for (int x = 0; x < width; x++) { } Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; 2 x ADD, 3 x MUL = 5 OPs 3 x RD, 1 x WR= 4 memory accesses AI = 5/4 = 1.25

12 Many-core platforms Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara IBM bg/p IBM Power Intel Core i AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E NVIDIA GTX NVIDIA GTX AMD HD AMD HD Intel MIC

13 Compute or memory intensive? 13 Sun Niagara 2 IBM bg/p IBM Power 7 Intel Core i7 AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E. NVIDIA GTX 580 NVIDIA GTX 680 AMD HD 6970 Intel Xeon Phi 7120 Intel Xeon Phi A multi-/many-core processor is a device built to turn a compute-intensive application into a memory-intensive one Kathy Yelick, UC Berkeley RGB to Gray

14 Applications AI 14 O( 1 ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

15 Operational intensity 15 The number of operations per byte of DRAM traffic Difference with Arithmetic Intensity Operations, not just arithmetic Memory operations & caches n After they have been filtered by the cache hierarchy n Do not count between processor and cache n But between cache and DRAM memory Optional assignment (send to A.L.Varbanescu@uva.nl): Explain what is the difference between operational intensity and arithmetic intensity on an example. All correct solutions will receive a small prize.

16 Attainable performance 16 Attainable GFlops/sec = min(peak Floating-Point Performance, Compute intensive Peak Memory Bandwidth * Operational Intensity) Memory intensive Peak iff OI app PeakFLOPs/PeakBW Compute-intensive iff OI app (FLOPs/Byte) platform Memory-intensive iff OI app < (FLOPs/Byte) platform

17 Attainable performance 17 Attainable GFlops/sec = min(peak Floating-Point Performance, Compute intensive Peak Memory Bandwidth * Operational Intensity) Example: RGB-to-Gray Memory intensive AI = 1.25 NVIDIA GTX680 n P = min ( 3090, 1.25 * 192) = 240 GFLOPs n Only 7.8% of the peak Intel MIC n P = min ( 2417, 1.25 * 352) = 440 GFLOPs n Only 18.2% of the peak

18 The Roofline model 18 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

19 Roofline: comparing architectures 19 AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

20 Roofline: computational ceilings 20 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

21 Roofline: bandwidth ceilings 21 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

22 22 Roofline: optimization regions

23 Use the Roofline model 23 Determine what to do first to gain performance Increase memory streaming rate Apply in-core optimizations Increase arithmetic intensity Reader Samuel Williams, Andrew Waterman, David Patterson Roofline: an insightful visual performance model for multicore architectures

24 24 Programming many-cores

25 Programming many-cores 25 = parallel programming: Choose/design algorithm Parallelize algorithm n Expose enough layers of parallelism n Minimize communication between concurrent operations n Overlap computation and communication Implement parallel algorithm n Choose parallel programming model n (?) Choose many-core platform Tune/optimize application n Apply platform specific optimizations n (?) Apply data specific optimizations

26 Layers of parallelism 26 Different (nested) levels at which concurrent operations can be defined Many-cores have MORE parallelism layers than traditional machines. Parallelization goals Match application and platform parallelism layers Example: RGB-to-Gray for a W x H image Parallelism: per pixel Two layers: n Vector units (SIMD) => process 4 pixels at time n Cores => process image chunks

27 Reason early about performance 27 Amdhal s law: Parallel part: assumed perfectly parallel! How fast can it really be? n Use HW achievable performance (Roofline) Perform optimization & tuning n Guided by the Roofline model

28 28 Multi-core CPUs

29 General Purpose Processors 29 Architecture Few fat cores Vectorization n Streaming SIMD Extensions (SSE) n Advanced Vector Extensions (AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Multi-threading OS Scheduler Coarse-grained parallelism

30 30 Intel

31 AMD Magny-Cours 31 Two 6-core processors on a single chip Up to four of these chips in a single compute node 48 cores in total Non-uniform memory access (NUMA) Per-core cache Per-chip cache Local memory Remote memory (hypertransport)

32 32 AMD Magny-Cours

33 33 AMD Magny-Cours

34 AWARI on the Magny-Cours 34 DAS-2 (1999) 51 hours 72 machines / 144 cores 72 GB RAM in total 1.4 TB disk in total Magny-Cours (2011) 45 hours 1 machine, 48 cores 128 GB RAM in 1 machine 4.5 TB disk in 1 machine Less than 12 hours with new algorithm (needs more RAM)

35 Multi-core CPU programming 35 Two levels of parallelism: Coarse-grain: threads / processes Fine-grain: SIMD operations Instantiate the threads Pthreads Java threads OpenMP MPI Vectorize Rely on compilers Manual vectorization n Vector types n Intrinsics

36 Vectorization on x86 architectures 36 Since Name Bits Single precision vector size Double precision vector size 1996 MultiMedia extensions (MMX) 64 bit Integer only Integer only 1999 Streaming SIMD Extensions (SSE) 128 bit 4 float 2 double 2011 Advanced Vector Extensions (AVX) 256 bit 8 float 4 double 2012 Intel Xeon Phi accelerator (was Larrabee, MIC) 512 bit 16 float 8 double

37 Vectorizing with SSE 37 Assembly instructions 16 vector registers C or C++: intrinsics Declare vector variables Name instruction Work on variables, not registers

38 Vectorizing with SSE examples 38 float data[1024]; // init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc. init(data); // Set all elements in my vector to zero. m128 myvector0 = _mm_setzero_ps(); element value // Load the first 4 elements of the array into my vector. m128 myvector1 = _mm_load_ps(data); element value // Load the second 4 elements of the array into my vector. m128 myvector2 = _mm_load_ps(data+4); element value

39 Vectorizing with SSE examples 39 // Add vectors 1 and 2; instruction performs 4 FLOPs. m128 myvector3 = _mm_add_ps(myvector1, myvector2); element value element = value element value // Multiply vectors 1 and 2; instruction performs 4 FLOPs. m128 myvector4 = _mm_mul_ps(myvector1, myvector2); element value element = value x element value // _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2. m128 myvector5 = _mm_shuffle_ps(myvector1, myvector2, _MM_SHUFFLE(2, 3, 0, 1)); element value element = value s element value

40 Vector add 40 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i++) { c[i] = a[i] + b[i]; } }

41 Vector add with SSE: unroll loop 41 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i += 4) { c[i+0] = a[i+0] + b[i+0]; c[i+1] = a[i+1] + b[i+1]; c[i+2] = a[i+2] + b[i+2]; c[i+3] = a[i+3] + b[i+3]; } }

42 Vector add with SSE: vectorize loop 42 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i += 4) { m128 veca = _mm_load_ps(a + i); // load 4 elts from a m128 vecb = _mm_load_ps(b + i); // load 4 elts from b m128 vecc = _mm_add_ps(veca, vecb); // add four elts _mm_store_ps(c + i, vecc); // store four elts } }

43 Optional assignment 43 Implement a vectorized version of Element-wise array multiplication, with complex numbers Element-wise array division, with complex numbers Compile with gcc and measure performance with/ without vectorization. Send (pseudo-)code (and performance numbers, if you have them) by to A.L.Varbanescu@uva.nl

44 Programming models 44 OpenMP TBB Thread building blocks Threading library ArBB array building blocks Array-based language Cilk Simple divide-and-conquer model OpenCL To be discussed

45 45 Intel Xeon Phi

46 Intel's Larrabee 46 GPU based on x86 architecture Hardware multithreading Wide SIMD Achieved 1 tflop sustained application performance (SC09) Canceled in Dec 2009, re-targeted to HPC market

47 Intel Xeon Phi 47 First product: Knights corner GPU-like accelerator ±60 Pentium1-like cores L1-cache per core (32KB I$ + 32KB D$) Unified L2-cache (512KB/core => ~30MB/chip) 512-bit SIMD n 16 SP-Flop, 16 int-ops, 8 DP-Flop / cycle n No support for MMX, SSE n Supports AVX-512 At least 8GB of GDDR5 1 teraflop double precision Programming is x86 compatible OpenMP, OpenCL, Cilk, TBB, parallel libraries

48 48 Architecture of Xeon Phi

49 Programming Xeon Phi 49 Example: void VectorAdd(float *a, float *b, float *c, int n){ int i; for(i=0; i<n; ++i) a[i] = b[i]+c[i]; }

50 Programming Xeon Phi: OpenMP 50 OpenMP Compiler + library Auto-vectorization // MIC function to add two vectors attribute ((target(mic))) add_mic(int* a, int* b, int* c, int n) { int i = 0; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } // main program int main() { int i, in1[size], in2[size], res[size]; #pragma offload target(mic) in(in1,in2) inout(res) { add_mic(in1, in2, res, SIZE); }}

51 Programming Xeon Phi: Cilk 51 Write sequential code Add directives to parallelize in divide-and-conquer cilk VectorAdd(float *a, float *b, float *c, int n){ if (n<grainsize) { int i; for(i=0; i<n; ++i) a[i] = b[i]+c[i]; } else { spawn (a,b,c,n/2); spawn (a+n/2,b+n/2,c+n/2,n/2); } }

52 52 GPUs Overview

53 It's all about the memory 53 CPU many-core core core core core core channel Memory core core core core core core core core channel Memory core core core core

54 Integration into host system 54 n n Typically PCI Express 2.0 x16 Theoretical speed 8 GB/s n n n protocol overhead 6 GB/s In reality: 4 6 GB/s V3.0 recently available n n Double bandwidth Less protocol overhead

55 Lessons from Graphics Pipeline 55 Throughput is paramount must paint every pixel within frame time scalability Create, run, & retire lots of threads very rapidly measured 14.8 billion thread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run

56 CPU vs GPU 56 Movie The Mythbusters Jamie Hyneman & Adam Savage Discovery Channel Appearance at NVIDIA s NVISION 2008

57 Why is this different from a CPU? 57 Different goals produce different designs! GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, etc.) multithreading can hide latency => no big caches share control logic across many threads

58 Chip area CPU vs GPU 58 Control ALU ALU ALU ALU CPU Cache GPU

59 Flynn s taxonomy revisited 59 Single Data Multiple Data Single instruction SISD SIMD Multiple Instruction MISD MIMD GPUs don t fit!

60 Key architectural Ideas 60 Data parallel, like a vector machine There, 1 thread issues parallel vector instructions SIMT (Single Instruction Multiple Thread) execution Many threads work on a vector, each on a different element They all execute the same instruction HW automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Context switching is (basically) free

61 61 Extra Slides: Intel SCC

62 Intel Single-chip Cloud Computer 62 Architecture Tile-based many-core (48 cores) A tile is a dual-core Stand-alone Memory Per-core and per-tile Shared off-chip Programming Multi-processing with message passing User-controlled mapping/scheduling Gain performance Coarse-grain parallelism Multi-application workloads (cluster-like)

63 63 Intel Single-chip Cloud Computer

64 Intel SCC Tile 64 2 cores 16K L1 cache per core 256K L2 per core 8K Message passing buffer On-chip network router

65 65 Extra Slides: Cell Broadband Engine

66 Cell/B.E. 66 Architecture Heterogeneous 1 PowerPC (PPE) 8 vector-processors (SPEs) Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism

67 Cell/B.E. memory 67 Normal main memory PPE: normal read / write SPEs: Asynchronous manual transfers n Direct Memory Access (DMA) Per-core fast memory: the Local Store (LS) Application-managed cache 256 KB 128 x 128 bit vector registers

68 68 Cell/B.E.

69 Roadrunner (IBM) 69 Los Alamos National Laboratory #1 of top500 June 2008 November 2009 Now #19 122,400 cores, 1.4 petaflops First petaflop system PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz

70 The Cell s vector instructions 70 Differences with SSE SPEs execute only vector instructions More advanced shuffling Not 16, but 128 registers! Fused Multiply Add support

71 FMA instruction 71 Multiply-Add (MAD): D = A * B + C A = + = Product C D B ( truncate digits ) Fused Multiply-Add (FMA): D = A * B + C A B = (retain all digits) Product + = C D (no loss of precision)

72 Cell Programming models 72 IBM Cell SDK C + MPI OpenCL Many models from academia...

73 Cell SDK 73 Threads, but only on the PPE Distributed memory Local stores = application-managed cache! DMA transfers Signaling and mailboxes Vectorization

74 Direct Memory Access (DMA) 74 Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag); Wait for DMA to finish mfc_write_tag_mask(tag); mfc_read_tag_status_all(); DMA lists Overlap communication with useful work Double buffering

75 Vector sum 75 float vectorsum(int size, float* vector) { float result = 0.0; for(int i=0; i<size; i++) { result += vector[i]; } return result; }

76 Parallelization strategy 76 Partition problem into 8 pieces (Assuming a chunk fits in the Local Store) PPE starts 8 SPE threads Each SPE processes 1 piece Has to load data from PPE with DMA PPE adds the 8 sub-results

77 Vector sum SPE code (1) 77 float vectorsum(int size, float* PPEVector) { float result = 0.0; int chunksize = size / NR_SPES; // Partition the data. float localbuffer[chunksize]; // Allocate a buffer in // my private local store. int tag = 42; // Points to my chunk in PPE memory. float* myremotechunk = PPEVector + chunksize * MY_SPE_NUMBER;

78 Vector sum SPE code (2) 78 // Copy the input data from the PPE. mfc_get(localbuffer, myremotechunk, chunksize*4, tag); mfc_write_tag_mask(tag); mfc_read_tag_status_all(); } // The real work. for(int i=0; i<chunksize; i++) { result += localbuffer[i]; } return result;

79 79 Can we optimize this strategy?

80 Can we optimize this strategy? 80 Vectorization Overlap communication and computation Double buffering Strategy: n Split in more chunks than SPEs n Let each SPE download the next chunk while processing the current chunk

81 DMA double buffering example (1) 81 float vectorsum(float* PPEVector, int size, int nrchunks) { float result = 0.0; int chunksize = size / nrchunks; int chunksperspe = nrchunks / NR_SPES; int firstchunk = MY_SPE_NUMBER * chunksperspe; int lastchunk = firstchunk + nrchunks; // Allocate two buffers in my private local store. float localbuffer[2][chunksize]; int currentbuffer = 0; // Start asynchronous DMA of first chunk. float* myremotechunk = PPEVector + firstchunk * chunksize; mfc_get(localbuffer[currentbuffer], myremotechunk, chunksize, currentbuffer);

82 82 DMA double buffering example (2) for (int chunk = firstchunk; chunk < lastchunk; chunk++) { // Prefetch next chunk asynchronously. if(chunk!= lastchunk - 1) { float* nextremotechunk = PPEVector + (chunk+1) * chunksize; mfc_get(localbuffer[!currentbuffer], nextremotechunk, chunksize,!currentbuffer); } // Wait for of current buffer DMA to finish. mfc_write_tag_mask(currentbuffer); mfc_read_tag_status_all(); // The real work. for(int i=0; i<chunksize; i++) result += localbuffer[currentbuffer][i]; currentbuffer =!currentbuffer; } return result; }

83 Double and triple buffering 83 Read-only data Double buffering Read-write data Triple buffering! n Work buffer n Prefetch buffer, asynchronous download n Finished buffer, asynchronous upload General technique On-chip networks GPUs (PCI-e) MPI (cluster)

84 84 ATI GPUs

85 Latest generation ATI 85 Southern Islands 1 chip: HD cores 264 GB/sec memory bandwidth 3.8 tflops single, 947 gflops double precision Maximum power: 250 Watts 399 euros! 2 chips: HD cores, 7.6 tflops Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops

86 ATI 5870 architecture overview 86

87 ATI 5870 SIMD engine 87 Each of the 20 SIMD engines has: 16 thread processors x 5 stream cores = 80 scalar stream processing units 20 * 16 * 5 = 1600 cores total 32KB Local Data Share its own control logic and runs from a shared set of threads a dedicated fetch unit with 8KB L1 cache a 64KB global data share to communicate with other SIMD engines

88 ATI 5870 thread processor 88 Each thread processor includes: 4 stream cores + 1 special function stream core general purpose registers FMA in a single clock

89 ATI 5870 Memory Hierarchy 89 EDC (Error Detection Code) CRC Checks on Data Transfers for Improved Reliability at High Clock Speeds Bandwidths Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L GB/s to device memory PCI-e 2.0, 16x: 8GB/s to main memory

90 Unified Load/Store Addressing 90 Non - unified Address Space Local * p _ local Shared * p _ shared Device bit * p _ device Unified Address Space Local Shared Device bit * p

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013)

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU COMPUTING. Ana Lucia Varbanescu (UvA) GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017 Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT Rob van Nieuwpoort rob@cs.vu.nl Who am I 10 years of Grid / Cloud computing 6 years of many-core computing, radio astronomy

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Instructor: Leopold Grinberg

Instructor: Leopold Grinberg Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

arxiv: v1 [astro-ph.im] 2 Feb 2017

arxiv: v1 [astro-ph.im] 2 Feb 2017 International Journal of Parallel Programming manuscript No. (will be inserted by the editor) Correlating Radio Astronomy Signals with Many-Core Hardware Rob V. van Nieuwpoort John W. Romein arxiv:1702.00844v1

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7. Multicores, Multiprocessors, and Clusters Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING 2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Vector Processors and Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs) Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

How HPC Hardware and Software are Evolving Towards Exascale

How HPC Hardware and Software are Evolving Towards Exascale How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information