MANY-CORE COMPUTING. 10-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Size: px

Start display at page:

Download "MANY-CORE COMPUTING. 10-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center"

Zoe Martin
5 years ago
Views:

1 MANY-CORE COMPUTING 10-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center

2 Schedule 2 1. Introduction, performance metrics 2. Programming many-cores (10/10) 1. Performance Evaluation and Analysis 2. Programming multi-/many-core CPUs 3. Programming GPUs 3. CUDA Programming (14/10) 4. Case study: LOFAR telescope with many-cores (Rob van Nieuwpoort, on17/10)

3 3 Performance analysis Operational Intensity and the Roofline model

4 An Example 4 I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want better performance! How fast can you make it? n Execution time => what users care about! How much will parallelism bring? n Compute speed-up => use correct reference How do I know you re not bluffing? n Compute achievable performance => depends on architecture Which architecture do you *really* need? n Compute Utilization => how much to spend

5 Software performance metrics (3 P s) 5 Performance Execution time n Speed-up Computational throughput (GFLOP/s) n Computational efficiency (i.e., utilization) Bandwidth (GB/s) n Memory efficiency (i.e., utilization) Productivity and Portability Programmability Production costs Maintenance costs

6 Performance evaluation 6 Measure execution time : Tpar Absolute performance Calculate speed-up : S = Tseq / Tpar Relative performance Does not take application into account! Execution time and speedup can be used to compare implementations of the same algorithm!

7 Performance evaluation 7 Calculate throughput: GFLOPs = #FLOPs / Tpar Takes application into account! Calculate compute efficiency: Ec = GFLOPs/peak *100 Calculate bandwidth: BW = #(RD+WR) / Tpar Takes application into account! Calculate bandwidth efficiency: Ebw = BW/peak*100 Achieved bandwidth and throughput can be used to compare *different* algorithms. Efficiency can be used to compare *different* (application, platform) combinations.

8 Performance analysis 8 Real-life performance vs. theoretical limits. Understand bottlenecks Perform correct optimizations decide when to stop fiddling with code!!! Computing the theoretical limits is the most difficult challenge in parallel performance analysis Use theoretical peak limits => low accuracy Use application characteristics Use the platform characteristics

9 Arithmetic intensity 9 The number of arithmetic (floating point) operations per byte of memory that is accessed Compute-intensive? Data-intensive? It s an application characteristic! Ignore overheads Loop counters Array index calculations Branches

10 RGB to gray 10 for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; } }

11 RGB to gray 11 for (int y = 0; y < height; y++) { } for (int x = 0; x < width; x++) { } Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; 2 x ADD, 3 x MUL = 5 OPs 3 x RD, 1 x WR= 4 memory accesses AI = 5/4 = 1.25

12 Many-core platforms Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara IBM bg/p IBM Power Intel Core i AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E NVIDIA GTX NVIDIA GTX AMD HD AMD HD Intel MIC

13 Compute or memory intensive? 13 Sun Niagara 2 IBM bg/p IBM Power 7 Intel Core i7 AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E. NVIDIA GTX 580 NVIDIA GTX 680 AMD HD 6970 Intel Xeon Phi 7120 Intel Xeon Phi A multi-/many-core processor is a device built to turn a compute-intensive application into a memory-intensive one Kathy Yelick, UC Berkeley RGB to Gray

14 Applications AI 14 O( 1 ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

15 Operational intensity 15 The number of operations per byte of DRAM traffic Difference with Arithmetic Intensity Operations, not just arithmetic Memory operations & caches n After they have been filtered by the cache hierarchy n Do not count between processor and cache n But between cache and DRAM memory Optional assignment (send to A.L.Varbanescu@uva.nl): Explain what is the difference between operational intensity and arithmetic intensity on an example. All correct solutions will receive a small prize.

16 Attainable performance 16 Attainable GFlops/sec = min(peak Floating-Point Performance, Compute intensive Peak Memory Bandwidth * Operational Intensity) Memory intensive Peak iff OI app PeakFLOPs/PeakBW Compute-intensive iff OI app (FLOPs/Byte) platform Memory-intensive iff OI app < (FLOPs/Byte) platform

17 Attainable performance 17 Attainable GFlops/sec = min(peak Floating-Point Performance, Compute intensive Peak Memory Bandwidth * Operational Intensity) Example: RGB-to-Gray Memory intensive AI = 1.25 NVIDIA GTX680 n P = min ( 3090, 1.25 * 192) = 240 GFLOPs n Only 7.8% of the peak Intel MIC n P = min ( 2417, 1.25 * 352) = 440 GFLOPs n Only 18.2% of the peak

18 The Roofline model 18 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

19 Roofline: comparing architectures 19 AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

20 Roofline: computational ceilings 20 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

21 Roofline: bandwidth ceilings 21 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

22 22 Roofline: optimization regions

23 Use the Roofline model 23 Determine what to do first to gain performance Increase memory streaming rate Apply in-core optimizations Increase arithmetic intensity Reader Samuel Williams, Andrew Waterman, David Patterson Roofline: an insightful visual performance model for multicore architectures

24 24 Programming many-cores

25 Programming many-cores 25 = parallel programming: Choose/design algorithm Parallelize algorithm n Expose enough layers of parallelism n Minimize communication between concurrent operations n Overlap computation and communication Implement parallel algorithm n Choose parallel programming model n (?) Choose many-core platform Tune/optimize application n Apply platform specific optimizations n (?) Apply data specific optimizations

26 Layers of parallelism 26 Different (nested) levels at which concurrent operations can be defined Many-cores have MORE parallelism layers than traditional machines. Parallelization goals Match application and platform parallelism layers Example: RGB-to-Gray for a W x H image Parallelism: per pixel Two layers: n Vector units (SIMD) => process 4 pixels at time n Cores => process image chunks

27 Reason early about performance 27 Amdhal s law: Parallel part: assumed perfectly parallel! How fast can it really be? n Use HW achievable performance (Roofline) Perform optimization & tuning n Guided by the Roofline model

28 28 Multi-core CPUs

29 General Purpose Processors 29 Architecture Few fat cores Vectorization n Streaming SIMD Extensions (SSE) n Advanced Vector Extensions (AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Multi-threading OS Scheduler Coarse-grained parallelism

30 30 Intel

31 AMD Magny-Cours 31 Two 6-core processors on a single chip Up to four of these chips in a single compute node 48 cores in total Non-uniform memory access (NUMA) Per-core cache Per-chip cache Local memory Remote memory (hypertransport)

32 32 AMD Magny-Cours

33 33 AMD Magny-Cours

34 AWARI on the Magny-Cours 34 DAS-2 (1999) 51 hours 72 machines / 144 cores 72 GB RAM in total 1.4 TB disk in total Magny-Cours (2011) 45 hours 1 machine, 48 cores 128 GB RAM in 1 machine 4.5 TB disk in 1 machine Less than 12 hours with new algorithm (needs more RAM)

35 Multi-core CPU programming 35 Two levels of parallelism: Coarse-grain: threads / processes Fine-grain: SIMD operations Instantiate the threads Pthreads Java threads OpenMP MPI Vectorize Rely on compilers Manual vectorization n Vector types n Intrinsics

36 Vectorization on x86 architectures 36 Since Name Bits Single precision vector size Double precision vector size 1996 MultiMedia extensions (MMX) 64 bit Integer only Integer only 1999 Streaming SIMD Extensions (SSE) 128 bit 4 float 2 double 2011 Advanced Vector Extensions (AVX) 256 bit 8 float 4 double 2012 Intel Xeon Phi accelerator (was Larrabee, MIC) 512 bit 16 float 8 double

37 Vectorizing with SSE 37 Assembly instructions 16 vector registers C or C++: intrinsics Declare vector variables Name instruction Work on variables, not registers

38 Vectorizing with SSE examples 38 float data[1024]; // init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc. init(data); // Set all elements in my vector to zero. m128 myvector0 = _mm_setzero_ps(); element value // Load the first 4 elements of the array into my vector. m128 myvector1 = _mm_load_ps(data); element value // Load the second 4 elements of the array into my vector. m128 myvector2 = _mm_load_ps(data+4); element value

39 Vectorizing with SSE examples 39 // Add vectors 1 and 2; instruction performs 4 FLOPs. m128 myvector3 = _mm_add_ps(myvector1, myvector2); element value element = value element value // Multiply vectors 1 and 2; instruction performs 4 FLOPs. m128 myvector4 = _mm_mul_ps(myvector1, myvector2); element value element = value x element value // _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2. m128 myvector5 = _mm_shuffle_ps(myvector1, myvector2, _MM_SHUFFLE(2, 3, 0, 1)); element value element = value s element value

40 Vector add 40 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i++) { c[i] = a[i] + b[i]; } }

41 Vector add with SSE: unroll loop 41 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i += 4) { c[i+0] = a[i+0] + b[i+0]; c[i+1] = a[i+1] + b[i+1]; c[i+2] = a[i+2] + b[i+2]; c[i+3] = a[i+3] + b[i+3]; } }

42 Vector add with SSE: vectorize loop 42 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i += 4) { m128 veca = _mm_load_ps(a + i); // load 4 elts from a m128 vecb = _mm_load_ps(b + i); // load 4 elts from b m128 vecc = _mm_add_ps(veca, vecb); // add four elts _mm_store_ps(c + i, vecc); // store four elts } }

43 Optional assignment 43 Implement a vectorized version of Element-wise array multiplication, with complex numbers Element-wise array division, with complex numbers Compile with gcc and measure performance with/ without vectorization. Send (pseudo-)code (and performance numbers, if you have them) by to A.L.Varbanescu@uva.nl

44 Programming models 44 OpenMP TBB Thread building blocks Threading library ArBB array building blocks Array-based language Cilk Simple divide-and-conquer model OpenCL To be discussed

45 45 Intel Xeon Phi

46 Intel's Larrabee 46 GPU based on x86 architecture Hardware multithreading Wide SIMD Achieved 1 tflop sustained application performance (SC09) Canceled in Dec 2009, re-targeted to HPC market

Intel Xeon Phi 47 First product: Knights corner GPU-like accelerator ±60 Pentium1-like cores L1-cache per core (32KB I$ + 32KB D$) Unified L2-cache (512KB/core => ~30MB/chip) 512-bit SIMD n 16

47 Intel Xeon Phi 47 First product: Knights corner GPU-like accelerator ±60 Pentium1-like cores L1-cache per core (32KB I$ + 32KB D$) Unified L2-cache (512KB/core => ~30MB/chip) 512-bit SIMD n 16 SP-Flop, 16 int-ops, 8 DP-Flop / cycle n No support for MMX, SSE n Supports AVX-512 At least 8GB of GDDR5 1 teraflop double precision Programming is x86 compatible OpenMP, OpenCL, Cilk, TBB, parallel libraries

48 48 Architecture of Xeon Phi

49 Programming Xeon Phi 49 Example: void VectorAdd(float *a, float *b, float *c, int n){ int i; for(i=0; i<n; ++i) a[i] = b[i]+c[i]; }

50 Programming Xeon Phi: OpenMP 50 OpenMP Compiler + library Auto-vectorization // MIC function to add two vectors attribute ((target(mic))) add_mic(int* a, int* b, int* c, int n) { int i = 0; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } // main program int main() { int i, in1[size], in2[size], res[size]; #pragma offload target(mic) in(in1,in2) inout(res) { add_mic(in1, in2, res, SIZE); }}

51 Programming Xeon Phi: Cilk 51 Write sequential code Add directives to parallelize in divide-and-conquer cilk VectorAdd(float *a, float *b, float *c, int n){ if (n<grainsize) { int i; for(i=0; i<n; ++i) a[i] = b[i]+c[i]; } else { spawn (a,b,c,n/2); spawn (a+n/2,b+n/2,c+n/2,n/2); } }

52 52 GPUs Overview

53 It's all about the memory 53 CPU many-core core core core core core channel Memory core core core core core core core core channel Memory core core core core

54 Integration into host system 54 n n Typically PCI Express 2.0 x16 Theoretical speed 8 GB/s n n n protocol overhead 6 GB/s In reality: 4 6 GB/s V3.0 recently available n n Double bandwidth Less protocol overhead

55 Lessons from Graphics Pipeline 55 Throughput is paramount must paint every pixel within frame time scalability Create, run, & retire lots of threads very rapidly measured 14.8 billion thread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run

56 CPU vs GPU 56 Movie The Mythbusters Jamie Hyneman & Adam Savage Discovery Channel Appearance at NVIDIA s NVISION 2008

57 Why is this different from a CPU? 57 Different goals produce different designs! GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, etc.) multithreading can hide latency => no big caches share control logic across many threads

58 Chip area CPU vs GPU 58 Control ALU ALU ALU ALU CPU Cache GPU

59 Flynn s taxonomy revisited 59 Single Data Multiple Data Single instruction SISD SIMD Multiple Instruction MISD MIMD GPUs don t fit!

60 Key architectural Ideas 60 Data parallel, like a vector machine There, 1 thread issues parallel vector instructions SIMT (Single Instruction Multiple Thread) execution Many threads work on a vector, each on a different element They all execute the same instruction HW automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Context switching is (basically) free

61 61 Extra Slides: Intel SCC

62 Intel Single-chip Cloud Computer 62 Architecture Tile-based many-core (48 cores) A tile is a dual-core Stand-alone Memory Per-core and per-tile Shared off-chip Programming Multi-processing with message passing User-controlled mapping/scheduling Gain performance Coarse-grain parallelism Multi-application workloads (cluster-like)

63 63 Intel Single-chip Cloud Computer

64 Intel SCC Tile 64 2 cores 16K L1 cache per core 256K L2 per core 8K Message passing buffer On-chip network router

65 65 Extra Slides: Cell Broadband Engine

66 Cell/B.E. 66 Architecture Heterogeneous 1 PowerPC (PPE) 8 vector-processors (SPEs) Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism

67 Cell/B.E. memory 67 Normal main memory PPE: normal read / write SPEs: Asynchronous manual transfers n Direct Memory Access (DMA) Per-core fast memory: the Local Store (LS) Application-managed cache 256 KB 128 x 128 bit vector registers

68 68 Cell/B.E.

69 Roadrunner (IBM) 69 Los Alamos National Laboratory #1 of top500 June 2008 November 2009 Now #19 122,400 cores, 1.4 petaflops First petaflop system PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz

70 The Cell s vector instructions 70 Differences with SSE SPEs execute only vector instructions More advanced shuffling Not 16, but 128 registers! Fused Multiply Add support

71 FMA instruction 71 Multiply-Add (MAD): D = A * B + C A = + = Product C D B ( truncate digits ) Fused Multiply-Add (FMA): D = A * B + C A B = (retain all digits) Product + = C D (no loss of precision)

72 Cell Programming models 72 IBM Cell SDK C + MPI OpenCL Many models from academia...

73 Cell SDK 73 Threads, but only on the PPE Distributed memory Local stores = application-managed cache! DMA transfers Signaling and mailboxes Vectorization

74 Direct Memory Access (DMA) 74 Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag); Wait for DMA to finish mfc_write_tag_mask(tag); mfc_read_tag_status_all(); DMA lists Overlap communication with useful work Double buffering

75 Vector sum 75 float vectorsum(int size, float* vector) { float result = 0.0; for(int i=0; i<size; i++) { result += vector[i]; } return result; }

76 Parallelization strategy 76 Partition problem into 8 pieces (Assuming a chunk fits in the Local Store) PPE starts 8 SPE threads Each SPE processes 1 piece Has to load data from PPE with DMA PPE adds the 8 sub-results

77 Vector sum SPE code (1) 77 float vectorsum(int size, float* PPEVector) { float result = 0.0; int chunksize = size / NR_SPES; // Partition the data. float localbuffer[chunksize]; // Allocate a buffer in // my private local store. int tag = 42; // Points to my chunk in PPE memory. float* myremotechunk = PPEVector + chunksize * MY_SPE_NUMBER;

78 Vector sum SPE code (2) 78 // Copy the input data from the PPE. mfc_get(localbuffer, myremotechunk, chunksize*4, tag); mfc_write_tag_mask(tag); mfc_read_tag_status_all(); } // The real work. for(int i=0; i<chunksize; i++) { result += localbuffer[i]; } return result;

79 79 Can we optimize this strategy?

80 Can we optimize this strategy? 80 Vectorization Overlap communication and computation Double buffering Strategy: n Split in more chunks than SPEs n Let each SPE download the next chunk while processing the current chunk

81 DMA double buffering example (1) 81 float vectorsum(float* PPEVector, int size, int nrchunks) { float result = 0.0; int chunksize = size / nrchunks; int chunksperspe = nrchunks / NR_SPES; int firstchunk = MY_SPE_NUMBER * chunksperspe; int lastchunk = firstchunk + nrchunks; // Allocate two buffers in my private local store. float localbuffer[2][chunksize]; int currentbuffer = 0; // Start asynchronous DMA of first chunk. float* myremotechunk = PPEVector + firstchunk * chunksize; mfc_get(localbuffer[currentbuffer], myremotechunk, chunksize, currentbuffer);

82 82 DMA double buffering example (2) for (int chunk = firstchunk; chunk < lastchunk; chunk++) { // Prefetch next chunk asynchronously. if(chunk!= lastchunk - 1) { float* nextremotechunk = PPEVector + (chunk+1) * chunksize; mfc_get(localbuffer[!currentbuffer], nextremotechunk, chunksize,!currentbuffer); } // Wait for of current buffer DMA to finish. mfc_write_tag_mask(currentbuffer); mfc_read_tag_status_all(); // The real work. for(int i=0; i<chunksize; i++) result += localbuffer[currentbuffer][i]; currentbuffer =!currentbuffer; } return result; }

83 Double and triple buffering 83 Read-only data Double buffering Read-write data Triple buffering! n Work buffer n Prefetch buffer, asynchronous download n Finished buffer, asynchronous upload General technique On-chip networks GPUs (PCI-e) MPI (cluster)

84 84 ATI GPUs

85 Latest generation ATI 85 Southern Islands 1 chip: HD cores 264 GB/sec memory bandwidth 3.8 tflops single, 947 gflops double precision Maximum power: 250 Watts 399 euros! 2 chips: HD cores, 7.6 tflops Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops

86 ATI 5870 architecture overview 86

87 ATI 5870 SIMD engine 87 Each of the 20 SIMD engines has: 16 thread processors x 5 stream cores = 80 scalar stream processing units 20 * 16 * 5 = 1600 cores total 32KB Local Data Share its own control logic and runs from a shared set of threads a dedicated fetch unit with 8KB L1 cache a 64KB global data share to communicate with other SIMD engines

88 ATI 5870 thread processor 88 Each thread processor includes: 4 stream cores + 1 special function stream core general purpose registers FMA in a single clock

89 ATI 5870 Memory Hierarchy 89 EDC (Error Detection Code) CRC Checks on Data Transfers for Improved Reliability at High Clock Speeds Bandwidths Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L GB/s to device memory PCI-e 2.0, 16x: 8GB/s to main memory

90 Unified Load/Store Addressing 90 Non - unified Address Space Local * p _ local Shared * p _ shared Device bit * p _ device Unified Address Space Local Shared Device bit * p

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class