MANY-CORE COMPUTING. 10-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center
|
|
- Zoe Martin
- 5 years ago
- Views:
Transcription
1 MANY-CORE COMPUTING 10-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center
2 Schedule 2 1. Introduction, performance metrics 2. Programming many-cores (10/10) 1. Performance Evaluation and Analysis 2. Programming multi-/many-core CPUs 3. Programming GPUs 3. CUDA Programming (14/10) 4. Case study: LOFAR telescope with many-cores (Rob van Nieuwpoort, on17/10)
3 3 Performance analysis Operational Intensity and the Roofline model
4 An Example 4 I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want better performance! How fast can you make it? n Execution time => what users care about! How much will parallelism bring? n Compute speed-up => use correct reference How do I know you re not bluffing? n Compute achievable performance => depends on architecture Which architecture do you *really* need? n Compute Utilization => how much to spend
5 Software performance metrics (3 P s) 5 Performance Execution time n Speed-up Computational throughput (GFLOP/s) n Computational efficiency (i.e., utilization) Bandwidth (GB/s) n Memory efficiency (i.e., utilization) Productivity and Portability Programmability Production costs Maintenance costs
6 Performance evaluation 6 Measure execution time : Tpar Absolute performance Calculate speed-up : S = Tseq / Tpar Relative performance Does not take application into account! Execution time and speedup can be used to compare implementations of the same algorithm!
7 Performance evaluation 7 Calculate throughput: GFLOPs = #FLOPs / Tpar Takes application into account! Calculate compute efficiency: Ec = GFLOPs/peak *100 Calculate bandwidth: BW = #(RD+WR) / Tpar Takes application into account! Calculate bandwidth efficiency: Ebw = BW/peak*100 Achieved bandwidth and throughput can be used to compare *different* algorithms. Efficiency can be used to compare *different* (application, platform) combinations.
8 Performance analysis 8 Real-life performance vs. theoretical limits. Understand bottlenecks Perform correct optimizations decide when to stop fiddling with code!!! Computing the theoretical limits is the most difficult challenge in parallel performance analysis Use theoretical peak limits => low accuracy Use application characteristics Use the platform characteristics
9 Arithmetic intensity 9 The number of arithmetic (floating point) operations per byte of memory that is accessed Compute-intensive? Data-intensive? It s an application characteristic! Ignore overheads Loop counters Array index calculations Branches
10 RGB to gray 10 for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; } }
11 RGB to gray 11 for (int y = 0; y < height; y++) { } for (int x = 0; x < width; x++) { } Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; 2 x ADD, 3 x MUL = 5 OPs 3 x RD, 1 x WR= 4 memory accesses AI = 5/4 = 1.25
12 Many-core platforms Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara IBM bg/p IBM Power Intel Core i AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E NVIDIA GTX NVIDIA GTX AMD HD AMD HD Intel MIC
13 Compute or memory intensive? 13 Sun Niagara 2 IBM bg/p IBM Power 7 Intel Core i7 AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E. NVIDIA GTX 580 NVIDIA GTX 680 AMD HD 6970 Intel Xeon Phi 7120 Intel Xeon Phi A multi-/many-core processor is a device built to turn a compute-intensive application into a memory-intensive one Kathy Yelick, UC Berkeley RGB to Gray
14 Applications AI 14 O( 1 ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods
15 Operational intensity 15 The number of operations per byte of DRAM traffic Difference with Arithmetic Intensity Operations, not just arithmetic Memory operations & caches n After they have been filtered by the cache hierarchy n Do not count between processor and cache n But between cache and DRAM memory Optional assignment (send to A.L.Varbanescu@uva.nl): Explain what is the difference between operational intensity and arithmetic intensity on an example. All correct solutions will receive a small prize.
16 Attainable performance 16 Attainable GFlops/sec = min(peak Floating-Point Performance, Compute intensive Peak Memory Bandwidth * Operational Intensity) Memory intensive Peak iff OI app PeakFLOPs/PeakBW Compute-intensive iff OI app (FLOPs/Byte) platform Memory-intensive iff OI app < (FLOPs/Byte) platform
17 Attainable performance 17 Attainable GFlops/sec = min(peak Floating-Point Performance, Compute intensive Peak Memory Bandwidth * Operational Intensity) Example: RGB-to-Gray Memory intensive AI = 1.25 NVIDIA GTX680 n P = min ( 3090, 1.25 * 192) = 240 GFLOPs n Only 7.8% of the peak Intel MIC n P = min ( 2417, 1.25 * 352) = 440 GFLOPs n Only 18.2% of the peak
18 The Roofline model 18 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
19 Roofline: comparing architectures 19 AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9
20 Roofline: computational ceilings 20 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
21 Roofline: bandwidth ceilings 21 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
22 22 Roofline: optimization regions
23 Use the Roofline model 23 Determine what to do first to gain performance Increase memory streaming rate Apply in-core optimizations Increase arithmetic intensity Reader Samuel Williams, Andrew Waterman, David Patterson Roofline: an insightful visual performance model for multicore architectures
24 24 Programming many-cores
25 Programming many-cores 25 = parallel programming: Choose/design algorithm Parallelize algorithm n Expose enough layers of parallelism n Minimize communication between concurrent operations n Overlap computation and communication Implement parallel algorithm n Choose parallel programming model n (?) Choose many-core platform Tune/optimize application n Apply platform specific optimizations n (?) Apply data specific optimizations
26 Layers of parallelism 26 Different (nested) levels at which concurrent operations can be defined Many-cores have MORE parallelism layers than traditional machines. Parallelization goals Match application and platform parallelism layers Example: RGB-to-Gray for a W x H image Parallelism: per pixel Two layers: n Vector units (SIMD) => process 4 pixels at time n Cores => process image chunks
27 Reason early about performance 27 Amdhal s law: Parallel part: assumed perfectly parallel! How fast can it really be? n Use HW achievable performance (Roofline) Perform optimization & tuning n Guided by the Roofline model
28 28 Multi-core CPUs
29 General Purpose Processors 29 Architecture Few fat cores Vectorization n Streaming SIMD Extensions (SSE) n Advanced Vector Extensions (AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Multi-threading OS Scheduler Coarse-grained parallelism
30 30 Intel
31 AMD Magny-Cours 31 Two 6-core processors on a single chip Up to four of these chips in a single compute node 48 cores in total Non-uniform memory access (NUMA) Per-core cache Per-chip cache Local memory Remote memory (hypertransport)
32 32 AMD Magny-Cours
33 33 AMD Magny-Cours
34 AWARI on the Magny-Cours 34 DAS-2 (1999) 51 hours 72 machines / 144 cores 72 GB RAM in total 1.4 TB disk in total Magny-Cours (2011) 45 hours 1 machine, 48 cores 128 GB RAM in 1 machine 4.5 TB disk in 1 machine Less than 12 hours with new algorithm (needs more RAM)
35 Multi-core CPU programming 35 Two levels of parallelism: Coarse-grain: threads / processes Fine-grain: SIMD operations Instantiate the threads Pthreads Java threads OpenMP MPI Vectorize Rely on compilers Manual vectorization n Vector types n Intrinsics
36 Vectorization on x86 architectures 36 Since Name Bits Single precision vector size Double precision vector size 1996 MultiMedia extensions (MMX) 64 bit Integer only Integer only 1999 Streaming SIMD Extensions (SSE) 128 bit 4 float 2 double 2011 Advanced Vector Extensions (AVX) 256 bit 8 float 4 double 2012 Intel Xeon Phi accelerator (was Larrabee, MIC) 512 bit 16 float 8 double
37 Vectorizing with SSE 37 Assembly instructions 16 vector registers C or C++: intrinsics Declare vector variables Name instruction Work on variables, not registers
38 Vectorizing with SSE examples 38 float data[1024]; // init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc. init(data); // Set all elements in my vector to zero. m128 myvector0 = _mm_setzero_ps(); element value // Load the first 4 elements of the array into my vector. m128 myvector1 = _mm_load_ps(data); element value // Load the second 4 elements of the array into my vector. m128 myvector2 = _mm_load_ps(data+4); element value
39 Vectorizing with SSE examples 39 // Add vectors 1 and 2; instruction performs 4 FLOPs. m128 myvector3 = _mm_add_ps(myvector1, myvector2); element value element = value element value // Multiply vectors 1 and 2; instruction performs 4 FLOPs. m128 myvector4 = _mm_mul_ps(myvector1, myvector2); element value element = value x element value // _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2. m128 myvector5 = _mm_shuffle_ps(myvector1, myvector2, _MM_SHUFFLE(2, 3, 0, 1)); element value element = value s element value
40 Vector add 40 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i++) { c[i] = a[i] + b[i]; } }
41 Vector add with SSE: unroll loop 41 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i += 4) { c[i+0] = a[i+0] + b[i+0]; c[i+1] = a[i+1] + b[i+1]; c[i+2] = a[i+2] + b[i+2]; c[i+3] = a[i+3] + b[i+3]; } }
42 Vector add with SSE: vectorize loop 42 void vectoradd(int size, float* a, float* b, float* c) { for(int i=0; i<size; i += 4) { m128 veca = _mm_load_ps(a + i); // load 4 elts from a m128 vecb = _mm_load_ps(b + i); // load 4 elts from b m128 vecc = _mm_add_ps(veca, vecb); // add four elts _mm_store_ps(c + i, vecc); // store four elts } }
43 Optional assignment 43 Implement a vectorized version of Element-wise array multiplication, with complex numbers Element-wise array division, with complex numbers Compile with gcc and measure performance with/ without vectorization. Send (pseudo-)code (and performance numbers, if you have them) by to A.L.Varbanescu@uva.nl
44 Programming models 44 OpenMP TBB Thread building blocks Threading library ArBB array building blocks Array-based language Cilk Simple divide-and-conquer model OpenCL To be discussed
45 45 Intel Xeon Phi
46 Intel's Larrabee 46 GPU based on x86 architecture Hardware multithreading Wide SIMD Achieved 1 tflop sustained application performance (SC09) Canceled in Dec 2009, re-targeted to HPC market
47 Intel Xeon Phi 47 First product: Knights corner GPU-like accelerator ±60 Pentium1-like cores L1-cache per core (32KB I$ + 32KB D$) Unified L2-cache (512KB/core => ~30MB/chip) 512-bit SIMD n 16 SP-Flop, 16 int-ops, 8 DP-Flop / cycle n No support for MMX, SSE n Supports AVX-512 At least 8GB of GDDR5 1 teraflop double precision Programming is x86 compatible OpenMP, OpenCL, Cilk, TBB, parallel libraries
48 48 Architecture of Xeon Phi
49 Programming Xeon Phi 49 Example: void VectorAdd(float *a, float *b, float *c, int n){ int i; for(i=0; i<n; ++i) a[i] = b[i]+c[i]; }
50 Programming Xeon Phi: OpenMP 50 OpenMP Compiler + library Auto-vectorization // MIC function to add two vectors attribute ((target(mic))) add_mic(int* a, int* b, int* c, int n) { int i = 0; #pragma omp parallel for for (i = 0; i < n; i++) c[i] = a[i] + b[i]; } // main program int main() { int i, in1[size], in2[size], res[size]; #pragma offload target(mic) in(in1,in2) inout(res) { add_mic(in1, in2, res, SIZE); }}
51 Programming Xeon Phi: Cilk 51 Write sequential code Add directives to parallelize in divide-and-conquer cilk VectorAdd(float *a, float *b, float *c, int n){ if (n<grainsize) { int i; for(i=0; i<n; ++i) a[i] = b[i]+c[i]; } else { spawn (a,b,c,n/2); spawn (a+n/2,b+n/2,c+n/2,n/2); } }
52 52 GPUs Overview
53 It's all about the memory 53 CPU many-core core core core core core channel Memory core core core core core core core core channel Memory core core core core
54 Integration into host system 54 n n Typically PCI Express 2.0 x16 Theoretical speed 8 GB/s n n n protocol overhead 6 GB/s In reality: 4 6 GB/s V3.0 recently available n n Double bandwidth Less protocol overhead
55 Lessons from Graphics Pipeline 55 Throughput is paramount must paint every pixel within frame time scalability Create, run, & retire lots of threads very rapidly measured 14.8 billion thread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run
56 CPU vs GPU 56 Movie The Mythbusters Jamie Hyneman & Adam Savage Discovery Channel Appearance at NVIDIA s NVISION 2008
57 Why is this different from a CPU? 57 Different goals produce different designs! GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread big on-chip caches sophisticated control logic GPU: maximize throughput of all threads # threads in flight limited by resources => lots of resources (registers, etc.) multithreading can hide latency => no big caches share control logic across many threads
58 Chip area CPU vs GPU 58 Control ALU ALU ALU ALU CPU Cache GPU
59 Flynn s taxonomy revisited 59 Single Data Multiple Data Single instruction SISD SIMD Multiple Instruction MISD MIMD GPUs don t fit!
60 Key architectural Ideas 60 Data parallel, like a vector machine There, 1 thread issues parallel vector instructions SIMT (Single Instruction Multiple Thread) execution Many threads work on a vector, each on a different element They all execute the same instruction HW automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Context switching is (basically) free
61 61 Extra Slides: Intel SCC
62 Intel Single-chip Cloud Computer 62 Architecture Tile-based many-core (48 cores) A tile is a dual-core Stand-alone Memory Per-core and per-tile Shared off-chip Programming Multi-processing with message passing User-controlled mapping/scheduling Gain performance Coarse-grain parallelism Multi-application workloads (cluster-like)
63 63 Intel Single-chip Cloud Computer
64 Intel SCC Tile 64 2 cores 16K L1 cache per core 256K L2 per core 8K Message passing buffer On-chip network router
65 65 Extra Slides: Cell Broadband Engine
66 Cell/B.E. 66 Architecture Heterogeneous 1 PowerPC (PPE) 8 vector-processors (SPEs) Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism
67 Cell/B.E. memory 67 Normal main memory PPE: normal read / write SPEs: Asynchronous manual transfers n Direct Memory Access (DMA) Per-core fast memory: the Local Store (LS) Application-managed cache 256 KB 128 x 128 bit vector registers
68 68 Cell/B.E.
69 Roadrunner (IBM) 69 Los Alamos National Laboratory #1 of top500 June 2008 November 2009 Now #19 122,400 cores, 1.4 petaflops First petaflop system PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz
70 The Cell s vector instructions 70 Differences with SSE SPEs execute only vector instructions More advanced shuffling Not 16, but 128 registers! Fused Multiply Add support
71 FMA instruction 71 Multiply-Add (MAD): D = A * B + C A = + = Product C D B ( truncate digits ) Fused Multiply-Add (FMA): D = A * B + C A B = (retain all digits) Product + = C D (no loss of precision)
72 Cell Programming models 72 IBM Cell SDK C + MPI OpenCL Many models from academia...
73 Cell SDK 73 Threads, but only on the PPE Distributed memory Local stores = application-managed cache! DMA transfers Signaling and mailboxes Vectorization
74 Direct Memory Access (DMA) 74 Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag); Wait for DMA to finish mfc_write_tag_mask(tag); mfc_read_tag_status_all(); DMA lists Overlap communication with useful work Double buffering
75 Vector sum 75 float vectorsum(int size, float* vector) { float result = 0.0; for(int i=0; i<size; i++) { result += vector[i]; } return result; }
76 Parallelization strategy 76 Partition problem into 8 pieces (Assuming a chunk fits in the Local Store) PPE starts 8 SPE threads Each SPE processes 1 piece Has to load data from PPE with DMA PPE adds the 8 sub-results
77 Vector sum SPE code (1) 77 float vectorsum(int size, float* PPEVector) { float result = 0.0; int chunksize = size / NR_SPES; // Partition the data. float localbuffer[chunksize]; // Allocate a buffer in // my private local store. int tag = 42; // Points to my chunk in PPE memory. float* myremotechunk = PPEVector + chunksize * MY_SPE_NUMBER;
78 Vector sum SPE code (2) 78 // Copy the input data from the PPE. mfc_get(localbuffer, myremotechunk, chunksize*4, tag); mfc_write_tag_mask(tag); mfc_read_tag_status_all(); } // The real work. for(int i=0; i<chunksize; i++) { result += localbuffer[i]; } return result;
79 79 Can we optimize this strategy?
80 Can we optimize this strategy? 80 Vectorization Overlap communication and computation Double buffering Strategy: n Split in more chunks than SPEs n Let each SPE download the next chunk while processing the current chunk
81 DMA double buffering example (1) 81 float vectorsum(float* PPEVector, int size, int nrchunks) { float result = 0.0; int chunksize = size / nrchunks; int chunksperspe = nrchunks / NR_SPES; int firstchunk = MY_SPE_NUMBER * chunksperspe; int lastchunk = firstchunk + nrchunks; // Allocate two buffers in my private local store. float localbuffer[2][chunksize]; int currentbuffer = 0; // Start asynchronous DMA of first chunk. float* myremotechunk = PPEVector + firstchunk * chunksize; mfc_get(localbuffer[currentbuffer], myremotechunk, chunksize, currentbuffer);
82 82 DMA double buffering example (2) for (int chunk = firstchunk; chunk < lastchunk; chunk++) { // Prefetch next chunk asynchronously. if(chunk!= lastchunk - 1) { float* nextremotechunk = PPEVector + (chunk+1) * chunksize; mfc_get(localbuffer[!currentbuffer], nextremotechunk, chunksize,!currentbuffer); } // Wait for of current buffer DMA to finish. mfc_write_tag_mask(currentbuffer); mfc_read_tag_status_all(); // The real work. for(int i=0; i<chunksize; i++) result += localbuffer[currentbuffer][i]; currentbuffer =!currentbuffer; } return result; }
83 Double and triple buffering 83 Read-only data Double buffering Read-write data Triple buffering! n Work buffer n Prefetch buffer, asynchronous download n Finished buffer, asynchronous upload General technique On-chip networks GPUs (PCI-e) MPI (cluster)
84 84 ATI GPUs
85 Latest generation ATI 85 Southern Islands 1 chip: HD cores 264 GB/sec memory bandwidth 3.8 tflops single, 947 gflops double precision Maximum power: 250 Watts 399 euros! 2 chips: HD cores, 7.6 tflops Comparison: entire 72-node DAS-4 VU cluster has 4.4 tflops
86 ATI 5870 architecture overview 86
87 ATI 5870 SIMD engine 87 Each of the 20 SIMD engines has: 16 thread processors x 5 stream cores = 80 scalar stream processing units 20 * 16 * 5 = 1600 cores total 32KB Local Data Share its own control logic and runs from a shared set of threads a dedicated fetch unit with 8KB L1 cache a 64KB global data share to communicate with other SIMD engines
88 ATI 5870 thread processor 88 Each thread processor includes: 4 stream cores + 1 special function stream core general purpose registers FMA in a single clock
89 ATI 5870 Memory Hierarchy 89 EDC (Error Detection Code) CRC Checks on Data Transfers for Improved Reliability at High Clock Speeds Bandwidths Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L GB/s to device memory PCI-e 2.0, 16x: 8GB/s to main memory
90 Unified Load/Store Addressing 90 Non - unified Address Space Local * p _ local Shared * p _ shared Device bit * p _ device Unified Address Space Local Shared Device bit * p
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationMANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center
MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013)
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)
PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationGPU COMPUTING. Ana Lucia Varbanescu (UvA)
GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationRoadrunner. By Diana Lleva Julissa Campos Justina Tandar
Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT Rob van Nieuwpoort rob@cs.vu.nl Who am I 10 years of Grid / Cloud computing 6 years of many-core computing, radio astronomy
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationInstructor: Leopold Grinberg
Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationarxiv: v1 [astro-ph.im] 2 Feb 2017
International Journal of Parallel Programming manuscript No. (will be inserted by the editor) Correlating Radio Astronomy Signals with Many-Core Hardware Rob V. van Nieuwpoort John W. Romein arxiv:1702.00844v1
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationChapter 7. Multicores, Multiprocessors, and Clusters
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationSCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING
2/20/13 CS 594: SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING Heike McCraw mccraw@icl.utk.edu 1. Basic Essentials OUTLINE Abstract architecture model Communication, Computation, and Locality
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationWhat does Heterogeneity bring?
What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationNumerical Simulation on the GPU
Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationVector Processors and Graphics Processing Units (GPUs)
Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationHow HPC Hardware and Software are Evolving Towards Exascale
How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationAccelerator Programming Lecture 1
Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming
More information