Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique
|
|
- Hannah Johnston
- 6 years ago
- Views:
Transcription
1 Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique
2 Outline of the course Feb 29: Introduction to GPU architecture Let's pretend we are a GPU architect Performance models March 7?: GPU programming The software side Programming model March 14: Performance optimization Possible bottlenecks Common optimization techniques 4 labs
3 GPGPU in HPC, today Graphics Processing Unit (GPU) Made for computer games: high-volume market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2016: the 2 fastest supercomputers use GPUs/accelerators #1 Tianhe-2 (China) #2 Titan (USA) 18,648 (1 AMD Opteron + 1 NVIDIA Tesla K20X) 16,000 (2 Intel Xeon + 3 Intel Xeon Phi) 3
4 GPGPU in the future? Yesterday ( ) Homogeneous multi-core Discrete components Today ( ) Chip-level integration Central Processing Unit (CPU) Graphics Processing Unit (GPU) Many embedded SoCs Intel Sandy Bridge AMD Fusion NVIDIA Denver/Maxwell project Tomorrow Heterogeneous multi-core GPUs to blend into throughput-optimized cores? Latencyoptimized cores Throughputoptimized cores Hardware accelerators Heterogeneous multi-core chip 4
5 Today: GPU internals What makes a GPU tick? NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering! 5
6 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 6
7 The free lunch era... was yesterday 1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements Exponential performance increase Software compatibility preserved Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed Do not rewrite software, buy a new machine! 7
8 Technology evolution Memory wall Compute Performance Gap Memory Memory speed does not increase as fast as computing speed Harder to hide memory latency Time Power wall Power consumption of transistors does not decrease as fast as density increases Transistor density Performance is now limited by power consumption Transistor power ILP wall Law of diminishing returns on Instruction-Level Parallelism Total power Time Cost Pollack rule: cost performance² Serial performance 8
9 Usage changes New applications demand parallel processing Computer games : 3D graphics Search engines, social networks big data processing New computing devices are power-constrained Laptops, cell phones, tablets Small, light, battery-powered Datacenters High power supply and cooling costs 9
10 Latency vs. throughput Latency: time to solution Minimize time, at the expense of power Metric: time e.g. seconds Throughput: quantity of tasks processed per unit of time Assumes unlimited parallelism Minimize energy per operation Metric: operations / time e.g. Gflops / s CPU: optimized for latency GPU: optimized for throughput 10
11 Amdahl's law Bounds speedup attainable on a parallel machine 1 S= Time to run sequential portions 1 P P N Time to run parallel portions S P N Speedup Ratio of parallel portions Number of processors S (speedup) N (available processors) G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. AFIPS
12 Why heterogeneous architectures? Time to run sequential portions S= 1 P 1 P N Time to run parallel portions Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential portions Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer,
13 Example: System on Chip for smartphone Small cores for background activity GPU Big cores for applications Lots of interfaces Special-purpose accelerators 13
14 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 14
15 The (simplest) graphics rendering pipeline Primitives (triangles ) Vertices Fragment shader Vertex shader Textures Z-Compare Blending Clipping, Rasterization Attribute interpolation Pixels Fragments Framebuffer Z-Buffer Programmable stage Parametrizable stage 15
16 How much performance do we need to run 3DMark 11 at 50 frames/second? Element Per frame Per second Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster Make a special-purpose accelerator Source: Damien Triolet, Hardware.fr 16
17 Beginnings of GPGPU Microsoft DirectX 7.x a 9.0b 9.0c Unified shaders NVIDIA NV10 FP 16 NV20 NV30 Programmable shaders Dynamic control flow G70 R200 G80-G90 CTM R300 R400 GT200 GF100 CUDA SIMT FP 24 ATI/AMD R100 FP 32 NV40 R500 FP 64 R600 CAL R700 Evergreen GPGPU traction
18 Today: what do we need GPUs for? 1. 3D graphics rendering for games Complex texture mapping, lighting computations 2. Computer Aided Design workstations Complex geometry 3. GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 18
19 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 19
20 What is parallelism? Parallelism: independent operations whose execution can be overlapped Operations: memory accesses or computations How much parallelism do I need? Little's law in queuing theory Average customer arrival rate λ Average time spent W Average number of customers: L = λ W Parallelism = throughput latency Units: Memory: B = GB/s ns Arithmetic: flops = Gflops/s ns J. Little. A proof for the queuing formula L= λ W. JSTOR
21 Throughput and latency: CPU vs. GPU Throughput (GB/s) CPU memory: Core i7 4790, DDR3-1600, 2 channels 25.6 Latency (ns) 67 GPU memory: NVIDIA GeForce GTX 980, GDDR5-7010, 256-bit 224 x8 Area: 56 x6 Need 56 times more parallelism! 410 ns 21
22 Sources of parallelism ILP: Instruction-Level Parallelism Between independent instructions in sequential program TLP: Thread-Level Parallelism Between independent execution contexts: threads DLP: Data-Level Parallelism Between elements of a vector: same operation on several elements add r3 r1, r2 Parallel mul r0 r0, r1 sub r1 r3, r0 Thread 1 add Thread 2 mul vadd r a,b Parallel a1 + b1 a2 + b2 a3 + b3 r1 r2 r3 24
23 Example: X a X In-place scalar-vector product: X a X Sequential (ILP) For i = 0 to n-1 do: X[i] a * X[i] Threads (TLP) Launch n threads: X[tid] a * X[tid] Vector (DLP) X a * X Or any combination of the above 25
24 Uses of parallelism Horizontal parallelism for throughput A B C D More units working in parallel Vertical parallelism for latency hiding A Pipelining: keep units busy when waiting for dependencies, memory B C D A B C A B latency throughput A cycle 1 cycle 2 cycle 3 cycle 4 26
25 How to extract parallelism? Horizontal Vertical ILP Superscalar Pipelined TLP Multi-core SMT Interleaved / switch-on-event multithreading DLP SIMD / SIMT Vector / temporal SIMT We have seen the first row: ILP We will now review techniques for the next rows: TLP, DLP 27
26 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 28
27 Sequential processor for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute Memory Memory Sequential CPU Focuses on instruction-level parallelism Exploits ILP: vertically (pipelining) and horizontally (superscalar) 29
28 The incremental approach: multi-core Several processors on a single chip sharing one memory space Intel Sandy Bridge Area: benefits from Moore's law Power: extra cores consume little when not in use e.g. Intel Turbo Boost Source: Intel 30
29 Homogeneous multi-core Horizontal use of thread-level parallelism add i 50 F IF store X[49] D ID mul mul EX X LSU Mem Threads: T0 EX X Memory add i 18 F IF store X[17] D ID LSU Mem T1 Improves peak throughput 31
30 Example: Tilera Tile-GX Grid of (up to) 72 tiles Each tile: 3-way VLIW processor, 5 pipeline stages, 1.2 GHz Tile (1,1) Tile (1,2) Tile (1,8) Tile (9,1) Tile (9,8) 32
31 Interleaved multi-threading Vertical use of thread-level parallelism Threads: T0 T1 T2 T3 mul Fetch mul Decode add i 73 add i 50 Execute load X[89] store X[72] load X[17] store X[49] Memory load-store unit Memory Hides latency thanks to explicit parallelism improves achieved throughput 33
32 Example: Oracle Sparc T5 16 cores / chip Core: out-of-order superscalar, 8 threads 15 pipeline stages, 3.6 GHz Thread 1 Thread 2 Thread 8 Core 1 Core 2 Core 16 34
33 Clustered multi-core For each individual unit, select between T0 T1 T2 Cluster 1 T3 Cluster 2 Horizontal replication Vertical time-multiplexing br Examples mul Sun UltraSparc T2, T3 AMD Bulldozer IBM Power 7 add i 73 Fetch store Decode mul add i 50 load X[89] store X[72] load X[17] store X[49] EX Memory L/S Unit Area-efficient tradeoff Blurs boundaries between cores 35
34 Implicit SIMD Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load F T1 (0-3) store D T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul X (3) Memory T0 Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Extracts DLP from multi-thread applications 36
35 Explicit SIMD Single Instruction Multiple Data Horizontal use of data level parallelism add i 20 F vstore X[ D vmul X Memory loop: vload T X[i] vmul T a T vstore X[i] T add i i+4 branch i<n? loop Mem Machine code SIMD CPU Examples Intel MIC (16-wide) AMD GCN GPU (16-wide 4-deep) Most general purpose CPUs (4-wide to 8-wide) 37
36 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 38
37 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 39
38 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 40
39 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 41
40 Example GPU: NVIDIA GeForce GTX 580 Combines all techniques SIMT: warps of 32 threads 16 SMs / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 Core 32 Core 18 Time Core 17 Warp 47 Core 16 Core 2 Core 1 Warp 48 SM1 SM16 Up to 512 operations per cycle from threads in flight 44
41 Taxonomy of parallel architectures Horizontal Vertical ILP Superscalar / VLIW Pipelined TLP Multi-core SMT Interleaved / switch-onevent multithreading DLP SIMD / SIMT Vector / temporal SIMT 45
42 Takeaway Parallelism for throughput and latency hiding Types of parallelism: ILP, TLP, DLP All modern processors exploit the 3 kinds of parallelism GPUs focus on Thread-level and Data-level parallelism 49
43 Outline Computer architecture crash course The simplest processor Exploiting instruction-level parallelism GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory 50
44 External memory: discrete GPU Classical CPU-GPU model Split memory spaces Need to transfer data explicitely Highest bandwidth from GPU memory PCI Express CPU 26GB/s Transfers to main memory are slower Main memory 8 GB 16GB/s GPU 224GB/s Graphics memory 4 GB Ex: Intel Core i7 4790, Nvidia GeForce GTX
45 External memory: embedded GPU Most GPUs today are integrated Same memory May support memory coherence GPU can read directly from CPU caches More contention on external memory CPU GPU Cache 26GB/s Main memory 8 GB 52
46 CPU: caches for latency Large memories are slower than small memories Computer theorists are liars: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache CPU Registers L1 cache L2 cache Capacity: 4 KB Access time: 0 (fully-bypassed) Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 55
47 GPU: caches for throughput Cache hierarchy Keep frequently-accessed data SM SM Reduce throughput demand on main memory L1 SMem SM ~2 TB/s L1 SMem L1 Smem 1 MB Managed by hardware (L1, L2) or software (shared memory) GPU Crossbar L2 L2 L2 6 MB 290 GB/s GPU memory 56
48 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 58
49 First-order performance model Questions you should ask yourself, before starting to code or optimize Will my code run faster on the GPU? Is my existing code running as fast as it should? Is performance limited by computations or memory bandwidth? Pen-and-pencil calculations can (often) answer such questions 59
50 Performance: metrics and definitions Optimistic evaluation: upper bound on performance Assume perfect overlap of computations and memory accesses Memory accesses: bytes Only external memory, not caches or registers Computations: flops Only useful computations (usually floating-point) not address calculations, loop iterators.. Arithmetic intensity: flops / bytes = computations / memory accesses Property of the code Arithmetic throughput: flops / s Property of code + architecture 60
51 The roofline model How much performance can I get for a given arithmetic intensity? Upper bound on arithmetic throughput, as a function of arithmetic intensity Property of the architecture Arithmetic throughput (Gflops/s) Arithmetic intensity (flops/byte) Memory-bound Compute-bound S. Williams, A. Waterman, D. Patterson. Roofline: an insightful visual performance model 64 for multicore architectures. Communications of the ACM, 2009
52 Building the machine model Compute or measure: Peak memory throughput GTX 980: 224 GB/s Ideal arithmetic intensity = peak compute throughput / mem throughput GTX 980: 6412 (Gflops/s) / 224 (GB/s) = 28.6 flops/b 4 (B/flop) = 114 (dimensionless) Arithmetic throughput (Gflops/s) 4612 Gflops 28.6 B/flop Beware of units: float=4b, double=8b! Arithmetic intensity (flops/byte) Achievable peaks may be lower than theoretical peaks Lower curves when adding realistic constraints 65
53 Using the model Compute arithmetic intensity, measure performance of program Identify bottleneck: memory or computation Take optimization decision Measured performance Optimize computation Optimize memory accesses Arithmetic throughput (Gflops/s) Reuse data Arithmetic intensity (flops/byte) 66
54 Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? How many memory accesses? Arithmetic intensity? Compute-bound or memory-bound? How many Gflop/s on a GTX 980 GPU? With data in GPU memory? With data in CPU memory? How many Gflop/s on an i CPU? GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 67
55 Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? 2 n flops How many memory accesses? 2 n words Arithmetic intensity? 1 flop/word = 0.25 flop/b Compute-bound or memory-bound? Highly memory-bound How many Gflop/s on a GTX 980 GPU? With data in GPU memory? 224 GB/s 0.25 flop/b 56 Gflops/s With data in CPU memory? 16 GB/s 0.25 flop/b 4 Gflops/s How many Gflop/s on an i CPU? 25.6 GB/s 0.25 flop/b 6.4 Gflops/s Conclusion: don't bother porting to GPU! GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 68
56 Takeaway Result of many tradeoffs Between locality and parallelism Between core complexity and interconnect complexity GPU optimized for throughput Exploits primarily DLP, TLP Energy-efficient on parallel applications with regular behavior CPU optimized for latency Exploits primarily ILP Can use TLP and DLP when available Performance models Back-of-the-envelope calculations and common sense can save time Next time: GPU programming in CUDA 69
57 Computer architecture crash course How does a processor work? Or was working in the 1980s to 1990s: modern processors are much more complicated! An attempt to sum up 30 years of research in 15 minutes 70
58 Machine language: instruction set Registers Example State for computations Keeps variables and temporaries Instructions R0, R1, R2, R3... R31 Perform computations on registers, move data between register and memory, branch Instruction word Binary representation of an instruction Assembly language Readable form of machine language ADD R1, R3 Examples Which instruction set does your laptop/desktop run? Your cell phone? 71
59 The Von Neumann processor PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus State machine Let's look at it step by step 72
60 Step by step: Fetch The processor maintains a Program Counter (PC) Fetch: read the instruction word pointed by PC in memory PC Fetch unit Instruction word Memory 73
61 Decode Split the instruction word to understand what it represents Which operation? ADD Which operands? R1, R3 PC Fetch unit Instruction word Decoder Operands R1, R3 Operation Memory ADD 74
62 Read operands Get the value of registers R1, R3 from the register file PC Fetch unit Instruction word Decoder Operands R1, R3 Register file Memory 42, 17 75
63 Execute operation Compute the result: PC Fetch unit Instruction word Decoder Operands Operation Memory Register file 42, 17 ADD Arithmetic and Logic Unit 59 76
64 Write back Write the result back to the register file PC Fetch unit Instruction word Decoder R1 Operands 59 Operation Memory Register file Arithmetic and Logic Unit Result bus 77
65 Increment PC PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Result bus 78
66 Load or store instruction Can read and write memory from a computed address PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Load/Store Unit Result bus 79
67 Branch instruction Instead of incrementing PC, set it to a computed value PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus 80
68 What about the state machine? The state machine controls everybody Sequences the successive steps Send signals to units depending on current state At every clock tick, switch to next state Clock is a periodic signal (as fast as possible) Fetchdecode Read operands Increment PC Read operands Writeback Execute 81
69 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 82
70 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 2: mul 1: add Execute Writeback Independent instructions can follow each other Exploits ILP to hide instruction latency 83
71 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 3: load 2: mul Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 84
72 Superscalar execution Multiple execution units in parallel Independent instructions can execute at the same time Fetch Decode Execute Writeback Decode Execute Writeback Decode Execute Writeback Exploits ILP to increase throughput 85
73 Locality Time to access main memory: ~200 clock cycles One memory access every few instructions Are we doomed? Fortunately: principle of locality ~90% of memory accesses on ~10% of data Accessed locations are often the same Temporal locality Access the same location at different times Spacial locality Access locations close to each other 86
74 Caches Large memories are slower than small memories The computer theorists lied to you: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache L1 cache L2 cache Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 87
75 Flynn's taxonomy Classification of parallel architectures Single Instruction F Single Data X SISD Multiple Data F Multiple Instruction FFFF X MISD FFFF XXXX XXXX SIMD MIMD M. Flynn. Some Computer Organizations and Their Effectiveness. IEEE TC
Parallel programming: Introduction to GPU architecture
Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 Outline of the course March
More informationOptimization Techniques for Parallel Code 2. Introduction to GPU architecture
Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 What
More informationOptimization Techniques for Parallel Code 3. Introduction to GPU architecture
Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2018 Graphics
More informationIntroduction to GPU architecture
Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics
More informationIntroduction to GPU architecture
Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics
More informationGPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Analytical performance modeling Optimizing host-device data transfers Optimizing
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationParallelism in Hardware
Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationCSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationCurrent Trends in Computer Graphics Hardware
Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationOverview. CS 472 Concurrent & Parallel Programming University of Evansville
Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationParallel computing and GPU introduction
國立台灣大學 National Taiwan University Parallel computing and GPU introduction 黃子桓 tzhuan@gmail.com Agenda Parallel computing GPU introduction Interconnection networks Parallel benchmark Parallel programming
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationComputer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationGPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.
Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationComputer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra
Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationCS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011
CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationMANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center
MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013)
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationGPU programming: 1. Parallel programming models. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Self-introduction Sylvain Collange Researcher at Inria, Rennes, France Visiting
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationPutting it all Together: Modern Computer Architecture
Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationParallel Programming
Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias
More informationImproving the efficiency of parallel architectures with regularity
Improving the efficiency of parallel architectures with regularity Sylvain Collange Arénaire, LIP, ENS de Lyon Dipartimento di Ingegneria dell'informazione Università di Siena February 2, 2011 Where I
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationECE 571 Advanced Microprocessor-Based Design Lecture 20
ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationChapter 7. Multicores, Multiprocessors, and Clusters
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationVector Processors and Graphics Processing Units (GPUs)
Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationLecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)
Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed
More informationGeneral-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA
General-purpose SIMT processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More information