Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Size: px
Start display at page:

Download "Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique"

Transcription

1 Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique

2 Outline of the course Feb 29: Introduction to GPU architecture Let's pretend we are a GPU architect Performance models March 7?: GPU programming The software side Programming model March 14: Performance optimization Possible bottlenecks Common optimization techniques 4 labs

3 GPGPU in HPC, today Graphics Processing Unit (GPU) Made for computer games: high-volume market Low unit price, amortized R&D Inexpensive, high-performance parallel processor 2002: General-Purpose computation on GPU (GPGPU) 2016: the 2 fastest supercomputers use GPUs/accelerators #1 Tianhe-2 (China) #2 Titan (USA) 18,648 (1 AMD Opteron + 1 NVIDIA Tesla K20X) 16,000 (2 Intel Xeon + 3 Intel Xeon Phi) 3

4 GPGPU in the future? Yesterday ( ) Homogeneous multi-core Discrete components Today ( ) Chip-level integration Central Processing Unit (CPU) Graphics Processing Unit (GPU) Many embedded SoCs Intel Sandy Bridge AMD Fusion NVIDIA Denver/Maxwell project Tomorrow Heterogeneous multi-core GPUs to blend into throughput-optimized cores? Latencyoptimized cores Throughputoptimized cores Hardware accelerators Heterogeneous multi-core chip 4

5 Today: GPU internals What makes a GPU tick? NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering! 5

6 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 6

7 The free lunch era... was yesterday 1980's to 2002: Moore's law, Dennard scaling, micro-architecture improvements Exponential performance increase Software compatibility preserved Hennessy, Patterson. Computer Architecture, a quantitative approach. 4 th Ed Do not rewrite software, buy a new machine! 7

8 Technology evolution Memory wall Compute Performance Gap Memory Memory speed does not increase as fast as computing speed Harder to hide memory latency Time Power wall Power consumption of transistors does not decrease as fast as density increases Transistor density Performance is now limited by power consumption Transistor power ILP wall Law of diminishing returns on Instruction-Level Parallelism Total power Time Cost Pollack rule: cost performance² Serial performance 8

9 Usage changes New applications demand parallel processing Computer games : 3D graphics Search engines, social networks big data processing New computing devices are power-constrained Laptops, cell phones, tablets Small, light, battery-powered Datacenters High power supply and cooling costs 9

10 Latency vs. throughput Latency: time to solution Minimize time, at the expense of power Metric: time e.g. seconds Throughput: quantity of tasks processed per unit of time Assumes unlimited parallelism Minimize energy per operation Metric: operations / time e.g. Gflops / s CPU: optimized for latency GPU: optimized for throughput 10

11 Amdahl's law Bounds speedup attainable on a parallel machine 1 S= Time to run sequential portions 1 P P N Time to run parallel portions S P N Speedup Ratio of parallel portions Number of processors S (speedup) N (available processors) G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. AFIPS

12 Why heterogeneous architectures? Time to run sequential portions S= 1 P 1 P N Time to run parallel portions Latency-optimized multi-core (CPU) Low efficiency on parallel portions: spends too much resources Throughput-optimized multi-core (GPU) Low performance on sequential portions Heterogeneous multi-core (CPU+GPU) Use the right tool for the right job Allows aggressive optimization for latency or for throughput M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer,

13 Example: System on Chip for smartphone Small cores for background activity GPU Big cores for applications Lots of interfaces Special-purpose accelerators 13

14 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 14

15 The (simplest) graphics rendering pipeline Primitives (triangles ) Vertices Fragment shader Vertex shader Textures Z-Compare Blending Clipping, Rasterization Attribute interpolation Pixels Fragments Framebuffer Z-Buffer Programmable stage Parametrizable stage 15

16 How much performance do we need to run 3DMark 11 at 50 frames/second? Element Per frame Per second Vertices 12.0M 600M Primitives 12.6M 630M Fragments 180M 9.0G Instructions 14.4G 720G Intel Core i7 2700K: 56 Ginsn/s peak We need to go 13x faster Make a special-purpose accelerator Source: Damien Triolet, Hardware.fr 16

17 Beginnings of GPGPU Microsoft DirectX 7.x a 9.0b 9.0c Unified shaders NVIDIA NV10 FP 16 NV20 NV30 Programmable shaders Dynamic control flow G70 R200 G80-G90 CTM R300 R400 GT200 GF100 CUDA SIMT FP 24 ATI/AMD R100 FP 32 NV40 R500 FP 64 R600 CAL R700 Evergreen GPGPU traction

18 Today: what do we need GPUs for? 1. 3D graphics rendering for games Complex texture mapping, lighting computations 2. Computer Aided Design workstations Complex geometry 3. GPGPU Complex synchronization, data movements One chip to rule them all Find the common denominator 18

19 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 19

20 What is parallelism? Parallelism: independent operations whose execution can be overlapped Operations: memory accesses or computations How much parallelism do I need? Little's law in queuing theory Average customer arrival rate λ Average time spent W Average number of customers: L = λ W Parallelism = throughput latency Units: Memory: B = GB/s ns Arithmetic: flops = Gflops/s ns J. Little. A proof for the queuing formula L= λ W. JSTOR

21 Throughput and latency: CPU vs. GPU Throughput (GB/s) CPU memory: Core i7 4790, DDR3-1600, 2 channels 25.6 Latency (ns) 67 GPU memory: NVIDIA GeForce GTX 980, GDDR5-7010, 256-bit 224 x8 Area: 56 x6 Need 56 times more parallelism! 410 ns 21

22 Sources of parallelism ILP: Instruction-Level Parallelism Between independent instructions in sequential program TLP: Thread-Level Parallelism Between independent execution contexts: threads DLP: Data-Level Parallelism Between elements of a vector: same operation on several elements add r3 r1, r2 Parallel mul r0 r0, r1 sub r1 r3, r0 Thread 1 add Thread 2 mul vadd r a,b Parallel a1 + b1 a2 + b2 a3 + b3 r1 r2 r3 24

23 Example: X a X In-place scalar-vector product: X a X Sequential (ILP) For i = 0 to n-1 do: X[i] a * X[i] Threads (TLP) Launch n threads: X[tid] a * X[tid] Vector (DLP) X a * X Or any combination of the above 25

24 Uses of parallelism Horizontal parallelism for throughput A B C D More units working in parallel Vertical parallelism for latency hiding A Pipelining: keep units busy when waiting for dependencies, memory B C D A B C A B latency throughput A cycle 1 cycle 2 cycle 3 cycle 4 26

25 How to extract parallelism? Horizontal Vertical ILP Superscalar Pipelined TLP Multi-core SMT Interleaved / switch-on-event multithreading DLP SIMD / SIMT Vector / temporal SIMT We have seen the first row: ILP We will now review techniques for the next rows: TLP, DLP 27

26 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 28

27 Sequential processor for i = 0 to n-1 X[i] a * X[i] Source code move i 0 loop: load t X[i] mul t a t store X[i] t add i i+1 branch i<n? loop Machine code add i 18 Fetch store X[17] Decode mul Execute Memory Memory Sequential CPU Focuses on instruction-level parallelism Exploits ILP: vertically (pipelining) and horizontally (superscalar) 29

28 The incremental approach: multi-core Several processors on a single chip sharing one memory space Intel Sandy Bridge Area: benefits from Moore's law Power: extra cores consume little when not in use e.g. Intel Turbo Boost Source: Intel 30

29 Homogeneous multi-core Horizontal use of thread-level parallelism add i 50 F IF store X[49] D ID mul mul EX X LSU Mem Threads: T0 EX X Memory add i 18 F IF store X[17] D ID LSU Mem T1 Improves peak throughput 31

30 Example: Tilera Tile-GX Grid of (up to) 72 tiles Each tile: 3-way VLIW processor, 5 pipeline stages, 1.2 GHz Tile (1,1) Tile (1,2) Tile (1,8) Tile (9,1) Tile (9,8) 32

31 Interleaved multi-threading Vertical use of thread-level parallelism Threads: T0 T1 T2 T3 mul Fetch mul Decode add i 73 add i 50 Execute load X[89] store X[72] load X[17] store X[49] Memory load-store unit Memory Hides latency thanks to explicit parallelism improves achieved throughput 33

32 Example: Oracle Sparc T5 16 cores / chip Core: out-of-order superscalar, 8 threads 15 pipeline stages, 3.6 GHz Thread 1 Thread 2 Thread 8 Core 1 Core 2 Core 16 34

33 Clustered multi-core For each individual unit, select between T0 T1 T2 Cluster 1 T3 Cluster 2 Horizontal replication Vertical time-multiplexing br Examples mul Sun UltraSparc T2, T3 AMD Bulldozer IBM Power 7 add i 73 Fetch store Decode mul add i 50 load X[89] store X[72] load X[17] store X[49] EX Memory L/S Unit Area-efficient tradeoff Blurs boundaries between cores 35

34 Implicit SIMD Factorization of fetch/decode, load-store units Fetch 1 instruction on behalf of several threads Read 1 memory location and broadcast to several registers (0-3) load F T1 (0-3) store D T2 T3 (0) mul (0) (1) mul (2) mul (1) (2) (3) mul X (3) Memory T0 Mem In NVIDIA-speak SIMT: Single Instruction, Multiple Threads Convoy of synchronized threads: warp Extracts DLP from multi-thread applications 36

35 Explicit SIMD Single Instruction Multiple Data Horizontal use of data level parallelism add i 20 F vstore X[ D vmul X Memory loop: vload T X[i] vmul T a T vstore X[i] T add i i+4 branch i<n? loop Mem Machine code SIMD CPU Examples Intel MIC (16-wide) AMD GCN GPU (16-wide 4-deep) Most general purpose CPUs (4-wide to 8-wide) 37

36 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 38

37 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 39

38 Quizz: link the words Parallelism Architectures ILP Superscalar processor TLP Homogeneous multi-core DLP Multi-threaded core Use Horizontal: more throughput Clustered multi-core Implicit SIMD Explicit SIMD Vertical: hide latency 40

39 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 41

40 Example GPU: NVIDIA GeForce GTX 580 Combines all techniques SIMT: warps of 32 threads 16 SMs / chip 2 16 cores / SM, 48 warps / SM Warp 1 Warp 2 Warp 3 Warp 4 Core 32 Core 18 Time Core 17 Warp 47 Core 16 Core 2 Core 1 Warp 48 SM1 SM16 Up to 512 operations per cycle from threads in flight 44

41 Taxonomy of parallel architectures Horizontal Vertical ILP Superscalar / VLIW Pipelined TLP Multi-core SMT Interleaved / switch-onevent multithreading DLP SIMD / SIMT Vector / temporal SIMT 45

42 Takeaway Parallelism for throughput and latency hiding Types of parallelism: ILP, TLP, DLP All modern processors exploit the 3 kinds of parallelism GPUs focus on Thread-level and Data-level parallelism 49

43 Outline Computer architecture crash course The simplest processor Exploiting instruction-level parallelism GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory 50

44 External memory: discrete GPU Classical CPU-GPU model Split memory spaces Need to transfer data explicitely Highest bandwidth from GPU memory PCI Express CPU 26GB/s Transfers to main memory are slower Main memory 8 GB 16GB/s GPU 224GB/s Graphics memory 4 GB Ex: Intel Core i7 4790, Nvidia GeForce GTX

45 External memory: embedded GPU Most GPUs today are integrated Same memory May support memory coherence GPU can read directly from CPU caches More contention on external memory CPU GPU Cache 26GB/s Main memory 8 GB 52

46 CPU: caches for latency Large memories are slower than small memories Computer theorists are liars: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache CPU Registers L1 cache L2 cache Capacity: 4 KB Access time: 0 (fully-bypassed) Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 55

47 GPU: caches for throughput Cache hierarchy Keep frequently-accessed data SM SM Reduce throughput demand on main memory L1 SMem SM ~2 TB/s L1 SMem L1 Smem 1 MB Managed by hardware (L1, L2) or software (shared memory) GPU Crossbar L2 L2 L2 6 MB 290 GB/s GPU memory 56

48 Outline GPU, many-core: why, what for? Technological trends and constraints From graphics to general purpose Forms of parallelism, how to exploit them Why we need (so much) parallelism: latency and throughput Sources of parallelism: ILP, TLP, DLP Uses of parallelism: horizontal, vertical Let's design a GPU! Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD Putting it all together Architecture of current GPUs: cores, memory High-level performance modeling 58

49 First-order performance model Questions you should ask yourself, before starting to code or optimize Will my code run faster on the GPU? Is my existing code running as fast as it should? Is performance limited by computations or memory bandwidth? Pen-and-pencil calculations can (often) answer such questions 59

50 Performance: metrics and definitions Optimistic evaluation: upper bound on performance Assume perfect overlap of computations and memory accesses Memory accesses: bytes Only external memory, not caches or registers Computations: flops Only useful computations (usually floating-point) not address calculations, loop iterators.. Arithmetic intensity: flops / bytes = computations / memory accesses Property of the code Arithmetic throughput: flops / s Property of code + architecture 60

51 The roofline model How much performance can I get for a given arithmetic intensity? Upper bound on arithmetic throughput, as a function of arithmetic intensity Property of the architecture Arithmetic throughput (Gflops/s) Arithmetic intensity (flops/byte) Memory-bound Compute-bound S. Williams, A. Waterman, D. Patterson. Roofline: an insightful visual performance model 64 for multicore architectures. Communications of the ACM, 2009

52 Building the machine model Compute or measure: Peak memory throughput GTX 980: 224 GB/s Ideal arithmetic intensity = peak compute throughput / mem throughput GTX 980: 6412 (Gflops/s) / 224 (GB/s) = 28.6 flops/b 4 (B/flop) = 114 (dimensionless) Arithmetic throughput (Gflops/s) 4612 Gflops 28.6 B/flop Beware of units: float=4b, double=8b! Arithmetic intensity (flops/byte) Achievable peaks may be lower than theoretical peaks Lower curves when adding realistic constraints 65

53 Using the model Compute arithmetic intensity, measure performance of program Identify bottleneck: memory or computation Take optimization decision Measured performance Optimize computation Optimize memory accesses Arithmetic throughput (Gflops/s) Reuse data Arithmetic intensity (flops/byte) 66

54 Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? How many memory accesses? Arithmetic intensity? Compute-bound or memory-bound? How many Gflop/s on a GTX 980 GPU? With data in GPU memory? With data in CPU memory? How many Gflop/s on an i CPU? GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 67

55 Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? 2 n flops How many memory accesses? 2 n words Arithmetic intensity? 1 flop/word = 0.25 flop/b Compute-bound or memory-bound? Highly memory-bound How many Gflop/s on a GTX 980 GPU? With data in GPU memory? 224 GB/s 0.25 flop/b 56 Gflops/s With data in CPU memory? 16 GB/s 0.25 flop/b 4 Gflops/s How many Gflop/s on an i CPU? 25.6 GB/s 0.25 flop/b 6.4 Gflops/s Conclusion: don't bother porting to GPU! GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 68

56 Takeaway Result of many tradeoffs Between locality and parallelism Between core complexity and interconnect complexity GPU optimized for throughput Exploits primarily DLP, TLP Energy-efficient on parallel applications with regular behavior CPU optimized for latency Exploits primarily ILP Can use TLP and DLP when available Performance models Back-of-the-envelope calculations and common sense can save time Next time: GPU programming in CUDA 69

57 Computer architecture crash course How does a processor work? Or was working in the 1980s to 1990s: modern processors are much more complicated! An attempt to sum up 30 years of research in 15 minutes 70

58 Machine language: instruction set Registers Example State for computations Keeps variables and temporaries Instructions R0, R1, R2, R3... R31 Perform computations on registers, move data between register and memory, branch Instruction word Binary representation of an instruction Assembly language Readable form of machine language ADD R1, R3 Examples Which instruction set does your laptop/desktop run? Your cell phone? 71

59 The Von Neumann processor PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus State machine Let's look at it step by step 72

60 Step by step: Fetch The processor maintains a Program Counter (PC) Fetch: read the instruction word pointed by PC in memory PC Fetch unit Instruction word Memory 73

61 Decode Split the instruction word to understand what it represents Which operation? ADD Which operands? R1, R3 PC Fetch unit Instruction word Decoder Operands R1, R3 Operation Memory ADD 74

62 Read operands Get the value of registers R1, R3 from the register file PC Fetch unit Instruction word Decoder Operands R1, R3 Register file Memory 42, 17 75

63 Execute operation Compute the result: PC Fetch unit Instruction word Decoder Operands Operation Memory Register file 42, 17 ADD Arithmetic and Logic Unit 59 76

64 Write back Write the result back to the register file PC Fetch unit Instruction word Decoder R1 Operands 59 Operation Memory Register file Arithmetic and Logic Unit Result bus 77

65 Increment PC PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Result bus 78

66 Load or store instruction Can read and write memory from a computed address PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Arithmetic and Logic Unit Load/Store Unit Result bus 79

67 Branch instruction Instead of incrementing PC, set it to a computed value PC Fetch unit +1 Instruction word Decoder Operands Operation Memory Register file Branch Unit Arithmetic and Logic Unit Load/Store Unit Result bus 80

68 What about the state machine? The state machine controls everybody Sequences the successive steps Send signals to units depending on current state At every clock tick, switch to next state Clock is a periodic signal (as fast as possible) Fetchdecode Read operands Increment PC Read operands Writeback Execute 81

69 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 82

70 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 2: mul 1: add Execute Writeback Independent instructions can follow each other Exploits ILP to hide instruction latency 83

71 Pipelined processor Program 1: add, r1, r3 2: mul r2, r3 3: load r3, [r1] Fetch Decode 3: load 2: mul Execute Writeback 1: add Independent instructions can follow each other Exploits ILP to hide instruction latency 84

72 Superscalar execution Multiple execution units in parallel Independent instructions can execute at the same time Fetch Decode Execute Writeback Decode Execute Writeback Decode Execute Writeback Exploits ILP to increase throughput 85

73 Locality Time to access main memory: ~200 clock cycles One memory access every few instructions Are we doomed? Fortunately: principle of locality ~90% of memory accesses on ~10% of data Accessed locations are often the same Temporal locality Access the same location at different times Spacial locality Access locations close to each other 86

74 Caches Large memories are slower than small memories The computer theorists lied to you: in the real world, access in an array of size n costs O(log n), not O(1)! Think about looking up a book in a small or huge library Idea: put frequently-accessed data in small, fast memory Can be applied recursively: hierarchy with multiple levels of cache L1 cache L2 cache Capacity: 64 KB Access time: 2 ns 1 MB 10 ns L3 cache 8 MB 30 ns Memory 8 GB 60 ns 87

75 Flynn's taxonomy Classification of parallel architectures Single Instruction F Single Data X SISD Multiple Data F Multiple Instruction FFFF X MISD FFFF XXXX XXXX SIMD MIMD M. Flynn. Some Computer Organizations and Their Effectiveness. IEEE TC

Parallel programming: Introduction to GPU architecture

Parallel programming: Introduction to GPU architecture Parallel programming: Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr http://www.irisa.fr/alf/collange/ PPAR - 2018 Outline of the course March

More information

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Optimization Techniques for Parallel Code 2. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 What

More information

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Optimization Techniques for Parallel Code 3. Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2018 Graphics

More information

Introduction to GPU architecture

Introduction to GPU architecture Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics

More information

Introduction to GPU architecture

Introduction to GPU architecture Introduction to GPU architecture Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr ADA - 2017 Graphics processing unit (GPU) GPU or GPU Graphics

More information

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Analytical performance modeling Optimizing host-device data transfers Optimizing

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Parallelism in Hardware

Parallelism in Hardware Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com History of GPUs

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Parallel computing and GPU introduction

Parallel computing and GPU introduction 國立台灣大學 National Taiwan University Parallel computing and GPU introduction 黃子桓 tzhuan@gmail.com Agenda Parallel computing GPU introduction Interconnection networks Parallel benchmark Parallel programming

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Thread, Multithreading, SMT CMP and multicore Benefits of

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU. Basics of s Basics Introduction to Why vs CPU S. Sundar and Computing architecture August 9, 2014 1 / 70 Outline Basics of s Why vs CPU Computing architecture 1 2 3 of s 4 5 Why 6 vs CPU 7 Computing 8

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013)

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD) Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

GPU programming: 1. Parallel programming models. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: 1. Parallel programming models. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Self-introduction Sylvain Collange Researcher at Inria, Rennes, France Visiting

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Putting it all Together: Modern Computer Architecture

Putting it all Together: Modern Computer Architecture Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development

More information

Parallel Programming

Parallel Programming Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias

More information

Improving the efficiency of parallel architectures with regularity

Improving the efficiency of parallel architectures with regularity Improving the efficiency of parallel architectures with regularity Sylvain Collange Arénaire, LIP, ENS de Lyon Dipartimento di Ingegneria dell'informazione Università di Siena February 2, 2011 Where I

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7. Multicores, Multiprocessors, and Clusters Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Vector Processors and Graphics Processing Units (GPUs)

Vector Processors and Graphics Processing Units (GPUs) Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA

General-purpose SIMT processors. Sylvain Collange INRIA Rennes / IRISA General-purpose SIMT processors Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Today

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information