Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Similar documents
Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Multi-Processors and GPU

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

EE 4683/5683: COMPUTER ARCHITECTURE

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Portland State University ECE 588/688. Graphics Processors

Chapter 4 Data-Level Parallelism

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

GRAPHICS PROCESSING UNITS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Vector Architectures. Intensive Computation. Annalisa Massini 2017/2018

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Dense Linear Algebra. HPC - Algorithms and Applications

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Parallel Computing: Parallel Architectures Jin, Hai

Master Informatics Eng.

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

GPU Fundamentals Jeff Larkin November 14, 2016

Fundamental CUDA Optimization. NVIDIA Corporation

Vector Processors and Graphics Processing Units (GPUs)

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

4.10 Historical Perspective and References

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Advanced Topics in Computer Architecture

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Portland State University ECE 588/688. Cray-1 and Cray T3E

Tesla Architecture, CUDA and Optimization Strategies

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

Numerical Simulation on the GPU

Parallel Systems I The GPU architecture. Jan Lemeire

Static Compiler Optimization Techniques

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

COMPUTER ORGANIZATION AND DESIGN

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Using GPUs to compute the multilevel summation of electrostatic forces

CME 213 S PRING Eric Darve

Arquitetura e Organização de Computadores 2

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

ME964 High Performance Computing for Engineering Applications

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Master Informatics Eng.

Mathematical computations with GPUs

Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Parallel Processing SIMD, Vector and GPU s

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS560 Lecture Parallel Architecture 1

arxiv: v1 [physics.comp-ph] 4 Nov 2013

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

COURSE DELIVERY PLAN - THEORY Page 1 of 6

Lecture 1: Introduction and Computational Thinking

Parallel Accelerators

Sparse Linear Algebra in CUDA

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Spring Prof. Hyesoon Kim

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Tesla GPU Computing A Revolution in High Performance Computing

arxiv: v1 [astro-ph.im] 2 Feb 2017

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

High Performance Computing on GPUs using NVIDIA CUDA

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Warps and Reduction Algorithms

Parallel and Distributed Programming Introduction. Kenjiro Taura


Chapter 7. Multicores, Multiprocessors, and Clusters

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Advanced ultrasound beam forming using GPGPU technology

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling

Parallel Accelerators

SIMD Programming CS 240A, 2017

Modern GPUs (Graphics Processing Units)

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CSE 160 Lecture 24. Graphical Processing Units

Introduction II. Overview

Advanced CUDA Optimizing to Get 20x Performance

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Transcription:

Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1

Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for x86 computers. This figure assumes that two cores per chip for MIMD will be added every two years and the number of operations for SIMD will double every four years. Copyright 2011, Elsevier Inc. All rights Reserved. 2

Figure 4.2 The basic structure of a vector architecture, VMIPS. This processor has a scalar architecture just like MIPS. There are also eight 64-element vector registers, and all the functional units are vector functional units. This chapter defines special vector instructions for both arithmetic and memory accesses. The figure shows vector units for logical and integer operations so that VMIPS looks like a standard vector processor that usually includes these units; however, we will not be discussing these units. The vector and scalar registers have a significant number of read and write ports to allow multiple simultaneous vector operations. A set of crossbar switches (thick gray lines) connects these ports to the inputs and outputs of the vector functional units. Copyright 2011, Elsevier Inc. All rights Reserved. 3

Figure 4.4 Using multiple functional units to improve the performance of a single vector add instruction, C = A + B. The vector processor (a) on the left has a single add pipeline and can complete one addition per cycle. The vector processor (b) on the right has four add pipelines and can complete four additions per cycle. The elements within a single vector add instruction are interleaved across the four pipelines. The set of elements that move through the pipelines together is termed an element group. (Reproduced with permission from Asanovic [1998].) Copyright 2011, Elsevier Inc. All rights Reserved. 4

Figure 4.5 Structure of a vector unit containing four lanes. The vector register storage is divided across the lanes, with each lane holding every fourth element of each vector register. The figure shows three vector functional units: an FP add, an FP multiply, and a load-store unit. Each of the vector arithmetic units contains four execution pipelines, one per lane, which act in concert to complete a single vector instruction. Note how each section of the vector register file only needs to provide enough ports for pipelines local to its lane. This figure does not show the path to provide the scalar operand for vector-scalar instructions, but the scalar processor (or control processor) broadcasts a scalar value to all lanes. Copyright 2011, Elsevier Inc. All rights Reserved. 5

Figure 4.6 A vector of arbitrary length processed with strip mining. All blocks but the first are of length MVL, utilizing the full power of the vector processor. In this figure, we use the variable m for the expression (n % MVL). (The C operator % is modulo.) Copyright 2011, Elsevier Inc. All rights Reserved. 6

Figure 4.10 Arithmetic intensity, specified as the number of floating-point operations to run the program divided by the number of bytes accessed in main memory [Williams et al. 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as dense matrix, but there are many kernels with arithmetic intensities independent of problem size. Copyright 2011, Elsevier Inc. All rights Reserved. 7

Figure 4.11 Roofline model for one NEC SX-9 vector processor on the left and the Intel Core i7 920 multicore computer with SIMD Extensions on the right [Williams et al. 2009]. This Roofline is for unit-stride memory accesses and double-precision floating-point performance. NEC SX-9 is a vector supercomputer announced in 2008 that costs millions of dollars. It has a peak DP FP performance of 102.4 GFLOP/sec and a peak memory bandwidth of 162 GBytes/sec from the Stream benchmark. The Core i7 920 has a peak DP FP performance of 42.66 GFLOP/sec and a peak memory bandwidth of 16.4 GBytes/sec. The dashed vertical lines at an arithmetic intensity of 4 FLOP/byte show that both processors operate at peak performance. In this case, the SX-9 at 102.4 FLOP/sec is 2.4x faster than the Core i7 at 42.66 GFLOP/sec. At an arithmetic intensity of 0.25 FLOP/byte, the SX-9 is 10x faster at 40.5 GFLOP/sec versus 4.1 GFLOP/sec for the Core i7. Copyright 2011, Elsevier Inc. All rights Reserved. 8

Figure 4.14 Simplified block diagram of a Multithreaded SIMD Processor. It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independentthreads of SIMD instructions that it schedules with a table of 48 PCs. Copyright 2011, Elsevier Inc. All rights Reserved. 9

Figure 4.15 Floor plan of the Fermi GTX 480 GPU. This diagram shows 16 multithreaded SIMD Processors. The Thread Block Scheduler is highlighted on the left. The GTX 480 has 6 GDDR5 ports, each 64 bits wide, supporting up to 6 GB of capacity. The Host Interface is PCI Express 2.0 x 16. Giga Thread is the name of the scheduler that distributes thread blocks to Multiprocessors, each of which has its own SIMD Thread Scheduler. Copyright 2011, Elsevier Inc. All rights Reserved. 10

Figure 4.16 Scheduling of threads of SIMD instructions. The scheduler selects a ready thread of SIMD instructions and issues an instruction synchronously to all the SIMD Lanes executing the SIMD thread. Because threads of SIMD instructions are independent, the scheduler may select a different SIMD thread each time. Copyright 2011, Elsevier Inc. All rights Reserved. 11

Figure 4.18 GPU Memory structures. GPU Memory is shared by all Grids (vectorized loops), Local Memory is shared by all threads of SIMD instructions within a thread block (body of a vectorized loop), and Private Memory is private to a single CUDA Thread. Copyright 2011, Elsevier Inc. All rights Reserved. 12

Figure 4.19 Block Diagram of Fermi s Dual SIMD Thread Scheduler. Compare this design to the single SIMD Thread Design in Figure 4.16. Copyright 2011, Elsevier Inc. All rights Reserved. 13

Figure 4.20 Block diagram of the multithreaded SIMD Processor of a Fermi GPU. Each SIMD Lane has a pipelined floating-point unit, a pipelined integer unit, some logic for dispatching instructions and operands to these units, and a queue for holding results. The four Special Function units (SFUs) calculate functions such as square roots, reciprocals, sines, and cosines. Copyright 2011, Elsevier Inc. All rights Reserved. 14

Figure 4.22 A vector processor with four lanes on the left and a multithreaded SIMD Processor of a GPU with four SIMD Lanes on the right. (GPUs typically have 8 to 16 SIMD Lanes.) The control processor supplies scalar operands for scalar-vector operations, increments addressing for unit and non-unit stride accesses to memory, and performs other accounting-type operations. Peak memory performance only occurs in a GPU when the Address Coalescing unit can discover localized addressing. Similarly, peak computational performance occurs when all internal mask bits are set identically. Note that the SIMD Processor has one PC per SIMD thread to help with multithreading. Copyright 2011, Elsevier Inc. All rights Reserved. 15

Figure 4.28 Roofline model [Williams et al. 2009]. These rooflines show double-precision floating-point performance in the top row and single-precision performance in the bottom row. (The DP FP performance ceiling is also in the bottom row to give perspective.) The Core i7 920 on the left has a peak DP FP performance of 42.66 GFLOP/sec, a SP FP peak of 85.33 GFLOP/sec, and a peak memory bandwidth of 16.4 GBytes/sec. The NVIDIA GTX 280 has a DP FP peak of 78 GFLOP/sec, SP FP peak of 624 GFLOP/sec, and 127 GBytes/sec of memory bandwidth. The dashed vertical line on the left represents an arithmetic intensity of 0.5 FLOP/byte. It is limited by memory bandwidth to no more than 8 DP GFLOP/sec or 8 SP GFLOP/sec on the Core i7. The dashed vertical line to the right has an arithmetic intensity of 4 FLOP/byte. It is limited only computationally to 42.66 DP GFLOP/sec and 64 SP GFLOP/sec on the Core i7 and 78 DP GFLOP/sec and 512 DP GFLOP/sec on the GTX 280. To hit the highest computation rate on the Core i7 you need to use all 4 cores and SSE instructions with an equal number of multiplies and adds. For the GTX 280, you need to use fused multiply-add instructions on all multithreaded SIMD processors. Guz et al. [2009] have an interesting analytic model for these two architectures. Copyright 2011, Elsevier Inc. All rights Reserved. 16

Figure 4.33 Sample tree. Copyright 2011, Elsevier Inc. All rights Reserved. 17