Parallel Computing: Parallel Architectures Jin, Hai

Similar documents
CS 426 Parallel Computing. Parallel Computing Platforms

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Architecture

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Portland State University ECE 588/688. Graphics Processors

Introduction to Computing and Systems Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

WHY PARALLEL PROCESSING? (CE-401)

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

The University of Texas at Austin

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

GRAPHICS PROCESSING UNITS

Multi-Processors and GPU

All About the Cell Processor

Introduction to CELL B.E. and GPU Programming. Agenda

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Processor Performance and Parallelism Y. K. Malaiya

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

high performance medical reconstruction using stream programming paradigms

! Readings! ! Room-level, on-chip! vs.!

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Spring 2011 Prof. Hyesoon Kim

RISC Processors and Parallel Processing. Section and 3.3.6

Lect. 2: Types of Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Antonio R. Miele Marco D. Santambrogio

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Parallel Processing SIMD, Vector and GPU s cont.

GPU Fundamentals Jeff Larkin November 14, 2016

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Introduction to parallel computing

Outline Marquette University

High Performance Computing. University questions with solution

CellSs Making it easier to program the Cell Broadband Engine processor

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CDA3101 Recitation Section 13

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUDA OPTIMIZATIONS ISC 2011 Tutorial

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Multiprocessors & Thread Level Parallelism

Tesla Architecture, CUDA and Optimization Strategies

Lecture 1: Introduction

CS427 Multicore Architecture and Parallel Computing

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

B. Tech. Project Second Stage Report on

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer parallelism Flynn s categories

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Issues in Multiprocessors

Instruction Register. Instruction Decoder. Control Unit (Combinational Circuit) Control Signals (These signals go to register) The bus and the ALU

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Numerical Simulation on the GPU

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

CMSC 611: Advanced. Parallel Systems

POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors

Parallel Architecture. Hwansoo Han

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Parallel Computing Platforms

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Threading Hardware in G80

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

CS 654 Computer Architecture Summary. Peter Kemper

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Introducing Multi-core Computing / Hyperthreading

Multiprocessors - Flynn s Taxonomy (1966)

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 2: CUDA Programming

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

Massively Parallel Architectures

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Transcription:

Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

Peripherals Computer Central Processing Unit Main Memory Computer System s Interconnection Communication lines Input Output 2

! Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of magnitude).! Higher levels of device integration have made available a large number of transistors.! How best to utilize these resources?! Conventionally, use these resources in multiple functional units and execute multiple instructions in the same cycle (ILP) " Pipelining " Superscalar 3

! overlaps various stages of instruction execution to achieve performance F1 E1 F2 E2 F3 E3....... I1 I2 I3 Time I1: I2: F1 E1 F2 E2 F: Fetch E: Execute I3: F3 E3 Time 4

! Limitations:! The speed of a pipeline is eventually limited by the slowest stage. needs more stage, or very deep pipelines! However, in typical program traces, every 5-6th instruction is a conditional jump! - requires very accurate branch prediction.! The penalty of a misprediction grows with the depth of the pipeline, since a larger number of instructions will have to be flushed. 5

! multiple redundant functional units within each CPU so that multiple instructions can executed on separate data items concurrently.! Early ones: two ALUs and a single FPU! modern ones: have more, e.g. the PowerPC 970 includes four ALUs and two FPUs, as ewll as two SIMD units. 6

! The performance of the system as a whole will suffer if unable to keep all of the units fed with instructions.! Things affect the performance:! True Data Dependency: The result of one operation is an input to the next.! Resource Dependency: Two operations require the same resource.! Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. 7

! The scheduler - a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of instructions to execute concurrently based on these factors.! Very Long Instruction Word (VLIW) processors - rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. 8

! Limitations:! The degrees of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism! The complexity and time cost of the dispatcher and associated dependency checking logic. 9

Power and Heat: Intel Embraces Multicore! May 17, 2004 Intel, the world's largest chip maker, publicly acknowledged that it had hit a ''thermal wall'' on its microprocessor line. As a result, the company is changing its product strategy and disbanding one of its most advanced design groups. Intel also said that it would abandon two advanced chip development projects Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor Intel's decision to change course and embrace a ''dual core' processor structure shows the challenge of overcoming the effects of heat generated by the constant on-off movement of tiny switches in modern computers some analysts and former Intel designers said that Intel was coming to terms with escalating heat problems so severe they threatened to cause its chips to fracture at extreme temperatures New York Times, May 17, 2004 10

Processor Memor y Processor Memor y Global Memory! Handful of processors each supporting ~1 hardware thread! On-chip memory near processors (cache, RAM, or both)! Shared global memory space (external DRAM) 11

! Multicore! Single powerful thread per core! Thread parallel! Explicit communication! Explicit synchronization 12

! PPE Power Processing Element! SPE Synergistic Processing Element! SPU Synergistic Processing Unit! LS Local Store! MFC Memory Flow Controller! EIB Element Interconnect Bus 13

Cell Processor the PPE! More-less a standard scalar processor! Access to the main memory through load and store instructions! Standard L1 and L2 caches! Capable of running scalar (not vectorized) code fast! Capable of running a standard operating system, e.g., Linux! Capable of executing IEEE floating point arithmetic (double and single precision) 14

Cell Processor the SPE! Completely a vector processor! Only executes code from the local memory! Only operates on data in the local memory! Accesses the maim memory and local memories of other SPEs only through DMA messages! Loads and stores only 128-bit vectors! Only operates on 128-bit vectors (scalar instructions are emulated in software)! Only supports a single thread with a register file of 128 vector registers at its disposal 15

Cell Processor Local Store! A fast local memory! Private memory of the SPU! 256 KB of static RAM! Loads and stores take only a few cycles! Supports only vector (128 b) loads and stores! DMA transfers to the main memory! DMA transfers to other Local Stores! No hardware coherency 16

Cell: Vectorization! You need to vectorize as much as possible scalar operations are not supported (emulated in software)! Loads and stores load and store 4-element vectors! Arithmetic operations - operate on 4-element vectors 17

Cell: Vectorization! Shuffles, shifts and rotations allow to rearrange data within a vector! The SPU has two pipelines, one for arithmetics, one for shuffles / shifts / rotations / etc.! The SPU can complete in one cycle one floating point operation and one shuffle / shift / rotation 18

Cell: Dual-Issue! Two pipelines can issue / complete one instruction each in one cycle! Even pipeline! Arithmetics! Odd pipeline! Loads and stores! Shuffles, shifts, rotations 19

Cell Processor MFC! DMA engine! Moves data between the main memory and Local Store! Moves data between local stores! Messages do not block computation! Multiple messages at the same time 20

Cell Processor EIB! A fast internal bus! Connects all elements in the chip! Each SPE has bandwidth of 25.6 GB/s! The EIB has aggregate bandwidth of 204.8 GB/s! The main memory has bandwidth of 25.6 GB/s! The main memory is organized in 16 banks (2 KB interleaved)! Maximum bandwidth achieved when transferring entire cache lines (aligned 128 B continuous blocks of data) 21

Cell: Communication! While the SPU is computing, the MFC can transfer data! Overlap computation and communication (double!buffering) 22

23

! A GPU (Graphics Processing Unit) contains multiple cores that utilise hardware multithreading and SIMD.! All PCs have a GPU - the main chip inside a computer which calculates and generates the positioning of graphics on a computer screen.! Games typically renders 10 000s triangles @ 60 fps! Screen resolution is typically 1600 x 1200 and each pixel is recalculated every frame! This corresponds to processing 115 200 000 pps! GPUs are designed to make these operations fast 24

Obviously, this pattern of computation is common with many other applications 25

Flynn s Taxonomy Data stream Single Multiple Instruction stream Single Multiple SISD Uniprocessor MISD Rarely used SIMD Procesor arrays Pipelined vector processors MIMD Multiprocessors Multicomputers 26

Types of Parallelism 27

! Single Instruction Multiple Data architecture! A single instruction can operate on multiple data elements in parallel! rely on the highly structured nature of the underlying computations " Data parallelism! widely applied in multimedia processing (e.g., graphics, image and video) 28

29

Stream Processing! A stream is a set of input and output data! Stream processing is a series of operations (kernel functions) applied for each element in a stream! Uniform streaming is most typical. One kernel at a time is applied to all elements of the stream! Single Instruction Multiple Data (SIMD) 30

Instruction-Based Processing! During processing, the data required for an instruction s execution is loaded into the cache, if not already present! Very flexible model, but has the disadvantage that the data-sequence is completely driven by the instruction sequence, yielding inefficient performance for uniform operations on large data blocks 31

Data Stream Processing! The processor is first configured by the instructions that need to be performed and in the next step a data-stream is processed! The execution can be distributed among several pipelines 32

! The GPU (a set of multiprocessors) executes many thread blocks! Each thread block consists of many threads! Within thread block threads are grouped in warps! Each thread has:! Per-thread registers! Per-thread memory (in DRAM)! Each thread block has:! Per-thread-block shared memory! Global memory (DRAM) is accessible to all threads 33

! SM Streaming Multiprocessor (more-less a core)! SP Streaming Processor ( scalar processor core ) (AKA thread processor )! Register file! Shared memory! Constant cache (read only for SM)! Texture cache (read only for SM) 34

! Eight scalar processors (thread procs.) with one instruction issue logic (SIMD)! Long vector (32 threads = 1 warp)! Massively multithreaded (512 scalar hardware threads = 16 warps)! Huge register file (8192 scalar registers shared among all threads)! Can load and store data directly to and from the main memory! Can load and store data to and from shared memory 35

! DRAM memory! Large latency (hundreds of cycles)! Large bandwidth (140 GB/s)! Maximum bandwidth requires coalescing (e.g. transferring of aligned 128 B blocks of data)! Organized in 6 to 8 partitions (256 B interleaved). Maximum bandwidth requires balanced accesses to all partitions. 36

! A fast local memory! Private to a thread block! 16 KB of static RAM! Loads and stores take only a few cycles! Organized in 16 banks! Allows to load 16 elements (e.g. floats) simultaneously by 16 threads (half-warp) if there are no bank conflicts (e.g. 16 consecutive addresses). Otherwise access is serialized. 37

General-Purpose Computing on GPUs! Idea:! Potential for very high performance at low cost! Architecture well suited for certain kinds of parallel applications (data parallel)! Early challenges:! Architectures very customized to graphics problems (e.g., vertex and fragment processors)! Programmed using graphics-specific programming models or libraries! Recent trends:! Some convergence between commodity and GPUs and their associated parallel programming models 38

CUDA! Compute Unified Device Architecture, one of first to support heterogeneous architectures! Data-parallel programming interface to GPU! Data to be operated on is discretized into independent partition of memory! Each thread performs roughly same computation to different partition of data! When appropriate, easy to express and very efficient parallelization! Programmer expresses! Thread programs to be launched on GPU, and how to launch! Data organization and movement between host and GPU! Synchronization, memory management, 39

CUDA Software Stack 40

! device = GPU = a set of multiprocessors! Multiprocessor = a set of processors & shared memory! Kernel = GPU program! Grid = array of thread blocks that execute a kernel! Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory 41

CUDA Hardware Model 42

CUDA Memory Model Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can read/write global, constant, and texture memory 43

CUDA Programming Model! The GPU is viewed as a compute device that:! is a coprocessor to the CPU or host! has its own device memory! runs many threads in parallel! Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads 44

CUDA Programming Model 45

Future Computer Systems 46

47