Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

Peripherals Computer Central Processing Unit Main Memory Computer System s Interconnection Communication lines Input Output 2

! Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of magnitude).! Higher levels of device integration have made available a large number of transistors.! How best to utilize these resources?! Conventionally, use these resources in multiple functional units and execute multiple instructions in the same cycle (ILP) " Pipelining " Superscalar 3

! overlaps various stages of instruction execution to achieve performance F1 E1 F2 E2 F3 E3....... I1 I2 I3 Time I1: I2: F1 E1 F2 E2 F: Fetch E: Execute I3: F3 E3 Time 4

! Limitations:! The speed of a pipeline is eventually limited by the slowest stage. needs more stage, or very deep pipelines! However, in typical program traces, every 5-6th instruction is a conditional jump! - requires very accurate branch prediction.! The penalty of a misprediction grows with the depth of the pipeline, since a larger number of instructions will have to be flushed. 5

! multiple redundant functional units within each CPU so that multiple instructions can executed on separate data items concurrently.! Early ones: two ALUs and a single FPU! modern ones: have more, e.g. the PowerPC 970 includes four ALUs and two FPUs, as ewll as two SIMD units. 6

! The performance of the system as a whole will suffer if unable to keep all of the units fed with instructions.! Things affect the performance:! True Data Dependency: The result of one operation is an input to the next.! Resource Dependency: Two operations require the same resource.! Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. 7

! The scheduler - a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of instructions to execute concurrently based on these factors.! Very Long Instruction Word (VLIW) processors - rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. 8

! Limitations:! The degrees of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism! The complexity and time cost of the dispatcher and associated dependency checking logic. 9

Power and Heat: Intel Embraces Multicore! May 17, 2004 Intel, the world's largest chip maker, publicly acknowledged that it had hit a ''thermal wall'' on its microprocessor line. As a result, the company is changing its product strategy and disbanding one of its most advanced design groups. Intel also said that it would abandon two advanced chip development projects Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor Intel's decision to change course and embrace a ''dual core' processor structure shows the challenge of overcoming the effects of heat generated by the constant on-off movement of tiny switches in modern computers some analysts and former Intel designers said that Intel was coming to terms with escalating heat problems so severe they threatened to cause its chips to fracture at extreme temperatures New York Times, May 17, 2004 10

Processor Memor y Processor Memor y Global Memory! Handful of processors each supporting ~1 hardware thread! On-chip memory near processors (cache, RAM, or both)! Shared global memory space (external DRAM) 11

! Multicore! Single powerful thread per core! Thread parallel! Explicit communication! Explicit synchronization 12

! PPE Power Processing Element! SPE Synergistic Processing Element! SPU Synergistic Processing Unit! LS Local Store! MFC Memory Flow Controller! EIB Element Interconnect Bus 13

Cell Processor the PPE! More-less a standard scalar processor! Access to the main memory through load and store instructions! Standard L1 and L2 caches! Capable of running scalar (not vectorized) code fast! Capable of running a standard operating system, e.g., Linux! Capable of executing IEEE floating point arithmetic (double and single precision) 14

Cell Processor the SPE! Completely a vector processor! Only executes code from the local memory! Only operates on data in the local memory! Accesses the maim memory and local memories of other SPEs only through DMA messages! Loads and stores only 128-bit vectors! Only operates on 128-bit vectors (scalar instructions are emulated in software)! Only supports a single thread with a register file of 128 vector registers at its disposal 15

Cell Processor Local Store! A fast local memory! Private memory of the SPU! 256 KB of static RAM! Loads and stores take only a few cycles! Supports only vector (128 b) loads and stores! DMA transfers to the main memory! DMA transfers to other Local Stores! No hardware coherency 16

Cell: Vectorization! You need to vectorize as much as possible scalar operations are not supported (emulated in software)! Loads and stores load and store 4-element vectors! Arithmetic operations - operate on 4-element vectors 17

Cell: Vectorization! Shuffles, shifts and rotations allow to rearrange data within a vector! The SPU has two pipelines, one for arithmetics, one for shuffles / shifts / rotations / etc.! The SPU can complete in one cycle one floating point operation and one shuffle / shift / rotation 18

Cell: Dual-Issue! Two pipelines can issue / complete one instruction each in one cycle! Even pipeline! Arithmetics! Odd pipeline! Loads and stores! Shuffles, shifts, rotations 19

Cell Processor MFC! DMA engine! Moves data between the main memory and Local Store! Moves data between local stores! Messages do not block computation! Multiple messages at the same time 20

Cell Processor EIB! A fast internal bus! Connects all elements in the chip! Each SPE has bandwidth of 25.6 GB/s! The EIB has aggregate bandwidth of 204.8 GB/s! The main memory has bandwidth of 25.6 GB/s! The main memory is organized in 16 banks (2 KB interleaved)! Maximum bandwidth achieved when transferring entire cache lines (aligned 128 B continuous blocks of data) 21

Cell: Communication! While the SPU is computing, the MFC can transfer data! Overlap computation and communication (double!buffering) 22

! A GPU (Graphics Processing Unit) contains multiple cores that utilise hardware multithreading and SIMD.! All PCs have a GPU - the main chip inside a computer which calculates and generates the positioning of graphics on a computer screen.! Games typically renders 10 000s triangles @ 60 fps! Screen resolution is typically 1600 x 1200 and each pixel is recalculated every frame! This corresponds to processing 115 200 000 pps! GPUs are designed to make these operations fast 24

Obviously, this pattern of computation is common with many other applications 25

Flynn s Taxonomy Data stream Single Multiple Instruction stream Single Multiple SISD Uniprocessor MISD Rarely used SIMD Procesor arrays Pipelined vector processors MIMD Multiprocessors Multicomputers 26

Types of Parallelism 27

! Single Instruction Multiple Data architecture! A single instruction can operate on multiple data elements in parallel! rely on the highly structured nature of the underlying computations " Data parallelism! widely applied in multimedia processing (e.g., graphics, image and video) 28

Stream Processing! A stream is a set of input and output data! Stream processing is a series of operations (kernel functions) applied for each element in a stream! Uniform streaming is most typical. One kernel at a time is applied to all elements of the stream! Single Instruction Multiple Data (SIMD) 30

Instruction-Based Processing! During processing, the data required for an instruction s execution is loaded into the cache, if not already present! Very flexible model, but has the disadvantage that the data-sequence is completely driven by the instruction sequence, yielding inefficient performance for uniform operations on large data blocks 31

Data Stream Processing! The processor is first configured by the instructions that need to be performed and in the next step a data-stream is processed! The execution can be distributed among several pipelines 32

! The GPU (a set of multiprocessors) executes many thread blocks! Each thread block consists of many threads! Within thread block threads are grouped in warps! Each thread has:! Per-thread registers! Per-thread memory (in DRAM)! Each thread block has:! Per-thread-block shared memory! Global memory (DRAM) is accessible to all threads 33

! SM Streaming Multiprocessor (more-less a core)! SP Streaming Processor ( scalar processor core ) (AKA thread processor )! Register file! Shared memory! Constant cache (read only for SM)! Texture cache (read only for SM) 34

! Eight scalar processors (thread procs.) with one instruction issue logic (SIMD)! Long vector (32 threads = 1 warp)! Massively multithreaded (512 scalar hardware threads = 16 warps)! Huge register file (8192 scalar registers shared among all threads)! Can load and store data directly to and from the main memory! Can load and store data to and from shared memory 35

! DRAM memory! Large latency (hundreds of cycles)! Large bandwidth (140 GB/s)! Maximum bandwidth requires coalescing (e.g. transferring of aligned 128 B blocks of data)! Organized in 6 to 8 partitions (256 B interleaved). Maximum bandwidth requires balanced accesses to all partitions. 36

! A fast local memory! Private to a thread block! 16 KB of static RAM! Loads and stores take only a few cycles! Organized in 16 banks! Allows to load 16 elements (e.g. floats) simultaneously by 16 threads (half-warp) if there are no bank conflicts (e.g. 16 consecutive addresses). Otherwise access is serialized. 37

General-Purpose Computing on GPUs! Idea:! Potential for very high performance at low cost! Architecture well suited for certain kinds of parallel applications (data parallel)! Early challenges:! Architectures very customized to graphics problems (e.g., vertex and fragment processors)! Programmed using graphics-specific programming models or libraries! Recent trends:! Some convergence between commodity and GPUs and their associated parallel programming models 38

CUDA! Compute Unified Device Architecture, one of first to support heterogeneous architectures! Data-parallel programming interface to GPU! Data to be operated on is discretized into independent partition of memory! Each thread performs roughly same computation to different partition of data! When appropriate, easy to express and very efficient parallelization! Programmer expresses! Thread programs to be launched on GPU, and how to launch! Data organization and movement between host and GPU! Synchronization, memory management, 39

CUDA Software Stack 40

! device = GPU = a set of multiprocessors! Multiprocessor = a set of processors & shared memory! Kernel = GPU program! Grid = array of thread blocks that execute a kernel! Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory 41

CUDA Hardware Model 42

CUDA Memory Model Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can read/write global, constant, and texture memory 43

CUDA Programming Model! The GPU is viewed as a compute device that:! is a coprocessor to the CPU or host! has its own device memory! runs many threads in parallel! Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads 44

CUDA Programming Model 45

Future Computer Systems 46