Computer Architecture Chapter 7 Parallel Processing 1
Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors (1 control unit, limited application) multiprocessor (multiple CPUs, common memory) multicomputer (multiple CPUs, each with own memory) 2
Processor-Level Parallelism Instruction-level parallelism helps a little, but pipelining and superscalar operation rarely wins more than a factor of five to ten. How to get even more speedup: An array processor which consists of a large number of identical processors that perform the same sequence of instructions on different sets of data. 3
4
Problems 1- Array processors work only well on problems requiring the same computation to be performed on many data sets simultaneously. 2- they require much more hardware and are difficult to program. 3- The processing elements are not independent CPU s since there is only one control unit! 5
Solution: Multiprocessors A multiprocessor is a system with more than one CPU sharing a common memory. Problems: Conflicts will result when processors access the common bus! Solution: Use a multicomputer 6
7
DESIGN ISSUE FOR PARALLEL COMPUTERS What are the nature, size, and number of the processing elements? What are the nature, size and number of memory modules? How are the processing and memory elements interconnected? 8
CLASSIFICATION OF PARALLEL STRUCTURES 1) A single processor system is called: Single Instruction stream, Single Data stream (SISD System). 2) A single stream of instructions is broadcast to a number of processors, each processor operates on its own data. Single Instruction stream, Multiple Data stream (SIMD system) 9
3)A number of independent processors executing a different program and having their own sequence of data: Multiple Instruction stream, Multiple Data stream: (MIMD system) 4) A common data structure is manipulated by separate processors each executing a different program. Multiple Instruction stream, Single Data stream ( MISD system) This form does not occur often in practice! 10
11
12
SIMD COMPUTERS Array Processing Idea: single Control Unit for many processing units. Examples: ILLIAC IV, CM-2, Maspar MP-2. 13
14
SIMD COMPUTERS Vector Processing It has been much more successful commercially. Developed by Seymour Cray for Cray Research. The machine takes two n-element vectors as input, and operates on the corresponding elements in parallel using a vector ALU that can operate on all n elements simultaneously. It produces a vector result. Examples: Cray-1 Vector Supercomputer, 15
16
CRAY-1 17
MIMD SYSTEMS These systems can be divided into 2 categories: Multiprocessors: also called shared memory system Multicomputers: also called distributed memory system 18
Multiprocessors All processes working together on a multiprocessor can share a single virtual address space mapped onto the common memory. The ability for two (or more) processes to communicate by just reading and writing memory is the reason multiprocessors are popular. It s an easy model for programmers to understand and is applicable on a very wide range of problems. 19
The system runs one copy of the operating system. When every CPU has equal access to all the memory modules and all the I/O devices, the system is called an SMP (Symmetric MultiProcessor) architecture 20
Multiprocessors 21
Example: UMA Bus-Based Architecture UMA: Uniform Memory Access Architecture based on a single bus bus contention. Solution: add cache to each CPU Also add private memory which can be accessed over a dedicated (private) bus. Results: much less traffic system supports more CPU s. 22
UMA Bus-Based Architecture 23
NUMA Multiprocessors NUMA: Non-Uniform Memory Access To get to more than 100 CPU s, the UMA fails due to hardware complexity, and to the fact that all memory modules have the same access time. Like UMA, they provide a single address space across all the CPU s, but access to local memory modules is faster than access to remote ones. 24
NUMA 25
Characteristics of NUMA machines 1) There is a single address space visible to all CPUs Access to remote memory is done using LOAD and STORE instructions. Access to remote memory is slower than access to local memory. 26
COMA Multiprocessors COMA: Cache Only Memory Access NUMA machines have the disadvantage that access to remote memory are much slower than local one even though excellent performance, but limited in size and quite expensive. Solution: Using each CPU s main memory as a cache greatly increases the hit rate, hence the performance. 27
MESSAGE-PASSING MULTICOMPUTERS In MIMD architectures: Multiprocessors: appear to the OS as having a shared memory that can be accessed by LOAD and STORE instructions. Multicomputers have one address space per CPU Distributed Memory System Instead of reading and writing the common memory, Multicomputers use another communication mechanism: Pass messages back and forth using the interconnection network Software primitives: send and receive 28
29
Interconnection networks 30
31
Multiprocessors are easy to program, so why do we have to build multicomputers? Answer: large multicomputers are much simpler and cheaper to build than multiprocessors with the same number of CPUs 32
MPP: Massively Parallel Processors MPP: Huge multimillion dollars supercomputers used in science, engineering and industry for very large complex calculations, for handling very large numbers of instructions per second. Or for data warehousing (managing immense databases). Most of these machines use standard CPUs as their processors. ( Pentium, Sun ultrasparc, IBM RS/6000, DEC Alpha..) 33
Classic (old) Example: Intel/Sandia Option Red machine 4608 CPUs arranged in 3D mesh (32 x 38 x 2): 4536 compute nodes, 32 service nodes, 32 disk nodes, 6 network nodes, 2 boot nodes. I/O nodes manage 640 disks with 1 TB of data. Speed: up to 100 teraflops 10 14 FL operations per second. 34
35
COW : Cluster Of Workstations Also called NOW ( Network Of Workstations). Consists of a few hundreds of PCs or workstations connected by a commercially available network board. The difference between MPPs and COWs is analogous to the difference between a mainframe and a PC! 36
Parallel computing performance depends on Hardware CPU speed of individual processors I/O speed of individual processors Interconnection network Scalability Software Parallelizability of algorithms Application programming languages Operating systems Parallel system libraries 37
Hardware CPU and I/O speed: Same factors as for single-processor machines Interconnection network Latency (wait time): Distance Collisions / collision resolution Bandwidth (bps) Bus limitations CPU and I/O limitations Scalability Adding more processors affects latency and bandwidth 38
Hardware Reducing latency Reducing collisions Resolving collisions Increasing bandwidth 39
40
41
Software Parallelizability of algorithms Number of processors Sequential/parallel parts Amdahl's Law: n = number of processors f = fraction of code that is sequential T = time to process entire algorithm sequentially speedup n 1 (n 1) f Note: total execution time is: ft (1 f n ) T 42
Example: Software An algorithm takes 15 seconds to execute on a single 1.8G processor. 30% of the algorithm is sequential. Assuming zero latency and perfect parallelism in the remaining code, how long should the algorithm take on a 20 x 1.8G processor parallel machine? 43
Example: Software An algorithm takes 15 seconds to execute on a single 1.8G processor. 30% of the algorithm is sequential. Assuming zero latency and perfect parallelism in the remaining code, how long should the algorithm take on a 20 x 1.8G processor parallel machine? speedup 1 n (n 1)f 20 1.3x19 20 6.7 Therefore the expected time is T / speedup 15 / (20 / 6.7) = 5.025 seconds Another way: (.3 x 15) + (.7 x 15) / 20 Seq. + Parallel 44
Software speedup n 1 (n 1) Assuming perfect scalability, what are the implications of Amdahl s Law as n? f 45
Software speedup n 1 (n 1) f Assuming perfect scalability, what are the implications of Amdahl s Law when n? speedup 1/f (assuming f 0) So if f =.4, parallelism can never make the program run more than 2.5 times as fast. 46
Software Parallel system libraries Precompiled functions designed for multiprocessing (e.g., matrix transformations) Functions for control of communication (e.g., background printing) Application programming languages Built-in functions for creating child processes, threads, parallel looping, etc. 47
Software issues: In order to really take advantage of hardware parallelism 1. Control models Single instruction thread Multiple instruction threads Single data set Multiple data sets SISD, SIMD, MISD, MIMD Software (including OS, compilers, etc.) must be designed to use the features 48
Software issues: In order to really take advantage of hardware parallelism 2. Granularity of parallelism At what levels is parallelism implemented? 3. Computational paradigms Pipelining Divide and conquer Phased computation Replicated worker 49
50
Software issues: In order to really take advantage of hardware parallelism 4. Communication methods Shared variable Message passing 5. Synchonization Semaphores, locks, etc. 51