Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6. Parallel Memory Architectures 7. Parallel Programming Models Images from LLNL Parallel Computing Tutorial, Wikipedia, or Majd F. Sakr s parallel computation lectures unless otherwise noted. Serial and Parallel Computation Serial Parallel 1

History of Parallel Computation Parallel Computation Breaking a problem into pieces and using multiple computer resources to solve each piece of the problem and reassemble the solution pieces into the final answer. Parallelism is limited by data dependencies Data Dependencies 2

Data Dependencies Parallel Terminology Task: A logically discrete section of computational work. A task is typically a thread or process at the OS level. Communications: Parallel tasks typically need to exchange data through a shared memory busy or over a network. Synchronization: Coordination of parallel tasks in real time. Often implemented by establishing a synchronization point within an application where task cannot proceed further until other tasks reach that point. Scalability: Ability of parallel system to demonstrate aproportionate increase in speed with addition of more processors. Embarrassingly Parallel: Solving many similar but independent tasks in parallel with little or no need for coordination between tasks. Parallel Granularity Fine grain Relatively small amounts of computation done between communication events. Low computation to communication ratio. Facilitates load balancing. Coarse grain Relatively large amounts of computation done between communication events. High computation to communication ratio. Difficult to load balance. 3

Why Parallel Computing? The Real World is Parallel Modeling Science & Engineering Problems 4

Reasons for Parallel Computation Limits to serial computing CPU speeds increased slowly >2003 Solve problems faster Reduce time by using more resources Solve larger problems Scientific problems Web-scale applications Bit-level Parallelism Instruction-level Parallelism Data-level Parallelism Task-level Parallelism Types of Parallelism The Processor The Brain: a functional unit that interprets and carries out instructions (mathematical operations) Also called a CPU (actually includes CPU + ALU) Consists of hundreds of millions of transistors. 5

Processor Components: Control Control Unit Processor s supervisor Fetch/execute cycle Fetch Store Decode Execute Program Counter (PC): stores address of instruction to be fetched. Instruction Register (IR): has instruction most recently fetched. Processor Components: Datapath Register File: General purpose storage locations inside processor that hold addresses or values. Arithmetic Logic Unit: Set of functional units that perform arithmetic and logic operations. A 32-bit processor has registers that are typically 32 bits in size. Bit-level Parallelism Increase processor word size to operate on more bits at once. Task: add two 64-bit numbers 32-bit CPU: must complete two 32-bit operations plus handle carry operations 64-bit CPU: adds two 64- bit ints in single instruction 6

Evolution of Processor Word Size Instruction Level Parallelism (ILP) Running independent instructions on separate execution units simultaneously. x = a + b y = c + d z = x + y Serial Execution: If each instruction takes one cycle, it takes 3 clock cycles to run program. Parallel Execution: First two programs are independent, so can be executed simultaneously. Third instruction depends on first two, so must be executed afterwards. Two clock cycles to run program. Instruction Pipelining Improve ILP by splitting processing of a single machine language instruction into a series of independent steps. CPU can issue instructions at processing rate of slowest step, increasing clock speed. IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory access WB = Register write back 7

Sequential Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time Sequential laundry = 8 hours for 4 loads http://inst.eecs.berkeley.edu/~cs61c/fa10 Pipelined Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e r A B C D 30 30 30 30 30 30 30 Time Pipelined laundry = 3.5 hours for 4 loads! http://inst.eecs.berkeley.edu/~cs61c/fa10 T a s k O r d e r 6 PM 7 8 9 A B C D Pipelining Lessons Time 30 30 30 30 30 30 30 Pipelining doesn t decrease latency of single task; it increases throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of stages Time to fill pipeline and time to drain it reduces speedup: 2.3X v. 4X in this example Speedup limited by slowest pipeline stage. http://inst.eecs.berkeley.edu/~cs61c/fa10 8

Processor Pipeline Execution P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12 Instr 1 IF ID ALU MEM WR Instr 2 IF ID ALU MEM WR IF ID ALU MEM WR Instr 3 IF ID ALU MEM WR Instr 4 Instr 5 Instr 6 Instr 7 Instr 8 IF ID ALU MEM WR IF ID ALU MEM WR IF ID ALU MEM WR IF ID ALU MEM WR IF ID ALU MEM WR http://inst.eecs.berkeley.edu/~cs61c/fa10 Hazards Problems that prevent pipeline execution. Data hazards Must wait for previous instruction to compute data that current instruction needs for input. Structural hazards A processor component (execution unit or memory unit) is needed by two simultaneous instructions. Control hazards Conditional branch statements offer two alternatives for the next instruction to fetch. Working Around Hazards Instruction re-ordering Re-order instructions to extract more ILP Data dependencies limit re-ordering Branch prediction Predict which branch is likely to be taken then execute those instructions Loops usually continue in loop; only exit once Loop unrolling Human or compiler replicates body of loop to expose more parallelism 9

Superscalar Pipelined Execution Superscalar Execution http://arstechnica.com/paedia/c/cpu/part-2/ AMD Athlon Architecture Superscalar Pipeline 10

Pipeline Depth and Issue Width CPU Year Clock Speed Pipeline Stages Issue Width Cores 80486 1989 25 MHz 5 1 1 5W Power Pentium 1993 66 MHz 5 2 1 10W Pentium Pro 1997 150 MHz 10 3 1 29W Pentium 4 2001 1500 MHz 22 3 1 75W Pentium 4 Prescott 2004 3600 MHz 31 3 1 103W Core 2 Conroe 2006 2930 MHz 14 4 2 75W Core 2 Yorkfield 2008 2930 MHz 16 4 4 95W Core i7 Gulftown 2010 3460 MHz 16 4 6 130W Pipeline Depth and Issue Width 10000 1000 100 10 Clock Power Pipeline Stages Issue width Cores 1 1989 1992 1995 1998 2001 2004 2007 2010 http://inst.eecs.berkeley.edu/~cs61c/fa10 Data Parallelism Distribute data across different computing nodes, so same operation performed on different parts of same data structure AKA loop-level parallelism If each loop iteration depends on results from previous iteration, loop cannot be parallelized 11

Task Parallelism Different operations on different data sets Each processor performs different task Each processor communicates with other processes to get inputs and put results Multiple CPUs Multicore Multicore CPU chips contain multiple complete processors Individual L1 and shared L2 caches OS and applications see each core as an independent processor Each core can run a separate task A single application must be divided into multiple tasks to improve performance 12

Multicore Organization Alternatives Core 2 Duo and Core i7 Architectures High Core CPUs Use On-Chip Network Low count cores use ~1000 wires per core to connect to L3 cache and each other. To scale, Core i7 uses a ring bus to communicate with cores and L3 cache slices. Core i7 2600k Ring Bus Architecture 13

Multicore vs. Multiprocessor Inter-processor communication Memory bandwidth per CPU Power consumption per CPU Multicore Faster Lower Lower Multiprocessor Slower Higher Higher Cost per CPU Lower Higher Simultaneous Multi-Threading CPU presents virtual cores to OS CPU duplicates PC, IR, and thread state registers But keeps same number of execution units OS feeds two threads at once to CPU Improves ILP by having multiple instruction streams that are unlikely to have cross-stream dependencies CPU Parallelism Architectures Compared Parallelism types combined to increase parallel capacity. x86 cores have been superscalar since Pentium in 1993. Server x86 CPUs use simultaneous multithreading since Pentium 4 Xeon in 2000. 14

Flynn s Taxonomy Flynn s Taxonomy Single Instruction Multiple Instruction Single Data SISD: Pentium III MISD: None today Multiple Data SIMD: SSE instruction set MIMD: Xeon e5345 (Clovertown) Single Instruction Single Data Serial computation Single instruction = only one instruction stream acted on by CPU during one clock cycle Single data = only one data stream used as input during any clock cycle 15

Single Instruction Multiple Data Parallel computation Single instruction = All processing units given same instruction at any clock cycle. Multiple data = Each processing unit can operate on a different data element Applications: graphics Radeon R770 GPU Multiple Instruction Single Data Parallel computation Single data = A single data stream is fed into multiple processing units Multiple instruction = Each processing unit operates on data independently via its own instruction stream Multiple Instruction Multiple Data Parallel computation Multiple instruction = each processing unit may execute a different instruction stream Multiple data = each processing unit may work with a different data stream Examples: multicore, grid computing 16

Taxonomy of Parallel Architectures Amdahl s Law Scaling Parallel Computing Just add more processors? Hardware limitations Memory-CPU bus bandwidth on local machine Network bandwidth and latency Software limitations Most algorithms have limits to scalability Supporting libraries may have their own limits 17

Amdahl s Law Speedup due to enhancement E is Execution time without E Speedup = Execution time with E Suppose E accelerates a piece P (P<1) of task by a factor S (S>1) and remainder unaffected Exec time with E = Exec time w/o E [ 1 - P + P/S ] Speedup = 1 1 P + P/S Amdahl s Law: Example Consider an application whose work is divided into the following four components: Work Load Time Memory Access Computation Disk Access Network Access 10% 70% 10% 10% What is the expected percent improvement if: Memory access speed is doubled? Computation speed is doubled? 5% 35% Amdahl s Law for Parallelization Speedup = 1 1 P + P/S Let P be the parallelizable portion of code As the number of processors increases, the time to do the parallel portion of the program, P/S tends towards zero, reducing the equation to: 1 Speedup = 1 P If P=0, then speedup=1 (no improvement) If P=1, then speedup grows without limit. If P=0.5, then maximum speed is 2. 18

Amdahl s Law: Parallel Example Consider an application whose work is divided into the following four functions: Work Load f1 f2 f3 f4 Time 4% 10% 80% 6% Assume f1, f3, and f4 can be parallelized, but f2 must be computed serially. Parallelizing which part would best improve performance? What is the best performance speedup that could be reached by parallelizing all three parallelizable functions? f3 10X Amdahl s Law: Time Example Consider an application whose work is divided into the following four functions: Work Load f1 f2 f3 f4 Time 2ms 5ms 40ms 3ms Assume f1, f3, and f4 can be parallelized, but f2 must be computed serially. Assume that running the whole program takes 50ms. What is the best running time that can be achieved by parallelizing f1, f3, and f4? Why can t parallelizing the program decrease the total running time below that time? 5ms 5ms is the time required for serial part. Even if parallel part takes 0ms, f2 still takes 5ms to run. Amdahl s Law 19

Uniform Memory Architecture Non-Uniform Memory Architecture Distributed Memory Hybrid Distributed-Shared Memory Parallel Memory Architectures Uniform Memory Architecture (UMA) Global shared address space for all memory Symmetric Multi-Processing (SMP) Does not scale to much greater than 8 CPUs due to memory contention Non-Uniform Memory Architecture (NUMA) Global shared address space where Processors have fast access to nearby memory Memory access across links is slower Better scaling than UMA but still limited by memory contention 20

Distributed Memory Each CPU has own local address space; changes by each CPU are not visible to other CPUs CPUs must use network to exchange data Highly scalable: CPUs have fast local RAM Data communication is programmer s responsibility Hybrid Distributed-Shared Memory Shared memory components are SMP nodes Distributed memory is network of SMP nodes Used by most large supercomputers + clouds Hybrid Distributed-Shared in Cloud 21

Shared Memory Model Threads Model Data Parallel Model Message Passing Model Parallel Programming Models Parallel Programming Models An abstraction above the hardware level for programmers to use. Any programming model can be used with any parallel architecture. Model performance may depend on architecture. Model choice depends on problem being solved and programmer preference. Shared Memory Tasks share common global address space, which they read and write to asynchronously. Software mechanisms such as locks and semaphores used to control access to shared memory. 22

Shared Memory Model Advantages Program development simplified since process owns data stored in memory. Referencing data in shared memory is similar to traditional serial programming. Disadvantages Difficult to understand and manage data locality. Keeping data local to the CPU working on it is faster, but bus traffic results when other CPUs try to access that data. Threads Divide single program into multiple concurrent execution paths called threads. Threads are distributed across CPUs and can be executed on multiple CPUs simultaneously. Threads Each thread has local data structures specific to this thread. However, all threads share common process global memory space. Threads are associated with shared memory architectures. Threads can be scheduled by OS or by middleware such as the language VM (green threads). Green threads start and synchronize faster But most implementations cannot use multiple CPUs 23

Data Parallel Data set is divided into chunks and operations are performed on each chunk concurrently Tasks carried out on same part of data structure perform same operations on each instance of data (ex: multiply each array element by X) Tasks use only their own local memory. Tasks can be on multiple machines. Data exchanged between tasks by sending and receiving messages. Data transfer requires cooperation between tasks. Message Passing Parallel Code Example 24

Example: Array Processing if MASTER initialize array send each WORKER info on chunk it owns chunk of initial array end recv results from workers elsif SLAVE recv info on my chunk recv array data do j=1 st col..last col do i = 1,n a(i,j) = f(i,j) end end end Array is divided into chunks; each CPU owns a chunk and executes the portion of loop corresponding to it. Example: Heat Equation The Heat Equation describes change in temperature in a region over time, given initial temperature distribution and boundary conditions. Divide region into chunks and iterate, allowing heat from nearby chunks to change temperature in chunk for next iteration. To compute U x,y Example: Heat Equation Serial program: 25

Example: Parallel Heat Equation Solver Divide array into chunks. Data dependencies: Interior elements are independent of other tasks. Border elements are dependent on other tasks, so tasks must communicate. Master sends initial state to worker tasks and collects results from workers. Workers compute heat equation, communicating state of border elements with adjacent worker tasks. Parallel Code Key Points Granularity refers to ratio of comp to comm. Levels of Parallelism Bit-level parallelism Data-level parallelism (loop-level) Instruction-level parallelism Pipelining Superscalar Task-level parallelism Multi-CPU Multi-core Hyperthreading Flynn s Taxonomy: SIMD, MIMD 26

Key Points Amdahl s Law No matter how many processors, speed-up is limited by sequential portion of program Parallel Memory Architectures Shared memory for SMP: UMA, NUMA Distributed memory for clusters or large-scale MP Most clouds use hybrid shared-distributed arch Parallel Programming Models Shared Memory Threads Data Parallel Message Passing 27