Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Size: px

Start display at page:

Download "Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism"

Gladys Washington
5 years ago
Views:

1 Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the aggregate speeds: bandwidth is improved by separate memories. Multiprocessors usually have more aggregate cache memory. Each processor in a cluster can have its own disk and network adapter, improving aggregate speeds. Communication enables parallel applications: Harnessing the computing power of distributed systems over the Internet is a popular example of parallel processing. (SETI, Folding@home,...) Constraints on the location of data Huge data sets could be difficult, expensive, or otherwise infeasible to store in a central location. Distributed data and parallel processing is a practical solution October 11th October 11th 2 Types of Parallelism ILP Example: Loop Unrolling Instruction Level Parallelism (ILP) Instructions near each other in an instruction stream could be independent. These can then execute in parallel either partially (pipelining), or fully (superscalar). Hardware needed for dependency tracking. The amount of ILP available per instruction stream is limited ILP is usually considered an implicit parallelism since the hardware automatically exploits it without programmer/compiler intervention. Programmers/compilers could transform applications to expose more ILP. for( i=0; i<=64; i++ ) { vec[i] = a * vec[i]; } for( i=0; i<=64; i+=4 ) { vec[i+0] = a * vec[i+0]; vec[i+1] = a * vec[i+1]; vec[i+2] = a * vec[i+2]; vec[i+3] = a * vec[i+3]; } Four independent load, multiply, store sequences per iteration. This loop has one sequence of load, multiply, store per iteration. The amount of ILP is very limited. 4-fold loop unrolling increases the amount of ILP exploitable by the hardware October 11th October 11th 4

2 Types of Parallelism Task Level Parallelism (TLP) Several instruction streams are independent. Parallel execution of the streams is possible. Much more coarse-grain parallelism compared with ILP. Typically involves the programmer and/or compiler to: decompose the application into tasks, enforce dependencies, and expose parallelism. Some experimental techniques, such as speculative multithreading, are aimed at removing these burdens from the programmer. TLP Example: Quicksort QuickSort( A, B ): if( B A < 10 ) { /* Base Case: Use fast sort */ FastSort( A, B ); } else { /* Continue Recursively */ Partition [A,B] into [A,C] and [C+1,B]; /* Task X */ QuickSort( A, C ); /* Task Y */ QuickSort( C+1, B ); /* Task Z */ } X Both Y and Z depend on X but are mutually independent. Y Z 2007 October 11th October 11th 6 Types of Parallelism Superscalar and OoO execution Data Parallelism (DP) In many applications a collection of data is transformed in such a way that the operations on each element is largely independent of the others. A typical scenario is when we apply the same instruction to a collection of data. Example: adding two arrays = = = = = = = The same operation (+) applied to a collection of data Scalar processors can issue one instruction per cycle. Superscalar processors can issue more than one instruction per cycle. Common feature in most modern processors. By replicating functional units and adding hardware to detect and track instruction dependencies a superscalar processor takes advantage of Instruction Level Parallelism. A related technique (applicable to both scalar and superscalar processors) is out-of-order (OoO) execution. Instructions are reordered (by hardware) for better utilization of pipeline(s). Excellent example of the use of extra transistors to speed up execution without programmer intervention. Naturally limited by the available ILP. Also severely limited by the hardware complexity of dependency checking. In practice: 2-way superscalar architectures are common, more than 4-way is unlikely October 11th October 11th 8

3 Vector Processors Vector Processors Vector processors refer to a previously common supercomputer architecture where a vector is a basic memory abstraction. Examples: Cray 1, IBM 3090/VF. Vector: 1D array of numbers Example: add two vectors Scalar solution: Loop through vectors and add each scalar element Repeated address translations Branches Vector solution: Add vectors via a vector addition instruction Address translations once No branches Independent operations enable: deeper pipelines, higher clock frequencies, and multiple concurrent functional units (e.g., SIMD-units) Vector processors are suitable for a certain range of applications. Traditional, scalar, processor designs have been successful over a larger spectrum of applications. Economic realities favor large clusters of commodity processors or small-scale SMPs October 11th October 11th 10 Single Instruction Multiple Data (SIMD) Shared Multiprocessor Several functional units executing the same instruction on different data streams simultaneously and synchronized. Suitable architecture for many data parallel applications: Matrix Computations Graphics Processing Image Analysis... Found primarily in common microprocessors and GPUs: SIMD instruction extensions such as MMX, SSE, AltiVec. Multiprocessors where all processors share a single address space are commonly called Shared Multiprocessors. They can be classified based on how long access time different processors have to different memory areas. Uniform Access (UMA): each processor has the same access time. Non-Uniform Access (NUMA): some memory is closer to a processor, access time is higher to distant memory. Furthermore, their caches could be coherent or not. CC-UMA: -Coherent Uniform Access By another name, Symmetric MultiProcessor (SMP) October 11th October 11th 12

4 Bus-Based UMA MP NUMA MP Chip Chip Chip Chip Bus Interconnect 2007 October 11th October 11th 14 Multicore Multicore When several processor cores are physically located in the same processor socket we refer to it as a multicore processor. Both Intel and AMD now have quad-core (4 cores) processors in their product portfolios. A new desktop computer today is definitely a multiprocessor. Multicores usually have a single address space and are cache-coherent. They are very similar to SMPs but they typically share one or more levels of cache, have more favorable inter-processor/core communication speed. Chip L October 11th October 11th 16

5 Multicore Distributed Multiprocessor Multicore chips have multiple benefits: Higher peak performance Power consumption control Some cores can be turned off. Production yield increase 8-core chips with a defective core sold with one core disabled....but also some potential drawbacks: bandwidth per core limited Physical limits such as the number of pins. Lower peak performance per thread Some inherently sequential applications may actually run slower. In contrast to a single address space, machines with multiple private memories are commonly called Distributed Machines. Data is exchanged between memories via messages communicated over a dedicated network. When the processor/memory pairs are physically separated, such as on different boards or in different casings, such machines are called Clusters October 11th October 11th 18 Cluster Clusters: Past and Present Node Node Node Network Node In the past, clusters were exclusively high-end machines with custom supercomputer processors, and custom high-performance interconnects. These machines where very expensive and therefore limited to research and big corporations. In the 90s onwards it is increasingly common with clusters based on off-the-shelf components: commodity processors, and commodity interconnects (e.g. ethernet with switches). The economic benefits and programmer familiarity with commodity components far outweigh the performance issues. Has helped democratize supercomputing: many corporations and universities have clusters today October 11th October 11th 20

6 Networks Access to memory on other nodes is very expensive. Data must be transferred over a relatively high-latency lowbandwidth network. Algorithms with low data locality will suffer. High synchronization requirements will also degrade performance for the same reason. The network design is a tradeoff between conflicting goals: Maximum bandwidth and low latency: full connectivity Low cost and power consumption: tree network Switch-based networks common today, other examples of topologies include rings, meshes, hypercubes, and trees October 11th 21

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately: