Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com

Review

Processor Architecture and Technology trends Processors chips are the key components of computers Processors chips consist of transistors (can be used as a rough estimate of its complexity and performance) Moore's law is an empirical observation which states that the number of transistors of a typical processor chip doubles every 18-24 months.

Processor Architecture and Technology trends

Processor Architecture and Technology trends Four phases of microprocessor design trends: 1. Parallelism at bit level 2. Parallelism by pipelining 3. Parallelism by multiple functional units 4. Parallelism at processor or thread level

Parallelism at bit level Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. (For example, consider a case where an 8-bit processor must add two 16-bit integers. The processor must first add the 8 lower-order bits from each integer, then add the 8 higher-order bits, requiring two instructions to complete a single operation. A 16-bit processor would be able to complete the operation with single instruction.)

Parallelism by pipelining A typical partition of pipelining a) fetch b) decode c) execute d) Write-back

Parallelism by pipelining The ILP processors (instruction-level parallelism) are processors which use pipelining to execute instructions. Processors with a relatively large number of pipeline stages are sometimes called superpipelined.

Parallelism by multiple functional units Many processors are multiple-issue processors. They use multiple, independent functional units like ALUs (arithmetic logical units), FPUs (floating-point units), load/store units, or branch units. These units can work in parallel, i.e., different independent instructions can be executed in parallel by different functional units.

Parallelism by multiple functional units Multiple-issue processors can be distinguished into superscalar processors and VLIW (very long instruction word) processors. But using even more functional units provides little additional gain because of dependencies between instructions and branching of control flow.

Parallelism at processor or thread level The degree of parallelism obtained by pipelining and multiple functional units is limited. This limit has already been reached for some time for typical processors. But more and more transistors are available per processor chip according to Moore s law. This can be used to integrate larger caches on the chip. But the cache sizes cannot be arbitrarily increased

Parallelism at processor or thread level An alternative approach to use the increasing number of transistors on a chip is to put multiple, independent processor cores onto a single processor chip. This approach has been used for typical desktop processors since 2005. The resulting processor chips are called multicore processors. Each of the cores of a multicore processor must obtain a separate flow of control, i.e., parallel programming techniques must be used. The cores of a processor chip access the same memory and may even share caches. Therefore, memory accesses of the cores must be coordinated.

Flynn's Taxonomy of Parallel Architectures Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966. The classification system has stuck, and has been used as a tool in design of modern processors and their functionalities. Since the rise of multiprocessing CPUs, a multiprogramming context has evolved as an extension of the classification system. Source: wikipedia

Flynn's Taxonomy of Parallel Architectures The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture: Single Instruction, Single Data stream (SISD) Single Instruction, (SIMD) Multiple (MISD) Instruction, Multiple Single Data Data streams stream Multiple Instruction, Multiple Data streams (MIMD)

Flynn's Taxonomy of Parallel Architectures Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. "PU" is a central processing unit:

Flynn's Taxonomy of Parallel Architectures Single Instruction, Multiple Data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.

Flynn's Taxonomy of Parallel Architectures Multiple Instruction, Single Data stream (MISD) Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. Examples include the Space Shuttle flight control computer.

Flynn's Taxonomy of Parallel Architectures Multiple Multiple (MIMD) Instruction, Data streams Multiple autonomous processors simultaneously executing different instructions on different data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space. A multi-core superscalar processor is an MIMD processor.

MIMD computer systems

Memory Organization of Parallel Computers Computers with Distributed Memory Organization Computers with Shared Memory Organization Reducing Memory Access Times

Computers with Distributed Memory Organization In computer science, distributed memory refers to a multiple-processor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors.

Computers with Distributed Memory Organization

Computers with Shared Memory Organization In computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, for example among its multiple threads, is also referred to as shared memory.

Thread A thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system

Computers with Shared Memory Organization

Reducing Memory Access Times Memory access time has a large influence on program performance. This can also be observed for computer systems with a shared address space A significant contribution to these improvements comes from a reduction in processor cycle time. At the same time, the capacity of DRAM chips that are used for building main memory has been increasing by about 60% per year.

Reducing Memory Access Times In contrast, the access time of DRAM chips has only been decreasing by about 25% per year. Thus, memory access time does not keep pace with processor performance improvement, and there is an increasing gap between processor cycle time and memory access time. A suitable organization of memory access becomes more and more important to get good performance results at program level.

Reducing Memory Access Times This is also true for parallel programs, in particular if a shared address space is used. Reducing the average latency observed by a processor when accessing memory can increase the resulting program performance significantly. Two important approaches have been considered to reduce the average latency for memory access: 1. The simulation of virtual processors by each physical processor (multithreading). 2. the use of local caches to store data values that are accessed often.

Multithreading In computer architecture, multithreading is the ability of a central processing unit or a single core in a multicore processor to execute multiple processes or threads concurrently, appropriately supported by the operating system. The idea of interleaved multithreading is to hide the latency of memory accesses by simulating a fixed number of virtual processors for each physical processor. Fine-grained multithreading, switch is performed after each instruction. Coarse-grained multithreading, switches between virtual processors only on costly stalls

Multithreading There are two multithreading: drawbacks of fine-grained The programming must be based on a large number of virtual processors. Therefore, the algorithm used must have a sufficiently large potential of parallelism to employ all virtual processors. The physical processors must be specially designed for the simulation of virtual processors. A softwarebased simulation using standard microprocessors is too slow.

Caches In computing, a cache (/ ˈkæʃ/ KASH, or /ˈkeɪʃ/ KAYSH in AuE) is a component that stores data so future requests for that data can be served faster; the data stored in a cache might be the results of an earlier computation, or the duplicates of data stored elsewhere. A cache is a small, but fast memory between the processor and main memory

Caches A cache can be used to store data that is often accessed by the processor, thus avoiding expensive main memory access. The data stored in a cache is always a subset of the data in the main memory, and the management of the data elements in the cache is done by hardware.