Parallel Real-Time Systems

Size: px
Start display at page:

Download "Parallel Real-Time Systems"

Transcription

1 Parallel Real-Time Systems Parallel Computing Overview References (Will be expanded as needed) Website for Parallel & Distributed Computing: Selected slides from Introduction to Parallel Computing Michael Quinn, Parallel Programming in C with MPI and Open MP, McGraw Hill, Chapter 1 is posted on website Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Updated online version available on website. 2 1

2 Outline Why use parallel computing Moore s Law Modern parallel computers Flynn s Taxonomy Seeking Concurrency Data clustering case study Programming parallel computers 3 Why Use Parallel Computers Solve compute-intensive problems faster Make infeasible problems feasible Reduce design time Solve larger problems in same amount of time Improve answer s precision Reduce design time Increase memory size More data can be kept in memory Dramatically reduces slowdown due to accessing external storage increases computation time Gain competitive advantage 4 2

3 1989 Grand Challenges to Computational Science Categories Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Materials design and superconductivity Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling Medicine, and modeling of human organs and bones Global weather and environmental modeling 5 Weather Prediction Atmosphere is divided into 3D cells Data includes temperature, pressure, humidity, wind speed and direction, etc Recorded at regular time intervals in each cell There are about cells of 1 mile cubes. Calculations would take a modern computer over 100 days to perform calculations needed for a 10 day forecast Details in Ian Foster s 1995 online textbook Design & Building Parallel Programs Included in Parallel Reference List, which will be posted on website. 6 3

4 Moore s Law In 1965, Gordon Moore [87] observed that the density of chips doubled every year. That is, the chip size is being halved yearly. This is an exponential rate of increase. By the late 1980 s, the doubling period had slowed to 18 months. Reduction of the silicon area causes speed of the processors to increase. Moore s law is sometimes stated: The processor speed doubles every 18 months 7 Microprocessor Revolution Speed (log scale) Micros Supercomputers Mainframes Minis Time 8 4

5 Some Definitions Concurrent Sequential events or processes which seem to occur or progress at the same time. Parallel Events or processes which occur or progress at the same time Parallel computing provides simultaneous execution of operations within a single parallel computer Distributed computing provides simultaneous execution of operations across a number of systems. 9 Flynn s Taxonomy Best known classification scheme for parallel computers. Depends on parallelism it exhibits with its Instruction stream Data stream A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) Four combinations: SISD, SIMD, MISD, MIMD 10 5

6 SISD Single Instruction, Single Data Usual sequential computer is primary example i.e., uniprocessors Note: co-processors don t count as more processors Concurrent processing allowed Instruction prefetching Pipelined execution of instructions Independent concurrent tasks can execute different sequences of operations. 11 SIMD Single instruction, multiple data One instruction stream is broadcast to all processors Each processor, also called a processing element (or PE), is very simplistic and is essentially an ALU; PEs do not store a copy of the program nor have a program control unit. Individual processors can be inhibited from participating in an instruction (based on a data test). 12 6

7 SIMD (cont.) All active processor executes the same instruction synchronously, but on different data On a memory access, all active processors must access the same location in their local memory. The data items form an array (or vector) and an instruction can act on the complete array in one cycle. 13 SIMD (cont.) Quinn calls this architecture a processor array. Examples include The STARAN and MPP (Dr. Batcher architect) Connection Machine CM2, built by Thinking Machines). 14 7

8 How to View a SIMD Machine Think of soldiers all in a unit. The commander selects certain soldiers as active. For example, every even numbered row. The commander barks out an order to all the active soldiers, who execute the order synchronously. 15 MISD Multiple instruction streams, single data stream Primarily corresponds to multiple redundant computation, say for reliability. Quinn argues that a systolic array is an example of a MISD structure (pg 55-57) Some authors include pipelined architecture in this category This category does not receive much attention from most authors, so we won t discuss it further. 16 8

9 MIMD Multiple instruction, multiple data Processors are asynchronous and can independently execute different programs on different data sets. Communications are handled either through shared memory. (multiprocessors) by use of message passing (multicomputers) MIMD s are considered by many researchers to include the most powerful, least restricted computers. 17 MIMD (cont. 2/4) Have major communication costs When compared to SIMDs Internal housekeeping activities are often overlooked Maintaining distributed memory & distributed databases Synchronization or scheduling of tasks Load balancing between processors The SPMD method of programming MIMDs All processors to execute the same program. SPMD stands for single program, multiple data. Easy method to program when number of processors are large. While processors have same code, they can each can be executing different parts at any point in time. 18 9

10 MIMD (cont 3/4) A more common technique for programming MIMDs is to use multi-tasking The problem solution is broken up into various tasks. Tasks are distributed among processors initially. If new tasks are produced during executions, these may handled by parent processor or distributed Each processor can execute its collection of tasks concurrently. If some of its tasks must wait for results from other tasks or new data, the processor will focus the remaining tasks. Larger programs usually require a load balancing algorithm to rebalance tasks between processors Dynamic scheduling algorithms may be needed to assign a higher execution priority to time-critical tasks E.g., on critical path, more important, earlier deadline, etc. 19 MIMD (cont 4/4) Recall, there are two principle types of MIMD computers: Multiprocessors (with shared memory) Multicomputers (message passing) Both are important and will be covered in greater detail next

11 Multiprocessors (Shared Memory MIMDs) Consists of two types Centralized Multiprocessors Also called UMA (Uniform Memory Access) Symmetric Multiprocessor or SMP Distributed Multiprocessors Also called NUMA (Nonuniform Memory Access) 21 Centralized Multiprocessors (SMPs) 22 11

12 Centralized Multiprocessors (SMPs) Consists of identical CPUs connected by a bus and to common block of memory. Each processor requires the same amount of time to access memory. Usually limited to a few dozen processors due to memory bandwidth. SMPs and clusters of SMPs are currently very popular 23 Distributed Multiprocessors 24 12

13 Distributed Multiprocessors (or NUMA) Has a distributed memory system Each memory location has the same address for all processors. Access time to a given memory location varies considerably for different CPUs. Normally, uses fast cache to reduce the problem of different memory access time for processors. Creates problem of ensuring all copies of the same data in different memory locations are identical. 25 Multicomputers (Message-Passing MIMDs) Processors are connected by a network Usually an interconnection network Also, may be connected by Ethernet links or a bus. Each processor has a local memory and can only access its own local memory. Data is passed between processors using messages, when specified by the program

14 Multicomputers (cont) Message passing between processors is controlled by a message passing language (e.g., MPI, PVM) The problem is divided into processes or tasks that can be executed concurrently on individual processors. Each processor is normally assigned multiple tasks. 27 Multiprocessors vs Multicomputers Programming disadvantages of messagepassing Programmers must make explicit message-passing calls in the code This is low-level programming and is error prone. Data is not shared but copied, which increases the total data size. Data Integrity: difficulty in maintaining correctness of multiple copies of data item

15 Multiprocessors vs Multicomputers (cont) Programming advantages of message-passing No problem with simultaneous access to data. Allows different PCs to operate on the same data independently. Allows PCs on a network to be easily upgraded when faster processors become available. Mixed distributed shared memory systems exist An example is a cluster of SMPs. 29 Types of Parallel Execution Data parallelism Control/Job/Functional parallelism Pipelining Virtual parallelism 30 15

16 Data Parallelism All tasks (or processors) apply the same set of operations to different data. Example: for i 0 to 99 do a[i] b[i] + c[i] endfor Operations may be executed concurrently Accomplished on SIMDs by having all active processors execute the operations synchronously. Can be accomplished on MIMDs by assigning 100/p tasks to each processor and having each processor to calculated its share asynchronously. 31 Supporting MIMD Data Parallelism SPMD (single program, multiple data) programming is not really data parallel execution, as processors typically execute different sections of the program concurrently. Data parallel programming can be strictly enforced when using SPMD as follows: Processors execute the same block of instructions concurrently but asynchronously No communication or synchronization occurs within these concurrent instruction blocks. Each instruction block is normally followed by a synchronization and communication block of steps 32 16

17 MIMD Data Parallelism (cont.) Strict data parallel programming is unusual for MIMDs, as the processors usually execute independently, running their own local program. 33 Data Parallelism Features Each processor performs the same data computation on different data sets Computations can be performed either synchronously or asynchronously Defn: Grain Size is the average number of computations performed between communication or synchronization steps See Quinn textbook, page 411 Data parallel programming usually results in smaller grain size computation SIMD computation is considered to be fine-grain MIMD data parallelism is usually considered to be medium grain 34 17

18 Control/Job/Functional Parallelism Independent tasks apply different operations to different data elements a 2 b 3 m (a + b) / 2 s (a 2 + b 2 ) / 2 v s - m 2 First and second statements may execute concurrently Third and fourth statements may execute concurrently 35 Control Parallelism Features Problem is divided into different nonidentical tasks Tasks are divided between the processors so that their workload is roughly balanced Parallelism at the task level is considered to be coarse grained parallelism 36 18

19 Data Dependence Graph Can be used to identify data parallelism and job parallelism. See page 11. Most realistic jobs contain both parallelisms Can be viewed as branches in data parallel tasks - If no path from vertex u to vertex v, then job parallelism can be used to execute the tasks u and v concurrently. - If larger tasks can be subdivided into smaller identical tasks, data parallelism can be used to execute these concurrently. 37 For example, mow lawn becomes Mow N lawn Mow S lawn Mow E lawn Mow W lawn If 4 people are available to mow, then data parallelism can be used to do these tasks simultaneously. Similarly, if several people are available to edge lawn and weed garden, then we can use data parallelism to provide more concurrency

20 Pipelining Divide a process into stages Produce several items simultaneously 39 Compute Partial Sums Consider the for loop: p[0] a[0] for i 1 to 3 do p[i] p[i-1] + a[i] endfor This computes the partial sums: p[0] a[0] p[1] a[0] + a[1] p[2] a[0] + a[1] + a[2] p[3] a[0] + a[1] + a[2] + a[3] The loop is not data parallel as there are dependencies. However, we can stage the calculations in order to achieve some parallelism

21 Partial Sums Pipeline p[0] p[1] p[2] p[3] p[0] p[1] p[2] = a[0] a[1] a[2] a[3] 41 Virtual Parallelism In data parallel applications, it is often simpler to initially design an algorithm or program assuming one data item per processor. Particularly useful for SIMD programming If more processors are needed in actual program, each processor is given a block of n/p or n/p data items Typically, requires a routine adjustment in program. Will result in a slowdown in running time of at least n/p. Called Virtual Parallelism since each processor plays the role of several processors. A SIMD computer has been built that automatically converts code to handle n/p items per processor. Wavetracer SIMD computer

22 Slides from Parallel Architecture Section See s References Slides in this section are taken from the Parallel Architecture Slides at site Book reference is Chapter 2 of Quinn s textbook

23 Interconnection Networks Uses of interconnection networks Connect processors to shared memory Connect processors to each other Different interconnection networks define different parallel machines. The interconnection network s properties influence the type of algorithm used for various machines as it affects how data is routed. Terminology for Evaluating Switch Topologies We need to evaluate 4 characteristics of a network in order to help us understand their effectiveness These are The diameter The bisection width The edges per node The constant edge length We ll define these and see how they affect algorithm choice. Then we will introduce several different interconnection networks. 23

24 Terminology for Evaluating Switch Topologies Diameter Largest distance between two switch nodes. A low diameter is desirable It puts a lower bound on the complexity of parallel algorithms which requires communication between arbitrary pairs of nodes. Terminology for Evaluating Switch Topologies Bisection width The minimum number of edges between switch nodes that must be removed in order to divide the network into two halves. Or within 1 node of one-half if the number of processors is odd. High bisection width is desirable. In algorithms requiring large amounts of data movement, the size of the data set divided by the bisection width puts a lower bound on the complexity of an algorithm. 24

25 Terminology for Evaluating Switch Topologies Number of edges per node It is best if the maximum number of edges/node is a constant independent of network size, as this allows the processor organization to scale more easily to a larger number of nodes. Degree is the maximum number of edges per node. Constant edge length? (yes/no) Again, for scalability, it is best if the nodes and edges can be laid out in 3D space so that the maximum edge length is a constant independent of network size. Three Important Interconnection Networks We will consider the following three well known interconnection networks: 2-D mesh linear network hypercube All three of these networks have been used to build commercial parallel computers. 25

26 2-D Meshes Note: Circles represent switches and squares represent processors in all these slides. 2-D Mesh Network Switches arranged into a 2-D lattice or grid Communication allowed only between neighboring switches Torus: Variant that includes wraparound connections between switches on edge of mesh 26

27 Evaluating 2-D Meshes (Assumes mesh is a square) n = number of processors Diameter: (n 1/2 ) Places a lower bound on algorithms that require processing with arbitrary nodes sharing data. Bisection width: (n 1/2 ) Places a lower bound on algorithms that require distribution of data to all nodes. Max number of edges per switch: 4 is the degree Constant edge length? Yes Does this scale well? Yes Linear Network Switches arranged into a 1-D mesh Corresponds to a row or column of a 2-D mesh Ring : A variant that allows a wraparound connection between switches on the end. The linear and ring networks have many applications Essentially supports a pipeline in both directions Although these networks are very simple, they support many optimal algorithms. 27

28 Evaluating Linear and Ring Networks Diameter Linear : n-1 or Θ(n) Ring: n/2 or Θ(n) Bisection width: Linear: 1 or Θ(1) Ring: 2 or Θ(1) Degree for switches: 2 Constant edge length? Yes Does this scale well? Yes Hypercube (also called binary n-cube) A hypercube with n = 2 d processors & switches for d=4 28

29 Hypercube with n = 2 d Processors Number of nodes is a power of 2 Node addresses 0, 1,, n-1 Node i is connected to k nodes whose addresses differ from i in exactly one bit position. Example: k = 0111 is connected to 1111, 0011, 0101, and Growing a Hypercube Note: For d = 4, it is called a 4-dimensional cube. 29

30 Evaluating Hypercube Network with n = 2 d nodes Diameter: d = log n Bisection width: n / 2 Edges per node: log n Constant edge length? No. The length of the longest edge increases as n increases MIMD Message-Passing Slides are still from Parallel Architecture Unit at Parallel & Distributed Computing website 30

31 Some Interconnection Network Terminology (1/2) References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks. A link is the connection between two nodes. A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed. The link between two nodes can be either bidirectional or use two directional links. Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word). The above choices do not have a major impact on the concepts presented in this course. 61 Some Interconnection Network Terminology (1/2) References: Wilkinson, et. al. & Grama, et. al. Also, earlier slides on architecture & networks. A link is the connection between two nodes. A switch that enables packets to be routed through the node to other nodes without disturbing the processor is assumed. The link between two nodes can be either bidirectional or use two directional links. Can assume either one wire that carries one bit or parallel wires (one wire for each bit in word). The above choices do not have a major impact on the concepts presented in this course

32 Network Terminology (2/2) The bandwidth is the number of bits that can be transmitted in unit time (i.e., bits per second). The network latency is the time required to transfer a message through the network. The communication latency is the total time required to send a message, including software overhead and interface delay. The message latency or startup time is the time required to send a zero-length message. Includes software & hardware overhead, such as Choosing a route packing and unpacking the message 63 Store-and-forward Packet Switching Message is divided into packets of information Each packet includes source and destination addresses. Packets can not exceed a fixed, maximum size (e.g., 1000 byte). A packet is stored in a node in a buffer until it can move to the next node. Different packets typically follow different routes but are re-assembled at the destination, as the packets arrive. Movements of packets is asynchronous

33 Packet Switching (cont) At each node, the designation information is looked at and used to select which node to forward the packet to. Routing algorithms (often probabilistic) are used to avoid hot spots and to minimize traffic jams. Significant latency is created by storing each packet in each node it reaches. Latency increases linearly with the length of the route. 65 Slides from Performance Analysis 33

34 References on Performance Evaluation Slides are from on topic of Performance Evaluation. Selim Akl, Parallel Computation: Models and Methods, Prentice Hall, 1997, Updated online version available through website. Michael Quinn, Parallel Programming in C with MPI and Open MP, Ch. 7, McGraw Hill, Outline Speedup Superlinearity Issues Speedup Analysis Cost Efficiency Amdahl s Law Gustafson s Law 34

35 Speedup Speedup measures increase in running time due to parallelism. The number of PEs is given by n. S(n) = t s /t p, where t s is the running time on a single processor, using the fastest known sequential algorithm t p is the running time using a parallel processor. In simplest terms, Sequential running time Speedup Parallel running time Linear Speedup Usually Optimal Speedup is linear if S(n) = (n) Claim: The maximum possible speedup for parallel computers with n PEs is n. Usual pseudo-proof: (Assume ideal conditions) Assume a computation is partitioned perfectly into n processes of equal duration. Assume no overhead is incurred as a result of this partitioning of the computation (e.g., partitioning process, information passing, coordination of processes, etc), Under these ideal conditions, the parallel computation will execute n times faster than the sequential computation and the parallel running time will be t s /n. Then the parallel speedup in this ideal situation is S(n) = t s /(t s /n) = n 35

36 Linear Speedup Usually Optimal (cont) This argument shows that for typical problems, linear speedup to be optimal This argument is valid for traditional problems, but is invalid for some types of nontraditional problems. Speedup Usually Smaller Than Linear Unfortunately, the best speedup possible for most applications is considerably smaller than n The ideal conditions performance mentioned in earlier argument is usually unattainable. Normally, some parts of programs are sequential and allow only one PE to be active. Sometimes a significant number of processors are idle for certain portions of the program. For example Some PEs may be waiting to receive or to send data during parts of the program. Congestion may occur during message passing 36

37 Superlinear Speedup Superlinear speedup occurs when S(n) > n Occasionally speedup that appears to be superlinear may occur, but can be explained by other reasons such as the extra memory in parallel system. a sub-optimal sequential algorithm is compared to parallel algorithm. Luck, in case of algorithm that has a random aspect in its design (e.g., random selection) Superlinearity (cont) Selim Akl has given a multitude of examples that establish that superlinear algorithms are required for many non-standard problems Examples include nonstandard problems involving Real-Time requirements where meeting deadlines is part of the problem requirements. Problems where all data is not initially available, but has to be processed after it arrives. Some problems are natural to solve using parallelism and sequential solutions are inefficient. 37

38 Execution time for parallel portion time processors Shows nontrivial parallel algorithm s computation component as a decreasing function of the number of processors used. Time for MIMD communication time processors Shows a nontrivial parallel algorithm s communication component as an increasing function of the number of processors. 38

39 Combining Parallel & MIMD Communicaton Times time processors Combining these, we see for a fixed problem size, there is an optimum number of processors that minimizes overall execution time. MIMD Speedup Plot Speedup reaches max and then drops as processors increase speedup processors 39

40 Cost The cost of a parallel algorithm (or program) is Cost = Parallel running time #processors The cost of a parallel algorithm should be compared to the running time of a sequential algorithm. Cost removes the advantage of parallelism by charging for each additional processor. A parallel algorithm whose cost is O(running time) of an optimal sequential algorithm is called cost-optimal. Efficiency Sequential running time Efficiency Processors Parallel running time Efficiency Sequential execution time Efficiency Processors used Parallel execution time Speedup Processors Speedup Efficiency Processors used Efficiency Sequential running time Cost For traditional problems, 0 Efficiency 1 40

41 Amdahl s Law Having a detailed understanding of Amdahl s law is not essential for this course. However, having a brief, non-technical introduction to this important law could be useful. 81 Amdahl s Law Let f be the fraction of operations in a computation that must be performed sequentially, where 0 f 1. The maximum speedup achievable by a parallel computer with n processors is 1 1 S( n) f (1 f ) / n f The word law is often used by computer scientists when it is an observed phenomena (e.g, Moore s Law) and not a theorem that has been proven in a strict sense. It is easy to prove Amdahl s law for traditional problems, but it is not valid for non-traditional problems. 41

42 Example 1 Assume 95% of a program s execution time occurs inside a loop that can be executed in parallel. Amdahl s law shows the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs is less than Speedup (1 0.05) (1 /8 0.05) /8 Example 2 Assume 5% of a parallel program s execution time is spent within inherently sequential code. Amdahl s law shows that the maximum speedup achievable by this program, regardless of how many PEs are used, is 1 lim 0.05 (1 0.05) / p p

43 Amdahl s Law The argument used in proof of Amdahl s law assumes that speedup can not be superliner, so proof is invalid for non-traditional problems. Sometimes Amdahl s law is just stated as S(n) 1/f Note that S(n) never exceeds 1/f and approaches 1/f as n increases. Consequences of Amdahl s Limitations to Parallelism For a long time, Amdahl s law was viewed as a fatal limit to the usefulness of parallelism. A key flaw in these early arguments is that they were unaware of the impact of Gustafon s Law: Gustafon s Law: The proportion of the computations that are sequential normally decreases as the problem size increases. Note: Gustafon s law is a observed phenomena and not a theorem. The negative impact of Amdahl s law disappears as the problem size increases. 43

44 Limitations of Amdahl s Law It is now generally accepted by parallel computing professionals that Amdahl s law is not a serious limit the benefit and future of parallel computing Note that Amdahl s law shows that efforts required to further reduce the fraction of the code that is sequential may pay off in huge performance gains. Task/Channel Model Local accesses of private data are assumed to be easily distinguished from nonlocal data access done over channels. Local accesses should be considered much faster than nonlocal accesses. In this model: The execution time of a parallel algorithm is the period of time a task is active. The starting time of a parallel algorithm is when all tasks simultaneously begin executing. The finishing time of a parallel algorithm is when the last task has stopped executing

45 Parallel MIMD Algorithm Design Reference: Chptr 3, Quinn Textbook References Slides at F08/ on Parallel Algorithm Design. Chapter 3 of Quinn s Textbook 90 45

46 Task/Channel Model This model is intended for MIMDs (i.e., multiprocessors and multicomputers) and not for SIMDs. Parallel computation = set of tasks A task consists of a Program Local memory Collection of I/O ports Tasks interact by sending messages through channels A task can send local data values to other tasks via output ports A task can receive data values from other tasks via input ports. The local memory contains the program s instructions and its private data 91 Task/Channel Model A channel is a message queue that connects one task s ouput port with another task s input port. Data values appear in input port in the same order in which they are placed in the channel s output queue. A task is blocked if a task tries to receive a value at an input port and the value isn t available. The blocked task must wait until the value is received. A process sending a message is never blocked even if previous messages it has sent on the channel have not been received yet. Thus, receiving is a synchronous operation and sending is an asynchronous operation

47 Task/Channel Model Local accesses of private data are assumed to be easily distinguished from nonlocal data access done over channels. Local accesses should be considered much faster than nonlocal accesses. In this model: The execution time of a parallel algorithm is the period of time a task is active. The starting time of a parallel algorithm is when all tasks simultaneously begin executing. The finishing time of a parallel algorithm is when the last task has stopped executing. 93 Task/Channel Model Task Channel A parallel computation can be viewed as a directed graph

48 Foster s Design Methodology Ian Foster has proposed a 4-step process for designing parallel algorithms for machines that fit the task/channel model. Foster s online textbook is a useful resource here It encourages the development of scalable algorithms by delaying machine-dependent considerations until the later steps. The 4 design steps are called: Partitioning Communication Agglomeration Mapping 95 Foster s Methodology 96 48

49 Partitioning Partitioning: Dividing the computation and data into pieces Domain decomposition one approach Divide data into pieces Determine how to associate computations with the data Focuses on the largest and most frequently accessed data structure Functional decomposition another approach Divide computation into pieces Determine how to associate data with the computations This often yields tasks that can be pipelined. 97 Example Domain Decompositions Think of the primitive tasks as processors. In 1 st, each 2D slice is mapped onto one processor of a system using 3 processors. In second, a 1D slice is mapped onto a processor. In last, an element is mapped onto a processor The last leaves more primitive tasks and is usually preferred

50 Example Functional Decomposition 99 Partitioning Checklist for Evaluating the Quality of a Partition At least 10x more primitive tasks than processors in target computer Minimize redundant computations and redundant data storage Primitive tasks are roughly the same size Number of tasks an increasing function of problem size Remember we are talking about MIMDs here which typically have a lot less processors than SIMDs

51 Foster s Methodology 101 Communication Determine values passed among tasks There are two kinds of communication: Local communication A task needs values from a small number of other tasks Create channels illustrating data flow Global communication A significant number of tasks contribute data to perform a computation Don t create channels for them early in design

52 Communication (cont.) Communications is part of the parallel computation overhead since it is something sequential algorithms do not have do. Costs larger if some (MIMD) processors have to be synchronized. SIMD algorithms have much smaller communication overhead because Much of the SIMD data movement is between the control unit and the PEs on broadcast/reduction circuits especially true for associative Parallel data movement along the interconnection network involves lockstep (i.e. synchronously) moves. 103 Communication Checklist for Judging the Quality of Communications Communication operations should be balanced among tasks Each task communicates with only a small group of neighbors Tasks can perform communications concurrently Task can perform computations concurrently

53 Foster s Methodology 105 What We Have Hopefully at This Point and What We Don t Have The first two steps look for parallelism in the problem. However, the design obtained at this point probably doesn t map well onto a real machine. If the number of tasks greatly exceed the number of processors, the overhead will be strongly affected by how the tasks are assigned to the processors. Now we have to decide what type of computer we are targeting Is it a centralized multiprocessor or a multicomputer? What communication paths are supported How must we combine tasks in order to map them effectively onto processors?

54 Agglomeration Agglomeration: Grouping tasks into larger tasks Goals Improve performance Maintain scalability of program Simplify programming i.e. reduce software engineering costs. In MPI programming, a goal is to lower communication overhead. often to create one agglomerated task per processor By agglomerating primitive tasks that communicate with each other, communication is eliminated as the needed data is local to a processor. 107 Agglomeration Can Improve Performance It can eliminate communication between primitive tasks agglomerated into consolidated task It can combine groups of sending and receiving tasks

55 Scalability Assume we are manipulating a 3D matrix of size 8 x 128 x 256 and Our target machine is a centralized multiprocessor with 4 CPUs. Suppose we agglomerate the 2 nd and 3 rd dimensions. Can we run on our target machine? Yes- because we can have tasks which are each responsible for a 2 x 128 x 256 submatrix. Suppose we change to a target machine that is a centralized multiprocessor with 8 CPUs. Could our previous design basically work. Yes, because each task could handle a 1 x 128 x 256 matrix. 109 Scalability However, what if we go to more than 8 CPUs? Would our design change if we had agglomerated the 2 nd and 3 rd dimension for the 8 x 128 x 256 matrix? Yes. This says the decision to agglomerate the 2 nd and 3 rd dimension in the long run has the drawback that the code portability to more CPUs is impaired

56 Reducing Software Engineering Costs Software Engineering the study of techniques to bring very large projects in on time and on budget. One purpose of agglomeration is to look for places where existing sequential code for a task might exist, Use of that code helps bring down the cost of developing a parallel algorithm from scratch. 111 Agglomeration Checklist for Checking the Quality of the Agglomeration Locality of parallel algorithm has increased Replicated computations take less time than communications they replace Data replication doesn t affect scalability All agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks suitable for likely target systems Tradeoff between agglomeration and code modifications costs is reasonable

57 Agglomeration Checklist for Checking the Quality of the Agglomeration Locality of parallel algorithm has increased Replicated computations take less time than communications they replace Data replication doesn t affect scalability All the agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks suitable for likely target systems Tradeoff between agglomeration and code modifications costs is reasonable 113 Foster s Methodology

58 Mapping Mapping: The process of assigning tasks to processors Centralized multiprocessor: Mapping done by operating system Distributed memory system: Mapping done by user Conflicting goals of mapping Maximize processor utilization i.e. the average percentage of time the system s processors are actively executing tasks necessary for solving the problem. Minimize interprocessor communication 115 Mapping Example (a) is a task/channel graph showing the needed communications over channels. (b) shows a possible mapping of the tasks to 3 processors

59 Mapping Example If all tasks require the same amount of time and each CPU has the same capability, this mapping would mean the middle processor will take twice as long as the other two Optimal Mapping Optimality is with respect to processor utilization and interprocessor communication. Finding an optimal mapping is NP-hard. Must rely on heuristics applied either manually or by the operating system. It is the interaction of the processor utilization and communication that is important. For example, with p processors and n tasks, putting all tasks on 1 processor makes interprocessor communication zero, but utilization is 1/p

60 A Mapping Decision Tree (Quinn s Suggestions Details on pg 72) Static number of tasks Structured communication Constant computation time per task Agglomerate tasks to minimize communications Create one task per processor Variable computation time per task Cyclically map tasks to processors Unstructured communication Use a static load balancing algorithm Dynamic number of tasks Frequent communication between tasks Use a dynamic load balancing algorithm Many short-lived tasks. No internal communication Use a run-time task-scheduling algorithm 119 Mapping Checklist to Judge the Quality of a Mapping Consider designs based on one task per processor and multiple tasks per processor. Evaluate static and dynamic task allocation If dynamic task allocation chosen, the task allocator (i.e., manager) is not a bottleneck to performance If static task allocation chosen, ratio of tasks to processors is at least 10:

61 Boundary Value Problem Example to illustrate use of Foster s design method Boundary Value Problem Ice water Rod Insulation Problem: The ends of a rod of length 1 are in contact with ice water at 0 0 C. The initial temperature at distance x from the end of the rod is 100sin( x). (These are the boundary values.) The rod is surrounded by heavy insulation. So, the temperature changes along the length of the rod are a result of heat transfer at the ends of the rod and heat conduction along the length of the rod. We want to model the temperature at any point on the 122 rod as a function of time. 61

62 Over time the rod gradually cools. A partial differential equation (PDE) models the temperature at any point of the rod at any point in time. PDEs can be hard to solve directly, but a method called the finite difference method is one way to approximate a good solution using a computer. The derivative of f at a point s is defined by the limit: lim f(x+h) f(x) h 0 h If h is a fixed non-zero value (i.e. don t take the limit), then the above expression is called a finite difference. 123 Finite differences approach differential quotients as h goes to zero. Thus, we can use finite differences to approximate derivatives. This is often used in numerical analysis, especially in numerical ordinary differential equations and numerical partial differential equations, which aim at the numerical solution of ordinary and partial differential equations respectively. The resulting methods are called finite-difference methods

63 An Example of Using a Finite Difference Method for an ODE (Ordinary Differential Equation) Given f (x) = 3f(x) + 2, the fact that f(x+h) f(x) approximates f (x) h can be used to iteratively calculate an approximation to f (x). In our case, a finite difference method finds the temperature at a fixed number of points in the rod at various time intervals. The smaller the steps in space and time, the better the approximation. 125 Rod Cools as Time Progresses A finite difference method computes these temperature approximations (vertical axis) at various points along the rod (horizontal axis) for different times between 0 and

64 The Finite Difference Approximation Requires the Following Data Structure A matrix is used where columns represent positions and rows represent time. The element u(i,j) contains the temperature at position i on the rod at time j. At each end of the rod the temperature is always 0. At time 0, the temperature at point x is 100sin( x) 127 Finite Difference Method Actually Used We have seen that for small h, we may approximate f (x) by f (x) ~ [f(x + h) f(x)] / h It can be shown that in this case, for small h, f (x) ~ [f(x + h) 2f(x) + f(x-h)] Let u(i,j) represent the matrix element containing the temperature at position i on the rod at time j. Using above approximations, it is possible to determine a positive value r so that u(i,j+1) ~ ru(i-1,j) + (1 2r)u(i,j) + ru(i+1,j) In the finite difference method, the algorithm computes the temperatures for the next time period using the above approximation

65 Partitioning Step This one is fairly easy to identify initially. There is one data item (i.e. temperature) per grid point in matrix. Let s associate one primitive task with each grid point. A primitive task would be the calculation of u(i,j+1) as shown on the last slide. This gives us a two-dimensional domain decomposition. 129 Communication Step Next, we identify the communication pattern between primitive tasks. Each interior primitive task needs three incoming and three outgoing channels because to calculate u(i,j+1) = ru(i-1,j) + (1 2r)u(i,j) + ru(i+1,j) the task needs u(i-1,j), u(i,j), and u(i+1,j). i.e. 3 incoming channels and u(i,j+1) will be needed for 3 other tasks - i.e. 3 outgoing channels. Tasks on the sides don t need as many channels, but we really need to worry about the interior nodes

66 Agglomeration Step We now have a task/channel graph below: It should be clear this is not a good situation even if we had enough processors. The top row depends on values from bottom rows. Be careful when designing a parallel algorithm that you don t think you have parallelism when tasks are sequential. 131 Collapse the Columns in the 1 st Agglomeration Step This task/channel graph represents each task as computing one temperature for a given position and time. This task/channel graph represents each task as computing the temperature at a particular position for all time steps

67 Mapping Step This graph shows only a few intervals. We are using one processor per task. For the sake of a good approximation, we may want many more intervals than we have processors. We go back to the decision tree on page 72 to see if we can do better when we want more intervals than we have available processors. Note: On a large SIMD with an interconnection network (which the ASC emulator doesn t have), we might stop here as we could possibly have enough processors. 133 Use Decision Tree (See earlier Slide on Decision Tree or Pg 72 of Quinn) The number of tasks is static once we decide on how many intervals we want to use. The communication pattern among the tasks is regular i.e. structured. Each task performs the same computations. Therefore, the decision tree says to create one task per processor by agglomerating primitive tasks so that computation workloads are balanced and communication is minimized. So, we will associate a contiguous piece of the rod with each task by dividing the rod into n pieces of size h, where n is the number of processors we have. Comment: Can decide how to design algorithm without use of the decision tree as well

68 Pictorially Our previous task/channel graph assumed 10 consolidated tasks, one per interval: If we now assume 3 processors, we would now have: Note this maintains the possibility of using some kind of nearest neighbor interconnection network and eliminates unnecessary communication. What interconnection networks would work well? 135 Agglomeration and Mapping Agglomeration and Mapping

69 End of Unit This unit covered an overview of general topics on parallel computing. Slides were taken from website for my Parallel and Distributed Computing course. This website is at and can be used for reference, if desired

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

Parallel Computing. Parallel Algorithm Design

Parallel Computing. Parallel Algorithm Design Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Parallel Computers. c R. Leduc

Parallel Computers. c R. Leduc Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

High Performance Computing Systems

High Performance Computing Systems High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Parallel Processors. Session 1 Introduction

Parallel Processors. Session 1 Introduction Parallel Processors Session 1 Introduction Applications of Parallel Processors Structural Analysis Weather Forecasting Petroleum Exploration Fusion Energy Research Medical Diagnosis Aerodynamics Simulations

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont. CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction

More information

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K. Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing

More information

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Parallel Architecture, Software And Performance

Parallel Architecture, Software And Performance Parallel Architecture, Software And Performance UCSB CS240A, T. Yang, 2016 Roadmap Parallel architectures for high performance computing Shared memory architecture with cache coherence Performance evaluation

More information

Chapter 11. Introduction to Multiprocessors

Chapter 11. Introduction to Multiprocessors Chapter 11 Introduction to Multiprocessors 11.1 Introduction A multiple processor system consists of two or more processors that are connected in a manner that allows them to share the simultaneous (parallel)

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY Sarvani Chadalapaka HPC Administrator University of California Merced, Office of Information Technology schadalapaka@ucmerced.edu it.ucmerced.edu

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Goals of this Course

Goals of this Course CISC 849-010 High performance parallel algorithms for computational science Instructor: Dr. Michela Taufer Spring 2009 Goals of this Course This course is intended to provide students with an understanding

More information

Parallel Computing Introduction

Parallel Computing Introduction Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information