Parallel Architectures
|
|
- Louise Daniels
- 5 years ago
- Views:
Transcription
1 Parallel Architectures Instructor: Tsung-Che Chiang Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between 1960s and the mid-1990s, scientists and engineers explored a wide variety of parallel computer architectures. Experts passionately debated whether the dominant parallel computer systems would contain at most a few dozen high-performance processors or thousands of less-powerful processors. Today, most contemporary parallel computers are constructed out of commodity s. 2
2 Outline Interconnection Networks Processor Arrays Multiprocessors Multicomputers Flynn s Taxonomy Summary 3 Interconnection Networks Shared Medium It allows only one message at a time. Each processor listens to every message and receives the ones for which it is the destination. Ethernet is a well-known example. Message collisions can significantly degrade the performance of a heavily utilized shared medium. shared medium 4
3 Interconnection Networks Switched Medium It supports point-to-point messages among pairs of processors. Each processor has its own communication path to the switch. Two advantages over shared medium support of concurrent transmission support of network scaling switched medium 5 Switch Network Topologies A switch network can be represented by a graph nodes: processors/switches Each processor is connected to one switch. Switches connect processors and/or other switches. edges: communication paths Direct vs. Indirect topology Direct: the ratio of switch nodes to processor nodes is 1:1. Indirect: the above ratio is greater than 1:1. 6
4 Switch Network Topologies Evaluation criteria Diameter ( ): the largest distance between two switch nodes Bisection width ( ): the minimum number of edges between switch nodes that must be removed to divide the network into two halves Edges per switch node It is best if this value is a constant independent of the network size (better scalability). Constant edge length It is best if the nodes and edges of the network can be laid out in 3-D space so that the maximum edge length is a constant independent of the network size. 7 Switch Network Topologies In the following, we will discuss six switch network topologies: 2-D mesh binary tree hypertree butterfly hypercube shuffle-exchange 8
5 2-D Mesh Network Properties direct topology Assuming n switch nodes and no wraparound connections minimum diameter: 2(n 1/2 1) maximum bisection width: n 1/2 edges/node: 4 constant edge length switch processor 9 Binary Tree Network Properties indirect toplogy Assuming n = 2 d processors (with 2n 1 switches) diameter: 2 log n bisection width: 1 edges/node: 3 non-constant edge length depth d 10
6 Hypertree Network (1/3) It shares the low diameter of binary tree but has an improved bisection width. For a hypertree of degree k and depth d: From the front, it looks like a complete k-ary tree. From the side, it looks like an upside-down binary tree. k = 4, d = 2 front view side view 11 Hypertree Network (2/3) 12
7 Hypertree Network (3/3) Properties indirect topology Assuming k = 4, n = 4 d processors, 2 d (2 d +1 1)switches diameter: 2d (i.e., log n) bisection width: 2 d+1 edges/node: no more than 6 non-constant edge length 13 Butterfly Network (1/6) n = 2 d processors Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 Rank d 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 14
8 Butterfly Network (2/6) ,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 (i, j 1) 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 (i, j) 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 15 Butterfly Network (3/6) , ,0 2,0 0,1 1,1 2,1 0,2 1,2 2,2 0,3 1,3 2,3 0,4 1, ,4 0,5 1,5 2,5 0,6 1,6 2, ,7 1,7 2,7 inverting the i th most significant bit in the binary representation of j (i, m) (i, j) 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,
9 Butterfly Network (4/6) Where the butterfly is. As the rank number decrease, the widths of the wings of the butterflies increase exponentially. (Hence, non-constant edge length) ,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 17 Butterfly Network (5/6) Message routing Each switch node picks off the lead bit from the message. 0 left, 1 right message 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 01 message 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 assume the same 1 message 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 message 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,
10 Butterfly Network (6/6) Properties indirect topology Assuming n = 2 d processors, n(log n + 1) switches, switch nodes on ranks 0 and log n are the same diameter: log n bisection width: n/2 edges/node: 4 non-constant edge length 19 Hypercube Network (1/4) A hypercube network, also called a binary n-cube, is a butterfly in which each column of switch nodes is collapsed into a single node ,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 3-D Hypercube ,0 1,1 1,2 1,3 1,4 1,5 1,6 1, ,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,
11 Hypercube Network (2/4) The processor and its associated switches are labeled 0, 1,..., 2 d 1; two switches are adjacent if their binary labels differ in exactly one bit position. d = d = Hypercube Network (3/4) Properties direct topology Assuming n = 2 d processors, diameter: log n bisection width: n/2 edges/node: log n non-constant edge length 22
12 Hypercube Network (4/4) Message Routing Note that edges always connect switches whose addresses differ in exactly one bit position. Example: Send a message from 0101 to 0011 Path1: Path2: Shuffle-Exchange Network (1/5) Perfect shuffle sorting, dividing it exactly in half, and shuffling the two halves perfectly
13 Shuffle-Exchange Network (2/5) Perfect shuffle The new position can be calculated by performing a left cyclic rotation of the binary number Shuffle-Exchange Network (3/5) Connections exchange: link switches whose numbers differ in their least significant bit shuffle: links switch i and j, where j is the result of cycling the bits of i left one position
14 Shuffle-Exchange Network (4/5) Properties direct topology Assuming n = 2 d processors, diameter: 2log n 1 bisection width: n/log n edges/node: 2 non-constant edge length Shuffle-Exchange Network (5/5) Message Routing The worst-case scenario is routing a message from switch 0 to switch n 1 (or vice versa) From 0000 to 1111: E S E S E S E From 0011 to 0101: E S E S S
15 Interconnection Networks No network can be optimal in every regard. #Processors #switch diameter bisection width edges/ node eonstant edge len. 2-D mesh n = d 2 n 2(n 1/2 1) n 1/2 4 Yes Binary tree n = 2 d 2n 1 2 log n 1 3 No 4-ary hypertree n = 4 d 2n n 1/2 log n 2 n 1/2 6 No Butterfly n = 2 d n(logn+1) log n n / 2 4 No Hypercube n = 2 d n log n n / 2 log n No Shuffleexchange n = 2 d n 2log n 1 n log n 2 No 29 Processor Arrays (1/11) Vector computer a computer whose instruction set includes operations on vectors as well as scalars Two general ways of implementation pipelined vector processor It streams vectors from memory to the, where pipelined arithmetic units manipulate them. Early supercomputers (Cray-1) are well-known examples. processor arrays It has a set of identical, synchronized processing elements capable of simultaneously performing the same operation on different data. motivation: high price of a control unit, data parallelism 30
16 Processor Arrays (2/11) Architecture Front-end computer Memory I/O processors Processor array scalar memory bus global result bus P P P P P P P P instruction broadcast bus M M M M M M M M Interconnection network Parallel I/O devices 31 Processor Arrays (3/11) Performance the amount of work accomplished per time unit depends on the utilization of its processors Example 2.1 a processor array with 1024 processors adding two integers in 1 µ second performance when adding two integer vectors of length 1024, assuming each vector is allocated to the processors in a balanced fashion 1024 operations Performance = = µ second 9 (operations/second) 32
17 Processor Arrays (4/11) Performance the amount of work accomplished per time unit depends on the utilization of its processors Example 2.2 a processor array with 512 processors adding two integers in 1 µ second performance when adding two integer vectors of length 600, assuming each vector is allocated to the processors in a balanced fashion 600 operations Performance = = µ second 8 (operations/second) 88 processors add 2 pairs of integers. The others add only one pair and sit idle while the 88 processors add their second integer pair. 33 Processor Arrays (5/11) Interconnection Network It is used to bring together operands stored in the memories of different processors. The most popular interconnection network for processor arrays is the 2-D mesh. It has the advantage of a relatively straightforward implementation in VLSI, where a single chip may contain a large number of processors. 4x4 8x12 34
18 Processor Arrays (6/11) Enabling & Disabling Processors It is possible for only a subset of the processors to perform an instruction by masking. useful when the number of data items is not an exact multiple of the size of the processor array useful to support conditionally executed parallel operations 35 Processor Arrays (7/11) Enabling & Disabling Processors Example (Fig. 2.12) if (a[i]!= 0) a[i] = 1; else a[i] = -1; indicates the processors that are masked out (inactive) 36
19 Processor Arrays (8/11) Enabling & Disabling Processors Efficiency of the processor array can drop rapidly when the programs enter conditionally executed code. There is additional overhead of performing the tests to set the mask bits. There is the inefficiency caused by having to work through different branches of control structures sequentially. In the previous example, the performance is less than 50% (of the performance when performing operations across the entire processor array) when the additional overhead is considered. 37 Processor Arrays (9/11) Additional Architecture Features Front-end computer Memory I/O processors Processor array scalar memory bus global result bus P P P P P P P P instruction broadcast bus M M M M M M M M Interconnection network Parallel I/O devices 38
20 Processor Arrays (10/11) Memory bus It allows particular elements of parallel variables to be used or defined in sequential code. In this way, the processor array can be viewed as an extension of the memory space of the front-end. Global result bus It enables values from the processor array to be combined and returned to the front end. The ability to compute a global and is valuable. 39 Processor Arrays (11/11) Shortcomings 1. Not all problems map well into a strict data-parallel solution. 2. The efficiency drops when entering conditionally executed parallel code. 3. They do not easily accommodate multiple users. 4. They do not scale down well due to the cost of highbandwidth communication networks. 5. They are built using custom VLSI, and thus losing the costeffectiveness of commodity s. 6. The original motivation the relatively high cost of control units is no longer valid in today s s. Processor arrays are no longer considered a viable option for general-purpose parallel computers. 40
21 Multiprocessors A multiprocessor is a multi- computer with shared memory. The same address on two different s refer to the same memory location. Comparing with processor arrays, they can be built out of commodity s, they naturally support multiple users, and they do not lose efficiency when encountering conditionally executed parallel code. 41 Multiprocessors We discuss two fundamental types of multiprocessors: centralized multiprocessors, in which all the primary memory is in one place distributed multiprocessors, in which the primary memory is distributed among the processors 42
22 Centralized Multiprocessors (1/5) A centralized multiprocessor is a straightforward extension of the uniprocessor. It is also called uniform memory access (UMA) symmetric multiprocessor (SMP) The presence of large and efficient caches makes multiprocessors practical. Still, memory bus bandwidth typically limits to a few dozen the number of processors that can be employed. Cache Cache Cache Cache Primary memory Bus I/O devices 43 Centralized Multiprocessors (2/5) Data private: used only by a single processor shared: used by multiple processors Designers of centralized multiprocessors must address two problems associated with shared data: cache coherence problem processor synchronization 44
23 Centralized Multiprocessors (3/5) Cache Coherence Problem Memory Memory Memory Memory X 7 X 7 X 7 X 2 Cache Cache A B A B A B A B 45 Centralized Multiprocessors (4/5) Cache Coherence Problem Snooping protocol are typically used to maintain cache coherence on centralized multiprocessors. Each s cache controller monitors the bus to identify which cache blocks are being requested by other s. Before the write occurs, all copies of the data item cached by other processors are invalidated. If two processors simultaneously try to write to the same memory location, only one of them wins the race. 46
24 Centralized Multiprocessors (5/5) Processor Synchronization mutual exclusion a situation in which at most one process can be engaged in a specified activity at any time barrier synchronization guarantees that no process will proceed beyond a designated point in the program until every process has reached the barrier 47 Distributed Multiprocessors (1/10) Architecture Cache memory Cache memory Cache memory Cache memory Bus Primary memory I/O devices Cache memory Cache memory Cache memory Memory I/O devices Memory I/O devices Memory I/O devices Interconnection network 48
25 Distributed Multiprocessors (2/10) Rationale & advantage spatial and temporal locality memory references are between processor and its local memory higher aggregate memory bandwidth and lower memory access time higher processor count Distributing I/O, too, can also improve scalability. 49 Distributed Multiprocessors (3/10) The same address on different processors refers to the same memory location. Memory access time varies considerably, depending upon whether the address being referenced is in that processor s local memory. Thus, it is also called a nonuniform memory access (NUMA) multiprocessor. 50
26 Distributed Multiprocessors (4/10) Cache Coherence Alternative1: Only storing instructions and private data in a processor s cache poor performance due to huge time difference between a local cache access and a nonlocal memory access Alternative2: Snooping methods do not scale well as # of processors because a cache controller cannot simply snoop on a shared memory bus and a more complicated protocol is needed Cache Cache Cache Cache memory memory memory memory Cache memory Cache memory Cache memory Bus Memory I/O Memory devices I/O Memory devices I/O devices Primary memory I/O devices Interconnection network 51 Distributed Multiprocessors (5/10) Cache Coherence Alternative3: directory-based protocol a single directory contains sharing information about every memory block that may be cached Status of a memory block uncached: not currently in any processor s cache shared: cached by one or more processors, and the copy in memory is correct exclusive: cached by exactly one processor that has written the block, so that the copy in memory is obsolete 52
27 Distributed Multiprocessors (6/10) Cache Coherence In addition to the block status, we also need to keep track of which processors have copies of any cache block, so that these copies can be invalidated when one processor writes a value to that block. To prevent accesses to the cache directory from becoming a performance bottleneck, the directory itself should be distributed among the computer s local memories. The information about a particular memory block is in exactly one location. 53 Distributed Multiprocessors (7/10) Directories Memories Caches Interconnection network U000 X Interconnection network S101 Interconnection network S100 X 7 X read Interconnection network E100 X 7 X 7 out-of-date X 7 X 7 X 6 invalidate read write 54
28 Distributed Multiprocessors (8/10) Interconnection network Interconnection network Directories E100 S110 Memories X 7 X 6 Caches X 6 X 6 X read Interconnection network Interconnection network E001 E100 X 6 X 5 X 5 X write write 55 Distributed Multiprocessors (9/10) Directories Memories Caches Interconnection network E100 X 5 X Interconnection network U000 X flush 56
29 Multicomputers A multicomputer is another example of a distributed-memory, multiple- computer. Unlike a NUMA multiprocessor, a multicomputer has disjoint local address space. The same address on different processors refers to different physical memory locations. Each processor only has direct access to its local memory. Processors interact with each other by passing messages, and there is no cache coherence problem. 57 Multicomputers Commercial multicomputers vs. commodity multicomputers custom vs. mass-produced computers low-latency vs. high-latency expensive vs. cheap 58
30 Multicomputers Designs User Asymmetrical Multicomputer Front-end computer Interconnection network Internet Symmetrical Multicomputer File server Interconnection network 59 Asymmetrical Multicomputers Advantages Back-end processors are used exclusively for executing parallel programs. They may be running a primitive OS. It is easier for the manufacturer to develop the primitive OS. Without processes occupying cycles or sending messages, it is easier to understand, model, and tune the performance of a parallel application. 60
31 Asymmetrical Multicomputers Disadvantages Users login into the front-end computer, which executes a full, multiprogrammed OS and provides all functions needed for program development. single point of failure scalability limited by the front-end multiple front-end computers? How do users know which front-end computer to log in? How will the workload be balanced? How are back-end nodes assigned to front-end processors? a centralized multicomputer? Underutilization condition might be frustrating. 61 Asymmetrical Multicomputers Disadvantages (continued) program debugging Without supporting I/O operations, the back-end nodes must send a message to the front-end computer to print the contents to users. requirement of development of two distinct programs front-end: interacting with users and the file system, transmitting data to the back-end processors, forwarding results to the outside world back-end: computationally intensive portion 62
32 Symmetrical Multicomputers The difficulty of debugging parallel programs is a strong incentive to provide full-featured I/O facilities on back-end nodes. A straightforward way is to run a multiprogrammed OS on the back-end processors, too. In a symmetrical multicomputer, every computer executes the same OS and has identical functionality. Users may log into any computer to edit and compile their programs. Any or all of the computers may be involved in the execution of a particular parallel program. 63 Symmetrical Multicomputers Advantages over asymmetrical multicomputers They alleviate the performance bottleneck caused by the front-end computer. Support for debugging is better since every computer runs a full-fledged OS. They also eliminate the front-end/back-end programming problem. Every processor executes the same problem. The if statement can be used to select partial processors. 64
33 Symmetrical Multicomputers Disadvantages It is more difficult to maintain the illusion of a single parallel computer. There is no simple way to balance the program development workload among all the processors. It is more difficult to achieve high performance from parallel programs when processes must compete with other processes for cycles cache space memory bandwidth 65 A Mixed Model ParPar cluster at the Hebrew University of Jerusalem Multicomputer User Front-end computer Internet Switched Ethernet Myrinet Switch File server 66
34 Commodity Cluster vs. Networks of Workstation A network of workstation It is a dispersed collection of computers, typically located on users desks. It is to serve the needs of the person using it. Individual workstations may have different OS and executable programs. Commodity cluster It is a co-located collocation of mass-produced computers. The computers are usually accessible only via the network. Some of the computers may not allow users to log in. The networking medium should have high speed. Latency Bandwidth Cost/node Fast Ethernet 100 µsec 100 Mbit/sec < $100 Gigabit Ethernet 100 µsec 1,000 Mbit/sec < $1,000 Myrinet 7 µsec 1,920 Mbit/sec < $2, Flynn s Taxonomy Data stream Single Multiple SISD SIMD Instruction stream Single Multiple Uniprocessors MISD Systolic arrays Processor arrays Pipelined vector processors MIMD Multiprocessors Multicomputers 68
35 Flynn s Taxonomy Systolic array, an example of an MISD computer primitive sorting element b c a First phase min(a, b, c) med(a, b, c) max(a, b, c) Second phase 69 Systolic Array Insert 7 4 Host Host
36 Systolic Array Extract minimum 4 8 Host Host Systolic Array Extract minimum 5 7 Host Host 72
37 Summary Processor Arrays 1. Not all problems map well into a strict data-parallel solution. 2. The efficiency drops when entering conditionally executed parallel code. 3. They do not easily accommodate multiple users. 4. They do not scale down well due to the cost of highbandwidth communication networks. 5. They are built using custom VLSI, and thus losing the costeffectiveness of commodity s. 6. The original motivation the relatively high cost of control units is no longer valid in today s s. Processor arrays are no longer considered a viable option for general-purpose parallel computers. 73 Summary Centralized Multiprocessors Cache coherence problem snooping protocol write invalidation protocol Cache Cache Cache Cache Bus Synchronization Primary I/O memory devices mutual exclusion & barrier replying upon hardware instructions that have the net effect of atomically reading and updating a memory location Small number of s limited by the shared memory bus 74
38 Summary Distributed Multiprocessors a single global address space more difficult cache coherence directory-based scheme Cache memory Cache memory Cache memory Memory I/O Memory devices I/O Memory devices I/O devices Interconnection network 75 Summary Multicomputers multiple joint address spaces no cache coherence problem Whether or not a copy of a data item is up-to-date or not depends entirely upon the programmer. Symmetrical Multicomputer Interconnection network User Internet File server Asymmetrical Multicomputer Front-end computer Interconnection network 76
Parallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More informationLecture 24: Memory, VM, Multiproc
Lecture 24: Memory, VM, Multiproc Today s topics: Security wrap-up Off-chip Memory Virtual memory Multiprocessors, cache coherence 1 Spectre: Variant 1 x is controlled by attacker Thanks to bpred, x can
More informationParallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationFLYNN S TAXONOMY OF COMPUTER ARCHITECTURE
FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn s classification scheme is based on the notion of a stream of information.
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ
CMPE 511 TERM PAPER Distributed Shared Memory Architecture by Seda Demirağ 2005701688 1. INTRODUCTION: Despite the advances in processor design, users still demand more and more performance. Eventually,
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationLecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single
More informationOrganisasi Sistem Komputer
LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More informationNormal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory
Parallel Machine 1 CPU Usage Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory Solution Use multiple CPUs or multiple ALUs For simultaneous
More informationComputer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More information06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1
Credits:4 1 Understand the Distributed Systems and the challenges involved in Design of the Distributed Systems. Understand how communication is created and synchronized in Distributed systems Design and
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationChapter 18. Parallel Processing. Yonsei University
Chapter 18 Parallel Processing Contents Multiple Processor Organizations Symmetric Multiprocessors Cache Coherence and the MESI Protocol Clusters Nonuniform Memory Access Vector Computation 18-2 Types
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationCSCI 4717 Computer Architecture
CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationMultiprocessors 1. Outline
Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationChapter 17 - Parallel Processing
Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationARCHITECTURAL CLASSIFICATION. Mariam A. Salih
ARCHITECTURAL CLASSIFICATION Mariam A. Salih Basic types of architectural classification FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE FENG S CLASSIFICATION Handler Classification Other types of architectural
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More information18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationLecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy SISD: single instruction and single data stream: uniprocessor
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationAleksandar Milenkovich 1
Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationParallel Computers. c R. Leduc
Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?
More informationCS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationA Multiprocessor system generally means that more than one instruction stream is being executed in parallel.
Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More information