CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport
|
|
- Claire Marsh
- 5 years ago
- Views:
Transcription
1 CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport
2 Chapter 2: Architecture of Parallel Computers Hardware Software
3 2.1.1 Flynn s taxonomy Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction single-data (MISD) Multiple-instruction multiple-data (MIMD) Michael Flynn classified systems according to the number of instruction streams and the number of data streams.
4 Instruction streams and data streams Data stream: a sequence of digitally encoded coherent signals of data packets used to transmit or receive information that is in transmission. Instruction stream: a sequence of instructions.
5 Instruction set architecture Stored program computer: memory stores programs, as well as data. Thus, programs must go from memory to the CPU where it can be executed. Programs consist of many instructions which are the 0's and 1's that tell the CPU what to do. The format and semantics of the instructions are defined by the ISA (instruction set architecture). The reason the instructions reside in memory is because CPUs typically hold very little memory. The more memory the CPU has, the slower it runs. Thus, memory and CPU are separate chips
6 2.1.2 SISD --- the classic van Neumann machine Load X Instruction pool Load Y Add Z, X, Y Store Z Input Devices Output Devices Control Unit External Storage Memory CP U Arithmetic Logic Unit A single processor executes a single instruction stream, to operate on data stored in a single memory. During any CPU cycle, only one data stream is used. The performance of an van Neumann machine can be improved by caching. P Data pool
7 Steps to run a single instruction IF (instruction fetch): the instruction is fetched from memory. The address of the instruction is fetched from program counter (PC), the instruction is copied from memory to instruction register (IR). ID (instruction decode): decode the instruction and fetch operands. EX (execute): perform the operation, done by ALU (arithmetic logic unit) MEM (memory access): it happens normally during load and store instructions. WB (write back): write results of the operation in the EX step to a register in the register file. PC (update program counter): update the value in program counter, normally PC PC + 4
8 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Subscalar CPUs: Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. This process is inherent inefficiency It takes 15 cycles to complete three instructions
9 2.1.3 Pipeline and vector architecture IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Scalar CPUs: In this 5 stage pipeline, it can barely achieve the performance of one instruction per CPU clock cycle
10 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Superscalar CPUs: in the simple superscalar pipeline, two instructions are fetched and dispatched at the same time, a performance of a maximum of two instructions per CPU clock cycle can be achieved.
11 Example float x[100], y[100], z[100]; for (i=0; i<100; i++) z[i] = x[i] + y[i]; Fetch the operands from memory Compare exponent Shift one operand Add Normalized the results Store result in memory The functional units are arranged in a pipeline. The output of one functional unit is the input to the next. Say x[0] and y[0] are being added, one of x[1] and y[1] can be shifted, the exponents in x[2] and y[2] can be compared, and x[3] and y[3] can be fetched. After pipelining, we can produce a result six times faster than without the pipelining.
12 clock fetch comp shift add norm store 1 X0,y0 2 X1,y1 X0,y0 3 X2,y2 X1,y1 X0/y
13 do i=1, 100 z(i) = x(i) + y(i) enddo z(1:100) = x(1:100) + y(1:100) Fortran 77 Fortran 90 By adding vector instructions to the basic machine instruction set, we can further improve the performance. Without vector instructions, each of the basic instructions has to be issued 100 times. With vector instructions, each of the basic instructions has to be issued 1 time. Using multiple memory banks. Operations (fetch and store) that access main memory are several times slower than CPU only operations (add). For example, suppose we can execute a CPU operation once every CPU cycle, but we can only execute a memory access every four cycles. If we used four memory banks, and distribute the data z[i] in memory bank i mod 4, we can execute one store operation per cycle.
14 2.1.4 SIMD Instruction pool Load X[1] load Y[1] Load X[2] Load Y[2] Load X[n] Load Y[3] P P Data pool Add Z[1], X[1], Y[1] Add Z[2], X[2], Y[2] Add Z[n], X[3], Y[3] P Store Z[1] Store Z[2] Store Z[n] A type of parallel computers. Single instruction: All processor units execute the same instruction at any give clock cycle. Multiple data: Each processing unit can operate on a different data element. It typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units Best suitable for specialized problems characterized by a high degree of regularity, e.g., image processing. infantry
15 A single CPU to control and a large collection of subordinate ALUs, each with its own memory. During each instruction cycle, the control processor broadcasts an instruction to all of the subordinate processors, and each of the subordinate processors either executes the instruction or is idle.
16 For (i=0; i<100; i++) if (y[i]!= 0.0) z[i] = x[i]/y[i]; else z[i] = x[i]; Time step 1 Time step 2 Time step 3 Test local_y!= 0 If local_y!= 0, z[i]=x[i]/y[i] If local_y == 0, idle If local_y!= 0, idle If local_y == 0, z[i]=x[i] Disadvantage: in a program with many conditional branches or long segments of code whose execution depends on conditionals, its more likely that many processes will remain idle for a long time.
17 2.1.5 MISD A single data stream is fed into multiple processing units Each processing unit operates on the data independently via independent instruction streams Very few actual machines: CMU s C.mmp computer (1971) Instruction pool P P Data pool Load X[1] Mul Y[1], A, X[1] Add Z[1], X[1], Y[1] Store Z[1] Load X[1] Mul Y[2], B, X[1] Add Z[2], X[1], Y[2] Store Z[2] Load X[1] Mul Y[3], C, X[1] Add Z[3], X[1], Y[3] Store Z[3]
18 2.1.6 MIMD Multiple instruction stream: Every processor may execute a different instruction stream Multiple data stream: Every processor may work with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, grids, networked parallel computers, multiprocessor SMP computer Each processor has both a control unit and an ALU, and is capable of executing its own program at its own pace P P P Instruction pool P P P Data pool Load X[1] load Y[1] Add Z[1], X[1], Y[1] Store Z[1] Load A Mul Y, A, 10 Sub B, Y, A Store B Load X[1] load C[2] Add Z[1], X[1], C[2] Sub B, Z[1], X[1]
19 2.1.7 shared-memory MIMD Bus-based architecture Switch-based architecture Cache coherence
20 CPU CPU CPU Interconnected network Memory Memory Memory Generic shared-memory architecture Shared-memory systems are sometimes called multiprocessors
21 Bus-based architecture CPU CPU CPU Cache Cache Cache Bus Memory Memory Memory The interconnect network is bus based. The bus will become saturated if multiple processors are simultaneously attempting to access memory. Each processor has access to a fairly large cache. These architectures do not scale well to large numbers of processors because of the limited bandwidth of a bus.
22 Switch-based architecture Memory Memory Memory CPU CPU CPU The interconnect network is switch-based. A crossbar can be visualized as a rectangular mesh of wires with switches at the points of intersection, and terminal on its left and top edges. The switches can either allow a signal to pass through in both the vertical and horizontal directions simultaneously, or they can redirect a signal from vertical to horizontal or vice versa. Any processor can access any memory, and any other processor can access any other memory.
23 The crossbar switch-based architecture is very expensive. A total number of mn hardware switches are need for an m n crossbar. The crossbar system is a NUMA (nonuniform memory access) system, because when a processor access memory attached to another crossbar, the access times will be greater.
24 Cache coherence The caching or shared variables should ensure cache coherence. Basic idea: each processor has a cache controller, which monitors the bus traffic. When a processor updates a shared variable, it also updates the corresponding main memory location. The cache controllers on the other processors detect the write to main memory and mark their copies of the variable as invalid. Not good for other types of shared-memory machine.
25 2.1.8 Distributed-memory MIMD CPU Memory CPU Memory CPU Memory Interconnected network In distributed-memory system, each processor has its own private memory.
26 A static network (mesh) A dynamic network (crossbar) A node is a vertex corresponding to a processor/memory pair. In static network, all vertices are nodes. In dynamic network, some vertices are nodes, other vertices are switches.
27 Fully connected interconnection network The ideal interconnected network is a fully connected network, in which each node is directly connected to every other node. With a fully connected network, each node can communicate directly with every other node. Communication involves no delay. The cost is too high to be practical. Question: How many connections are needed for a 10 processor machine?
28 Crossbar interconnection network Question: for a machine with p processors, how many switches do we need?
29 Multistage switching network For a machine of p nodes, an omega network will use plog 2 (p)/2 switches. An omega network
30 Static interconnection networks A linear array A ring For a system of p processors, a linear array needs p-1 wires, a ring need p wires. They scale well, but the communication cost is high. In a linear array, two communication processors may have to forward the message along as many as p-1 wires, and in a ring it may be necessary to forward the message along as many as p/2 wires.
31 Dimension 1 Dimension 2 Dimension 3 For a hypercube network with dimension d, the number of processors is p=2 d. The maximum number of wires a message will need to be forwarded is d = log 2 (p). This is much better than the linear array or ring. It does not scale well. Each time we wish to increase the machine size, we must double the number of nodes and add a new wire to each node.
32 Two dimensional mesh Three dimensional mesh If a mesh has dimension d 1 d 2 d n, then the maximum number of wires a message will have to travere is n i= 1 ( d i 1) If a mesh is a square, d 1 =d 2 = =d n, the maximum will be n(p 1/n -1). A mesh becomes a torus if wrap around wires are added. For torus, the maximum will be 1/2np 1/n. Mesh and torus scale better than hypercubes. If we increase the size of a q q mesh, we simply add a q 1 mesh and q wires. We need to add p (n-1)/n nodes if we want to increase the size of a square n-dimensional mesh or torus.
33 Characteristics of static networks Diameter: the diameter of a network is the maximum distance between any two nodes in the network. Arc connectivity: The arc connectivity of a network is the minimum number of arcs that must be removed from the network to break it into two disconnected networks Bisection width: The bisection width is the minimum number of communication links that must be removed to partition the network into two equal halves Number of links: the number of links is the total number of links in the network.
34 Characteristics of static networks Network Diameter Bisection width Arc connection Number of links Fully connected 1 p 2 /4 p-1 p(p-1)/2 star p-1 Linear array p p-1 Ring (p>2) p p Hypercube logp p/2 logp ( p log p) / 2 2D mesh 2D torus 2( p 1) 2 p / 2 2 p p 2 4 2( p p) 2p
35 2.1.9 communication and routing If two nodes are not directly connected or if a processor is not directly connected to memory module, how is data transmitted between the two? If there are multiple routes joining the two nodes or processor and memory, how is the route decided on? Is the route chosen always shortest?
36 Store-and-forward routing Time Node A z y x w z y x z y z Data Node B w x w y x w z y x w z y x z y z Node C w x w y x w z y x w Store-and-forward routing: A sends a message to C through B, B read the entire message, and then send it to node C. It takes more time and memory.
37 Cut-through routing Time Node A z y x w z y x z y z Data Node B w x y z Node C w x w y x w z y x w Cut-through routing: A sends an message through B to C, B immediately forward each identifiable piece or packet of the message to C
38 Communication unit A message is a contiguous group of bits that is transferred from source terminal to destination terminal. A packet is the basic unit of a message. Its size is in the order of hundreds and thousands of bytes. It consists of header flits and data flits.
39 Flit: flit is the small unit of information at link layer, and its size is of a few words Phit: phit is the smallest physical unit of information at the physical layer, which is transferred across one physical link in one cycle
40 Communication cost Startup time : startup time is the time required to handle a message at the sending and receiving nodes, that includes, (1) prepare message (adding header, trailer, error correction information), (2) execute the routing algorithm, and (3) establish an interface between the local node and the router Note: This latency is incurred only once for a single message transfer Per-hop Time: per-hop time is the time taken by the header of a message to travel between two directly connected nodes in the network. Note: The per-hop time is also called node latency Per-word transfer time: per-word transfer time is the time taken for one word to traverse one link. Per-word transfer time is the reciprocal of the channel bandwidth
41 When a message traverses a path with multiple links, Each intermediate node on the path forwards the message to the next node after it has received and stored the message Total communication cost for a message of size m to traverse a path of l links t comm = t s + (mt w t h )l
42 Example: communication time for linear array o (1) Store and forward routing: t comm = t s + mlt w, since in modern parallel computers, the per-hop time is very small compared to per-word time. (2) cut-through routing: t comm = t s + lt h + mt w, the term of the product of message size and number of links is no longer contained.
43 2.2 software issues A program is parallel if at any time during its execution, it can comprise more than one process. We see how the processes can be specified, created, and destroyed.
44 2.2.1 Shared memory programming Private and shared variables int private_x; shared int sum=0; sum = sum + private_x; Time 0 1 Process 0 Fetch sum =0 Fetch private_x=2 Process 1 Finish calculation of private_x Fetch sum = 0 2 Add 2+0 Fetch private_x 3 Fetch sum into register A Fetch private_x into register B Add contents of register B to register A Store content of register A in sum 3 4 Store sum = 2 Add Store sum =3
45 Mutual exclusion, Critical section, Binary semaphore, barrier shared int s = 1; // wait until s=1; while (!s); s = 0; sum = sum + private_x; s = 1; Void P(int* s /* in/out */); Void V(int *s /*out */); P(int * s) { While (!s); s=0; } V(int *s) { s=1; } Problem: s is not atomic, one process has a value of 1, another may have a value of 0 Int private_x; Shared int sum=0; Shared int s=1; /* compute priviate_x */ P(&s); Sum=sum+private_x; V(&s); Barrier(); If (I m process 0) printf( sum = %d\n, sum);
46 2.2.2 Message passing The most commonly used method of programming distributed-memory MIMD system is message passing, or its variant. We focus on the Message-Passing Interface (MPI)
47 MPI_Send() and MPI_Recv() Int MPI_Send(void* buffer /* in */, int count /* in */, MPI_Datatype datatype /* in */, int destination /* in */, int tag /* in */, MPI_Comm communicator /* in */) Int MPI_Recv(void* buffer /* in */, int count /* in */, MPI_Datatype datatype /* in */, int source /* in */, int tag /* in */, MPI_Comm communicator /* in */), MPI_Status* status /* out */)
48 Process 0 sends a float x to process 1; MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Process 1 receives the float x from process 0; MPI_Recv(&x, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD); Different programs or a single program? SPMD(Single-Program-Multiple_Data) model If (my_process_rank == 0) MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Else if(my_process_rank == 1) MPI_Recv(&x, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
49 Buffering 0 (A) request to send ; 1 (B) ready to receive We can buffer the message: the content of the message can be copied into a system-controlled block of memory (on A or B, or both), and 0 can continue executing. Synchronous communication: process 0 wait until process 1 is ready; Buffered communication: the message is buffered into the appropriate memory location controlled by 1. Advantage: the sending process can continue to do useful work if the receiving process is not ready, the system will not crash even if process 1 doesn t execute a receive Disadvantage: it uses additional memory and if the receiving process is ready, the communication will actually take longer because of copying data between the buffer and the user program memory locations.
50 Blocking and nonblocking communication Blocking communication: a process remains idle until the message is available, such as MPI-Recv(). In blocking communication, it may not be necessary for process 0 to receive permission to go ahead with the send. Nonblocking receive operation: MPI_Irecv(), with an additional parameter a request. The call would notify the system that process 1 intended to receive a message from 0 with the properties indicated by the argument. The system would initialize the request argument, and process 1 would return. Then process 1 could perform some other useful work and check back later to see if the message has arrived. Nonblocking communication can be used to provide dramatic improvements in the performance of message-passing programs.
51 2.2.3 Data-parallel languages!hpf$!hpf$!hpf$!hpf$ C Program add_arrays PROCESSORS p(10); real x(1000), y(1000), z(1000) ALIGN y(:) WITH x(:) ALIGN z(:) WITH x(:) DISTRIBUTE x(block) ONTO p initialize x and y. z = x + y end (1) Specify a collection of 10 abstract processors; (2) Define arrays; (3) Specify that y should be mapped to the abstract processors in the same way that x is; (4) Specify that y should be mapped to the abstract processors in the same way that x is; (5) Specify which elements of x will be mapped to which abstract processors; (6) BLOCK specifies that x will be mapped by blocks onto the processors. The first 1000/10=100 elements will be mapped to the first processor.
52 2.2.4 RPC and Active message RPC (remote procedure call) and active messages are two other methods to parallel systems, but we are not going to discuss them in this course.
53 2.2.5 Data mapping Optimal data mapping is about how to assign data elements to processors so that communication is minimized. Our array A=(a0, a1, a2,, an-1), our processors P=(q0, q1, q2,, qp-1) If the number of processor is equal to the number of array elements ai = qi Block mapping: partitioning the array elements into blocks of consecutive entries and assigns the blocks to the processors. If p=3 and n=12 a0, a1, a2, a3 q0 a4, a5, a6, a7 q1 a8, a9, a10, a11 q2 Cyclic mapping: it assigns the first element to the first processor, the second element to the second processor, and so on. a0, a3, a6, a9 q0 a1, a4, a7, a10 q1 a2, a5, a8, a11 q2 Block-cyclic mapping: it partitions the array into blocks of consecutive elements as in the block mapping. The blocks are not necessarily of size n/p. The blocks are then mapped to the processors in the same way that the elements are mapped in the cyclic mapping. The block size is 2 in the following example. a0, a1, a6, a7 q0 a2, a3, a8, a9 q1 a4, a5, a10, a11 q2
54 How about matrices? p0 p0 p0 p1 grid
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationIntroduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2
Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationLecture 7: Distributed memory
Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationCS575 Parallel Processing
CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationCopyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 Roadmap Some background Modifications to the von Neumann model Parallel hardware Parallel software
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationParallel Numerics, WT 2017/ Introduction. page 1 of 127
Parallel Numerics, WT 2017/2018 1 Introduction page 1 of 127 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! page 2 of 127 Scope Revise standard
More informationParallel Hardware and Interconnects
Parallel Hardware and Interconnects Sec1on 2.3 Bryan Mills, PhD Spring 2017 Summary Pthreads Simple threading (ie strassen) Shared Data and Cri1cal Sec1ons Busy- Wait Locks (mutex) Semaphore Read/Write
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationINTERCONNECTION NETWORKS LECTURE 4
INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More informationNormal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory
Parallel Machine 1 CPU Usage Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory Solution Use multiple CPUs or multiple ALUs For simultaneous
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationDr e v prasad Dt
Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction
More informationCS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationParallel Architectures
Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationConcurrent/Parallel Processing
Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,
More informationCommunication Cost in Parallel Computing
Communication Cost in Parallel Computing Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2016 Outline Cost Startup time Pre-hop time Pre-word time Store-and-forward Packet routing Cut-through
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationParallel Architecture, Software And Performance
Parallel Architecture, Software And Performance UCSB CS240A, T. Yang, 2016 Roadmap Parallel architectures for high performance computing Shared memory architecture with cache coherence Performance evaluation
More informationModel Questions and Answers on
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Model Questions and Answers on PARALLEL COMPUTING Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. Model Questions and Answers Subject Parallel Computing
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationModule 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth
Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationPipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD
Pipeline and Vector Processing 1. Parallel Processing Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationLecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control
Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationInterconnection Networks. Issues for Networks
Interconnection Networks Communications Among Processors Chris Nevison, Colgate University Issues for Networks Total Bandwidth amount of data which can be moved from somewhere to somewhere per unit time
More informationDesign of Parallel Algorithms. The Architecture of a Parallel Computer
+ Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationHigh performance computing. Message Passing Interface
High performance computing Message Passing Interface send-receive paradigm sending the message: send (target, id, data) receiving the message: receive (source, id, data) Versatility of the model High efficiency
More informationProcessor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design
Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues
More informationComputer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1
Pipelining and Vector Processing Parallel Processing: The term parallel processing indicates that the system is able to perform several operations in a single time. Now we will elaborate the scenario,
More informationPCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.
PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed
More informationChapter 2: Parallel Programming Platforms
Chapter 2: Parallel Programming Platforms Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents Implicit Parallelism: Trends in Microprocessor
More informationOverview: Classical Parallel Hardware
Overview: Classical Parallel Hardware Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware
More informationCS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.
CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Administrative Nikhil office hours: - Monday, 2-3PM - Lab hours on Tuesday afternoons during programming assignments First homework
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationLast Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort
Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms
More informationStructure of Computer Systems
288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationParallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationParallel Computing Ideas
Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationMessage Passing and Network Fundamentals ASD Distributed Memory HPC Workshop
Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia October 30, 2017
More information