EE/CSCI 451: Parallel and Distributed Computation

Size: px
Start display at page:

Download "EE/CSCI 451: Parallel and Distributed Computation"

Transcription

1 EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian University of Southern California 1

2 Outline Midterm 1: 2/26, in class 2-3:20pm 2

3 Example memory system performance (1) Memory Latency = 10 ns (Peak) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) (bus frequency) Processor: 2 GHz, 1 word = 64 bits (8 bytes) Unit of data access = 1 word Cycle time = 0.5 ns (processor cycle) Assume 2-issue superscalar, single cycle processor, 2 Double Precision multiply-add (2 multipliers, 2 adders) FPUs M P 64 Peak performance of the processor = 2 4 = 8 GFlops/s (Raw compute power) Clock rate Total # of FP ops /cycle Peak performance can also be computed as 2 2 = 4 GFlops/s Clock rate 2 pipelines 3

4 Example memory system performance (2) Example: Inner product: a b = σ n i=1 a i b i (data in external memory ) 2 data fetches for each multiply, add Processor can do 8 mult or add/ns = 4 FP ops /cycle Note: issue bandwidth limits performance to 2 FP ops/cycle Read Read Mul Add If we can stream the data, Read Read In the best case, i.e., data is streamed (no pipeline stalls): Sustained performance (possible best case) = 2 ops over 4 processor cycles = 1 GFlops/s Processor memory bandwidth determines possible best case sustained performance ns Mul Add

5 Example memory system performance (3) M P 64 Latency = 10 ns (Peak) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) Repeat 1. read a i 10 ns 2. read b i 10 ns 3. multiply (use register data) 4. add (use register data) n times Overlap Instr. 3,4 with Instr. 1,2 (Pipelining) Sustained performance (possible worst case) Total # of FP ops Total time 2n 20n 10 9 = 0.1 GFlops/s Memory access time 5

6 Example memory system performance (4) M P ns 1 GHz 2 GHz 2 issue 2 FPU Inner product Processor Peak = 8 GFlops/s (Raw compute power) Processor Peak = 4 GFlops/s (Processor organization) (Best case) Sustained = 1 GFlops/s (Memory b/w) (Worst case) Sustained = 0.1 GFlops/s (Memory latency) Memory bound computation 6

7 Cache (1) Improving effective memory latency M 10 ns 1 GHz DRAM 4 GB Why it works? Data reuse P 2 GHz 4 way superscalar C Exec. pipelines Accessing from DRAM is expensive (DRAM latency) Repeatedly use the data in cache (fast) Hit ratio: fraction of the memory references served by the cache Cache Small (16 MB) Fast (2 GHz) 7

8 Spatial locality If location i is referenced at time t, then locations near i are referenced in a small window of time following t Temporal locality Cache (3) Locality of references In a small window of time, repeated references to a small set of data items i 1 i i + 1 memory t 1 t 2 Time t Small set of data items referenced time 8

9 Example: bubble sort for i from 1 to N end for for j from 0 to N-2 R 1 R 2 a[j] a[j+1] if R 1 > R 2 swap (R 1 and R 2 ) end if a[j] R 1 a[j+1] R 2 end for Cache (4) i = 1 i = 2 9 Assumptions: Direct mapped First in first out policy Write through policy N = 8 Cache line size = 1 data element Cache size 2 data elements Read a[0], a[1] miss, miss Read a[1], a[2] hit, miss Read a[2], a[3] hit, miss Read a[6], a[7] hit, miss Read a[0], a[1] miss, miss Read a[1], a[2] hit, miss Cache hit ratio= Read hit ratio = # of times data found in cache Total # ofaccesses N 2 (N 1) 2 = = 0.43

10 Data layout and data access pattern (1) Storing a 2-dimemsional array in memory 0 n-1 Row major order 0 n-1 Map Memory Column major order 0 n-1 0 n-1 Map Memory A(i, j) Memory(i n + j) 0 i, j < n A(i, j) Memory(i + n j) 0 i, j < n 10

11 Shuffle Network Perfect shuffle (PS) connection A link exists between input i and output j if: j = 2i, 0 i < p 2 p 2i + 1 p, i < p 2 Left rotation (circular left shift) of binary representation of i p = power of = left_rotate(000) 001 = left_rotate(100) 010 = left_rotate(001) = left_rotate(101) = left_rotate(010) 101 = left_rotate(110) 110 = left_rotate(011) = left_rotate(111) 11

12 Shuffle for n = 8 Shuffle Exchange Network ii 2i2 i mod modnn ii (2i i + 1) mod mod nn 2i 2i 2i 2i+ 1 Exchange 0 i n 1 2 n i n i n

13 Example: n=8 3 bit index Shuffle Connection Circular left shift ( 4 1 ) Exchange connection 2i 2i + 1 Complement lsb 1 Diameter: (discussed later) O(log n) k n 2 13

14 Routing in Shuffle Exchange Network (1) Source x = x k 1 x 0 Destination d = d k 1 d 0 y x {current location} i 1 While i k End Shuffle y {Rotate left} Source x Compare LSB of y with bit (k i) of destination (d) If bits are the same, then do not Exchange; else Exchange {Complement y 0 } i i + 1 Total # of hops 2k (2log 2 n) Intermediate nodes Destination d 14

15 Routing in Shuffle Exchange Network (2) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 Example: i = 1 Shuffle Compare LSB of y with bit 2 of destination y 0 = d 2? Same as x 2 = d 2? x = 000 d = 110 i = 1 i = 2 i = S 001 E 010 S 011 E 110 S 110 No E Position at the end of first iteration: 001 End of i th iteration: y = x k 1 i x 0 d k 1 d k i 15

16 Routing in Shuffle Exchange Network (3) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 16

17 Routing in Shuffle Exchange Network (4) Theorem: In a shuffle exchange network with n = 2 k nodes, data from any source to any destination can be routed in at most 2log 2 n steps. 17

18 Omega Network (1) p input, p output log 2 p stages, each stage having p switches Switch: 0 Shuffle Exchange

19 Omega Network (2) Omega network properties Multistage network Cost ~ P log 2 2p (number of switches) Note: in actual hardware design, routing cost dominates! Omega network can do 2 (p 2 log 2p) < p! Permutations All p! Permutations can not be realized Unique (only one) path from any input to any output 19

20 Omega Network (3) Example of Blocking one of the messages (010 to 111 or 110 to 100) is blocked at link AB B A

21 Congestion in a Network (1) Given a routing protocol and data communication pattern (ex. permutation) Congestion = Max. { # of paths passing through the node } Congestion =

22 Congestion in a Network (2) Interconnection Network = Graph + Routing algorithm Assume routing algorithm provides unique (exactly one path) communication from i to j for all i, j For a given permutation: Congestion at node k = # of paths that pass through k Congestion in the network = Max {# of paths that pass through k} over nodes K Max {congestion in the network} all permutations 22

23 CLOS network (2) Structure of CLOS network Control n n 23

24 CLOS network (3) Stage i Stage i + 1 connections Any Box all boxes in the next stage 3 stage network Number of switches (boxes) = 3 n Cost of n n crossbar = O n 2 Total Cost = O n n = O(n 3 2) Note: CLOS network can realize all n! permutations 24

25 n = 2 k for some k Butterfly network (1) 8-input butterfly network Stage 0 Stage 3 log 2 n + 1 stages Power of 2 connections 2 l, l = 0,1, log 2 n 1 Stage l, 0 l < log 2 n i i in Stage l + 1 i i complement bit l in Stage l + 1 l = 0 l = 1 l = 2 25

26 Butterfly network (2) n input butterfly network Total number of nodes = n log 2 n + 1 Total number of edges = n 2 log 2 n # of edges/node # of stages 26

27 Mesh-connected Network 1-D mesh 2-D mesh (without wraparound) Without wraparound 1-D torus With wraparound ring 0 1 p p-1 p k-dimensional mesh: p 1/k p 1/k p 1/k k times Number of connections per node 2k p 27

28 Hypercube Network (1) Existing edge Added edge D hypercube 1-D hypercube 2-D hypercube 3-D hypercube D hypercube Construction of hypercube from hypercubes of lower dimension 28

29 Hypercube Network (2) In general, for k dimension hypercube: p = total number of nodes = 2 k Node i = i k 1 i k 2 i 0 k connections/node complement a bit of i Example:

30 Tree-based Network (1) Processing node Switching node Height: log p Static tree network Dynamic tree network p = total number of nodes 30

31 Tree-based Network (2) Fat tree (16 processing nodes) 31

32 Performance metrics (1) Diameter Diameter Maximum distance between any two processing nodes in the network diameter max{distance(i, j)} (i,j) distance(i, j) min path length between i and j length of the shortest path between i and j length of a path = # of edges in the path 32

33 Performance metrics (6) Example Bisection width (2) partition p p p for 2-D mesh (no wraparound) 33

34 Performance metrics (8) Cost of a static network Cost number of communication links in the network Example: Tree: p 1 1-D mesh (no wraparound): p 1 d-dimensional wraparound mesh: d p Hypercube: p log 2p 2 k-ary d-cube A d-dimensional array with k elements in each dimension Number of nodes p = k d Cost: dp 34

35 Summary Network Diameter Bisection width Cost (No. of links) Completely connected 1 p 2 4 p(p 1) 2 Star 2 1 p 1 Complete binary tree 2 log p p 1 1-D mesh, no wraparound p 1 1 p 1 2-D mesh, no wraparound 2( p 1) p 2(p p) 2-D wraparound mesh 2 p/2 2 p 2p Hypercube log p p/2 (p log p)/2 Wraparound k-ary d-cube p = k d d k/2 2k d 1 dp 35

36 Shared Address Space Programming (1) All threads have access to the same global, shared memory Threads can also have their own private data Programmer in responsible for synchronizing access (protecting) globally shared data (to ensure correctness of the program) Private data Thread Thread Private data Shared Memory Private data Thread Thread Private data 36

37 Shared Address Space Programming (5) Thread 1 k from 1 to n C 1,1 = C(1,1) + A 1, k B(k, 1) Example 1 Shared Memory C A B Thread i n + j k from 1 to n C i, j = C(i, j) + A i, k B(k, j) C(i, j) 37

38 Shared Address Space Programming (7) Thread 1 i from 1 to n 2 j from 1 to n 2 C i, j = A i, : B(:, j) Example 2 Matrix Multiplication Shared Memory A C B Thread 2 i from 1 to n 2 j from n + 1 to n 2 C i, j = A i, : B(:, j) Thread 3 i from n + 1 to n 2 j from 1 to n 2 C i, j = A i, : B(:, j) Input data shared No interaction among threads Thread 4 i from n + 1 to n 2 j from n + 1 to n 2 C i, j = A i, : B(:, j) 38

39 Synchronization method Barrier (1) Barrier objects can be created at certain places in the program Any thread which reaches the barrier stops until all the threads have reached the barrier T1 T2 T3 T4 T1, T2, T3, T4 wait till all threads reach the barrier Barrier Time 39

40 Shared Variable Access (1) Threads acquire lock to modify a shared variable Release lock when done Only 1 thread can acquire a lock at any time Successful T1 T2 Unsuccessful Lock Access Shared Variable Lock Wait Lock T3 Release lock Lock Access Shared Variable Release lock Example execution sequence Wait Lock 40

41 Shared Variable Access (2) Example: Find max between two threads Each thread has a local value Initialize Max (Max is a shared variable), 0 Thread 1 Acquire_lock(Max) If (i > Max) Max = i Release_lock(Max) Thread 2 Acquire_lock(Max) If (j > Max) Max = j Release_lock(Max) 41

42 Correct Parallel Program Data Parallel Program Output Parallel Platform For all data inputs, for all execution sequences, correct output is produced 42

43 A Simple Model of Shared Address Space Parallel Machine (PRAM) (1) 1 unit of time = Local access shared memory access Synchronous model Parallel time = total number of cycles Pthreads programming model? Asynchronous shared memory?? Shared memory 0 p 1 p processors 43

44 Random Access Machine (RAM) Random Access Access to any memory location (Direct access) Time Access to memory 1 unit of time Arithmetic/Logic operation 1 unit of time Serial time complexity T s (n) Example: Merge Sort PRAM (2) T s (n) = O(n log n) 1 unit of time Processor Memory 44 2/21/

45 PRAM (5) PRAM is a synchronous model clock 0 Shared memory p 1 Each processor is a RAM Local program acting on data in shared memory 1 unit of time For all i, i-th instruction in the execution sequence is executed in the i-th cycle by all the processors 45 Synchronous execution

46 Adding on PRAM (1) Simple shared memory algorithm for adding n numbers Output = σ n 1 i=0 A i in A(0) A(0) A(n 1) 46

47 For n = 8 i = iteration # Adding on PRAM (2) Key Idea end of i = 0 i = 1 i = Active processors = Processor index = {0, 2, 4, 6} {0, 4} {0} {0, 1, 2, 3} 2 {0, 1} 4 {0} 8 47

48 Adding on PRAM (3) Example: n = 8 Active processors Processor # (j) iteration # (i) ~~~~~~~~~ ~~~~~~~~~ (time) Total # of additions performed = n 2 + n Total # of additions performed = n 1 Total # of additions performed = Total # of additions performed by a serial program 48

49 Algorithm Program in processor j, 0 j n Do i = 0 to log 2 n 1 Adding on PRAM (4) If j = k 2 i+1, for some k N then A j A j + A(j + 2 i ) A(0) 3. end Note: A is shared among all the processors Synchronous operation [For ex. all the processors execute instruction 2 during the same cycle, log 2 n time] N = set of natural numbers = {0, 1, } Parallel time = O(log n) A(n 1) 49

50 Addition on PRAM, p < n processors 1. Add within block of size n p 2. Apply Algorithm 1 n/p T p = Parallel Time = O( n + log p) p T s = Serial Time = O(n) n p 50 50

51 Synchronous and Asynchronous Clock 0 y Synchronous (e.g. PRAM) Do i = 0 to 499 y y + A(i) End y x + y OUTPUT y STOP x 0 Do i = 500 to 999 x x + A(i) End STOP Asynchronous (e.g. Pthreads) y 0 Do i = 0 to 499 y y + A(i) End Barrier y x + y x 0 Do i = 500 to 999 x x + A(i) End Barrier

52 Addition: Pthreads model? Instruction execution NOT synchronized Thread j Do i = 0 to log 2 n 1 end If j = k 2 i+1, for same k N then A j A j + A(j + 2 i ) BARRIER Correct output? How to measure time? Number of active threads in iteration i = 2n 2 i+1 Complete each iteration before proceeding to next iteration Within each iteration threads may execute asynchronously 52

53 OpenMP Programming Model (2) Directive based parallel programming 1 #pragma omp directive [clause list] 2 /* structured block */ Structured block = a section of code which is grouped together OpenMP directives are not part of a programming language Can be included in various programming languages (e.g., C, C++, Fortran) 53

54 OpenMP Programming Model (3) Fork - Join model: Fork: The master thread creates a team of parallel threads Join: When the team threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread Fork Join Fork Join 54

55 OpenMP Directives (1) OpenMP executes serially until parallel directive parallel region construct Same code will be executed by multiple threads Fundamental OpenMP parallel construct Example #pragma omp parallel [clause list] { task(); } 55 Master thread Team thread Team Thread task() task() task() Clause list specifies parameters of parallel execution # of threads, private variables, shared variables Implied Barrier

56 OpenMP Directives (2) Work-Sharing Constructs Divide the execution of the enclosed code region among the members Implied barrier at the end of a work sharing construct Types of Work-Sharing Constructs: for - shares iterations of a loop across the team. Represents a type of "data parallelism". sections - breaks work into separate, discrete sections. Each section is executed by a thread. Represents a type of "functional parallelism". 56

57 Routing Mechanisms (1) 1. Store and Forward Routing Message length = m words Number of hops = l P 0 P 1 P 2 P 3 Example: l = 3 P 0 P 1 t s (t h + m t w ) source destination Store: each intermediate node receives the entire message from its predecessor Forward: forward to the next node P 2 P 3 time Total time for communication = t s + (t h + m t w ) l 57

58 Cut Through Routing (3) 1 2 l t comm = t s + l t h + t w m Pipeline delay Communication time on a link Pipelined Data Communication Note: There is a small overhead in processing FLIT at each intermediate node, smaller than processing a packet. t h is smaller than store and forward scenario. 58

59 Cut Through Routing (7) Minimizing l (number of hops) is hard Program has little control in program to processor mapping Machines (platform) use various routing techniques to minimize congestion (ex: randomized routing) Per hop time is usually relatively small compared with t s and t w m for large m Simple communication cost model: t comm = t s + t w m Same amount of time to communicate between any two nodes (fully connected?) Use this cost model to design and optimize instead of parallel architecture (mesh, hypercube) specific algorithms Simplified cost model does not take congestion into account 59

60 LogP Model (3) P processors P-M latency P-M throughput M M M o overhead P P P g gap L (latency) Communication Network 60

61 LogP Model (4) L: an upper bound on the latency, or delay, incurred in communicating a message containing a word from its source module to its target module o: the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message g: the gap, defined as the min. time interval between consecutive message transmission or reception of each message P: the number of processors/memory modules 61

62 Program and Data Mapping (1) Parallel Program = Collection of Processes + Interaction among Processes Each process = Computation + Data access Classic problem: Given a parallel program, embed (map) processes and data onto a given parallel architecture such that communication cost is minimized overall execution time is minimized 62

63 Program and Data Mapping (2) Parallel Program G (V, E) A simple abstraction Map s Parallel Architecture G (V, E ) Node - process Edge - communication Node - processor Edge - interconnection Graph embedding problem 63

64 Program and Data Mapping (3) Graph embedding problem Given parallel program and parallel architecture Given parallel program and parallel architecture G(V, E) G (V, E ) undirected b f(b) Function f: V V a f(a) Each edge in E specifies a path in G Function g: E paths in G For every edge (a, b) in E, g specifies a path in G, such that the end points in G are f(a) and f(b) Note: 1. V can be larger than V 2. A vertex in V may correspond to more than one vertex in V 64

65 Program and Data Mapping (5) Metrics Congestion: Max number of edges in E mapped to an edge in E (on edges) Dilation: Edge in E Path in E Max path length in E Expansion: V / V It is also possible V < V for example: virtualization 65

66 Scalability (2) Speedup = Serial time (on a uniprocessor system) Parallel time using p processors If speedup = O(p), then it is a scalable solution. 66

67 Amdahl s Law (1) Amdahl s Law Limit on speedup achievable when a program is run on a parallel machine Given an input program Total execution time: S + P S 1 Serial Portion Parallelizable Portion P Time on a uni-processor machine: 1 = S + P Time on a parallel machine: S + P/f f = speedup factor 67

68 Scaled Speedup (Gustafson s Law) (2) Amdahl s Law : Fixed amount of computations 1 Serial Time S 1 S Parallel Time S 1 S p p : Number of Processors Gustafson s Law : Increase p and amount of computations 1 Parallel Time S P = 1 S Serial Time S (1 S ) p If parallelism scales linearly with p, number of processors Serial time Scaled Speedup = = = Parallel time 68 S + 1 S p S + 1 S p S + P 1

69 Performance (1) Efficiency Question: If we use p processors, is speedup = p? Efficiency = Fraction of time a processor is usefully employed during the computation p=2 p=1 Typical execution of a program on a parallel machine P 1 P 2 Serial time 0.5 useful work 1 idle (overhead) E = Speedup / # of processors used E is the average efficiency over all the processors Efficiency of each processor can be different from the average value Serial Computation Time 69

70 Performance (3) Cost = Total amount of work done by a parallel system = Parallel Execution Time x Number of Processors = T p p Cost is also called Processor Time Product COST OPTIMAL (or WORK OPTIMAL) Parallel Algorithm Total work done = Serial Complexity of the problem 70

71 Performance Analysis (1) Asymptotic Analysis Big O Notation or Order Notation Worst case execution time of an algorithm Upper bound on the growth rate of the execution time Example: n n matrix multiplication 1. Do i 2. Do j 3. C(i, j) 0 4. Do k = 1 to n 5. C(i, j) C(i, j) + A(i, k) B(k, j) 6. End 7. End 8. End T(n) = time complexity function = n 2 + n 3 + n 3 Line 3 Line 5 71

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #5 1/29/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

CS575 Parallel Processing

CS575 Parallel Processing CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K. Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

CSC630/CSC730: Parallel Computing

CSC630/CSC730: Parallel Computing CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction

More information

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Design of Parallel Algorithms. The Architecture of a Parallel Computer + Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA) Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

A Simple and Asymptotically Accurate Model for Parallel Computation

A Simple and Asymptotically Accurate Model for Parallel Computation A Simple and Asymptotically Accurate Model for Parallel Computation Ananth Grama, Vipin Kumar, Sanjay Ranka, Vineet Singh Department of Computer Science Paramark Corp. and Purdue University University

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Interconnection Networks. Issues for Networks

Interconnection Networks. Issues for Networks Interconnection Networks Communications Among Processors Chris Nevison, Colgate University Issues for Networks Total Bandwidth amount of data which can be moved from somewhere to somewhere per unit time

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Model Questions and Answers on

Model Questions and Answers on BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Model Questions and Answers on PARALLEL COMPUTING Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. Model Questions and Answers Subject Parallel Computing

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #2 1/17/2017 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Opportunities

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA) Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.

More information

Chapter 2: Parallel Programming Platforms

Chapter 2: Parallel Programming Platforms Chapter 2: Parallel Programming Platforms Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents Implicit Parallelism: Trends in Microprocessor

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Implicit Parallelism:

More information

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2 Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

EE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100

EE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100 EE/CSCI 45 Spring 08 Homework Assigned: February 7, 08 Due: February 4, 08, before :59 pm Total Points: 00 [0 points] Explain the following terms:. Diameter of a network. Bisection width of a network.

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control

CS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control CS 498 Hot Topics in High Performance Computing Networks and Fault Tolerance 9. Routing and Flow Control Intro What did we learn in the last lecture Topology metrics Including minimum diameter of directed

More information

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009 VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs April 6 th, 2009 Message Passing Costs Major overheads in the execution of parallel programs: from communication

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Routing, Network Embedding John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 14-15 4,11 October 2018 Topics for Today

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Communication Performance in Network-on-Chips

Communication Performance in Network-on-Chips Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering

More information

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU? CPS 540 Fall 204 Shirley Moore, Instructor Test November 9, 204 Answers Please show all your work.. Draw a sketch of the extended von Neumann architecture for a 4-core multicore processor with three levels

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Parallel Numerics, WT 2017/ Introduction. page 1 of 127 Parallel Numerics, WT 2017/2018 1 Introduction page 1 of 127 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! page 2 of 127 Scope Revise standard

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 1 [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task

More information