EE/CSCI 451: Parallel and Distributed Computation
|
|
- Dwain Watkins
- 5 years ago
- Views:
Transcription
1 EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian University of Southern California 1
2 Outline Midterm 1: 2/26, in class 2-3:20pm 2
3 Example memory system performance (1) Memory Latency = 10 ns (Peak) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) (bus frequency) Processor: 2 GHz, 1 word = 64 bits (8 bytes) Unit of data access = 1 word Cycle time = 0.5 ns (processor cycle) Assume 2-issue superscalar, single cycle processor, 2 Double Precision multiply-add (2 multipliers, 2 adders) FPUs M P 64 Peak performance of the processor = 2 4 = 8 GFlops/s (Raw compute power) Clock rate Total # of FP ops /cycle Peak performance can also be computed as 2 2 = 4 GFlops/s Clock rate 2 pipelines 3
4 Example memory system performance (2) Example: Inner product: a b = σ n i=1 a i b i (data in external memory ) 2 data fetches for each multiply, add Processor can do 8 mult or add/ns = 4 FP ops /cycle Note: issue bandwidth limits performance to 2 FP ops/cycle Read Read Mul Add If we can stream the data, Read Read In the best case, i.e., data is streamed (no pipeline stalls): Sustained performance (possible best case) = 2 ops over 4 processor cycles = 1 GFlops/s Processor memory bandwidth determines possible best case sustained performance ns Mul Add
5 Example memory system performance (3) M P 64 Latency = 10 ns (Peak) Bandwidth = 64 bits at 1 GHz (64 Gbits/sec or 8 GB/sec) Repeat 1. read a i 10 ns 2. read b i 10 ns 3. multiply (use register data) 4. add (use register data) n times Overlap Instr. 3,4 with Instr. 1,2 (Pipelining) Sustained performance (possible worst case) Total # of FP ops Total time 2n 20n 10 9 = 0.1 GFlops/s Memory access time 5
6 Example memory system performance (4) M P ns 1 GHz 2 GHz 2 issue 2 FPU Inner product Processor Peak = 8 GFlops/s (Raw compute power) Processor Peak = 4 GFlops/s (Processor organization) (Best case) Sustained = 1 GFlops/s (Memory b/w) (Worst case) Sustained = 0.1 GFlops/s (Memory latency) Memory bound computation 6
7 Cache (1) Improving effective memory latency M 10 ns 1 GHz DRAM 4 GB Why it works? Data reuse P 2 GHz 4 way superscalar C Exec. pipelines Accessing from DRAM is expensive (DRAM latency) Repeatedly use the data in cache (fast) Hit ratio: fraction of the memory references served by the cache Cache Small (16 MB) Fast (2 GHz) 7
8 Spatial locality If location i is referenced at time t, then locations near i are referenced in a small window of time following t Temporal locality Cache (3) Locality of references In a small window of time, repeated references to a small set of data items i 1 i i + 1 memory t 1 t 2 Time t Small set of data items referenced time 8
9 Example: bubble sort for i from 1 to N end for for j from 0 to N-2 R 1 R 2 a[j] a[j+1] if R 1 > R 2 swap (R 1 and R 2 ) end if a[j] R 1 a[j+1] R 2 end for Cache (4) i = 1 i = 2 9 Assumptions: Direct mapped First in first out policy Write through policy N = 8 Cache line size = 1 data element Cache size 2 data elements Read a[0], a[1] miss, miss Read a[1], a[2] hit, miss Read a[2], a[3] hit, miss Read a[6], a[7] hit, miss Read a[0], a[1] miss, miss Read a[1], a[2] hit, miss Cache hit ratio= Read hit ratio = # of times data found in cache Total # ofaccesses N 2 (N 1) 2 = = 0.43
10 Data layout and data access pattern (1) Storing a 2-dimemsional array in memory 0 n-1 Row major order 0 n-1 Map Memory Column major order 0 n-1 0 n-1 Map Memory A(i, j) Memory(i n + j) 0 i, j < n A(i, j) Memory(i + n j) 0 i, j < n 10
11 Shuffle Network Perfect shuffle (PS) connection A link exists between input i and output j if: j = 2i, 0 i < p 2 p 2i + 1 p, i < p 2 Left rotation (circular left shift) of binary representation of i p = power of = left_rotate(000) 001 = left_rotate(100) 010 = left_rotate(001) = left_rotate(101) = left_rotate(010) 101 = left_rotate(110) 110 = left_rotate(011) = left_rotate(111) 11
12 Shuffle for n = 8 Shuffle Exchange Network ii 2i2 i mod modnn ii (2i i + 1) mod mod nn 2i 2i 2i 2i+ 1 Exchange 0 i n 1 2 n i n i n
13 Example: n=8 3 bit index Shuffle Connection Circular left shift ( 4 1 ) Exchange connection 2i 2i + 1 Complement lsb 1 Diameter: (discussed later) O(log n) k n 2 13
14 Routing in Shuffle Exchange Network (1) Source x = x k 1 x 0 Destination d = d k 1 d 0 y x {current location} i 1 While i k End Shuffle y {Rotate left} Source x Compare LSB of y with bit (k i) of destination (d) If bits are the same, then do not Exchange; else Exchange {Complement y 0 } i i + 1 Total # of hops 2k (2log 2 n) Intermediate nodes Destination d 14
15 Routing in Shuffle Exchange Network (2) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 Example: i = 1 Shuffle Compare LSB of y with bit 2 of destination y 0 = d 2? Same as x 2 = d 2? x = 000 d = 110 i = 1 i = 2 i = S 001 E 010 S 011 E 110 S 110 No E Position at the end of first iteration: 001 End of i th iteration: y = x k 1 i x 0 d k 1 d k i 15
16 Routing in Shuffle Exchange Network (3) Source x2x1x0 ( 000) Destination d2d1d0 (110) k 3 16
17 Routing in Shuffle Exchange Network (4) Theorem: In a shuffle exchange network with n = 2 k nodes, data from any source to any destination can be routed in at most 2log 2 n steps. 17
18 Omega Network (1) p input, p output log 2 p stages, each stage having p switches Switch: 0 Shuffle Exchange
19 Omega Network (2) Omega network properties Multistage network Cost ~ P log 2 2p (number of switches) Note: in actual hardware design, routing cost dominates! Omega network can do 2 (p 2 log 2p) < p! Permutations All p! Permutations can not be realized Unique (only one) path from any input to any output 19
20 Omega Network (3) Example of Blocking one of the messages (010 to 111 or 110 to 100) is blocked at link AB B A
21 Congestion in a Network (1) Given a routing protocol and data communication pattern (ex. permutation) Congestion = Max. { # of paths passing through the node } Congestion =
22 Congestion in a Network (2) Interconnection Network = Graph + Routing algorithm Assume routing algorithm provides unique (exactly one path) communication from i to j for all i, j For a given permutation: Congestion at node k = # of paths that pass through k Congestion in the network = Max {# of paths that pass through k} over nodes K Max {congestion in the network} all permutations 22
23 CLOS network (2) Structure of CLOS network Control n n 23
24 CLOS network (3) Stage i Stage i + 1 connections Any Box all boxes in the next stage 3 stage network Number of switches (boxes) = 3 n Cost of n n crossbar = O n 2 Total Cost = O n n = O(n 3 2) Note: CLOS network can realize all n! permutations 24
25 n = 2 k for some k Butterfly network (1) 8-input butterfly network Stage 0 Stage 3 log 2 n + 1 stages Power of 2 connections 2 l, l = 0,1, log 2 n 1 Stage l, 0 l < log 2 n i i in Stage l + 1 i i complement bit l in Stage l + 1 l = 0 l = 1 l = 2 25
26 Butterfly network (2) n input butterfly network Total number of nodes = n log 2 n + 1 Total number of edges = n 2 log 2 n # of edges/node # of stages 26
27 Mesh-connected Network 1-D mesh 2-D mesh (without wraparound) Without wraparound 1-D torus With wraparound ring 0 1 p p-1 p k-dimensional mesh: p 1/k p 1/k p 1/k k times Number of connections per node 2k p 27
28 Hypercube Network (1) Existing edge Added edge D hypercube 1-D hypercube 2-D hypercube 3-D hypercube D hypercube Construction of hypercube from hypercubes of lower dimension 28
29 Hypercube Network (2) In general, for k dimension hypercube: p = total number of nodes = 2 k Node i = i k 1 i k 2 i 0 k connections/node complement a bit of i Example:
30 Tree-based Network (1) Processing node Switching node Height: log p Static tree network Dynamic tree network p = total number of nodes 30
31 Tree-based Network (2) Fat tree (16 processing nodes) 31
32 Performance metrics (1) Diameter Diameter Maximum distance between any two processing nodes in the network diameter max{distance(i, j)} (i,j) distance(i, j) min path length between i and j length of the shortest path between i and j length of a path = # of edges in the path 32
33 Performance metrics (6) Example Bisection width (2) partition p p p for 2-D mesh (no wraparound) 33
34 Performance metrics (8) Cost of a static network Cost number of communication links in the network Example: Tree: p 1 1-D mesh (no wraparound): p 1 d-dimensional wraparound mesh: d p Hypercube: p log 2p 2 k-ary d-cube A d-dimensional array with k elements in each dimension Number of nodes p = k d Cost: dp 34
35 Summary Network Diameter Bisection width Cost (No. of links) Completely connected 1 p 2 4 p(p 1) 2 Star 2 1 p 1 Complete binary tree 2 log p p 1 1-D mesh, no wraparound p 1 1 p 1 2-D mesh, no wraparound 2( p 1) p 2(p p) 2-D wraparound mesh 2 p/2 2 p 2p Hypercube log p p/2 (p log p)/2 Wraparound k-ary d-cube p = k d d k/2 2k d 1 dp 35
36 Shared Address Space Programming (1) All threads have access to the same global, shared memory Threads can also have their own private data Programmer in responsible for synchronizing access (protecting) globally shared data (to ensure correctness of the program) Private data Thread Thread Private data Shared Memory Private data Thread Thread Private data 36
37 Shared Address Space Programming (5) Thread 1 k from 1 to n C 1,1 = C(1,1) + A 1, k B(k, 1) Example 1 Shared Memory C A B Thread i n + j k from 1 to n C i, j = C(i, j) + A i, k B(k, j) C(i, j) 37
38 Shared Address Space Programming (7) Thread 1 i from 1 to n 2 j from 1 to n 2 C i, j = A i, : B(:, j) Example 2 Matrix Multiplication Shared Memory A C B Thread 2 i from 1 to n 2 j from n + 1 to n 2 C i, j = A i, : B(:, j) Thread 3 i from n + 1 to n 2 j from 1 to n 2 C i, j = A i, : B(:, j) Input data shared No interaction among threads Thread 4 i from n + 1 to n 2 j from n + 1 to n 2 C i, j = A i, : B(:, j) 38
39 Synchronization method Barrier (1) Barrier objects can be created at certain places in the program Any thread which reaches the barrier stops until all the threads have reached the barrier T1 T2 T3 T4 T1, T2, T3, T4 wait till all threads reach the barrier Barrier Time 39
40 Shared Variable Access (1) Threads acquire lock to modify a shared variable Release lock when done Only 1 thread can acquire a lock at any time Successful T1 T2 Unsuccessful Lock Access Shared Variable Lock Wait Lock T3 Release lock Lock Access Shared Variable Release lock Example execution sequence Wait Lock 40
41 Shared Variable Access (2) Example: Find max between two threads Each thread has a local value Initialize Max (Max is a shared variable), 0 Thread 1 Acquire_lock(Max) If (i > Max) Max = i Release_lock(Max) Thread 2 Acquire_lock(Max) If (j > Max) Max = j Release_lock(Max) 41
42 Correct Parallel Program Data Parallel Program Output Parallel Platform For all data inputs, for all execution sequences, correct output is produced 42
43 A Simple Model of Shared Address Space Parallel Machine (PRAM) (1) 1 unit of time = Local access shared memory access Synchronous model Parallel time = total number of cycles Pthreads programming model? Asynchronous shared memory?? Shared memory 0 p 1 p processors 43
44 Random Access Machine (RAM) Random Access Access to any memory location (Direct access) Time Access to memory 1 unit of time Arithmetic/Logic operation 1 unit of time Serial time complexity T s (n) Example: Merge Sort PRAM (2) T s (n) = O(n log n) 1 unit of time Processor Memory 44 2/21/
45 PRAM (5) PRAM is a synchronous model clock 0 Shared memory p 1 Each processor is a RAM Local program acting on data in shared memory 1 unit of time For all i, i-th instruction in the execution sequence is executed in the i-th cycle by all the processors 45 Synchronous execution
46 Adding on PRAM (1) Simple shared memory algorithm for adding n numbers Output = σ n 1 i=0 A i in A(0) A(0) A(n 1) 46
47 For n = 8 i = iteration # Adding on PRAM (2) Key Idea end of i = 0 i = 1 i = Active processors = Processor index = {0, 2, 4, 6} {0, 4} {0} {0, 1, 2, 3} 2 {0, 1} 4 {0} 8 47
48 Adding on PRAM (3) Example: n = 8 Active processors Processor # (j) iteration # (i) ~~~~~~~~~ ~~~~~~~~~ (time) Total # of additions performed = n 2 + n Total # of additions performed = n 1 Total # of additions performed = Total # of additions performed by a serial program 48
49 Algorithm Program in processor j, 0 j n Do i = 0 to log 2 n 1 Adding on PRAM (4) If j = k 2 i+1, for some k N then A j A j + A(j + 2 i ) A(0) 3. end Note: A is shared among all the processors Synchronous operation [For ex. all the processors execute instruction 2 during the same cycle, log 2 n time] N = set of natural numbers = {0, 1, } Parallel time = O(log n) A(n 1) 49
50 Addition on PRAM, p < n processors 1. Add within block of size n p 2. Apply Algorithm 1 n/p T p = Parallel Time = O( n + log p) p T s = Serial Time = O(n) n p 50 50
51 Synchronous and Asynchronous Clock 0 y Synchronous (e.g. PRAM) Do i = 0 to 499 y y + A(i) End y x + y OUTPUT y STOP x 0 Do i = 500 to 999 x x + A(i) End STOP Asynchronous (e.g. Pthreads) y 0 Do i = 0 to 499 y y + A(i) End Barrier y x + y x 0 Do i = 500 to 999 x x + A(i) End Barrier
52 Addition: Pthreads model? Instruction execution NOT synchronized Thread j Do i = 0 to log 2 n 1 end If j = k 2 i+1, for same k N then A j A j + A(j + 2 i ) BARRIER Correct output? How to measure time? Number of active threads in iteration i = 2n 2 i+1 Complete each iteration before proceeding to next iteration Within each iteration threads may execute asynchronously 52
53 OpenMP Programming Model (2) Directive based parallel programming 1 #pragma omp directive [clause list] 2 /* structured block */ Structured block = a section of code which is grouped together OpenMP directives are not part of a programming language Can be included in various programming languages (e.g., C, C++, Fortran) 53
54 OpenMP Programming Model (3) Fork - Join model: Fork: The master thread creates a team of parallel threads Join: When the team threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread Fork Join Fork Join 54
55 OpenMP Directives (1) OpenMP executes serially until parallel directive parallel region construct Same code will be executed by multiple threads Fundamental OpenMP parallel construct Example #pragma omp parallel [clause list] { task(); } 55 Master thread Team thread Team Thread task() task() task() Clause list specifies parameters of parallel execution # of threads, private variables, shared variables Implied Barrier
56 OpenMP Directives (2) Work-Sharing Constructs Divide the execution of the enclosed code region among the members Implied barrier at the end of a work sharing construct Types of Work-Sharing Constructs: for - shares iterations of a loop across the team. Represents a type of "data parallelism". sections - breaks work into separate, discrete sections. Each section is executed by a thread. Represents a type of "functional parallelism". 56
57 Routing Mechanisms (1) 1. Store and Forward Routing Message length = m words Number of hops = l P 0 P 1 P 2 P 3 Example: l = 3 P 0 P 1 t s (t h + m t w ) source destination Store: each intermediate node receives the entire message from its predecessor Forward: forward to the next node P 2 P 3 time Total time for communication = t s + (t h + m t w ) l 57
58 Cut Through Routing (3) 1 2 l t comm = t s + l t h + t w m Pipeline delay Communication time on a link Pipelined Data Communication Note: There is a small overhead in processing FLIT at each intermediate node, smaller than processing a packet. t h is smaller than store and forward scenario. 58
59 Cut Through Routing (7) Minimizing l (number of hops) is hard Program has little control in program to processor mapping Machines (platform) use various routing techniques to minimize congestion (ex: randomized routing) Per hop time is usually relatively small compared with t s and t w m for large m Simple communication cost model: t comm = t s + t w m Same amount of time to communicate between any two nodes (fully connected?) Use this cost model to design and optimize instead of parallel architecture (mesh, hypercube) specific algorithms Simplified cost model does not take congestion into account 59
60 LogP Model (3) P processors P-M latency P-M throughput M M M o overhead P P P g gap L (latency) Communication Network 60
61 LogP Model (4) L: an upper bound on the latency, or delay, incurred in communicating a message containing a word from its source module to its target module o: the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message g: the gap, defined as the min. time interval between consecutive message transmission or reception of each message P: the number of processors/memory modules 61
62 Program and Data Mapping (1) Parallel Program = Collection of Processes + Interaction among Processes Each process = Computation + Data access Classic problem: Given a parallel program, embed (map) processes and data onto a given parallel architecture such that communication cost is minimized overall execution time is minimized 62
63 Program and Data Mapping (2) Parallel Program G (V, E) A simple abstraction Map s Parallel Architecture G (V, E ) Node - process Edge - communication Node - processor Edge - interconnection Graph embedding problem 63
64 Program and Data Mapping (3) Graph embedding problem Given parallel program and parallel architecture Given parallel program and parallel architecture G(V, E) G (V, E ) undirected b f(b) Function f: V V a f(a) Each edge in E specifies a path in G Function g: E paths in G For every edge (a, b) in E, g specifies a path in G, such that the end points in G are f(a) and f(b) Note: 1. V can be larger than V 2. A vertex in V may correspond to more than one vertex in V 64
65 Program and Data Mapping (5) Metrics Congestion: Max number of edges in E mapped to an edge in E (on edges) Dilation: Edge in E Path in E Max path length in E Expansion: V / V It is also possible V < V for example: virtualization 65
66 Scalability (2) Speedup = Serial time (on a uniprocessor system) Parallel time using p processors If speedup = O(p), then it is a scalable solution. 66
67 Amdahl s Law (1) Amdahl s Law Limit on speedup achievable when a program is run on a parallel machine Given an input program Total execution time: S + P S 1 Serial Portion Parallelizable Portion P Time on a uni-processor machine: 1 = S + P Time on a parallel machine: S + P/f f = speedup factor 67
68 Scaled Speedup (Gustafson s Law) (2) Amdahl s Law : Fixed amount of computations 1 Serial Time S 1 S Parallel Time S 1 S p p : Number of Processors Gustafson s Law : Increase p and amount of computations 1 Parallel Time S P = 1 S Serial Time S (1 S ) p If parallelism scales linearly with p, number of processors Serial time Scaled Speedup = = = Parallel time 68 S + 1 S p S + 1 S p S + P 1
69 Performance (1) Efficiency Question: If we use p processors, is speedup = p? Efficiency = Fraction of time a processor is usefully employed during the computation p=2 p=1 Typical execution of a program on a parallel machine P 1 P 2 Serial time 0.5 useful work 1 idle (overhead) E = Speedup / # of processors used E is the average efficiency over all the processors Efficiency of each processor can be different from the average value Serial Computation Time 69
70 Performance (3) Cost = Total amount of work done by a parallel system = Parallel Execution Time x Number of Processors = T p p Cost is also called Processor Time Product COST OPTIMAL (or WORK OPTIMAL) Parallel Algorithm Total work done = Serial Complexity of the problem 70
71 Performance Analysis (1) Asymptotic Analysis Big O Notation or Order Notation Worst case execution time of an algorithm Upper bound on the growth rate of the execution time Example: n n matrix multiplication 1. Do i 2. Do j 3. C(i, j) 0 4. Do k = 1 to n 5. C(i, j) C(i, j) + A(i, k) B(k, j) 6. End 7. End 8. End T(n) = time complexity function = n 2 + n 3 + n 3 Line 3 Line 5 71
EE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #5 1/29/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationCS575 Parallel Processing
CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationFundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.
Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationINTERCONNECTION NETWORKS LECTURE 4
INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationCPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport
CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction
More informationDesign of Parallel Algorithms. The Architecture of a Parallel Computer
+ Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationNetwork Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)
Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.
More informationLecture 3: Topology - II
ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationA Simple and Asymptotically Accurate Model for Parallel Computation
A Simple and Asymptotically Accurate Model for Parallel Computation Ananth Grama, Vipin Kumar, Sanjay Ranka, Vineet Singh Department of Computer Science Paramark Corp. and Purdue University University
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationInterconnection Networks. Issues for Networks
Interconnection Networks Communications Among Processors Chris Nevison, Colgate University Issues for Networks Total Bandwidth amount of data which can be moved from somewhere to somewhere per unit time
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationModel Questions and Answers on
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Model Questions and Answers on PARALLEL COMPUTING Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. Model Questions and Answers Subject Parallel Computing
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #2 1/17/2017 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Opportunities
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationNetwork Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)
Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.
More informationChapter 2: Parallel Programming Platforms
Chapter 2: Parallel Programming Platforms Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents Implicit Parallelism: Trends in Microprocessor
More informationParallel Computing Platforms
Parallel Computing Platforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Implicit Parallelism:
More informationCS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2
Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99
More informationLecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control
Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationLecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control
Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More informationEE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100
EE/CSCI 45 Spring 08 Homework Assigned: February 7, 08 Due: February 4, 08, before :59 pm Total Points: 00 [0 points] Explain the following terms:. Diameter of a network. Bisection width of a network.
More informationCS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationCS 498 Hot Topics in High Performance Computing. Networks and Fault Tolerance. 9. Routing and Flow Control
CS 498 Hot Topics in High Performance Computing Networks and Fault Tolerance 9. Routing and Flow Control Intro What did we learn in the last lecture Topology metrics Including minimum diameter of directed
More informationVIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009
VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs April 6 th, 2009 Message Passing Costs Major overheads in the execution of parallel programs: from communication
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationLecture 2: Topology - I
ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,
More informationCS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010
CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationLast Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort
Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms
More informationParallel Computing Platforms
Parallel Computing Platforms Routing, Network Embedding John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 14-15 4,11 October 2018 Topics for Today
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationCommunication Performance in Network-on-Chips
Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationa. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?
CPS 540 Fall 204 Shirley Moore, Instructor Test November 9, 204 Answers Please show all your work.. Draw a sketch of the extended von Neumann architecture for a 4-core multicore processor with three levels
More informationLecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationPerformance Optimization Part II: Locality, Communication, and Contention
Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationLecture 16: On-Chip Networks. Topics: Cache networks, NoC basics
Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality
More informationCS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationParallel Numerics, WT 2017/ Introduction. page 1 of 127
Parallel Numerics, WT 2017/2018 1 Introduction page 1 of 127 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! page 2 of 127 Scope Revise standard
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationRecall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms
CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationEE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100
EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 1 [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task
More information