PCS - Part 1: Introduction to Parallel Computing
|
|
- Georgia Holland
- 6 years ago
- Views:
Transcription
1 PCS - Part 1: Introduction to Parallel Computing Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2009
2 Part 1 - Overview Reasons for parallel computing Goals and limitations Criteria for high performance computing Overview of parallel computer architectures Examples of problems demanding parallel processing Well and hard to parallelize Influence of algorithmic complexity Measures and Limitations Measures: Speedup, Efficiency, Scaleup Reachable Speedup (Amdahl s Law) Optimal number of processors Typical parallel applications
3 Reasons for Parallel Computing Parallel computing often considered as the main direction for high performance computing. Specific goals can be: Solve problems in a shorter, acceptable time Find solutions for big problems (i.e. large set of input variables, very high accuracy) in acceptable time Map a problem into the memory of a computer Most of these aspects are addressed by technological progress and processor architecture improvements,... but there are limitations.
4 Limiting Factors Traditionally, performance growth was driven by: Packing of more and more functions into a processor chip Scaling up processors clock frequency Limiting factors. i.e. aspects against: Area size of processor chip (die area) can not be enlarged without increasing the time for signal propagation. Clock frequency is bounded: signals often must propagate across the chip area within a single clock cycle A further increase of functional density would cause structures measuring only a few atoms More functionality and higher clock frequency cause more energy consumption and more heating of the processors.
5 How these limitations materialize Example: 20 cm ns as typical velocity of electrons in copper (electrical wires) A future processor chip could be clocked with 100 GHz. A clock period then is ns. The signal distance in this time is 0.2 cm. A processor chip may not contain wires that are longer than 0.2 cm, which limits the size and number of functional units on a processor. Different techniques for further performance growth: Parallel utilization of smaller sub-components within a processor chip that are not in a common clock domain Using multiple processors, e.g. multicore, multiprocessor machines Using multiple computers, e.g. computer clusters, GRID
6 Criteria for high computation performance better algorithm (less operations) degree of parallel processing mapping of the code on processors and into memory hierarchy Algorithm with minimal number of computation steps: often several algorithms exist, differing in the number of computations steps and memory consumption. Mapping to processor architecture: Use of operations that directly correspond to CPU instructions, exploiting memory locality with proper data structures Parallel execution: Program decomposition into independently executable operation streams / using vector operations.
7 Example of a parallel algorithm (1/3) Calculate value of polynomial for a given x: y = a 4 x 4 + a 3 x 3 + a 2 x 2 + a 1 x + a 0 Algorithm A1: separate calculation of powers and products 1 x 1, x 2, x 3, x 4 : 3 Multiply 2 Products a i x i : 4 Multiply 3 Summarize: 4 Add 4 Control (Loop over i) requires 3 Add A1 requires 7 multiplications and 7 additions.
8 Example of a parallel algorithm (2/3) Algorithm A2: stepwise calculation a[4] = a 4, a[3] = a 3, a[2] = a 2, a[1] = a 1, a[0] = a 0 (1) i:=4; n:=4; (2) z:=a[n]; (3) z1:= x * z + a[i-1]; (4) i:= i - 1; (5) if (i>0) { z := z1; goto (3); } (6) result := z1; A2 requires 4 multiplications, 4 additions, 4 subtractions. A2 is the better algorithm, compared to A1, because it needs less operations.
9 Example of a parallel algorithm (3/3) Question: Are A1 and A2 suited for parallel execution? y = a *x*x*x*x + a *x*x*x + a *x*x + a *x + a y = (((a * x + a ) * x + a ) * x + a ) * x + a * * * * * * * * * + * + * + * A1 Result after 5 time steps + + A2 Result after 8 time steps + * + * + A1 is the better choice for parallel execution. A2 can not exploit parallelism, due to data dependencies
10 Overview: Parallel Computer Architectures (1) Definition Parallel Computer by T. Ungerer: A parallel computer consists of multiple processing units that work coordinated and (at least partly) simultaneously in order to solve a problem cooperatively. high-end computers massively use parallel computation units even low-end computers make use of parallel computation units (e.g. shader units of graphic cards)
11 Overview: Parallel Computer Architectures (2) Classification (Flynn, 1966): A coarse classification - based on the number of independent instruction steams and the number of data streams. Single (SD) data streams Multiple (MD) instruction streams Single (SI) Multiple (MI) SISD SIMD MIMD von Neumann computer: SISD SIMD, MIMD are extensions of the von-neumann architecture, and both parallel computers
12 Overview: Parallel Computer Architectures (3) MIMD - Multiple Instruction/ Multiple Data - computers: Shared Memory Multiprocessor Systems Server with many processors (as usual today), with different applications components being executed onto different processors, e.g. database and web server Multicore processors Distributed Memory Multiprocessor Systems - Distributed Systems Blade systems (networked computer blades) Cluster computer Networked workstations, as long used with parallel run time environment (e.g. MPI) Parallel computers connected by wide area networks: GRID
13 Overview: Parallel Computer Architectures (4) SIMD - Single Instruction / Multiple Data - computers: Array Computer: A high number of equally structured arithmetical units, which work synchronously under control of a single control unit. Vector Computer - one (multiple) specialized arithmetical units. These units work in a pipeline mode for fast floating point calculations. Arithmetical Pipelining
14 Classification Parallel Computer von Neumann Non von Neumann MIMD SIMD Dataflow Computer Systolic array Distributed Memory Shared Memory Array Computer Vector Computer NUMA UMA Network topologies, Routing Cache coherency & memory consistency architecture class technical treats
15 The worlds biggest: TOP 500 List A ranking of supercomputers, released each 6 months, now with the 33st TOP500 list (June 2009)... 1 IBM - Roadrunner - BladeCenter QS22/LS21 Cluster, Processors: PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Network: Voltaire Infiniband, installed at DOE/NNSA/LANL, cores, R max = TFlops, R Peak = TFlops 2 Cray XT5 - Jaguar - Cray XT5 QC 2.3 GHz, Oak Ridge National Laboratory, U.S., cores, R max = TFlops, R Peak = TFlops 3 IBM JUGENE - Blue Gene/P, installed at Forschungszentrum Juelich, Germany, cores, R max = TFlops, R Peak = TFlops 9 Sun Microsystems - Ranger - SunBlade x6420, Opteron QC 2.3 Ghz, Infiniband, Texas Advanced Computing Center/Univ. of Texas, cores, R max = TFlops, R Peak = TFlops...
16 Top500: 1st rank: Roadrunner Data from 2008: in total 6,562 dual-core AMD Opteron, and 12,240 Cell chips 98 terabytes of memory, and is housed in 278 refrigerator-sized racks its 10,000 connections are both Infiniband and Gigabit Ethernet Hybrid computing system: Standard processing (e.g. file system I/O) is handled by the Opteron processors. Mathematically and CPU-intensive elements are directed to the Cell processors.
17 Influence of algorithmic complexity Definition: Time Complexity Number of computation steps related to the problem size n... size of input data T(n) exact number of computation steps O(n): Order of Complexity (without constant factors, contains only major functions of n) Example: T(n) = n + 3 n 2 O(n 2 )
18 Hierarchy of complexities Useful algorithms: O(1), O(log n) Still useful: O(n), O(n log n), polynomial Critical, useless algorithms: O(2 n ), O(n!) Parallel execution of algorithms beneficial, if: complexity between logarithmic to polynomial (O(log n) to O(n x )) algorithm contains a high degree of independent calculations
19 Complexity and Parallel Computing Scaling problem size linear n log n polynomial grade 2 polynomial grade 3 n! factorial computation time 1 linear n log n polynomial grade 2 polynomial grade 3 n! factorial problem size n computation time problem size n (left) single processor vs. (right) linearly growing number of processors
20 Complexity and Parallel Computing Examples: Scalar-Product O(n): number of needed processors directly corresponds to the scaled vector size, n new = d n old p new = d p old. Matrix-Multiplication O(n 3 ): Parallel matrix multiplication allows bigger problem sizes in a constant time, n new = d n old p new = d 3 p old. Generate and test binary numbers of length n, O(2 n ): practically not scalable, n new = n old + 1 p new = p old 2. Traveling Salesman Problem O(n!): practically not scalable, n new = n old + 1 p new = p old n new.
21 A Good Example (1/3) Matrix Multiplication C = A B for i:=0 to n-1 for j:=0 to n-1 c[i,j]:=0 for k:=0 to n-1 endfor endfor endfor c[i,j] := c[i,j] + a[k,j] * b[i,k] Complexity Order: O(n 3 ) Parallel algorithm: Input partitioning - The outer two loops (i,j) are split, and different processes/threads cover these different areas.
22 A Good Example (2/3) Matrix multiplication principle i k k B b j a A C c i,j Scalar product of vectors With scalar vector product: s = n 1 k=0 a[k] b[k]
23 A Good Example (3/3) Matrix Multiplication Table shows number of steps, divided in steps per loop (i,j,k) n input size T 1 (n) T 2 (n) T 4 (n) T 8 (n) 10 2 * *10*10 5*10*10 5*5*10 5*5*5 = 200 = 1000 = 500 = 250 = * *20*20 10*20*20 10*10*20 5*10*20 = 800 = 8000 = 4000 = 2000 = * *40*40 20*40*40 20*20*40 10*20*40 = 3200 = = =16000 = * *80*80 40*80*80 40*40*80 20*40*80 = = = = = Problem size can be increased, but doubling problem size requires the processor number to be increased by factor 8.
24 A Bad Example (1/2) Traveling Salesman Problem (TSP) Input: n objects, for each two objects i,j a distance cost d i,j {1, 2,...,n} Required result: Permutation p of the objects with p(i) = i-th Element, such that d p(i),p(i+1) ) + d p(n),p(0) is minimal n 1 ( i=0 p(0) p(0) d p(0),p(1) p(1) START / STOP p(1) Time Complexity: T = (n-1)!, T = (n-1)!/2 (symmetric TSP)
25 A Bad Example (2/2) Experiment: Provide (n 1) processors for a traveling salesman problem of size n n T 1 (n) T n 1 (n) 4 3!=6 6/3=2 5 4!=24 24/4 = 6 6 5! = /5 = != /9 = != /10 = By using n processors we are able to process a problem size of n + 1, compared to a single processor machine with a problem of size n.
26 Measures and Limitations: Speedup Parameters: p... number of processors used T 1... time steps needed for execution on a single processor T p... time steps for execution on a parallel computer with p processors Speedup - how many times faster does the program run S p = T 1 T p Speedup normally in the range of 1... p. If S p > p, then this is caused by additional effects, e.g. better memory utilization, parallel operating system.
27 Measures: Efficiency Efficiency - utilization of parallelism E p = S p p Normally, E p is in the range of Ideal algorithms exhibit an E p = 1, independently of p - the number of processors. When E p on a realistic machine does not sink with increasing number of processors, we call that scalable. (Scalability)
28 Measures: Scaleup Scaleup - how much more data can we process in a fixed period of time m... size of the small problem n... size of the big problem, computed with p processors SC p = n m whereby T 1 (m) = T p (n) Scaleup depends directly on time complexity of the algorithm.
29 Limitations: Reachable Speedup (1) Ideally, with p processors we can gain a speedup of p. But this is not always true, because most algorithms contain (small) sequential parts. Known parameters: a... fraction of instructions that can be parallelized on p processors b... fraction of instructions that remain sequential, e.g. due to data dependencies a and b express fractions of time consumptions related to the entire execution time on a single processor. Thus, a + b = 1. By using the speedup formula and normalizing T 1 (n) to 1, we obtain: S p = T 1(n) T p (n) = a + b b + a p = 1 (1 a) + a p
30 Limitations: Reachable Speedup (2) Amdahl s Law (1967) Vary b in the range from 0 to 1: S p = 1 b + a p Maximal Speedup: Use an infinite number of processors: Speedup p=16 p=8 lim p S p = 1 b 4 2 p= Fraction of sequential operations (b) Even a low fraction non-parallel operations may significantly limit the reachable speedup.
31 Limitations: Reachable Speedup (3) lim p S p = 1 b Example: with b=0.1, the maximum speedup is 10, independently how many processors are used (p>=10). S p 1 b
32 Measures: Reachable Speedup (4) Vary the number of processors used, curves for several b-values Speedup b=0.00 b=0.01 b=0.05 b=0.10 b=0.20 b= Number processors (p) The existence b > 0.05 causes that speedup increase can only be reached until a number of processors p x. As bigger b gets, the smaller is p x.
33 Measures: Optimal number of processors (1) We use another measure: F p = S p E p T 1 F p grows with increasing speedups But F p sinks with decreasing efficiency Division by T 1 in order to normalize F p ; not really necessary in our scope F p reaches a maximum, when the optimal number of processors is used.
34 Measures: Optimal number of processors (2) Applying Amdahl s Law to S p and calculate E p, F p. Plot for several b-values - fractions of non-parallel operations Fp = Sp * Ep b=0.00 b=0.01 b=0.05 b=0.10 b=0.20 b= Number processors (p) F p reaches a maximum, when the optimal number of processors is used Search for the top points in the curves!
35 Measures: Optimal number of processors (3) Analytical approach: with F p = S p E p = (S p ) 2 1 p F p = ( d df p dp = 0 ( 1 b+ a p 1 b + a p dp ) 2 1 p ) 2 1 p = 0
36 Measures: Optimal number of processors (4) we obtain: ( a 2 p = 1 2a + a 2 ) 1 2 Examples for p using the analytical approach: b a optimal p
37 Typical Parallel Applications (1) All common applications exhibit a very high fraction of parallel operations (b very small) Linear Algebra: Operations with vectors and matrices Systems of linear equations: A x = b Solvers may work in a direct way, e.g. Gaussian-Elimination-Algorithm Iterative solvers, e.g. Gauss-Seidel-Iteration, some very efficient solvers for sparse coefficient matrices A
38 Typical Parallel Applications (2) Solution of Differential Equations: Equations that contain x, a function y(x) and deviations y (x). Numerical solution using discrete differences instead of symbolic differentiation Calculate approximated values for different values of x in parallel (Runge-Kutta-Algorithm)
39 Typical Parallel Applications (3) Image processing: Local operators, e.g. spreading of spectrum, smoothing can be executed on different image parts in parallel Object matching, e.g. detection of geometric forms Finding of similar blocks in different images for detection of object movements (soft) real-time multimedia
40 Example: Partial Differential Equation (1) Laplace Partial Differetial Equation (Laplace PDE): U(x, y) = δ2 δ2 U(x, y) + U(x, y) = 0 δx 2 δy 2 the values in U(x, y) express for instance spacial distribution of electrical potential fields temperature on a surface level of ground water (e.g. for planning of building constructions) Boundary values must be known for a solution of U(x, y) = 0
41 Example: Partial Differential Equation (2) Discretization: express U(x, y) by a two-dimensional array of values at discrete grid points U(x, y) : U(i, j) with x i = i h, y j = j h with h as the distance of neighbor points in x, and in y direction. Discrete approximation of differential operator: Common practice is a substitution for the first order deviation, according to: d dx f(x) = f (x) = d dx f(x) = f (x) = f(x + h) f(x) lim h 0 h f(x + h) f(x) + O(h) h... we need a discretization of d 2 dx 2 f(x) = f (x)
42 Example: Partial Differential Equation (3) Using Taylor series: f(x + h) = f(x) + f (x)h f (x)h f (x)h f(x h) = f(x) f (x)h f (x)h f (x)h f(x + h) + f(x h) = 2f(x) + f (x)h f (x)h 4 + O(h 6 ) This can be written... f (x) = f(x + h) + f(x h) 2f(x) h 2 + O(h 2 ) O(h 2 ) = h2 12 f (x) +...
43 Example: Partial Differential Equation (4) The differential operators can be written as differences: U(i + 1, j) + U(i 1, j) 2U(i, j) U(i, j + 1) + U(i, j 1) 2U(i, j) h 2 + h 2 = 0 Finally, an iterative formula for U(i,j) is obtained: U(i, j) = 1 (U(i + 1, j) + U(i 1, j) + U(i, j + 1) + U(i, j 1)) 4 iteration x+1 iteration x
44 Example: Partial Differential Equation (5) Can be transformed into a parallel iteration on separated areas for each processor iteration x+1 iteration x Access to neighbor areas: multiprocessor: via access to shared memory and synchronization multicomputer: by exchanging U(i,j) values that lay on the boundaries of the locally processed area (using messages).
45 Summary Part 1 High performance computing with parallel computers Goals: solve problem in shorter time (speedup), or bigger problems in a specified/acceptable time (scaleup) Different parallel computer architectures: Multiprocessors (shared memory), Distributed systems (Distributed Memory), Vector processors and combinations of them Scaleup directly depends on time complexity of the algorithm, parallelization helps if time complexity order is polynomial or less Speedup is limited by sequential fraction of operations b: S p 1 b, most parallel applications with a very small sequential fraction of operations
10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationLet s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.
Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationChallenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationParallel Computing Why & How?
Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationSchool of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Intro to HPC Architecture Instructor: Ekpe Okorafor A little about me! PhD Computer Engineering Texas A&M University Computer
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationBİL 542 Parallel Computing
BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationIntroduction. EE 4504 Computer Organization
Introduction EE 4504 Computer Organization Section 11 Parallel Processing Overview EE 4504 Section 11 1 This course has concentrated on singleprocessor architectures and techniques to improve upon their
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationParallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 15 Numerically solve a 2D boundary value problem Example:
More informationOverview. CS 472 Concurrent & Parallel Programming University of Evansville
Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University
More informationChapter 11. Introduction to Multiprocessors
Chapter 11 Introduction to Multiprocessors 11.1 Introduction A multiple processor system consists of two or more processors that are connected in a manner that allows them to share the simultaneous (parallel)
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationComputer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationPresentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories
HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationPARALLEL COMPUTER ARCHITECTURES
8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different
More informationOverview of High Performance Computing
Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationParallel Computing Introduction
Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationHigh Performance Computing Systems
High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationA Multiprocessor system generally means that more than one instruction stream is being executed in parallel.
Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationOverview of Parallel Computing. Timothy H. Kaiser, PH.D.
Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationCDA3101 Recitation Section 13
CDA3101 Recitation Section 13 Storage + Bus + Multicore and some exam tips Hard Disks Traditional disk performance is limited by the moving parts. Some disk terms Disk Performance Platters - the surfaces
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationParallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationCS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering
CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More information27. Parallel Programming I
760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationReview of previous examinations TMA4280 Introduction to Supercomputing
Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with
More informationThread and Data parallelism in CPUs - will GPUs become obsolete?
Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationObjectives of the Course
Objectives of the Course Parallel Systems: Understanding the current state-of-the-art in parallel programming technology Getting familiar with existing algorithms for number of application areas Distributed
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationGP-GPU. General Purpose Programming on the Graphics Processing Unit
GP-GPU General Purpose Programming on the Graphics Processing Unit Goals Learn modern GPU architectures and its advantage and disadvantage as compared to modern CPUs Learn how to effectively program the
More informationA study on SIMD architecture
A study on SIMD architecture Gürkan Solmaz, Rouhollah Rahmatizadeh and Mohammad Ahmadian Department of Electrical Engineering and Computer Science University of Central Florida Email: {gsolmaz,rrahmati,mohammad}@knights.ucf.edu
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More information