Goals of this Course

Size: px
Start display at page:

Download "Goals of this Course"

Transcription

1 CISC High performance parallel algorithms for computational science Instructor: Dr. Michela Taufer Spring 2009 Goals of this Course This course is intended to provide students with an understanding of parallelization with MPI and OpenMP. Case studies for parallelization include Molecular Dynamics and Monte Carlo simulations, their principals, and their sequential and parallel algorithms. Emphasis is placed on the algorithmic and code components of these simulations, their performance analysis, and their scalability. From the syllabus 2 1

2 Course Topics Parallel Programming Parallel architectures Parallel programming with Message Passing Interface (MPI) Parallel programming with OpenMP Case study I: Molecular Dynamics (MD) simulations Parallelization of Molecular Dynamics (MD) algorithm with MPI and OpenMP Case Study II: Monte Carlo Simulations Parallelization of Monte Carlo algorithm with MPI Hybrid Parallelism: Combining MPI and OpenMP 3 Course Information and Deadlines Webpage: Mailing list: cisc849010_sp09@gcl.cis.udel.edu Access course material: User: cisc849student Password: Work4Fun! Schedule: Download it from the course webpage It is a tentative schedule! Syllabus Download it at the course webpage Read it carefully! 4 2

3 Books Parallel Programming with MPI by Peter Pacheco 5 Books Parallel Programming in C with MPI and OpenMP by Michael J. Quinn 6 3

4 Books Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 7 Books The Art of Molecular Dynamics Simulation by D.C. Rapaport, Cambridge Ed. 8 4

5 Books Molecular Modeling Principles and Applications by A.R. Leach, Pearson Ed. 9 Modern Scientific Method (I) Nature Observation Physical experiments and models Theory Classical science 10 5

6 Modern Scientific Method (II) Nature Observation Numerical simulations Physical experiments and models Expensive Time-consuming Unethical Impossible.. Theory Contemporary science 11 Grand Challenges Grand challenges are complex scientific problems: Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Biology, pharmacology, genome sequencing, protein folding, and cell modeling Global weather and environmental modeling They require extraordinarily powerful computers when solved via numerical simulations: Need more computational power Benefit from parallel computing 12 6

7 What is Parallel Computing? Parallel computing: use of multiple processors or computers working together to solve a single computational problem. Each processor works on its section of the problem Processors can exchange information Grid of Problem to be solved CPU #1 works on this area of the problem exchange CPU #2 works on this area of the problem y exchange exchange CPU #3 works on this area of the problem exchange CPU #4 works on this area of the problem x 13 Why Do Parallel Computing? Limits of single CPU computing performance available memory Parallel computing allows one to: solve problems that don t fit on a single CPU solve problems that can t be solved in a reasonable time We can solve larger problems faster more cases 14 7

8 Example: Weather Modeling and Forecasting For modeling a hurricane region: Assume region of interest is 1000 X 1000 miles, with height of 10 miles. Partition into segments of 0.1 x 0.1 x 0.1 miles: 10^10 grid points Simulate 2 days, with 30-minute time steps: 100 total time steps Assume the computations at each grid point require 100 instructions. A single time step then requires 10^12 instructions. For two days we need 10^14 instructions For serial computer with 10^8 instructions/sec, this takes 10^6 seconds (10 days!) to predict next 2 days!! THIS REQUIRES PARALLELISM FOR PERFORMANCE TO PREDICT Also requires lots of memory which implies parallelism Currently all major weather forecast centers (US, Europe, Asia) have supercomputers with 1000s of processors. 15 Other Examples Vehicle design and dynamics Analysis of protein structures Human genome work Quantum chromodynamics Cosmology Ocean modeling Imaging and Rendering Petroleum exploration Nuclear Weapon design Database query Ozone layer monitoring Natural language understanding Study of chemical phenomena And many other grand challenge projects 16 8

9 What is Parallel Computers? A Parallel Computer is a computer (or collection of computers) with multiple processors that can work together on solving a complex problem and supporting parallel computing Distributed multiprocessor: parallel computer constructed out of multiple computers and an interconnected network Centralized multiprocessor (or Symmetrical multiprocessor or SMP): all CPUs share access to a single global memory. How do the processors work together? 17 Distributed Multiprocessor 18 9

10 Centralized Multiprocessors 19 What is Parallel Programming? Parallel Programming: programming in a language that allow you to explicitly indicate how parts of the computation may be executed in parallel (concurrently). Confide the task to compiler technology: compiler detect and exploit the parallelism in existing code written in sequential languages Write your own parallel program: e.g., parallel programs written in C/C++/Fortran with MPI or OpenMP 20 10

11 MPI and OpenMP MPI (Message Passing Interface) MPI is a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementors, and users. From OpenMP OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. From 21 Single Program, Multiple Data (SPMD) SPMD: dominant programming model for shared and distributed memory machines. One source code is written Code can have conditional execution based on which processor is executing the copy All copies of code are started simultaneously and communicate and synch with each other periodically MPMD: more general, and possible in hardware, but no system/programming software enables it 22 11

12 SPMD Programming Model source.c source.c source.c source.c source.c Processor 0 Processor 1 Processor 2 Processor 3 23 Types of Parallelism: Two Extremes Data parallelism Each processor performs the same task on different data Task parallelism (or Functional Parallelism) Each processor performs a different task Most applications fall somewhere on the continuum between these two extremes 24 12

13 Data Parallel Programming Example One code will run on 2 CPUs Program has array of data to be operated on by 2 CPUs so array is split into two parts. program: if CPU=a then low_limit=1 upper_limit=50 elseif CPU=b then low_limit=51 upper_limit=100 end if do I = low_limit, upper_limit work on A(I) end do... end program CPU A program: low_limit=1 upper_limit=50 do I= low_limit, upper_limit work on A(I) end do end program CPU B program: low_limit=51 upper_limit=100 do I= low_limit, upper_limit work on A(I) end do end program 25 Task Parallel Programming Example One code will run on 2 CPUs Program has 2 tasks (a and b) to be done by 2 CPUs program.f: initialize... if CPU=a then do task a elseif CPU=b then do task b end if. end program CPU A program.f: initialize do task a end program CPU B program.f: initialize do task b end program 26 13

14 Task Parallelism: Protein Folding Same initial protein structure Different MDs (independent tasks) Final set of folded protein structures Independent tasks: change atoms velocities to random seed values while preserving temperature 27 Data Parallelism: Protein Folding One single folding process is performed in parallel PC 1 PC 3 PC 2 PC 0 Partitioned the space in four regions 28 14

15 Data Dependency Graphs Formal method to identify parallelism A directed graph: Vertexes (circles) represent tasks to be completed Edges denote dependencies among tasks If there is not path between two vertexes, then the tasks are independent Labels inside circles represent the kind of tasks being performed Multiple circles with the same label represent tasks performing the same operation on different operands 29 A Parallelism in Data Dependency Graphs A A B B B B C BD B C CE C Data parallelism Task parallelism Sequential dependency 30 15

16 Pipeline Divide a process into stages Produce several items simultaneously E.g., automobile assembly line 31 Pipelining Given a sequential dependence graph (a sequence of independent tasks or stages), assume that: all tasks take the same amount of time multiple problem instances need to be proceeded Then the output of each functional unit is the input of the input to the next. i-2 i-1 i i+1 A B C Examples: von Neuman model where the various circuits in the CPU are split up into functional units; Automobile assembly line 32 16

17 Limits of Parallel Computing Theoretical Upper Limits Amdahl s Law Practical Limits Load balancing Non-computational sections Time to re-write code Hardware/System Limits Topology Network bandwidth and latency Number of processors 33 Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t n = ( f p / N + f s )t 1 Effect of multiple processors on speed up S = 1 Where f s + fp / N f s = serial fraction of code f p = parallel fraction of code N = number of processors 34 17

18 It takes only a small fraction of serial content in a code to degrade the parallel performance. 250 Illustration of Amdahl's Law S f p = f p = f p = f p = Number of processors 35 Practical Limits: Amdahl s Law vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance. S f p = 0.99 Amdahl's Law Reality Number of processors 36 18

19 Shared and Distributed Memory P P P P P P BBus U S Memory P M P P P P P M M M M M Network Shared memory: single address space. All processors have access to a pool of shared memory. (examples: Cray SV1, IBM Power4 node) Methods of memory access: - Bus - Crossbar Distributed memory: each processor has it s own local memory. Must do message passing to exchange data between processors. (examples: Clusters, Cray T3E) Methods of memory access: - various topological interconnects 37 Bus-Based Shared-Memory Architecture (I) Processors are connected to global memory by means of a common data path called a bus. Global Memory BUS CPU CPU CPU 38 19

20 Critical Issues Simplicity of construction Provides uniform access to shared memory Bus can carry limited amount of data between the memory and processors As the number of processors increases each processor spends more time waiting for memory access while the bus is used by other processor Saturation of the bus SGI Challenge XL has only 36 processors 39 Bus-Based Shared-Memory Architecture (II) By adding caches to bus, the performance increases. Global Memory BUS Cache Cache Cache CPU CPU CPU 40 20

21 Bus with and without Cache With cache performance # processors What is the matter with this picture? 41 Switch-Based Shared-Memory Architecture M1 M2... Mm CPU 1 CPU 2 CPU p Switch element PxM crossbar switch: P processors and M memory banks Example of 5x5 crossbar switch: basic unit of the Convex SPP

22 Pro and Contra Do not suffer from problems of saturations BUT Very expensive architectures mxn crossbar needs mn hardware switches 43 Cost Estimation Crossbar switch is a non-blocking network: connection of a processor to a memory bank does not block the connection of any other processor to any other memory bank. Total # of switching elements required is: f(p*m) = approx f(p*p) (assuming p = m) As p grows, so does complexity of switching network f(p*p) Cross bar switches are not scalable in terms of cost

23 Type of Distributed-Memory Architectures Bus-based networks Cluster of workstations on a Ethernet Dynamic interconnect (indirect topology) Static interconnect (direct topology) 45 Dynamic vs. Static Interconnect DYNAMIC INTERCONNECT (indirect topology): Communication links are connected to one another dynamically by switching elements to establish path among processors STATIC INTERCONNECT (direct topology): Point to point communication links among processors. Switch Processor/memory pair Crossbar network: specialized switching nodes transfer the messages Mash: processors themselves as the routing nodes 46 23

24 Dynamic Interconnect: Crossbar Switch Network 47 Dynamic Interconnect: Omega Network (I) 48 24

25 Dynamic Interconnect: Omega Network (II) Routing 49 Example of Systems with Dynamic Interconnection Networks Crossbar switch network Fujitsu VPP 500 (224x224 crossbar with 224 nodes) Compromised strategy between crossbar and omega networks SP series from IBM Each switch of the omega structure is a 8x8 crossbar Larger installation machine has 512 nodes 50 25

26 Static Interconnection Networks completely connected star connected linear array ring mesh hypercube 51 Static Interconnect Completely Connected : Each processor has direct communication link to every other processor Star Connected Network : The middle processor is the central processor. Every other processor is connected to it. Counter part of Cross Bar switch in Dynamic interconnect

27 Static Interconnect Linear Array : Ring : Mesh Network (e.g. 2D) 53 Static Interconnect Torus or Wraparound Mesh : 54 27

28 Static Interconnect Hypercube Network : A multidimensional mesh of processors with exactly two processors in each dimension. A d dimensional processor consists of p = 2 d processors Shown below are 0, 1, 2, and 3D hypercubes 0-D 1-D 2-D 3-D hypercubes 55 Routing How is data transmitted between two nodes not directly connected? Hardware and hardware+software solutions How is a route between nodes decided if there are multiple routes? Deterministic shortest-path routing algorithms How do intermediate nodes forward communications? Store-and-forward routing Wormhole routing 56 28

29 Store-and-Forward vs. Wormhole Routing Store-and-forward routing: data being shipped through a network a packet at a time. We send the packet to the first intermediate node, then on to the second, and so forth. Wormhole routing: a worm crawling through a wormhole. A packet contains a header with routing information followed by a payload containing the actual data, probably followed by a checksum or something to guarantee integrity. Once the header has arrived at a node, it's possible to make routing decisions and pass it along immediately, rather than waiting for the entire packet to arrive first. Wormhole routing dramatically reduces latency, but creates new possibilities for deadlock. 57 Evaluate Network Topology Diameter Connectivity Bisection width Channel width Channel rate Channel bandwidth Bisection bandwidth 58 29

30 Metrics Diameter: Maximum distance between any two processors in the network. The distance between two processors is defined as the shortest path, in terms of links, between them. This relates to communication time. Diameter for completely connected network is 1, for star network is 2, for ring is p/2 (for p even processors) 59 Metrics Connectivity: This is a measure of the multiplicity of paths between any two processors (# arcs that must be removed to break into two). High connectivity is desired since it lowers contention for communication resources. Connectivity is 1 for linear array, 1 for star, 2 for ring, 2 for mesh, 4 for tours in previous examples 60 30

31 Metrics Bisection width: Minimum # of communication links that have to be removed to partition the network into two equal halves. Bisection width is 2 for ring, sq. root(p) for mesh with p (even) processors, p/2 for hypercube, (p*p)/4 for completely connected (p even). Channel width: # of physical wires in each communication link Channel rate: Peak rate at which a single physical wire link can deliver bits 61 Metrics Channel bandwidth: Peak rate at which data can be communicated between the ends of a communication link: = (channel width) * (channel rate) ) Bisection bandwidth: Minimum volume of communication allowed between any 2 halves of the network with equal # of processors: = (bisection width) * (channel BW) 62 31

32 Example I: 2D Mesh Processor nodes n=d 2 Without wraparound connections Switch nodes n Diameter 2( n 1) Bisection width n Edges/node 4 Edge length constant 63 Example 2: Binary Tree Network Processor nodes n=2 d Switch nodes 2n-1 Diameter 2logn Bisection width 1 Edges/node 3 Edge length variable 64 32

33 Next Lecture. Topics: Programming with MPI Deadlines: Course: Read syllabus and tentative schedule Print slides for next lecture. Take slides with you!!!! Seminar: Choose your seminar day and your paper Project: Read the project descriptions - Next deadline is 2/19!!!! Homework: No homework assignment this time. 65 Get some Practice Find memory model and topology for the following machines: CRAY T3E CRAY SV1 IBM RS/6000 SP Hitachi SR8000 Compaq HPC320 IBM eserver p690 SGI Origin 2000 Clusters of SMPs Cray T3D Fujitsu VPP5000 series Next lecture: you give your answers!!!! 66 33

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

CSC630/CSC730: Parallel Computing

CSC630/CSC730: Parallel Computing CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Overview of Parallel Computing. Timothy H. Kaiser, PH.D. Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

CS575 Parallel Processing

CS575 Parallel Processing CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K. Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

EE382 Processor Design. Illinois

EE382 Processor Design. Illinois EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Parallel Computers. c R. Leduc

Parallel Computers. c R. Leduc Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?

More information

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Commodity Cluster Computing

Commodity Cluster Computing Commodity Cluster Computing Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne http://capawww.epfl.ch Commodity Cluster Computing 1. Introduction 2. Characterisation of nodes, parallel machines,applications

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #5 1/29/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

Lecture 23 Database System Architectures

Lecture 23 Database System Architectures CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

Programming Shared Memory Systems with OpenMP Part I. Book

Programming Shared Memory Systems with OpenMP Part I. Book Programming Shared Memory Systems with OpenMP Part I Instructor Dr. Taufer Book Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 2 1 Machine

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2) Lecture 15 Multiple Processor Systems Multiple Processor Systems Multiprocessors Multicomputers Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Parallel Real-Time Systems

Parallel Real-Time Systems Parallel Real-Time Systems Parallel Computing Overview References (Will be expanded as needed) Website for Parallel & Distributed Computing: www.cs.kent.edu/~jbaker/pdc-f08/ Selected slides from Introduction

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Parallel Ant System on Max Clique problem (using Shared Memory architecture)

Parallel Ant System on Max Clique problem (using Shared Memory architecture) Parallel Ant System on Max Clique problem (using Shared Memory architecture) In the previous Distributed Ants section, we approach the original Ant System algorithm using distributed computing by having

More information

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2 Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99

More information

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2 Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #28: Parallel Computing 2005-08-09 CS61C L28 Parallel Computing (1) Andy Carle Scientific Computing Traditional Science 1) Produce

More information

CS61C : Machine Structures

CS61C : Machine Structures CS61C L28 Parallel Computing (1) inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #28: Parallel Computing 2005-08-09 Andy Carle Scientific Computing Traditional Science 1) Produce

More information

Types of Parallel Computers

Types of Parallel Computers slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Interconnection Networks. Issues for Networks

Interconnection Networks. Issues for Networks Interconnection Networks Communications Among Processors Chris Nevison, Colgate University Issues for Networks Total Bandwidth amount of data which can be moved from somewhere to somewhere per unit time

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Intro to HPC Architecture Instructor: Ekpe Okorafor A little about me! PhD Computer Engineering Texas A&M University Computer

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

Introduction. HPC Fall 2007 Prof. Robert van Engelen

Introduction. HPC Fall 2007 Prof. Robert van Engelen Introduction HPC Fall 2007 Prof. Robert van Engelen Syllabus Title: High Performance Computing (ISC5935-1 and CIS5930-13) Classes: Tuesday and Thursday 2:00PM to 3:15PM in 152 DSL Evaluation: projects

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

First, the need for parallel processing and the limitations of uniprocessors are introduced.

First, the need for parallel processing and the limitations of uniprocessors are introduced. ECE568: Introduction to Parallel Processing Spring Semester 2015 Professor Ahmed Louri A-Introduction: The need to solve ever more complex problems continues to outpace the ability of today's most powerful

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance

More information

Communication Performance in Network-on-Chips

Communication Performance in Network-on-Chips Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In

More information

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Parallel Numerics, WT 2017/ Introduction. page 1 of 127 Parallel Numerics, WT 2017/2018 1 Introduction page 1 of 127 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! page 2 of 127 Scope Revise standard

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Design of Parallel Algorithms. The Architecture of a Parallel Computer + Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz

More information