CME342 - Parallel Methods in Numerical Analysis
|
|
- Bruno Phillips
- 6 years ago
- Views:
Transcription
1 CME342 - Parallel Methods in Numerical Analysis Parallel Architectures April 2, 2014 Lecture 2
2 Announcements 1. Subscribe to the mailing list. Go to lists.stanford.edu and follow directions in 1 st handout Subscribe to list with name: cme342-class 2. Sign your name up on the list I am passing around if you have not done so. If you are not on the list, you will not have preferred access to cluster resources / GPUs. 3. If you missed lecture notes or handouts, they are on the web page now 2
3 Some units 1MFlop= 10^6 flops/sec 1MByte=10^6 bytes 1GFlop= 10^9 flops/sec 1GByte=10^9 bytes 1TFlop= 10^12 flops/sec 1TByte=10^12 bytes 3
4 Performance goals 4
5 Microprocessor performance 5
6 Top 500 List November
7 Top 500 List Historical Trends 7
8 Parallel Computers A parallel computer is a collection of CPUs/ Processing Units that cooperate to solve a problem. It can solve a problem faster or just solve a bigger problem. How large is the collection? How is the memory organized? How do they communicate and transfer information? This is not a course on parallel architectures: this lecture will be just an overview.you need to have a certain knowledge of the underlying hardware to efficiently use parallel computers. 8
9 Categorization of Parallel Architectures Control mechanism: instruction stream and data stream Process granularity Address space organization Interconnection network Static Dynamic 9
10 Control mechanism (Flynn s taxonomy) SISD: Single Instruction stream Single Data stream SIMD: Single Instruction stream Multiple Data stream MIMD: Multiple Instruction stream Multiple Data stream MISD: Multiple Instruction stream Single Data stream 10
11 SIMD Multiple processing elements are under the supervision of a control unit Thinking Machine CM-2, MasPar MP-2, Quadrics SIMD extensions are now present in commercial microprocessors (MMX or SSE in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4) 11
12 MIMD Each processing element is capable of executing a different program independent of the other processors Most multiprocessors can be classified in this category 12
13 Process granularity Coarse grain: Cray C90, Fujitsu Medium grain: IBM SP2, CM-5, clusters Fine grain: CM-2, Quadrics, Blue Gene/L? 13
14 Address space: Single address space Uniform Memory Address: SMP (UMA) Non Uniform memory Address (NUMA) Local address spaces: Message passing 14
15 SMP architecture CPU CPU CPU CPU Cache Cache Cache Cache Bus or Crossbar Switch Memory I/O SMP uses shared system resources (memory, I/O) that can be accessed equally from all the processors Cache coherence is maintained 15
16 NUMA architecture Memory Memory Memory Memory CPU CPU CPU CPU Cache Cache Cache Cache Bus or Crossbar Switch Shared address space Memory latency varies whether you access local or remote memory Cache coherence is maintained using an hardware or software protocol 16
17 Message-passing / Distributed Memory Memory Memory Memory Memory CPU CPU CPU CPU Cache Cache Cache Cache Local address space No Cache coherence Communication network 17
18 Hybrids? Memory Device memory Memory Device memory Memory Device memory CPU GPU CPU GPU CPU GPU Cache Cache Cache PCIx PCIx PCIx Notice that, currently, a CPU can have many cores that share memory within one processor (and possibly multiple ones) In addition, accelerator cards can be present with separate memory (heterogeneous) architectures (GPUs and others) No Cache coherence Communication network 18
19 Dynamic interconnections Crossbar Switching : P1 M1 M2 most expensive and extensive interconnection. P2 Bus connected : Processors are connected to memory through a common datapath Multistage interconnection: Butterfly,Omega network, perfect shuffle, etc Butterfly 19
20 Static interconnection networks Complete interconnection Star interconnection Linear array Mesh: 2D/3D mesh, 2D/3D torus Tree and fat tree network Hypercube network 20
21 Characteristics of static networks Diameter: maximum distance between any two processors in the network D=1 complete connection D=N-1 D=N/2 D=2( N -1) linear array ring 2D mesh D=2 ( (N/2)) 2D torus D=log N hypercube 21
22 Characteristics of static networks Bisection width: the minimum number of communications links that have to be removed to partition the network in half. Channel rate: peak rate at which a single wire can deliver bits Channel bandwidth: it is the product of channel rate and channel width Bisection bandwidth B: it is the product of bisection width and channel bandwidth. 22
23 Linear array, ring, mesh, torus P P P P P P Processors are arranged as a d-dimensional grid or torus P P P P P P P P P P P P P P P P P P 23
24 Tree, Fat-tree Tree network: there is only one path between any pair of processors. Fat tree network: increase the number of communication links close to the root 24
25 Hypercube 1-D 2-D 3-D 25
26 Binary Reflected GRAY code: G(i,d) denotes the i-th entry in a sequence of Gray codes of d bits. G(i,d+1) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with 1 and the original entry with 0. 26
27 0 Example of BRG code 1-bit 2-bit 3-bit 8p ring 8p hyper
28 Embedding other networks into hypercubes The hypercube is a rich topology, many other networks can be easily mapped onto it. Mapping a linear array into an hypercube: A linear array (or ring) of 2^d processors can be embedded into a d-dimensional hypercube by mapping processor I onto processor G(I,d) of the hypercube Mapping a 2^r x 2^s mesh on an hypercube: processor(i,j)---> G(i,r) G(j,s) concatenation) ( denotes 28
29 Trade-off among different networks Network Minimum latency Maximum Bw per Proc Wires Switches Example Completely connected Constant Constant O(p*p) - - Crossbar Constant Constant O(p) O(p*p) Cray Bus Constant O(1/p) O(p) O(p) SGI Challenge Mesh O(sqrt p) Constant O(p) - Intel ASCI Red Hypercube O(log p) Constant O(p log p) - Sgi Origin Switched O(log p) Constant O(p log p) O(p log p) IBM SP-2 29
30 Beowulf Cluster built with commodity hardware components PC hardware (x86,alpha,powerpc) Commercial high-speed interconnection (100Base-T, Gigabit Ethernet, Myrinet,SCI,Infiniband) Linux, Free-BSD operating system 30
31 Appleseed: PowerPC cluster 31
32 Clusters of SMPs The next generation of supercomputers will have thousand of SMP nodes connected. Increase the computational power of the single node Keep the number of nodes low New programming approach needed, MPI +Threads (OpenMP,Pthreads,.) ASCI White, Compaq SC, IBM SP
33 Multithreaded architecture The MTA system provides scalable shared memory, in which every processor has equal access to every memory location No concerns about the layout of memory Each MTA processor has up to 128 RISC-like virtual processors Each virtual processor is a hardware stream with its own instruction counter, register set, stream status word and target and trap registers. A different hardware stream is activated every clock period. (now Cray) 33
34 Earth Simulator Between , the Earth Simulator was the most powerful supercomputer available today 40 Teraflops of peak 10 Terabytes of memory Use a crossbar switch to connect 640 nodes: 3,000Km of cables!!!! Each node has 8 vector processors Sustained performances of up to 20TFlops on climate simulation, 15TFlops on 4096^3 isotropic turbulence simulation 34
35 BlueGene/L In , BGL was the #1 computer on the TOP500 supercomputer list 32 x 32 x 64 3D torus 131,000 processors System on a chip Low-cost, low-power processor 360 teraops peak 280 teraops sustained (Linpack) 32 tebibytes 35
36 Tianhe-1 and -2 National Super Computer Center in Guangzhou 3,120,000 cores Intel Xeon E5-2692, 2.2GHz Intel Xeon Phi 31S1P accelerator cards TH Express-2 interconnect Linpack performance: 33.8 PFlop/s Power: 18.8 MWatt Intel CC compiler, Intel MKL MPICH2 for communication 36
37 Programming models Shared memory: Automatic parallelization Pthreads Compiler directives: OpenMP Message passing: MPI: message passing interface PVM: parallel virtual machine HPF: high performance Fortran 37
38 Pthreads Posix threads: Standard definition but non-standard implementation Hard to code 38
39 More Recent Models Extracting high performance from the use of Graphics Processing Units (GPUs) using CUDA for NVIDIA cards OpenCL for CPUs / GPUs / DSPs / FPGAs OpenACC for heterogeneous systems And the list goes on what should you do? 39
40 OpenMP New de-facto standard available on all major platforms Easy to implement Single source for parallel and serial code 40
41 MPI Standard parallel interface: MPI 1 MPI 2: extend MPI 1 with single side communication Need to rewrite the code Code portable on all architectures 41
42 PVM Parallel Virtual Machine Another popular message passing interface Useful in environments with multiple vendors and for MPMD approach 42
43 Simple code to compute Π!program compute_pi!!integer n, i!!double precision w, x, sum, pi, f, a! c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n! c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = 1, n!!!x = w * (i - 0.5d0)!!!sum = sum + f(x)!!end do!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end! 43
44 OpenMP code program compute_pi!!integer n, i!!double precision w, x, sum, pi, f, a! c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n! c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!$OMP PARALLEL DO PRIVATE(x), SHARED(w)!!$OMP& REDUCTION(+: sum)!!do i = 1, n!!!x = w * (i - 0.5d0)!!!sum = sum + f(x)!!end do!!$omp END PARALLEL DO!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end! 44
45 MPI code program compute_pi!!include 'mpif.h'!!double precision mypi, pi, w, sum, x, f, a!!integer n, myid, numprocs, i, rc! c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!call MPI_INIT( ierr )!!call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )!!call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )!!if ( myid.eq. 0 ) then!!print *, 'Enter number of intervals: '!!read *, n!!endif!!call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)! c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = myid+1, n, numprocs!!!x = w * (i - 0.5d0)!!!sum = sum + f(x)!!enddo!!mypi = w * sum!! 45
46 MPI code collect all the partial sums!!call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,!!$ MPI_COMM_WORLD,ierr)! c node 0 prints the answer.!!!if (myid.eq. 0) then!!!print *, 'computed pi = ', pi!!endif!!call MPI_FINALIZE(rc)!!stop!!end! 46
47 Pthreads code #include <stdio.h>! #include <unistd.h>! #include <sys/times.h>! #include <pthread.h>! #include <stdlib.h>!! float sumall, width;! int i, iend;! private int ibegin;! private int cut;! private float x;! private float xsum;! void *do_work();! main(argc,argv)! int argc;! char *argv[];! {! /* Pi - Program loops over slices in interval, summing area of each slice! */! struct tms time_buf;! float ticks, t1, t2;! int intrvls, numthreads, istart;! pthread_t pt[32];!! /* get intervals from command line */!! intrvls = atoi(argv[1]);! numthreads = atoi(getenv( "PL_NUM_THREADS"));! printf(" intervals = %d PL_NUM_THREADS = %d \n",intrvls, numthreads);!! /* get number of clock ticks per second and initialize timer */!! ticks = sysconf(_sc_clk_tck);! t2 =times(&time_buf);!! /* - - Compute width of cuts */!! width = 1. / intrvls;! sumall = 0.0;!! 47
48 Pthreads code /* - - Loop over interval, summing areas */! istart = 1;! iend = intrvls/numthreads;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_create(&pt[i], pthread_attr_default, do_work,(void *) istart);! istart += iend;! }! do_work( istart);! istart += iend;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_join(pt[i], NULL);! }! /* - - fininish any remaining slices */! iend = intrvls - (intrvls/numthreads)* numthreads;! if( iend) do_work( istart);! /* - - Finish overall timing and write results */! t1 = times(&time_buf);! printf("time in main = %20.14e sum = %20.14f \n",(t1 -t2)/ticks,sumall);! printf("error = %20.15e \n",sumall );! }!! void *do_work(istart)! int istart;! {! ibegin = istart;! xsum = 0.0;! for (cut = ibegin ; cut < ibegin+iend; cut++)! {! x = (( (float ) cut) -.5) * width ;! }! xsum += width * 4. /(1. + x * x);! }! sumall += xsum;! 48
49 PVM code 1/3 program compute_pi_master! include '~/pvm3/include/fpvm3.h'! parameter (NTASKS = 5)! parameter (INTERVALS = 1000)! integer mytid! integer tids(ntasks)! real sum, area! real width! integer i, numt, msgtype, bufid, bytes, who, info!! sum = 0.0! C Enroll in PVM! call pvmfmytid(mytid)! C spawn off NTASKS workers! call pvmfspawn('comppi.worker', PVMDEFAULT, ' ',! + NTASKS, tids, numt)! width = 0.0! i = 0! C Multi-cast initial dummy message to workers! msgtype = 0! call pvmfinitsend(0, info)! call pvmfpack(integer4, i, 1, 1, info)! call pvmfpack(real4, width, 1, 1, info)! call pvmfmcast(ntasks, tids, msgtype, info)! C compute interval width! width = 1.0 / INTERVALS! C for each interval, 1) receive area from worker 2) add area to sum 3) send worker new interval number and width! call pvmfrecv(-1, -1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(real4, area, 1, 1, info)! sum = sum + area! call pvmfinitsend(pvmdatadefault, info)! call pvmfpack(integer4, i, 1, 1, info)! call pvmfpack(real4, width, 1, 1, info)! call pvmfsend(who, msgtype, info)! enddo!! 49
50 PVM code 2/3 C Signal to workers that tasks are done! i = -1! call pvmfinitsend(0, info)! call pvmfpack(integer4, i, 1, 1, info)! call pvmfpack(real4, width, 1, 1, info)!! C Collect the last NTASK areas and send the completion signal! do i = 1, NTASKS! call pvmfrecv(-1,-1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(real4, area, 1, 1, info)!! sum = sum + area!! call pvmfsend(who, msgtype, info)! enddo!! print 10,sum! 10 format(x,'computed value of Pi is ',F8.6)!! call pvmfexit(info)! end! 50
51 PVM code 3/3 integer mytid, master! real area! real width, int_val, height! integer int_num! C Enroll in PVM! call pvmfmytid(mytid)! C who is sending me work?! call pvmfparent(master)!! C receive first job from the master! call pvmfrecv(-1, -1, info)! call pvmfunpack(integer4, int_num, 1, 1, info)! call pvmfunpack(real4, width, 1, 1, info)!! C While I've not been sent the signal to quit, I'll keep processing! 40 if (int_num.eq. -1) goto 50! C compute interval value from interval number! int_val = int_num * width! C compute height of given rectangle! height = F(int_val)!! C compute area! area = height * width!! C send area back to master! call pvmfinitsend(pvmdatadefault, info)! call pvmfpack(real4, area, 1, 1, info)! call pvmfsend(master, 9, info)!! C Wait for next job from master! call pvmfrecv(-1, -1, info)! call pvmfunpack(integer4, int_num, 1, 1, info)! call pvmfunpack(real4, width, 1, 1, info)! goto 40! C all done! 50 call pvmfexit(info)! end!! REAL FUNCTION F(X)! REAL X! F = 4.0/(1.0+X**2)! RETURN! END! 51
CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationShared memory programming
CME342- Parallel Methods in Numerical Analysis Shared memory programming May 14, 2014 Lectures 13-14 Motivation Popularity of shared memory systems is increasing: Early on, DSM computers (SGI Origin 3000
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationIntroduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2
Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationMulti-core Programming - Introduction
Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationEEC 581 Computer Architecture. Lec 10 Multiprocessors (4.3 & 4.4)
EEC 581 Computer Architecture Lec 10 Multiprocessors (4.3 & 4.4) Chansu Yu Electrical and Computer Engineering Cleveland State University Acknowledgement Part of class notes are from David Patterson Electrical
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCS 267: Introduction to Parallel Machines and Programming Models Lecture 3 "
CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 " James Demmel www.cs.berkeley.edu/~demmel/cs267_spr16/!!! Outline Overview of parallel machines (~hardware) and programming models
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationCS 267: Introduction to Parallel Machines and Programming Models Lecture 3 "
CS 267: Introduction to Parallel Machines and Lecture 3 " James Demmel www.cs.berkeley.edu/~demmel/cs267_spr15/!!! Outline Overview of parallel machines (~hardware) and programming models (~software) Shared
More informationParallel Computing Why & How?
Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel
More informationBİL 542 Parallel Computing
BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,
More informationBlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationParallelization, OpenMP
~ Parallelization, OpenMP Scientific Computing Winter 2016/2017 Lecture 26 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 18 Why parallelization? Computers became faster and faster
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationCMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)
CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI
More informationParallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationLecture 7: Distributed memory
Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationIntroduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign
Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s
More informationTypes of Parallel Computers
slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationOutline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers
Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationCPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport
CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More information30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy
Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationOutline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples
Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationHigh Performance Computing - Parallel Computers and Networks. Prof Matt Probert
High Performance Computing - Parallel Computers and Networks Prof Matt Probert http://www-users.york.ac.uk/~mijp1 Overview Parallel on a chip? Shared vs. distributed memory Latency & bandwidth Topology
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationParallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?
Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More information6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT
6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationEE 4683/5683: COMPUTER ARCHITECTURE
3/3/205 EE 4683/5683: COMPUTER ARCHITECTURE Lecture 8: Interconnection Networks Avinash Kodi, kodi@ohio.edu Agenda 2 Interconnection Networks Performance Metrics Topology 3/3/205 IN Performance Metrics
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationChapter Seven. Idea: create powerful computers by connecting many smaller ones
Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationGeneral introduction: GPUs and the realm of parallel architectures
General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationDheeraj Bhardwaj May 12, 2003
HPC Systems and Models Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi 110 016 India http://www.cse.iitd.ac.in/~dheerajb 1 Sequential Computers Traditional
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationOpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016
OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More information