General Remarks. Overview. General Remarks. High Performance Computing Programming Paradigms and Scalability Part 1: Introduction

Size: px
Start display at page:

Download "General Remarks. Overview. General Remarks. High Performance Computing Programming Paradigms and Scalability Part 1: Introduction"

Transcription

1 High erformance Computing rogramming aradigms and Scalability art 1: Introduction D Dr. rer. nat. habil. Ralf-eter Mundani Computation in Engineering (CiE) Scientific Computing (SCCS) Summer Term 2015 General Remarks Ralf-eter Mundani mundani@tum.de, phone: , room: 3181 consultation-hour: by appointment lecture: Tuesday, 12:00 13:30, room Christoph Riesinger riesinge@in.tum.de exercise: Wednesday, 10:15 11:45, room (fortnightly) examination written, 90 minutes all printed/written materials allowed (no electronic devices) materials: 12 General Remarks Overview content part 1: introduction part 2: high-performance networks part 3: foundations part 4: shared-memory programming part 5: distributed-memory programming part 6: examples of parallel algorithms motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. Grace Murray Hopper 13 14

2 Motivation numerical simulation: from phenomena to predictions physical phenomenon technical process 1. modelling determination of parameters, expression of relations Motivation why numerical simulation? because experiments are sometimes impossible life cycle of galaxies, weather forecast, terror attacks, e.g. discipline mathematics computer science application 2. numerical treatment model discretisation, algorithm development 3. implementation software development, parallelisation 4. visualisation illustration of abstract simulation results 5. validation comparison of results with reality 6. embedding insertion into working process bomb attack on WTC (1993) because experiments are sometimes not welcome avalanches, nuclear tests, medicine, e.g Motivation why numerical simulation? (cont d) because experiments are sometimes very costly & time consuming protein folding, material sciences, e.g. Mississippi basin model (Jackson, MS) because experiments are sometimes more expensive aerodynamics, crash test, e.g. Motivation why parallel programming and HC? complex problems (especially the so called grand challenges ) demand for more computing power climate or geophysics simulation (tsunami, e.g.) structure or flow simulation (crash test, e.g.) development systems (CAD, e.g.) large data analysis (Large Hadron Collider at CERN, e.g.) military applications (crypto analysis, e.g.) performance increase due to faster hardware, more memory ( work harder ) more efficient algorithms, optimisation ( work smarter ) parallel computing ( get some help ) 17 18

3 Motivation objectives (in case all resources would be available N-times) throughput: compute N problems simultaneously running N instances of a sequential program with different data sets ( embarrassing parallelism ); SETI@home, e.g. drawback: limited resources of single nodes response time: compute one problem at a fraction (1N) of time running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e.g. drawback: writing a parallel program; communication problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e.g. drawback: writing a parallel program; communication Motivation levels of parallelism qualitative meaning: level(s) on which work is done in parallel granularity sub-instruction level instruction level block level process level program level Motivation levels of parallelism (cont d) program level parallel processing of different programs independent units without any shared data organised by the OS process level a program is subdivided into processes to be executed in parallel each process consists of a larger amount of sequential instructions and some private data communication in most cases necessary (data exchange, e.g.) term of process often referred to as heavy-weight process Motivation levels of parallelism (cont d) block level blocks of instructions are executed in parallel each block consists of few instructions and shares data with others communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread) instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order of commands sub-instruction level instructions are further subdivided in units to be executed in parallel or via overlapping (vector operations, e.g.)

4 Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation definition of parallel computers A collection of processing elements that communicate and cooperate to solve large problems (ALMASE and GOTTLIEB, 1989) possible appearances of such processing elements specialised units (steps of a vector pipeline, e.g.) parallel features in modern monoprocessors (instruction pipelining, superscalar architectures, VLIW, multithreading, multicore, ) several uniform arithmetical units (processing elements of array computers, GGUs, e.g.) complete stand-alone computers connected via LAN (work station or C clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called metacomputers) instruction pipelining instruction execution involves several operations 1.instruction fetch (IF) 2.decode (DE) 3.fetch operands (O) 4.execute (EX) 5.write back (WB) which are executed successively hence, only one part of CU works at a given moment IF DE O EX WB IF DE O EX WB instruction N instruction N1 instruction pipelining (cont d) observation: while processing particular stage of instruction, other stages are idle hence, multiple instructions to be overlapped in execution instruction pipelining (similar to assembly lines) advantage: no additional hardware necessary instruction N IF DE O EX WB instruction N1 instruction N2 instruction N3 instruction N4 IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB time

5 superscalar faster CU throughput due to simultaneously execution of instructions within one clock cycle via redundant functional units (ALU, multiplier, ) dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional units for instance, owerc 970 (4 ALU, 2 FU) instr. 1 ALU instr. 2 ALU instr. 3 ALU instr. 4 ALU FU but, performance improvement is limited (intrinsic parallelism) instr. A instr. B FU superscalar (cont d) pipelining for superscalar architectures also possible instruction N instruction N1 instruction N9 IF DE O EX WB IF DE O EX WB time IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB IF DE O EX WB very long instruction word (VLIW) in contrast to superscalar architectures, the compiler groups parallel executable instructions during compilation (pipelining still possible) advantage: no additional hardware logic necessary drawback: not always fully useable ( dummy filling (NO)) VLIW instruction vector units simultaneously execution of one instruction on a one-dimensional array of data ( vector) VU first appeared in 1970s and were the basis of most supercomputers in the 1980s and 1990s ( A 1 B 1 A 2 B 2 A 3 B 3 A N1 B N1 A N B N )T instr. 1 instr. 2 instr. 3 instr. 4 instruction N1 N ( C 1 C 2 C 3 C N1 C N ) T registers specialised hardware very expensive limited application areas (mostly CFD, CSD, )

6 dual core, quad core, many core, and multicore observation: increasing frequency f (and thus core voltage v) over past years problem: thermal power dissipation fv 2 dual core, quad core, many core, and multicore (cont d) 25% reduction in performance (i.e. core voltage) leads to approx. 50% reduction in dissipation dissipation performance normal CU reduced CU dual core, quad core, many core, and multicore (cont d) idea: installation of two cores per die with same dissipation as single core system dual core, quad core, many core, and multicore (cont d) single vs. dual quad core 0 core 0 core 1 core 0 core 1 core 2 core 3 dissipation L1 L2 L1 L1 shared L2 L1 L1 shared L2 L1 L1 shared L2 performance FSB FSB FSB single core dual core FSB: front side bus (i.e. connection to memory (via north bridge))

7 INTEL Nehalem Core i7 core 0 core 1 core 2 core 3 L1L2 L1L2 L1L2 L1L2 shared L3 Intel E Sandy-Bridge Series 2 CUs connected by 2 QIs (Intel Quick ath Interconnect) Quick ath Interconnect (1 sending and 1 receiving port) 8 GT/s 16 Bit/T payload 2 directions / 8 Bit/Byte = 32 GB/s max bandwidth / QI 2 QI links: 2 32 GB/s 64 GB/s max bandwidth QI source: QI: Quickath Interconnect replaces FSB (QI is a point-to-point interconnection with a memory controller now on-die in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB) source: G. Wellein, RRZE Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation arrival of clusters in the late eighties, Cs became a commodity market with rapidly increasing performance, mass production, and decreasing prices growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MI library 1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops

8 supercomputers supercomputing or high-performance scientific computing as the most important application of the big number crunchers national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world s first TFlops computer) then: ASCI Blue acific (1998, LLNL), ASCI Blue Mountain, ASCI White, meanwhile new high-end computing memorandum (2004) supercomputers (cont d) federal Bundeshöchstleistungsrechner initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich) one new installation every second year (i.e. a six year upgrade cycle for each centre) the newest one to be among the top 10 of the world overview and state of the art: Top500 list (updated every six month), see finally (a somewhat different definition) Supercomputer: Turns CU-bound problems into I/O-bound problems. Ken Batcher MOORE s law observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965) some numbers: Top500 number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months

9 some numbers: Top500 (cont d) some numbers: Top500 (cont d) Citius, altius, fortius! the 10 fastest supercomputers in the world (by November 2014) The Earth Simulator world s #1 from installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching) 8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; TFlops sustained performance (Linpack) nodes connected by single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) further 700 TB disc space and 1.60 B mass storage

10 BlueGeneL world s #1 from installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes) 2 owerc 440d processors (2.8 GFlops each) 512MB memory 131,072 processors ( TFlops peak performance) and TB memory; TFlops sustained performance (Linpack) nodes configured as 3D torus ( ); global reduction tree for fast operations (global max sum) in a few microseconds 1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES 9 Roadrunner world s #1 from installed in 2008 at LANL, NM, USA installation costs about $120 million first hybrid supercomputer dual-core Opteron Cell Broadband Engine 129,600 cores ( TFlops peak performance) and 98 TB memory; TFlops sustained performance (Linpack) standard processing (file system IO, e. g.) handled by Opteron, while mathematically and CU-intensive tasks are handled by Cell 2.35 MW power consumption ( 437 MFlops per Watt ) primarily usage: ensure safety and reliability of nation s nuclear weapons stockpile, real-time applications (cause & effect in capital markets, bone structures and tissues renderings as patients are being examined, e.g.) HLRB II (world s #6 for ) installed in 2006 at LRZ, Garching installation costs 38M monthly costs approx. 400,000 upgrade in 2007 (finished) one of Germany s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus) 256 blades (ccnuma link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.80 GFlops) 4GB memory per core 9728 cores (62.30 TFlops peak performance) and 39 TB memory; TFlops sustained performance (Linpack) footprint 24m 12m; total weight 103 metric tons SuperMUC (world s #4 for ) installed in 2012 at LRZ, Garching IBM System x idatalex (still) one of Germany s 3 supercomputers consists of 19 islands (Infiniband FDR10 pruned tree with 4:1 intra-island / inter-island ratio) 18 thin islands with 512 nodes each (total 288 TB memory) Sandy Bridge-E Xeon E5 (2 CUs (8 cores each) / node) 1 fat island with 205 nodes (total 52 TB memory) Westmere-EX Xeon E7 (4 CUs (10 cores each) / node) 147,456 cores (3.185 Flops peak performance thin islands only); Flops sustained performance (Linpack) footprint 21m 26m; warm water cooling

11 Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation Classification of arallel Computers standard classification according to FLYNN global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data drawback: very different computers may belong to the same class Classification of arallel Computers Classification of arallel Computers standard classification according to FLYNN (cont d) SISD one processing unit that has access to one data memory and to one program memory classical monoprocessor following VON NEUMANN s principle data memory processor program memory standard classification according to FLYNN (cont d) SIMD several processing units, each with separate access to a (shared or distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes outdated due to recent developments at commodity market data memory processor program memory data memory processor

12 Classification of arallel Computers standard classification according to FLYNN (cont d) MISD several processing units that have access to one data memory; several program memories not very popular class (mainly for special applications such as Digital Signal rocessing) operating on a single stream of data, forwarding results from one processing unit to the next example: systolic array (network of primitive processing elements that pump data) processor program memory Classification of arallel Computers standard classification according to FLYNN (cont d) MIMD several processing units, each with separate access to a (shared or distributed) data memory; several program memories classification according to (physical) memory organisation shared memory shared (global) address space distributed memory distributed (local) address space example: multiprocessor systems, networks of computers data memory processor program memory data memory processor program memory data memory processor program memory Classification of arallel Computers processor coupling cooperation of processors computers as well as their shared use of various resources require communication and synchronisation the following types of processor coupling can be distinguished memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS) shared address space distributed address space global memory MemMS, SM distributed memory Mem-MesMS (hybrid) MesMS Classification of arallel Computers processor coupling (cont d) uniform memory access (UMA) each processor has direct access via the network to each memory module M with same access times to all data standard programming model can be used (i.e. no explicit send receive of messages necessary) communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in general by the programmer) M M M network

13 Classification of arallel Computers processor coupling (cont d) symmetric multiprocessor (SM) only a small amount of processors, in most cases a central bus, one address space (UMA), but bad scalability cache-coherence implemented in hardware (i.e. a read always provides a variable s value from its last write) example: double or quad boards, SGI Challenge M Classification of arallel Computers processor coupling (cont d) non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of data (i.e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM VSM, Cray T3E network C: cache C C C M M Classification of arallel Computers processor coupling (cont d) cache-coherent non-uniform memory access (ccnuma) caches for local and remote addresses; cache-coherence implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000 network Classification of arallel Computers processor coupling (cont d) cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1 network C M C M C C C

14 Classification of arallel Computers processor coupling (cont d) no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange (due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and IO due to parallel data transfer (Direct Memory Access, e.g.) possible example: IBM S2, ASCI Red Blue White M M network M Classification of arallel Computers difference between processes and threads program (*.exe, *.out, e.g.) thread model (UMA, NUMA) process model (NORMA) program (*.exe, *.out, e.g.) messages messages Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation Quantitative erformance Evaluation execution time time T of a parallel program between start of the execution on one processor and end of all computations on the last processor during execution all processors are in one of the following states compute T COM : time spent for computations communicate T COMM : time spent for send and receive operations idle T IDLE : time spent for waiting (sending receiving messages) hence T T COM T COMM T IDLE

15 Quantitative erformance Evaluation comparison multiprocessor monoprocessor correlation of multi- and monoprocessor systems performance important: program that can be executed on both systems definitions (1): amount of unit operations of a program on the monoprocessor system (p): amount of unit operations of a program on the multiprocessor systems with p processors T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors Quantitative erformance Evaluation comparison multiprocessor monoprocessor (cont d) simplifying preconditions T(1) (1) one operation to be executed in one step on the monoprocessor system T(p) (p) more than one operation to be executed in one step (for p 2) on the multiprocessor system with p processors Quantitative erformance Evaluation comparison multiprocessor monoprocessor (cont d) speed-up S(p) indicates the improvement in processing speed T(1) S( p) with 1 S(p) p T( p) efficiency E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p S( p) E( p) with 1p E(p) 1 p Quantitative erformance Evaluation comparison multiprocessor monoprocessor (cont d) speed-up and efficiency can be seen in two different ways algorithm-independent best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency algorithm-dependent parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; unfair due to communication and synchronisation overhead relative speed-up relative efficiency

16 Quantitative erformance Evaluation scalability objective: adding further processing elements to the system shall reduce the execution time without any program modifications i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size one porter may carry one suitcase in a minute 60 porters won t do it in a second but 60 porters may carry 60 suitcases in a minute in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems Quantitative erformance Evaluation AMDAHL s law the probably most important and most famous estimate for the speed-up (even if quite pessimistic) underlying model each program has a sequential part s, 0 s 1, that can only be executed in a sequential way: synchronisation, data IO, furthermore, each program consists of a parallelisable part 1s that can be executed in parallel by several processes; finding the maximum value within a set of numbers, e.g. hence, the execution time for the parallel program executed on p processors can be written as 1 s T( p) st(1) T(1) p Quantitative erformance Evaluation Quantitative erformance Evaluation AMDAHL s law (cont d) the speed-up can thus be computed as T(1) S (p ) T( p) T(1) 1 s s T(1) T(1) p 1 1 s s p AMDAHL s law (cont d) example: s 0.1 independent from p the speed-up is bounded by this limit where s the error? 10 when increasing p we finally get AMDAHL s law 9 8 lim S( p) p 1 lim p 1 s s p 1 s speed-up speed-up is bounded: S(p) 1s the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s 0.1) S(p) # processes

17 Quantitative erformance Evaluation GUSTAFSON s law addresses the shortcomings of AMDAHL s law as it states that any sufficient large problem can be efficiently parallelised instead of a fixed problem size it supposes a fixed time concept underlying model execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part, 0 1 hence, the execution time for the sequential program on the monoprocessor can be written as T(1) p(1) the speed-up can thus be computed as Quantitative erformance Evaluation GUSTAFSON s law (cont d) difference to AMDAHL sequential part s(p) is not constant, but gets smaller with increasing p s( p), s(p) 0, 1 p(1 ) often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, ) speed-up is not bounded for increasing p S(p) p(1) p (1p) Quantitative erformance Evaluation GUSTAFSON s law (cont d) some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system Quantitative erformance Evaluation communication computation-ratio (CCR) important quantity measuring the success of a parallelisation relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size example NN matrix distributed among p processors (Np rows each) iterative method: in each step, each matrix element is replaced by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8NNp communication time: 2N CCR: p4n what does this mean?

18 Twelve ways to fool the masses when giving performance results on parallel computers. David H. Bailey, NASA Ames Research Centre, Quote only 32-bit performance results, not 64-bit results. 2. resent performance figures for an inner kernel, and then represent these figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs. 4. Scale up the problem size with the number of processors, but omit any mention of this fact. 5. Quote performance results projected to a full system. 6. Compare your results against scalar, unoptimised codes on Crays. Twelve ways 7. When direct run time comparisons are required, compare with an old code on an obsolete system. 8. If MFLOS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 9. Quote performance in terms of processor utilisation, parallel speed-ups or MFLOS per dollar. 10. Mutilate the algorithm used in the parallel implementation to match the architecture. 11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12. If all else fails, show pretty pictures and animated videos, and don t talk about performance

High Performance Computing Programming Paradigms and Scalability Part 1: Introduction

High Performance Computing Programming Paradigms and Scalability Part 1: Introduction High Performance Computing Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing (SCCS) Summer Term

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming ATHENS Course on Parallel Numerical Simulation Munich, March 19 23, 2007 Dr. Ralf-Peter Mundani Scientific Computing in Computer Science Technische Universität München

More information

Practical Scientific Computing

Practical Scientific Computing Practical Scientific Computing Performance-optimised Programming Preliminary discussion, 17.7.2007 Dr. Ralf-Peter Mundani, mundani@tum.de Dipl.-Ing. Ioan Lucian Muntean, muntean@in.tum.de Dipl.-Geophys.

More information

Practical Scientific Computing

Practical Scientific Computing Practical Scientific Computing Performance-optimized Programming Preliminary discussion: July 11, 2008 Dr. Ralf-Peter Mundani, mundani@tum.de Dipl.-Ing. Ioan Lucian Muntean, muntean@in.tum.de MSc. Csaba

More information

Parallel Computing. PD Dr. rer. nat. habil. Ralf-Peter Mundani. Computation in Engineering / BGU Scientific Computing in Computer Science / INF

Parallel Computing. PD Dr. rer. nat. habil. Ralf-Peter Mundani. Computation in Engineering / BGU Scientific Computing in Computer Science / INF Parallel Computing PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 General Remarks Ralf-Peter Mundani email

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM CCS HC taisuke@cs.tsukuba.ac.jp 1 2 CU memoryi/o 2 2 4single chipmulti-core CU 10 C CM (Massively arallel rocessor) M IBM BlueGene/L 65536 Interconnection Network 3 4 (distributed memory system) (shared

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 8: RISC & Parallel Computers. Parallel computers Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer

More information

Multi-core Programming - Introduction

Multi-core Programming - Introduction Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that

More information

Dr. Joe Zhang PDC-3: Parallel Platforms

Dr. Joe Zhang PDC-3: Parallel Platforms CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

Chapter 1: Perspectives

Chapter 1: Perspectives Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical,

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Intro to Multiprocessors

Intro to Multiprocessors The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple

More information

Performance Report Guidelines. Babak Behzad, Alex Brooks, Vu Dang 12/04/2013

Performance Report Guidelines. Babak Behzad, Alex Brooks, Vu Dang 12/04/2013 Performance Report Guidelines Babak Behzad, Alex Brooks, Vu Dang 12/04/2013 Motivation We need a common way of presenting performance results on Blue Waters! Different applications Different needs Different

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Dheeraj Bhardwaj May 12, 2003

Dheeraj Bhardwaj May 12, 2003 HPC Systems and Models Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi 110 016 India http://www.cse.iitd.ac.in/~dheerajb 1 Sequential Computers Traditional

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Supercomputers. Alex Reid & James O'Donoghue

Supercomputers. Alex Reid & James O'Donoghue Supercomputers Alex Reid & James O'Donoghue The Need for Supercomputers Supercomputers allow large amounts of processing to be dedicated to calculation-heavy problems Supercomputers are centralized in

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

HPC Issues for DFT Calculations. Adrian Jackson EPCC

HPC Issues for DFT Calculations. Adrian Jackson EPCC HC Issues for DFT Calculations Adrian Jackson ECC Scientific Simulation Simulation fast becoming 4 th pillar of science Observation, Theory, Experimentation, Simulation Explore universe through simulation

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Lecture notes for CS Chapter 4 11/27/18

Lecture notes for CS Chapter 4 11/27/18 Chapter 5: Thread-Level arallelism art 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? erformance potential Flynn classification Communication models Architectures

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Parallel Computer Architecture Concepts

Parallel Computer Architecture Concepts Outline This image cannot currently be displayed. arallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler ELAB / IDA Linköping university Sweden 2015 Lecture 1: arallel Computer Architecture

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Fabio AFFINITO.

Fabio AFFINITO. Introduction to High Performance Computing Fabio AFFINITO What is the meaning of High Performance Computing? What does HIGH PERFORMANCE mean??? 1976... Cray-1 supercomputer First commercial successful

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction CSE 262 Spring 2007 Scott B. Baden Lecture 1 Introduction Introduction Your instructor is Scott B. Baden, baden@cs.ucsd.edu Office: room 3244 in EBU3B Office hours: Tuesday after class (week 1) or by appointment

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

CS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006

CS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006 CS420/CSE 402/ECE 492 Introduction to Parallel Programming for Scientists and Engineers Spring 2006 1 of 28 Additional Foils 0.i: Course organization 2 of 28 Instructor: David Padua. 4227 SC padua@uiuc.edu

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Parallel Programming

Parallel Programming Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

ARCHITECTURES FOR PARALLEL COMPUTATION

ARCHITECTURES FOR PARALLEL COMPUTATION Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures

More information