Parallel Programming Concepts. Summary. Dr. Peter Tröger M.Sc. Frank Feinbube

Size: px
Start display at page:

Download "Parallel Programming Concepts. Summary. Dr. Peter Tröger M.Sc. Frank Feinbube"

Transcription

1 1

2 Parallel Programming Concepts Summary Dr. Peter Tröger M.Sc. Frank Feinbube

3 Course Topics 3 The Parallelization Problem Power wall, memory wall, Moore s law Terminology and metrics Shared Memory Parallelism Theory of concurrency, hardware today and in the past Programming models, optimization, profiling Shared Nothing Parallelism Theory of concurrency, hardware today and in the past Programming models, optimization, profiling Accelerators Patterns Future trends

4 Scaring Students with Word Clouds...

5 The Free Lunch Is Over 5 Clock speed curve flattened in 2003 Heat Power consumption Leakage 2-3 GHz since 2001 (!) Speeding up the serial instruction execution through clock speed improvements no longer works We stumbled into the Many-Core Era [Herb Sutter, 2009]

6 OpenHPI Parallel Programming Concepts Dr. Peter Tröger The Power Wall 6 Air cooling capabilities are limited Maximum temperature of C, hot spot problem Static and dynamic power consumption must be limited Power consumption increases with Moore s law, but grow of hardware performance is expected Further reducing voltage as compensation We can t do that endlessly, lower limit around 0.7V Strange physical effects Next-generation processors need to use even less power Lower the frequencies, scale them dynamically Use only parts of the processor at a time ( dark silicon ) Build energy-efficient special purpose hardware No chance for faster processors through frequency increase

7 Memory Wall 7 Caching: Well established optimization technique for performance Relies on data locality Some instructions are often used (e.g. loops) Some data is often used (e.g. local variables) Hardware keeps a copy of the data in the faster cache On read attempts, data is taken directly from the cache On write, data is cached and eventually written to memory Similar to ILP, the potential is limited Larger caches do not help automatically At some point, all data locality in the code is already exploited Manual vs. compiler-driven optimization [arstechnica.com]

8 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Memory Wall 8 If caching is limited, we simply need faster memory The problem: Shared memory is shared Interconnect contention Memory bandwidth Memory transfer speed is limited by the power wall Memory transfer size is limited by the power wall Transfer technology cannot keep up with GHz processors Memory is too slow, effects cannot be hidden through caching completely à Memory wall [dell.com]

9 The Situation 9 Hardware people Number of transistors N is still increasing Building larger caches no longer helps (memory wall) ILP is out of options (ILP wall) Voltage / power consumption is at the limit (power wall) Some help with dynamic scaling approaches Frequency is stalled (power wall) Only possible offer is to use increasing N for more cores For faster software in the future... Speedup must come from the utilization of an increasing core count, since F is now fixed Software must participate in the power wall handling, to keep F fixed Software must tackle the memory wall

10 Three Ways Of Doing Anything Faster [Pfister] 10 Core Core Core Core Core Problem CPU Work harder (clock speed) Ø Power wall problem Ø Memory wall problem Work smarter (optimization, caching) Ø ILP wall problem Ø Memory wall problem Get help (parallelization) More cores per single CPU Software needs to exploit them in the right way Ø Memory wall problem OpenHPI Parallel Programming Concepts Dr. Peter Tröger

11 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Parallelism on Different Levels 11 A processor chip (socket) Chip multi-processing (CMP) Multiple CPU s per chip, called cores Multi-core / many-core Simultaneous multi-threading (SMT) Interleaved execution of tasks on one core Example: Intel Hyperthreading Chip multi-threading (CMT) = CMP + SMT Instruction-level parallelism (ILP) Parallel processing of single instructions per core Multiple processor chips in one machine (multi-processing) Symmetric multi-processing (SMP) Multiple processor chips in many machines (multi-computer)

12 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Parallelism on Different Levels 12 ILP, SMT ILP, SMT ILP, SMT ILP, SMT CMP Architecture [arstechnica.com] ILP, SMT ILP, SMT ILP, SMT ILP, SMT

13 Parallelism on Different Levels 13 Blue Gene/Q 3. Compute card: One chip module, 16 GB DDR3 Memory, Heat Spreader for H 2 O Cooling 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips; 5D Torus 2. Single Chip Module 1. Chip: 16+2!P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 3D I/O torus 7. System: 96 racks, 20PF/s 5a. Midplane: 16 Node Cards Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming models for exploitation of node hardware concurrency 6. Rack: 2 Midplanes 2011 IBM Corporation

14 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Memory on Different Levels 14 Fast Expensive Small Registers Processor Caches volatile Random Access Memory (RAM) non-volatile Flash / SSD Memory Hard Drives Slow Cheap Large Tapes

15 A Wild Mixture 15 Network

16 GF100 GF100 16

17 1 A Wild Mixture GDDR5 Dual Gigabit LAN Dual Gigabit LAN GDDR5 DDR3 Core CPU Core 16x PCIE MIC QPI GDDR5 CPU DDR3 CPU Core Core Core Core 16x PCIE GPU MIC 16x PCIE Core Core DDR3 Dual Gigabit LAN GDDR5 DDR3 CPU Core Core 16x PCIE MIC GDDR5 QPI DDR3 QPI CPU Core Core Core Core 16x PCIE GDDR5 GPU CPU Dual Gigabit LAN GDDR5 GPU 16x PCIE Core Core GPU Core DDR3 DDR3 DDR3 CPU Core Core QPI CPU Core Core Core Core 16x PCIE 16x PCIE MIC GDDR5 GPU

18 The Parallel Programming Problem 18 Configuration Type Flexible Parallel Application Match? Execution Environment

19 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Hardware Abstraction: Flynn s Taxonomy 19 Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension Data Item Data Items Instruction Processing Step Output Instruction Processing Step Output Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) Instructions Data Item Processing Step Output Instructions Data Items Processing Step Output Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)

20 Hardware Abstraction: Tasks + Processing Elements 20 Program Program Program Process Process Process Process Task Process Process Process Process Task Process Process Process Process Task PE PE PE PE PE Memory Node PE PE Memory PE PE PE Memory PE PE PE Memory Network PE PE PE Memory OpenHPI Parallel Programming Concepts Dr. Peter Tröger

21 Hardware Abstraction: PRAM 21 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors Alternative models: BSP, LogP CPU CPU CPU CPU Shared Bus Input Memory Output Input Memory Output

22 Hardware Abstraction: BSP 22 Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 Success of von Neumann model Bridge between hardware and software High-level languages can be efficiently compiled on this model Hardware designers can optimize the realization of this model Similar model for parallel machines Should be neutral about the number of processors Program should be written for v virtual processors that are mapped to p physical ones When v >> p, the compiler has options BSP computation consists of a series of supersteps: 1.) Concurrent computation on all processors 2.) Exchange of data between all processes 3.) Barrier synchronization

23 Hardware Abstraction: CSP 23 Behavior of real-world objects can be described through their interaction with other objects Leave out internal implementation details Interface of a process is described as set of atomic events Event examples for an ATM: card insertion of a credit card in an ATM card slot money extraction of money from the ATM dispenser Events for a printer: {accept, print} Alphabet - set of relevant (!) events for an object description Event may never happen in the interaction Interaction is restricted to this set of events αatm = {card, money} A CSP process is the behavior of an object, described with its alphabet

24 Hardware Abstraction: LogP 24 Criticism on overly simplification in PRAM-based approaches, encourage exploitation of,formal loopholes (e.g. communication) Trend towards multicomputer systems with large local memories Characterization of a parallel machine by: P: Number of processors g (gap): Minimum time between two consecutive transmissions Reciprocal corresponds to per-processor communication bandwidth L (latency): Upper bound on messaging time o (overhead): Exclusive processor time needed for send / receive operation L, o, G in multiples of processor cycles

25 Hardware Abstraction: OpenCL 25 Private Per work-item Local Shared within a workgroup Global/ Constant Visible to all workgroups [4] ParProg GPU Computing FF2013 Host Memory On the CPU

26 The Parallel Programming Problem 26 Configuration Type Flexible Parallel Application Match? Execution Environment

27 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Software View: Concurrency vs. Parallelism 27 Concurrency means dealing with several things at once Programming concept for the developer In shared-memory systems, implemented by time sharing Parallelism means doing several things at once Demands parallel hardware Parallel programming is a misnomer Concurrent programming aiming at parallel execution Any parallel software is concurrent software Note: Some researchers disagree, most practitioners agree Concurrent software is not always parallel software Many server applications achieve scalability by optimizing concurrency only (web server) Concurrency Parallelism

28 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: No Concurrency, No Parallelism 28

29 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Concurrency for Throughput 29

30 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Parallelism for Throughput 30

31 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Parallelism for Speedup 31

32 Concurrent Execution 32 Program as sequence of atomic statements Atomic : Executed without interruption Concurrent execution is the interleaving of atomic statements from multiple tasks Tasks may share resources (variables, operating system handles, ) Operating system timing is not predictable, so interleaving is not predictable May impact the result of the application Since parallel programs are concurrent programs, we need to deal with that! y=x y=y-1 x=y y=x y=y-1 x=y z=x z=z+1 x=z x=1 y=x y=y-1 y=x y=y+1 x=y x=y x=0 OpenHPI Parallel Programming Concepts Dr. Peter Tröger x=1 z=x z=z+1 x=z Case 1 Case 2 y=x z=x y=y-1 z=z+1 x=y x=z x=2 Case 3 Case 4 y=x z=x y=y-1 z=z+1 x=z x=y x=0

33 Critical Section 33 N threads has some code - critical section - with shared data access Mutual Exclusion demand Only one thread at a time is allowed into its critical section, among all threads that have critical sections for the same resource. Progress demand If no other thread is in the critical section, the decision for entering should not be postponed indefinitely. Only threads that wait for entering the critical section are allowed to participate in decisions. Bounded Waiting demand It must not be possible for a thread requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)

34 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Critical Sections with Mutexes 34 Waiting Queue T1 T2 T3 m.lock() T3 T2 T3 T2 m.unlock() Critical Section m.lock() m.lock() Critical Section m.unlock() Critical Section m.unlock()

35 Critical Sections with High-Level Primitives 35 Today: Multitude of high-level synchronization primitives Spinlock Perform busy waiting, lowest overhead for short locks Reader / Writer Lock Special case of mutual exclusion through semaphores Multiple Reader processes can enter the critical section at the same time, but Writer process should gain exclusive access Different optimizations possible: minimum reader delay, minimum writer delay, throughput, Mutex Semaphore that works amongst operating system processes Concurrent Collections Blocking queues and key-value maps with concurrency support

36 Critical Sections with High-Level Primitives 36 Reentrant Lock Lock can be obtained several times without locking on itself Useful for cyclic algorithms (e.g. graph traversal) and problems were lock bookkeeping is very expensive Reentrant mutex needs to remember the locking thread(s), which increases the overhead Barriers All concurrent activities stop there and continue together Participants statically defined at compile- or start-time Newer dynamic barrier concept allows late binding of participants (e.g. X10 clocks, Java phasers) Memory barrier or memory fence enforce separation of memory operations before and after the barrier Needed for low-level synchronization implementation

37 Nasty Stuff Deadlock Two or more processes / threads are unable to proceed Each is waiting for one of the others to do something Livelock Two or more processes / threads continuously change their states in response to changes in the other processes / threads No global progress for the application Race condition Two or more processes / threads are executed concurrently Final result of the application depends on the relative timing of their execution 37

38 Coffman Conditions E.G. Coffman and A. Shoshani. Sequencing tasks in multiprocess systems to avoid deadlocks. All conditions must be fulfilled to allow a deadlock to happen Mutual exclusion condition - Individual resources are available or held by no more than one thread at a time Hold and wait condition Threads already holding resources may attempt to hold new resources No preemption condition Once a thread holds a resource, it must voluntarily release it on its own Circular wait condition Possible for a thread to wait for a resource held by the next thread in the chain Avoiding circular wait turned out to be the easiest solution for deadlock avoidance Avoiding mutual exclusion leads to non-blocking synchronization These algorithms no longer have a critical section

39 Terminology Starvation A runnable process / thread is overlooked indefinitely Although it is able to proceed, it is never chosen to run (dispatching / scheduling) Atomic Operation Function or action implemented as a sequence of one or more instructions Appears to be indivisible - no other process / thread can see an intermediate state or interrupt the operation Executed as a group, or not executed at all Mutual Exclusion The requirement that when one process / thread is using a resource, no other shall be allowed to do that 39

40 Is it worth the pain? 40 Parallelization metrics are application-dependent, but follow a common set of concepts Speedup: More resources lead less time for solving the same task Linear speedup: n times more resources à n times speedup Scaleup: More resources solve a larger version of the same task in the same time Linear scaleup: n times more resources à n times larger problem solvable The most important goal depends on the application Transaction processing usually heads for throughput (scalability) Decision support usually heads for response time (speedup)

41 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup 41 Application Idealized assumptions All tasks are equal sized All code parts can run in parallel Tasks: v=12 Processing elements: N=1 Time needed: T 1 =12 t Tasks: v=12 Processing elements: N= 3 Time needed: T 3 = 4 (Linear) Speedup: T 1 /T 3 =12/4=3 t

42 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup with Load Imbalance 42 Assumptions Application Tasks have different size, best-possible speedup depends on optimized resource usage All code parts can run in parallel Tasks: v=12 Processing elements: N=1 Time needed: T 1 =16 t t Tasks: v=12 Processing elements: N= 3 Time needed: T 3 = 6 Speedup: T 1 /T 3 =16/6=2.67

43 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup with Serial Parts 43 Each application has inherently non-parallelizable serial parts Algorithmic limitations Shared resources acting as bottleneck Overhead for program start Communication overhead in shared-nothing systems t SER1 t PAR1 t SER2 t PAR2 t SER3

44 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 44 Gene Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967 Serial parts T SER = t SER1 + t SER2 + t SER3 + Parallelizable parts T PAR = t PAR1 + t PAR2 + t PAR3 + Execution time with one processing element: T 1 = T SER +T PAR Execution time with N parallel processing elements: T N >= T SER + T PAR / N Equal only on perfect parallelization, e.g. no load imbalance Amdahl s Law for maximum speedup with N processing elements S = T 1 T N S = T SER + T PAR T SER + T PAR /N

45 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 45

46 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 46 Speedup through parallelism is hard to achieve For unlimited resources, speedup is bound by the serial parts: Assume T 1 =1 S N!1 = T 1 T N!1 S N!1 = 1 T SER Parallelization problem relates to all system layers Hardware offers some degree of parallel execution Speedup gained is bound by serial parts: Limitations of hardware components Necessary serial activities in the operating system, virtual runtime system, middleware and the application Overhead for the parallelization itself

47 Gustafson-Barsis Law (1988) 47 Gustafson and Barsis pointed out that people are typically not interested in the shortest execution time Rather solve the biggest problem in reasonable time Problem size could then scale with the number of processors Leads to larger parallelizable part with increasing N Typical goal in simulation problems Time spend in the sequential part is usually fixed or grows slower than the problem size à linear speedup possible Formally: P N : Portion of the program that benefits from parallelization, depending on N (and implicitly the problem size) Maximum scaled speedup by N processors:

48 The Parallel Programming Problem 48 Configuration Type Flexible Parallel Application Match? Execution Environment

49 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Programming Model for Shared Memory 49 Concurrent Processes Concurrent Threads Memory Main Thread Process Memory Main Thread Process Memory Main Thread Thread Thread Process Explicitly Shared Memory Concurrent Tasks Task Task Memory Task Task Thread Thread Process Different programming models for concurrency in shared memory Processes and threads mapped to processing elements (cores) Process- und thread-based programming typically part of operating system lectures

50 OpenHPI Parallel Programming Concepts Dr. Peter Tröger OpenMP 50 Programming with the fork-join model Master thread forks into declared tasks Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool Worker task barrier before finalization (join) [Wikipedia]

51 Task Scheduling 51 Classical task scheduling with central queue All worker threads fetch tasks from a central queue Scalability issue with increasing thread (resp. core) count Work stealing in OpenMP (and other libraries) Task queue per thread Idling thread steals tasks from another thread Independent from thread scheduling Only mutual synchronization No central queue Thread Next Task Task Queue Thread Next Task Work Stealing Task Queue New Task New Task

52 OpenHPI Parallel Programming Concepts Dr. Peter Tröger PGAS Languages 52 Non-uniform memory architectures (NUMA) became default But: Understanding of memory in programming is flat All variables are equal in access time Considering the memory hierarchy is low-level coding (e.g. cache-aware programming) Partitioned global address space (PGAS) approach Driven by high-performance computing community Modern approach for large-scale NUMA Explicit notion of memory partition per processor Data is designated as local (near) or global (possibly far) Programmer is aware of NUMA nodes Performance optimization for deep memory hierarchies

53 Parallel Programming for Accelerators 53 [4] OpenCL exposes CPUs, GPUs, and other Accelerators as devices Each device contains one or more compute units, i.e. cores, SMs,... Each compute unit contains one or more SIMD processing elements ParProg GPU Computing FF2013

54 The BIG idea behind OpenCL 54 OpenCL execution model execute a kernel at each point in a problem domain. E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions

55 Message Passing Programming paradigm targeting shared-nothing infrastructures Implementations for shared memory available, but typically not the best-possible approach Multiple instances of the same application on a set of nodes (SPMD) Instance 0 Instance 1 Submission Host Instance 2 Instance 3 Execution Hosts

56 Single Program Multiple Data (SPMD) 56 seq. program and data distribution seq. node program with message passing P0 P1 P2 P3 identical copies with different process identifications

57 Actor Model 57 Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI Another mathematical model for concurrent computation No global system state concept (relationship to physics) Actor as computation primitive Makes local decisions Concurrently creates more actors Concurrently sends / receives messages Asynchronous one-way messaging with changing topology (CSP communication graph is fixed), no order guarantees Recipient is identified by mailing address Everything is an actor

58 Actor Model 58 Interaction with asynchronous, unordered, distributed messaging Fundamental aspects Emphasis on local state, time and name space No central entity Actor A gets to know actor B only by direct creation, or by name transmission from another actor C Computation Not global state sequence, but partially ordered set of events Event: Receipt of a message by a target actor Each event is a transition from one local state to another Events may happen in parallel Messaging reliability declared as orthogonal aspect

59 Message Passing Interface (MPI) 59 MPI_GATHER ( IN sendbuf, IN sendcount, IN sendtype, OUT recvbuf, IN recvcount, IN recvtype, IN root, IN comm ) Each process sends its buffer to the root process, including root Incoming messages are stored in rank order Receive buffer is ignored for all non-root processes MPI_GATHERV allows varying count of data to be received Returns if the buffer is re-usable (no finishing promised)

60 The Parallel Programming Problem 60 Configuration Type Flexible Parallel Application Match? Execution Environment

61 Execution Environment Mapping 61 Multiple Instruction, Multiple Data (MIMD) Single Instruction, Multiple Data (SIMD)

62 Patterns for Parallel Programming [Mattson] 62 Finding Concurrency Design Space task / data decomposition, task grouping and ordering due to data flow dependencies, design evaluation Algorithm Structure Design Space Task parallelism, divide and conquer, geometric decomposition, recursive data, pipeline, event-based coordination Mapping of concurrent design elements to execution units Supporting Structures Design Space SPMD, master / worker, loop parallelism, fork / join, shared data, shared queue, distributed array Program structures and data structures used for code creation Implementation Mechanisms Design Space

63 Designing Parallel Algorithms [Foster] 63 Map workload problem on an execution environment Concurrency for speedup Data locality for speedup Scalability Best parallel solution typically differs massively from the sequential version of an algorithm Foster defines four distinct stages of a methodological approach Example: Parallel Sum

64 Example: Parallel Reduction 64 Reduce a set of elements into one, given an operation Example: Sum

65 Designing Parallel Algorithms [Foster] 65 A) Search for concurrency and scalability Partitioning Decompose computation and data into small tasks Communication Define necessary coordination of task execution B) Search for locality and other performance-related issues Agglomeration Consider performance and implementation costs Mapping Maximize processor utilization, minimize communication Might require backtracking or parallel investigation of steps

66 Partitioning 66 Expose opportunities for parallel execution fine-grained decomposition Good partition keeps computation and data together Data partitioning leads to data parallelism Computation partitioning leads task parallelism Complementary approaches, can lead to different algorithms Reveal hidden structures of the algorithm that have potential Investigate complementary views on the problem Avoid replication of either computation or data, can be revised later to reduce communication overhead Step results in multiple candidate solutions

67 Partitioning - Decomposition Types 67 Domain Decomposition Define small data fragments Specify computation for them Different phases of computation on the same data are handled separately Rule of thumb: First focus on large or frequently used data structures Functional Decomposition Split up computation into disjoint tasks, ignore the data accessed for the moment With significant data overlap, domain decomposition is more appropriate

68 Partitioning Strategies [Breshears] 68 Produce at least as many tasks as there will be threads / cores But: Might be more effective to use only fraction of the cores (granularity) Computation must pay-off with respect to overhead Avoid synchronization, since it adds up as overhead to serial execution time Patterns for data decomposition By element (one-dimensional) By row, by column group, by block (multi-dimensional) Influenced by ratio of computation and synchronization

69 Partitioning - Checklist 69 Checklist for resulting partitioning scheme Order of magnitude more tasks than processors? -> Keeps flexibility for next steps Avoidance of redundant computation and storage requirements? -> Scalability for large problem sizes Tasks of comparable size? -> Goal to allocate equal work to processors Does number of tasks scale with the problem size? -> Algorithm should be able to solve larger tasks with more processors Resolve bad partitioning by estimating performance behavior, and eventually reformulating the problem

70 Communication Step 70 Specify links between data consumers and data producers Specify kind and number of messages on these links Domain decomposition problems might have tricky communication infrastructures, due to data dependencies Communication in functional decomposition problems can easily be modeled from the data flow between the tasks Categorization of communication patterns Local communication (few neighbors) vs. global communication Structured communication (e.g. tree) vs. unstructured communication Static vs. dynamic communication structure Synchronous vs. asynchronous communication

71 Communication - Hints 71 Distribute computation and communication, don t centralize algorithm Bad example: Central manager for parallel summation Divide-and-conquer helps as mental model to identify concurrency Unstructured communication is hard to agglomerate, better avoid it Checklist for communication design Do all tasks perform the same amount of communication? -> Distribute or replicate communication hot spots Does each task performs only local communication? Can communication happen concurrently? Can computation happen concurrently?

72 Ghost Cells 72 Domain decomposition might lead to chunks that demand data from each other for their computation Solution 1: Copy necessary portion of data (,ghost cells ) If no synchronization is needed after update Data amount and frequency of update influences resulting overhead and efficiency Additional memory consumption Solution 2: Access relevant data,remotely Delays thread coordination until the data is really needed Correctness ( old data vs. new data) must be considered on parallel progress

73 Agglomeration Step 73 Algorithm so far is correct, but not specialized for some execution environment Check again partitioning and communication decisions Agglomerate tasks for efficient execution on some machine Replicate data and / or computation for efficiency reasons Resulting number of tasks can still be greater than the number of processors Three conflicting guiding decisions Reduce communication costs by coarser granularity of computation and communication Preserve flexibility with respect to later mapping decisions Reduce software engineering costs (serial -> parallel version)

74 74 Agglomeration [Foster]

75 Agglomeration Granularity vs. Flexibility 75 Reduce communication costs by coarser granularity Sending less data Sending fewer messages (per-message initialization costs) Agglomerate, especially if tasks cannot run concurrently Reduces also task creation costs Replicate computation to avoid communication (helps also with reliability) Preserve flexibility Flexible large number of tasks still prerequisite for scalability Define granularity as compile-time or run-time parameter

76 Agglomeration - Checklist 76 Communication costs reduced by increasing locality? Does replicated computation outweighs its costs in all cases? Does data replication restrict the range of problem sizes / processor counts? Does the larger tasks still have similar computation / communication costs? Does the larger tasks still act with sufficient concurrency? Does the number of tasks still scale with the problem size? How much can the task count decrease, without disturbing load balancing, scalability, or engineering costs? Is the transition to parallel code worth the engineering costs?

77 Mapping Step 77 Only relevant for shared-nothing systems, since shared memory systems typically perform automatic task scheduling Minimize execution time by Place concurrent tasks on different nodes Place tasks with heavy communication on the same node Conflicting strategies, additionally restricted by resource limits In general, NP-complete bin packing problem Set of sophisticated (dynamic) heuristics for load balancing Preference for local algorithms that do not need global scheduling state

78 Surface-To-Volume Effect [Foster, Breshears] 78 Visualize the data to be processed (in parallel) as sliced 3D cube Synchronization requirements of a task Proportional to the surface of the data slice it operates upon Visualized by the amount of,borders of the slice Computation work of a task Proportional to the volume of the data slice it operates upon Represents the granularity of decomposition Ratio of synchronization and computation High synchronization, low computation, high ratio à bad Low synchronization, high computation, low ratio à good Ratio decreases for increasing data size per task Coarse granularity by agglomerating tasks in all dimensions For given volume, the surface then goes down à good

79 Surface-To-Volume Effect [Foster, Breshears] 79 (C) nicerweb.com

80 Surface-to-Volume Effect [Foster] 80 Computation on 8x8 grid (a): 64 tasks, one point each 64x4=256 synchronizations 256 data values are transferred (b): 4 tasks, 16 points each 4x4=16 synchronizations 16x4=64 data values are transferred

81 Designing Parallel Algorithms [Breshears] 81 Parallel solution must keep sequential consistency property Mentally simulate the execution of parallel streams Check critical parts of the parallelized sequential application Amount of computation per parallel task Always introduced by moving from serial to parallel code Speedup must offset the parallelization overhead (Amdahl) Granularity: Amount of parallel computation done before synchronization is needed Fine-grained granularity overhead vs. coarse-grained granularity concurrency Iterative approach of finding the right granularity Decision might be only correct only for a chosen execution environment

82 OK?!?

83 83 Certificate for free

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Parallel Computing Introduction

Parallel Computing Introduction Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger. Sources:

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger. Sources: Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

27. Parallel Programming I

27. Parallel Programming I The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Parallel and Distributed Computing (PD)

Parallel and Distributed Computing (PD) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Parallel and Distributed Computing (PD) The past decade has brought explosive growth in multiprocessor computing, including multi-core

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019 CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?

More information

Parallel Programming: Background Information

Parallel Programming: Background Information 1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same

More information

Parallel Programming: Background Information

Parallel Programming: Background Information 1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same

More information

Parallel Computing Why & How?

Parallel Computing Why & How? Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

Curriculum 2013 Knowledge Units Pertaining to PDC

Curriculum 2013 Knowledge Units Pertaining to PDC Curriculum 2013 Knowledge Units Pertaining to C KA KU Tier Level NumC Learning Outcome Assembly level machine Describe how an instruction is executed in a classical von Neumann machine, with organization

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

CS377P Programming for Performance Multicore Performance Multithreading

CS377P Programming for Performance Multicore Performance Multithreading CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

Parallelism Marco Serafini

Parallelism Marco Serafini Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

Parallel Programming Concepts. Introduction. Peter Tröger

Parallel Programming Concepts. Introduction. Peter Tröger Parallel Programming Concepts Introduction Peter Tröger Course Design Lectures covering theoretical and practical aspects of concurrency 30 minutes oral exam Lectures partially given by domain experts

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Shared-Memory Hardware

Shared-Memory Hardware Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube Shared-Memory Hardware Hardware architecture: Processor(s), memory system(s), data path(s)

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Computer Science Curricula 2013

Computer Science Curricula 2013 Computer Science Curricula 2013 Curriculum Guidelines for Undergraduate Degree Programs in Computer Science December 20, 2013 The Joint Task Force on Computing Curricula Association for Computing Machinery

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Parallel Processing & Multicore computers

Parallel Processing & Multicore computers Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing Introduction to Parallel Computing with MPI and OpenMP P. Ramieri Segrate, November 2016 Course agenda Tuesday, 22 November 2016 9.30-11.00 01 - Introduction to parallel

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Application Programming

Application Programming Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

ANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING

ANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING ANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING Master of Science Thesis Examiner(s): Professor Jari Nurmi, Doctor Sanna Määttä Examiner and topic approved by the Teaching and research Council

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information