Parallel Programming Concepts. Summary. Dr. Peter Tröger M.Sc. Frank Feinbube
|
|
- Alexina Fisher
- 5 years ago
- Views:
Transcription
1 1
2 Parallel Programming Concepts Summary Dr. Peter Tröger M.Sc. Frank Feinbube
3 Course Topics 3 The Parallelization Problem Power wall, memory wall, Moore s law Terminology and metrics Shared Memory Parallelism Theory of concurrency, hardware today and in the past Programming models, optimization, profiling Shared Nothing Parallelism Theory of concurrency, hardware today and in the past Programming models, optimization, profiling Accelerators Patterns Future trends
4 Scaring Students with Word Clouds...
5 The Free Lunch Is Over 5 Clock speed curve flattened in 2003 Heat Power consumption Leakage 2-3 GHz since 2001 (!) Speeding up the serial instruction execution through clock speed improvements no longer works We stumbled into the Many-Core Era [Herb Sutter, 2009]
6 OpenHPI Parallel Programming Concepts Dr. Peter Tröger The Power Wall 6 Air cooling capabilities are limited Maximum temperature of C, hot spot problem Static and dynamic power consumption must be limited Power consumption increases with Moore s law, but grow of hardware performance is expected Further reducing voltage as compensation We can t do that endlessly, lower limit around 0.7V Strange physical effects Next-generation processors need to use even less power Lower the frequencies, scale them dynamically Use only parts of the processor at a time ( dark silicon ) Build energy-efficient special purpose hardware No chance for faster processors through frequency increase
7 Memory Wall 7 Caching: Well established optimization technique for performance Relies on data locality Some instructions are often used (e.g. loops) Some data is often used (e.g. local variables) Hardware keeps a copy of the data in the faster cache On read attempts, data is taken directly from the cache On write, data is cached and eventually written to memory Similar to ILP, the potential is limited Larger caches do not help automatically At some point, all data locality in the code is already exploited Manual vs. compiler-driven optimization [arstechnica.com]
8 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Memory Wall 8 If caching is limited, we simply need faster memory The problem: Shared memory is shared Interconnect contention Memory bandwidth Memory transfer speed is limited by the power wall Memory transfer size is limited by the power wall Transfer technology cannot keep up with GHz processors Memory is too slow, effects cannot be hidden through caching completely à Memory wall [dell.com]
9 The Situation 9 Hardware people Number of transistors N is still increasing Building larger caches no longer helps (memory wall) ILP is out of options (ILP wall) Voltage / power consumption is at the limit (power wall) Some help with dynamic scaling approaches Frequency is stalled (power wall) Only possible offer is to use increasing N for more cores For faster software in the future... Speedup must come from the utilization of an increasing core count, since F is now fixed Software must participate in the power wall handling, to keep F fixed Software must tackle the memory wall
10 Three Ways Of Doing Anything Faster [Pfister] 10 Core Core Core Core Core Problem CPU Work harder (clock speed) Ø Power wall problem Ø Memory wall problem Work smarter (optimization, caching) Ø ILP wall problem Ø Memory wall problem Get help (parallelization) More cores per single CPU Software needs to exploit them in the right way Ø Memory wall problem OpenHPI Parallel Programming Concepts Dr. Peter Tröger
11 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Parallelism on Different Levels 11 A processor chip (socket) Chip multi-processing (CMP) Multiple CPU s per chip, called cores Multi-core / many-core Simultaneous multi-threading (SMT) Interleaved execution of tasks on one core Example: Intel Hyperthreading Chip multi-threading (CMT) = CMP + SMT Instruction-level parallelism (ILP) Parallel processing of single instructions per core Multiple processor chips in one machine (multi-processing) Symmetric multi-processing (SMP) Multiple processor chips in many machines (multi-computer)
12 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Parallelism on Different Levels 12 ILP, SMT ILP, SMT ILP, SMT ILP, SMT CMP Architecture [arstechnica.com] ILP, SMT ILP, SMT ILP, SMT ILP, SMT
13 Parallelism on Different Levels 13 Blue Gene/Q 3. Compute card: One chip module, 16 GB DDR3 Memory, Heat Spreader for H 2 O Cooling 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips; 5D Torus 2. Single Chip Module 1. Chip: 16+2!P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 3D I/O torus 7. System: 96 racks, 20PF/s 5a. Midplane: 16 Node Cards Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming models for exploitation of node hardware concurrency 6. Rack: 2 Midplanes 2011 IBM Corporation
14 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Memory on Different Levels 14 Fast Expensive Small Registers Processor Caches volatile Random Access Memory (RAM) non-volatile Flash / SSD Memory Hard Drives Slow Cheap Large Tapes
15 A Wild Mixture 15 Network
16 GF100 GF100 16
17 1 A Wild Mixture GDDR5 Dual Gigabit LAN Dual Gigabit LAN GDDR5 DDR3 Core CPU Core 16x PCIE MIC QPI GDDR5 CPU DDR3 CPU Core Core Core Core 16x PCIE GPU MIC 16x PCIE Core Core DDR3 Dual Gigabit LAN GDDR5 DDR3 CPU Core Core 16x PCIE MIC GDDR5 QPI DDR3 QPI CPU Core Core Core Core 16x PCIE GDDR5 GPU CPU Dual Gigabit LAN GDDR5 GPU 16x PCIE Core Core GPU Core DDR3 DDR3 DDR3 CPU Core Core QPI CPU Core Core Core Core 16x PCIE 16x PCIE MIC GDDR5 GPU
18 The Parallel Programming Problem 18 Configuration Type Flexible Parallel Application Match? Execution Environment
19 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Hardware Abstraction: Flynn s Taxonomy 19 Classify parallel hardware architectures according to their capabilities in the instruction and data processing dimension Data Item Data Items Instruction Processing Step Output Instruction Processing Step Output Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD) Instructions Data Item Processing Step Output Instructions Data Items Processing Step Output Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)
20 Hardware Abstraction: Tasks + Processing Elements 20 Program Program Program Process Process Process Process Task Process Process Process Process Task Process Process Process Process Task PE PE PE PE PE Memory Node PE PE Memory PE PE PE Memory PE PE PE Memory Network PE PE PE Memory OpenHPI Parallel Programming Concepts Dr. Peter Tröger
21 Hardware Abstraction: PRAM 21 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors Alternative models: BSP, LogP CPU CPU CPU CPU Shared Bus Input Memory Output Input Memory Output
22 Hardware Abstraction: BSP 22 Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 Success of von Neumann model Bridge between hardware and software High-level languages can be efficiently compiled on this model Hardware designers can optimize the realization of this model Similar model for parallel machines Should be neutral about the number of processors Program should be written for v virtual processors that are mapped to p physical ones When v >> p, the compiler has options BSP computation consists of a series of supersteps: 1.) Concurrent computation on all processors 2.) Exchange of data between all processes 3.) Barrier synchronization
23 Hardware Abstraction: CSP 23 Behavior of real-world objects can be described through their interaction with other objects Leave out internal implementation details Interface of a process is described as set of atomic events Event examples for an ATM: card insertion of a credit card in an ATM card slot money extraction of money from the ATM dispenser Events for a printer: {accept, print} Alphabet - set of relevant (!) events for an object description Event may never happen in the interaction Interaction is restricted to this set of events αatm = {card, money} A CSP process is the behavior of an object, described with its alphabet
24 Hardware Abstraction: LogP 24 Criticism on overly simplification in PRAM-based approaches, encourage exploitation of,formal loopholes (e.g. communication) Trend towards multicomputer systems with large local memories Characterization of a parallel machine by: P: Number of processors g (gap): Minimum time between two consecutive transmissions Reciprocal corresponds to per-processor communication bandwidth L (latency): Upper bound on messaging time o (overhead): Exclusive processor time needed for send / receive operation L, o, G in multiples of processor cycles
25 Hardware Abstraction: OpenCL 25 Private Per work-item Local Shared within a workgroup Global/ Constant Visible to all workgroups [4] ParProg GPU Computing FF2013 Host Memory On the CPU
26 The Parallel Programming Problem 26 Configuration Type Flexible Parallel Application Match? Execution Environment
27 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Software View: Concurrency vs. Parallelism 27 Concurrency means dealing with several things at once Programming concept for the developer In shared-memory systems, implemented by time sharing Parallelism means doing several things at once Demands parallel hardware Parallel programming is a misnomer Concurrent programming aiming at parallel execution Any parallel software is concurrent software Note: Some researchers disagree, most practitioners agree Concurrent software is not always parallel software Many server applications achieve scalability by optimizing concurrency only (web server) Concurrency Parallelism
28 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: No Concurrency, No Parallelism 28
29 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Concurrency for Throughput 29
30 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Parallelism for Throughput 30
31 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Server Example: Parallelism for Speedup 31
32 Concurrent Execution 32 Program as sequence of atomic statements Atomic : Executed without interruption Concurrent execution is the interleaving of atomic statements from multiple tasks Tasks may share resources (variables, operating system handles, ) Operating system timing is not predictable, so interleaving is not predictable May impact the result of the application Since parallel programs are concurrent programs, we need to deal with that! y=x y=y-1 x=y y=x y=y-1 x=y z=x z=z+1 x=z x=1 y=x y=y-1 y=x y=y+1 x=y x=y x=0 OpenHPI Parallel Programming Concepts Dr. Peter Tröger x=1 z=x z=z+1 x=z Case 1 Case 2 y=x z=x y=y-1 z=z+1 x=y x=z x=2 Case 3 Case 4 y=x z=x y=y-1 z=z+1 x=z x=y x=0
33 Critical Section 33 N threads has some code - critical section - with shared data access Mutual Exclusion demand Only one thread at a time is allowed into its critical section, among all threads that have critical sections for the same resource. Progress demand If no other thread is in the critical section, the decision for entering should not be postponed indefinitely. Only threads that wait for entering the critical section are allowed to participate in decisions. Bounded Waiting demand It must not be possible for a thread requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)
34 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Critical Sections with Mutexes 34 Waiting Queue T1 T2 T3 m.lock() T3 T2 T3 T2 m.unlock() Critical Section m.lock() m.lock() Critical Section m.unlock() Critical Section m.unlock()
35 Critical Sections with High-Level Primitives 35 Today: Multitude of high-level synchronization primitives Spinlock Perform busy waiting, lowest overhead for short locks Reader / Writer Lock Special case of mutual exclusion through semaphores Multiple Reader processes can enter the critical section at the same time, but Writer process should gain exclusive access Different optimizations possible: minimum reader delay, minimum writer delay, throughput, Mutex Semaphore that works amongst operating system processes Concurrent Collections Blocking queues and key-value maps with concurrency support
36 Critical Sections with High-Level Primitives 36 Reentrant Lock Lock can be obtained several times without locking on itself Useful for cyclic algorithms (e.g. graph traversal) and problems were lock bookkeeping is very expensive Reentrant mutex needs to remember the locking thread(s), which increases the overhead Barriers All concurrent activities stop there and continue together Participants statically defined at compile- or start-time Newer dynamic barrier concept allows late binding of participants (e.g. X10 clocks, Java phasers) Memory barrier or memory fence enforce separation of memory operations before and after the barrier Needed for low-level synchronization implementation
37 Nasty Stuff Deadlock Two or more processes / threads are unable to proceed Each is waiting for one of the others to do something Livelock Two or more processes / threads continuously change their states in response to changes in the other processes / threads No global progress for the application Race condition Two or more processes / threads are executed concurrently Final result of the application depends on the relative timing of their execution 37
38 Coffman Conditions E.G. Coffman and A. Shoshani. Sequencing tasks in multiprocess systems to avoid deadlocks. All conditions must be fulfilled to allow a deadlock to happen Mutual exclusion condition - Individual resources are available or held by no more than one thread at a time Hold and wait condition Threads already holding resources may attempt to hold new resources No preemption condition Once a thread holds a resource, it must voluntarily release it on its own Circular wait condition Possible for a thread to wait for a resource held by the next thread in the chain Avoiding circular wait turned out to be the easiest solution for deadlock avoidance Avoiding mutual exclusion leads to non-blocking synchronization These algorithms no longer have a critical section
39 Terminology Starvation A runnable process / thread is overlooked indefinitely Although it is able to proceed, it is never chosen to run (dispatching / scheduling) Atomic Operation Function or action implemented as a sequence of one or more instructions Appears to be indivisible - no other process / thread can see an intermediate state or interrupt the operation Executed as a group, or not executed at all Mutual Exclusion The requirement that when one process / thread is using a resource, no other shall be allowed to do that 39
40 Is it worth the pain? 40 Parallelization metrics are application-dependent, but follow a common set of concepts Speedup: More resources lead less time for solving the same task Linear speedup: n times more resources à n times speedup Scaleup: More resources solve a larger version of the same task in the same time Linear scaleup: n times more resources à n times larger problem solvable The most important goal depends on the application Transaction processing usually heads for throughput (scalability) Decision support usually heads for response time (speedup)
41 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup 41 Application Idealized assumptions All tasks are equal sized All code parts can run in parallel Tasks: v=12 Processing elements: N=1 Time needed: T 1 =12 t Tasks: v=12 Processing elements: N= 3 Time needed: T 3 = 4 (Linear) Speedup: T 1 /T 3 =12/4=3 t
42 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup with Load Imbalance 42 Assumptions Application Tasks have different size, best-possible speedup depends on optimized resource usage All code parts can run in parallel Tasks: v=12 Processing elements: N=1 Time needed: T 1 =16 t t Tasks: v=12 Processing elements: N= 3 Time needed: T 3 = 6 Speedup: T 1 /T 3 =16/6=2.67
43 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Speedup with Serial Parts 43 Each application has inherently non-parallelizable serial parts Algorithmic limitations Shared resources acting as bottleneck Overhead for program start Communication overhead in shared-nothing systems t SER1 t PAR1 t SER2 t PAR2 t SER3
44 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 44 Gene Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967 Serial parts T SER = t SER1 + t SER2 + t SER3 + Parallelizable parts T PAR = t PAR1 + t PAR2 + t PAR3 + Execution time with one processing element: T 1 = T SER +T PAR Execution time with N parallel processing elements: T N >= T SER + T PAR / N Equal only on perfect parallelization, e.g. no load imbalance Amdahl s Law for maximum speedup with N processing elements S = T 1 T N S = T SER + T PAR T SER + T PAR /N
45 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 45
46 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Amdahl s Law 46 Speedup through parallelism is hard to achieve For unlimited resources, speedup is bound by the serial parts: Assume T 1 =1 S N!1 = T 1 T N!1 S N!1 = 1 T SER Parallelization problem relates to all system layers Hardware offers some degree of parallel execution Speedup gained is bound by serial parts: Limitations of hardware components Necessary serial activities in the operating system, virtual runtime system, middleware and the application Overhead for the parallelization itself
47 Gustafson-Barsis Law (1988) 47 Gustafson and Barsis pointed out that people are typically not interested in the shortest execution time Rather solve the biggest problem in reasonable time Problem size could then scale with the number of processors Leads to larger parallelizable part with increasing N Typical goal in simulation problems Time spend in the sequential part is usually fixed or grows slower than the problem size à linear speedup possible Formally: P N : Portion of the program that benefits from parallelization, depending on N (and implicitly the problem size) Maximum scaled speedup by N processors:
48 The Parallel Programming Problem 48 Configuration Type Flexible Parallel Application Match? Execution Environment
49 OpenHPI Parallel Programming Concepts Dr. Peter Tröger Programming Model for Shared Memory 49 Concurrent Processes Concurrent Threads Memory Main Thread Process Memory Main Thread Process Memory Main Thread Thread Thread Process Explicitly Shared Memory Concurrent Tasks Task Task Memory Task Task Thread Thread Process Different programming models for concurrency in shared memory Processes and threads mapped to processing elements (cores) Process- und thread-based programming typically part of operating system lectures
50 OpenHPI Parallel Programming Concepts Dr. Peter Tröger OpenMP 50 Programming with the fork-join model Master thread forks into declared tasks Runtime environment may run them in parallel, based on dynamic mapping to threads from a pool Worker task barrier before finalization (join) [Wikipedia]
51 Task Scheduling 51 Classical task scheduling with central queue All worker threads fetch tasks from a central queue Scalability issue with increasing thread (resp. core) count Work stealing in OpenMP (and other libraries) Task queue per thread Idling thread steals tasks from another thread Independent from thread scheduling Only mutual synchronization No central queue Thread Next Task Task Queue Thread Next Task Work Stealing Task Queue New Task New Task
52 OpenHPI Parallel Programming Concepts Dr. Peter Tröger PGAS Languages 52 Non-uniform memory architectures (NUMA) became default But: Understanding of memory in programming is flat All variables are equal in access time Considering the memory hierarchy is low-level coding (e.g. cache-aware programming) Partitioned global address space (PGAS) approach Driven by high-performance computing community Modern approach for large-scale NUMA Explicit notion of memory partition per processor Data is designated as local (near) or global (possibly far) Programmer is aware of NUMA nodes Performance optimization for deep memory hierarchies
53 Parallel Programming for Accelerators 53 [4] OpenCL exposes CPUs, GPUs, and other Accelerators as devices Each device contains one or more compute units, i.e. cores, SMs,... Each compute unit contains one or more SIMD processing elements ParProg GPU Computing FF2013
54 The BIG idea behind OpenCL 54 OpenCL execution model execute a kernel at each point in a problem domain. E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions
55 Message Passing Programming paradigm targeting shared-nothing infrastructures Implementations for shared memory available, but typically not the best-possible approach Multiple instances of the same application on a set of nodes (SPMD) Instance 0 Instance 1 Submission Host Instance 2 Instance 3 Execution Hosts
56 Single Program Multiple Data (SPMD) 56 seq. program and data distribution seq. node program with message passing P0 P1 P2 P3 identical copies with different process identifications
57 Actor Model 57 Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI Another mathematical model for concurrent computation No global system state concept (relationship to physics) Actor as computation primitive Makes local decisions Concurrently creates more actors Concurrently sends / receives messages Asynchronous one-way messaging with changing topology (CSP communication graph is fixed), no order guarantees Recipient is identified by mailing address Everything is an actor
58 Actor Model 58 Interaction with asynchronous, unordered, distributed messaging Fundamental aspects Emphasis on local state, time and name space No central entity Actor A gets to know actor B only by direct creation, or by name transmission from another actor C Computation Not global state sequence, but partially ordered set of events Event: Receipt of a message by a target actor Each event is a transition from one local state to another Events may happen in parallel Messaging reliability declared as orthogonal aspect
59 Message Passing Interface (MPI) 59 MPI_GATHER ( IN sendbuf, IN sendcount, IN sendtype, OUT recvbuf, IN recvcount, IN recvtype, IN root, IN comm ) Each process sends its buffer to the root process, including root Incoming messages are stored in rank order Receive buffer is ignored for all non-root processes MPI_GATHERV allows varying count of data to be received Returns if the buffer is re-usable (no finishing promised)
60 The Parallel Programming Problem 60 Configuration Type Flexible Parallel Application Match? Execution Environment
61 Execution Environment Mapping 61 Multiple Instruction, Multiple Data (MIMD) Single Instruction, Multiple Data (SIMD)
62 Patterns for Parallel Programming [Mattson] 62 Finding Concurrency Design Space task / data decomposition, task grouping and ordering due to data flow dependencies, design evaluation Algorithm Structure Design Space Task parallelism, divide and conquer, geometric decomposition, recursive data, pipeline, event-based coordination Mapping of concurrent design elements to execution units Supporting Structures Design Space SPMD, master / worker, loop parallelism, fork / join, shared data, shared queue, distributed array Program structures and data structures used for code creation Implementation Mechanisms Design Space
63 Designing Parallel Algorithms [Foster] 63 Map workload problem on an execution environment Concurrency for speedup Data locality for speedup Scalability Best parallel solution typically differs massively from the sequential version of an algorithm Foster defines four distinct stages of a methodological approach Example: Parallel Sum
64 Example: Parallel Reduction 64 Reduce a set of elements into one, given an operation Example: Sum
65 Designing Parallel Algorithms [Foster] 65 A) Search for concurrency and scalability Partitioning Decompose computation and data into small tasks Communication Define necessary coordination of task execution B) Search for locality and other performance-related issues Agglomeration Consider performance and implementation costs Mapping Maximize processor utilization, minimize communication Might require backtracking or parallel investigation of steps
66 Partitioning 66 Expose opportunities for parallel execution fine-grained decomposition Good partition keeps computation and data together Data partitioning leads to data parallelism Computation partitioning leads task parallelism Complementary approaches, can lead to different algorithms Reveal hidden structures of the algorithm that have potential Investigate complementary views on the problem Avoid replication of either computation or data, can be revised later to reduce communication overhead Step results in multiple candidate solutions
67 Partitioning - Decomposition Types 67 Domain Decomposition Define small data fragments Specify computation for them Different phases of computation on the same data are handled separately Rule of thumb: First focus on large or frequently used data structures Functional Decomposition Split up computation into disjoint tasks, ignore the data accessed for the moment With significant data overlap, domain decomposition is more appropriate
68 Partitioning Strategies [Breshears] 68 Produce at least as many tasks as there will be threads / cores But: Might be more effective to use only fraction of the cores (granularity) Computation must pay-off with respect to overhead Avoid synchronization, since it adds up as overhead to serial execution time Patterns for data decomposition By element (one-dimensional) By row, by column group, by block (multi-dimensional) Influenced by ratio of computation and synchronization
69 Partitioning - Checklist 69 Checklist for resulting partitioning scheme Order of magnitude more tasks than processors? -> Keeps flexibility for next steps Avoidance of redundant computation and storage requirements? -> Scalability for large problem sizes Tasks of comparable size? -> Goal to allocate equal work to processors Does number of tasks scale with the problem size? -> Algorithm should be able to solve larger tasks with more processors Resolve bad partitioning by estimating performance behavior, and eventually reformulating the problem
70 Communication Step 70 Specify links between data consumers and data producers Specify kind and number of messages on these links Domain decomposition problems might have tricky communication infrastructures, due to data dependencies Communication in functional decomposition problems can easily be modeled from the data flow between the tasks Categorization of communication patterns Local communication (few neighbors) vs. global communication Structured communication (e.g. tree) vs. unstructured communication Static vs. dynamic communication structure Synchronous vs. asynchronous communication
71 Communication - Hints 71 Distribute computation and communication, don t centralize algorithm Bad example: Central manager for parallel summation Divide-and-conquer helps as mental model to identify concurrency Unstructured communication is hard to agglomerate, better avoid it Checklist for communication design Do all tasks perform the same amount of communication? -> Distribute or replicate communication hot spots Does each task performs only local communication? Can communication happen concurrently? Can computation happen concurrently?
72 Ghost Cells 72 Domain decomposition might lead to chunks that demand data from each other for their computation Solution 1: Copy necessary portion of data (,ghost cells ) If no synchronization is needed after update Data amount and frequency of update influences resulting overhead and efficiency Additional memory consumption Solution 2: Access relevant data,remotely Delays thread coordination until the data is really needed Correctness ( old data vs. new data) must be considered on parallel progress
73 Agglomeration Step 73 Algorithm so far is correct, but not specialized for some execution environment Check again partitioning and communication decisions Agglomerate tasks for efficient execution on some machine Replicate data and / or computation for efficiency reasons Resulting number of tasks can still be greater than the number of processors Three conflicting guiding decisions Reduce communication costs by coarser granularity of computation and communication Preserve flexibility with respect to later mapping decisions Reduce software engineering costs (serial -> parallel version)
74 74 Agglomeration [Foster]
75 Agglomeration Granularity vs. Flexibility 75 Reduce communication costs by coarser granularity Sending less data Sending fewer messages (per-message initialization costs) Agglomerate, especially if tasks cannot run concurrently Reduces also task creation costs Replicate computation to avoid communication (helps also with reliability) Preserve flexibility Flexible large number of tasks still prerequisite for scalability Define granularity as compile-time or run-time parameter
76 Agglomeration - Checklist 76 Communication costs reduced by increasing locality? Does replicated computation outweighs its costs in all cases? Does data replication restrict the range of problem sizes / processor counts? Does the larger tasks still have similar computation / communication costs? Does the larger tasks still act with sufficient concurrency? Does the number of tasks still scale with the problem size? How much can the task count decrease, without disturbing load balancing, scalability, or engineering costs? Is the transition to parallel code worth the engineering costs?
77 Mapping Step 77 Only relevant for shared-nothing systems, since shared memory systems typically perform automatic task scheduling Minimize execution time by Place concurrent tasks on different nodes Place tasks with heavy communication on the same node Conflicting strategies, additionally restricted by resource limits In general, NP-complete bin packing problem Set of sophisticated (dynamic) heuristics for load balancing Preference for local algorithms that do not need global scheduling state
78 Surface-To-Volume Effect [Foster, Breshears] 78 Visualize the data to be processed (in parallel) as sliced 3D cube Synchronization requirements of a task Proportional to the surface of the data slice it operates upon Visualized by the amount of,borders of the slice Computation work of a task Proportional to the volume of the data slice it operates upon Represents the granularity of decomposition Ratio of synchronization and computation High synchronization, low computation, high ratio à bad Low synchronization, high computation, low ratio à good Ratio decreases for increasing data size per task Coarse granularity by agglomerating tasks in all dimensions For given volume, the surface then goes down à good
79 Surface-To-Volume Effect [Foster, Breshears] 79 (C) nicerweb.com
80 Surface-to-Volume Effect [Foster] 80 Computation on 8x8 grid (a): 64 tasks, one point each 64x4=256 synchronizations 256 data values are transferred (b): 4 tasks, 16 points each 4x4=16 synchronizations 16x4=64 data values are transferred
81 Designing Parallel Algorithms [Breshears] 81 Parallel solution must keep sequential consistency property Mentally simulate the execution of parallel streams Check critical parts of the parallelized sequential application Amount of computation per parallel task Always introduced by moving from serial to parallel code Speedup must offset the parallelization overhead (Amdahl) Granularity: Amount of parallel computation done before synchronization is needed Fine-grained granularity overhead vs. coarse-grained granularity concurrency Iterative approach of finding the right granularity Decision might be only correct only for a chosen execution environment
82 OK?!?
83 83 Certificate for free
Workloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationParallel Computing Introduction
Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices
More information27. Parallel Programming I
760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling
More information27. Parallel Programming I
771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger. Sources:
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More information18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor
More information27. Parallel Programming I
The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationParallel and Distributed Computing (PD)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Parallel and Distributed Computing (PD) The past decade has brought explosive growth in multiprocessor computing, including multi-core
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationComputer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationIN5050: Programming heterogeneous multi-core processors Thinking Parallel
IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationCS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019
CS 31: Introduction to Computer Systems 22-23: Threads & Synchronization April 16-18, 2019 Making Programs Run Faster We all like how fast computers are In the old days (1980 s - 2005): Algorithm too slow?
More informationParallel Programming: Background Information
1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same
More informationParallel Programming: Background Information
1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same
More informationParallel Computing Why & How?
Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationComputer Architecture Crash course
Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping
More informationCurriculum 2013 Knowledge Units Pertaining to PDC
Curriculum 2013 Knowledge Units Pertaining to C KA KU Tier Level NumC Learning Outcome Assembly level machine Describe how an instruction is executed in a classical von Neumann machine, with organization
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationCS377P Programming for Performance Multicore Performance Multithreading
CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015 Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationMoore s Law. Computer architect goal Software developer assumption
Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer
More informationParallelism Marco Serafini
Parallelism Marco Serafini COMPSCI 590S Lecture 3 Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationTHREAD LEVEL PARALLELISM
THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture
More informationParallel Programming Concepts. Introduction. Peter Tröger
Parallel Programming Concepts Introduction Peter Tröger Course Design Lectures covering theoretical and practical aspects of concurrency 30 minutes oral exam Lectures partially given by domain experts
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationShared-Memory Hardware
Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube Shared-Memory Hardware Hardware architecture: Processor(s), memory system(s), data path(s)
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationComputer Science Curricula 2013
Computer Science Curricula 2013 Curriculum Guidelines for Undergraduate Degree Programs in Computer Science December 20, 2013 The Joint Task Force on Computing Curricula Association for Computing Machinery
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationCSCI 4717 Computer Architecture
CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel
More informationCOMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction
COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing Introduction to Parallel Computing with MPI and OpenMP P. Ramieri Segrate, November 2016 Course agenda Tuesday, 22 November 2016 9.30-11.00 01 - Introduction to parallel
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationApplication Programming
Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationDesign of Parallel Algorithms. Course Introduction
+ Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationClaude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique
Got 2 seconds Sequential 84 seconds Expected 84/84 = 1 second!?! Got 25 seconds MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Séminaire MATHEMATIQUES
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING
ANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING Master of Science Thesis Examiner(s): Professor Jari Nurmi, Doctor Sanna Määttä Examiner and topic approved by the Teaching and research Council
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More information