Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design

Size: px
Start display at page:

Download "Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design"

Transcription

1 Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware why use 2 CPUs if you can double the speed on one! Multiple Processor Design architectural classifications shared/distributed memory hierarchical/flat memory dynamic/static processor connectivity FLOPS/Sec Prefix Occurrence 10 3 kilo very badly written code 10 6 mega badly written code 10 9 giga single-core tera multiple chip (NCI) peta 23 machines in Top500 (Nov 2012, measured) exa around 2020! PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz 40GF Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz 105GF NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz 1.19PF evaluating static networks case study; the NCI Raijin supercomputer COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware The Processor Adding Numbers Performs: floating point operations (FLOPS) - add, mult, division (sqrt maybe!) integer operations (MIPS) - adds etc, also logical ops and instruction processing MIPS: Machine Instructions Per Second (on a very old VAX 100/780) anyway, what is a machine instruction on different CPUs! our primary focus will be in floating point operations The clock: all ops take a fixed number of clock ticks to complete clock speed is measured in GHz (10 9 cycles/second) or nsec (10 9 seconds) Apple iphone 6 ARM A8 1.4GHz (0.71ns), NCI Raijin Intel Xeon Sandy Bridge 2.6GHz (0.38ns), IBM zec12 processor 5.5Ghz (0.18ns) clock speed is limited by etching+speed of light, hence motivates parallelism (to my knowledge) IBM zec12 is fastest commodity processor at 5.5GHz light travels about 1cm in 3.2ns, a chip is a few cm! COMP4300/8300 L2-3: Classical Parallel Hardware Consider adding two double precision (8 byte) numbers ± Exponent Significand Possible steps: determine largest exponent normalize significand of the smaller exponent to the larger add significand re normalize the significand and exponent of the result Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP) COMP4300/8300 L2-3: Classical Parallel Hardware

2 Operation Pipelining Step in Pipeline Waiting Done X(6) X(5) X(4) X(3) X(2) X(1) X(1) takes 4 ticks to appear (startup latency); X(2) appears 1 tick after X(1) asymptotically achieves 1 result per clock tick the operation (X) is said to be pipelined: steps in the pipeline are running in parallel requires same op consecutively on different (independent) data items good for vector operations (note limitations on chaining output data to input) not all operations are pipelined, e.g. integer mult., division, sqrt. E.g. Alpha EV6 operation latency repeat +,-,* 4 1 / sqrt COMP4300/8300 L2-3: Classical Parallel Hardware Pipelining: Dependent Instructions principle: CPU must ensure result is the same as if no pipelining (or ism) instructions requiring only 1 cycle in the EX stage: r r s r can be solved by pipeline feedback from EX stage to next cycle (important) instructions requiring c cycles for the EX stage (ie. f.p. * +, load, store) are normally implemented by having c EX stages. This requires any dependent instruction to be delayed by c cycles. e.g. c = 3 r r r r r r I0: FI DI FO EX1 EX2 EX3 WB I1: FI DI FO EX1 EX2 EX3 WB I2: FI DI FO EX1 EX2 EX3 WB I3: FI DI FO EX1 EX2 EX3 WB COMP4300/8300 L2-3: Classical Parallel Hardware Instruction Pipelining Superscalar (Multiple Instruction Issue) break instruction execution into k stages; can get k-way ism (generally, the circuitry for each stage is independent) eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX = Execute Instrn., WB = Write Back (branch): FI DI FO EX WB (guess) FI DI FO EX WB (guess) FI DI FO EX WB (guess) FI DI FO EX WB (sure) FI DI FO EX WB note: FO & WB stages may involve memory accesses (and may possibly stall the pipeline) conditional branch instructions are problematic: the wrong guess may require flushing succeeding instructions from the pipeline tendency to increase the number of stages, but this increases the startup latency e.g. number of stages: UltraSPARC II (9), UltraSPARC III (14), Intel Prescott (31) A small number (w) of instructions are scheduled by the H/W to execute together. groups must have an appropriate instruction mix 2 different floating point e.g. UltraSPARC (w = 4): 1 load / store ; 1 branch 2 integer / logical have w-way ism over different types of instr ns generally requires: multiple ( w) instruction fetches an extra grouping (G) stage in the pipeline instr ns per group problem: amplifies dependencies and other problems of pipelining by a factor of w! issues: the instruction mix must be balanced for maximum performance! i.e. floating point, + must be balanced COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware

3 Review - Instruction Level Parallelism instruction level parallelism (ILP): pipelining and superscalar, offer kw-way ism branch prediction techniques alleviates issue of conditional branches uses a table recoding the result of recently-taken branches out-of-order execution: alleviates the issue of dependencies pulls fetched instructions into a buffer of size W, W w execute them in any order provided dependencies are not violated must keep track of all in-flight instructions and associated registers (O(W 2 ) area and power!) in most situations, the compiler can do as good a job as a human at exposing this parallelism (ILP was part of the Free Lunch ) An exception: how to get a UltraSPARC III to execute 1 f.p. multiply (and add!) per cycle COMP4300/8300 L2-3: Classical Parallel Hardware Cache Memory Main Memory large cheap memory large latency/small bandwidth Cache small fast expensive memory lower latency/higher bandwidth CPU Registers part of the memory hierarchy cache hit: data is in cache and received in a few cycles cache miss: data must be fetched from main memory (or higher level cache) try to ensure data is in cache (or as close to the CPU as possible) can we block algorithm to minimize memory traffic? Cache memory is effective because algorithms often use data that was recently accessed from memory (temporal locality) was close to other recently accessed data (spatial locality) Note: duplication of data in the cache will have implications for parallel systems! COMP4300/8300 L2-3: Classical Parallel Hardware Consider the DAXPY computation: Memory Structure y(i) = x(i) + y(i) If theoretically the CPU can perform the 2 FLOPs in 1 cycle the memory system must load two s (x(i) and y(i) 16 bytes) and store one (y(i) 8 bytes) each clock cycle on a 1 GHz system this implies 16 GB/s load traffic and 8 GB/s store traffic typically: a processor core can only issue one load OR store instruction in a clock cycle DDR3-SDRAM memory is available clocked at 1066 MHz with access times accordingly Memory latency and bandwidth are critical performance issues. caches: reduce latency and provide improved cache to CPU bandwidth multiple memory banks: improve bandwidth (by parallel access) COMP4300/8300 L2-3: Classical Parallel Hardware Cache Mapping Blocks of main memory are mapped to a cache line cache lines are typically bytes wide mapping may be direct, or n-way associative. E.g. 2-way associative Cache Main Memory Mapping Line 1 1 or 3 2 or 4 1 or 3 2 or 4 Line 2 1 or 3 2 or 4 1 or 3 2 or 4 Line 3 1 or 3 2 or 4 1 or 3 2 or 4 Line 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 entire cache line is fetched from memory, not just one element (why?) try to structure code to use an entire cache line of data best to have unit stride pointer chasing is very bad! (large address strides guaranteed!) COMP4300/8300 L2-3: Classical Parallel Hardware

4 Memory Banks Going Parallel Memory bandwidth can be improved by having multiple parallel paths to/from memory, organized in b banks. Bank 1 Bank 2 Bank 3 Bank 4 traditional solution used by vector processors good performance for unit stride very bad performance if bank conflicts occur worst case: power-of-2 strides, where b is a power 2 C P U inevitably the performance of a single processor is limited by the clock speed improved manufacturing increases clock but ultimately limited by speed of light ILP allows multiple ops at once, but is limited It s time to go parallel Hardware Issues Flynn s Taxonomy of parallel processors SIMD/MIMD shared/distributed memory hierarchical/flat memory dynamic/static processor connectivity characteristics of static networks COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware Parallelism in Memory Access Architecture Classification: Flynn s Taxonomy Dual Inline Memory Module (DIMM) a processor may have 2 or 4 DIMMs (i.e. b = 16,32), depending on its lowest-level cache line size each DRAM chip access is pipelined (select row address, select column address, access byte) why classify: what kind of parallelism is employed? Which architecture has the best prospect for the future? What has already been achieved by current architecture types? Reveal configurations that have not yet considered by system architect. Enable building of performance models. Flynn s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction and data streams Single Data Stream Multiple Single SISD SIMD Instruction 1CPU Array/Vector Processor Stream Multiple MISD MIMD (Pipelined?) Multiple Processor COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware

5 SIMD and MIMD Address Space Organization: Message Passing SIMD: Single Instruction Multiple Data also known as data parallel processors or array processors vector processors (to some extent) current examples include SSE instructions, SPEs on CellBE, GPUs NVIDIA s SIMT (T = Threads) is slight variation MIMD: Multiple Instruction Multiple Data examples include quad-core PC, octa-core Xeons on Raijin each processor has local or private memory interact solely by message passing commonly known as distributed memory parallel computers memory bandwidth scales with number of processors example: between nodes on the NCI Raijin system P P P P COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware MIMD Address Space Organization: Shared Address Space Most successful parallel model more general purpose than SIMD (e.g. CM5 could emulate CM2) harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines scheduling: efficient allocation of processors to tasks in a dynamic fashion synchronization: prevent processors accessing the same data simultaneously interconnect design: processor to memory and processor to processor interconnects. Also I/O network - often processors dedicated to I/O devices overhead: inevitably there is some overhead associated with coordinating activities between processors, e.g. resolve contention for resources partitioning: identifying parallelism in algorithms that is capable of exploiting concurrent processing streams is non-trivial processors interact by modifying data objects stored in a shared address space simplest solution is a flat or uniform memory access (UMA) scalability of memory bandwidth and processor-processor communications (arising from cache line transfers) are problems so is synchronizing access to shared data objects example: dual/quad core PC (ignoring cache) (Aside: SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable. Simplifies use of shared address space.) P P P P COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware

6 Non-Uniform Memory Access (NUMA) Machine includes some hierarchy in its memory structure all memory is visible to the programmer (single address space), but some memory takes longer to access than others Dynamic Processor Connectivity: Crossbar is a non-blocking network, in that the connection of two processors does not block connection between other processors complexity grows as O(p 2 ) may be used to connect processors each with their own local memory in this sense, cache introduces one level of NUMA between sockets on NCI Raijin system or in a multi-socket Opteron system (some ANU research on an Opteron (2011)) P P P P may also be used to connect processors with various memory banks e.g. the 8 cores on an UltraSPARC T2 access the 8 banks of its level 2 cache COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware Shared Address Space Access Dynamic Processor Connectivity: Multi-staged Networks Parallel Random Access Machine (PRAM): any shared memory machine P what happens when multiple processors try to read/write to the same memory location at the same time? PRAM models Exclusive-read, exclusive-write (EREW PRAM) Concurrent-read, exclusive-write (CREW PRAM) Exclusive-read, concurrent-write (ERCW PRAM) Concurrent-read, concurrent-write (CRCW PRAM) Concurrent read OK, but write requires arbitration: Common: allowed if all values being written are identical Arbitrary: an arbitrary processor is allowed to proceed and the rest fail Priority: processors are organized into a predefined prioritized list, process with highest priority succeeds the rest fail Sum: the sum of all quantities is written COMP4300/8300 L2-3: Classical Parallel Hardware consist of lg p stages, where p is the number of processors (lg p = log 2 (p)) s and t are binary representation of message source and destination at stage 1 route through if most significant bits of s and t are the same otherwise, crossover process repeats for next stage using the next most significant bit etc COMP4300/8300 L2-3: Classical Parallel Hardware

7 Dynamic Processor Connectivity: Bus Static Processor Connectivity: Hypercube processor gains exclusive access to the bus for some period the performance of the bus limits scalability P P P P a multidimensional mesh with exactly two processors in each dimension Performance crossbar > multistage > bus Cost crossbar > multistage > bus Discussion point: which would you use for small, medium and large numbers of processors (p)? What values of p would (roughly) be the cuttoff points? COMP4300/8300 L2-3: Classical Parallel Hardware p = 2 d where d is the dimension of the hypercube advantages: d hops between any processors disadvantage: the number of connections per processor is d can be generalized to k-ary d-cubes, e.g. a 4 4 torus is a 4-ary 2-cube examples: Intel ipsc Hypercube, NCube, SGI Origin, Cray T3D, TOFU COMP4300/8300 L2-3: Classical Parallel Hardware Static Processor Connectivity: Complete, Mesh, Tree Static Processor Connectivity: Hypercube Characteristics Completely connected (becomes very complex!) Linear array/ring, mesh/2d torus two processors connected directly ONLY IF binary labels differ by one bit in a d-dimensional hypercube each processor directly connects to d others a d-dimensional hypercube can be partitioned into two d 1 sub-cubes etc Tree (static if nodes are processors) the number of links in the shortest path between two processors is the Hamming distance between their labels the Hamming distance between two processors labeled s and t is the number of bits that are on in the binary representation of s t where is the bitwise exclusive or operation (e.g. 3 for and 2 for ) COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware P

8 Diameter Evaluating Static Interconnection Networks#1 the maximum distance between any two processors in the network directly determines communication time (latency) Connectivity the multiplicity of paths between any two processors a high connectivity is desirable as it minimizes contention (also enhances fault-tolerance) arc connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks 1 for linear arrays and binary trees 2 for rings and 2D meshes 4 for a 2D torus d for d-dimensional hypercubes COMP4300/8300 L2-3: Classical Parallel Hardware Summary: Static Interconnection Characteristics Bisection Arc Cost Network Diameter width connectivity (no. of links) Completely-connected 1 p 2 /4 p 1 p(p 1)/2 Binary Tree 2lg((p + 1)/2) 1 1 p 1 Linear array p p 1 Ring p/2 2 2 p 2D Mesh 2( p 1) p 2 2(p p) 2D Torus 2 p/2 2 p 4 2p Hypercube lg p p/2 lg p (p lg p)/2 Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right sub-trees must pass through the root. The fat tree interconnect alleviates this: usually, only the leaf nodes contain a processor it can be easily partitioned into sub-networks without degraded performance (useful for supercomputers) Discussion point: which would you use for small, medium and large numbers of processors (p)? What values of p would (roughly) be the cuttoff points? COMP4300/8300 L2-3: Classical Parallel Hardware Evaluating Static Interconnection Networks#2 Channel width the number of bits that can be communicated simultaneously over a link connecting two processors Bisection width and bandwidth bisection width is the minimum number of communication links that have to be removed to partition the network into two equal halves bisection bandwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors Cost many criteria can be used; we will use the number of communication links or wires required by the network COMP4300/8300 L2-3: Classical Parallel Hardware NCI s Raijin: A Petascale Supercomputer 57,472 cores (dual socket, 8 core Intel Xeon Sandy Bridge, 2.6 GHz) in 3592 compute nodes 160 TBytes (approx.) of main memory Mellanox Infiniband FDR interconnect (52 km cable), interconnects: ring (cores), full (sockets), fat tree (nodes) 10 PBytes (approx.) of usable fast filesystem (for shortterm scratch space apps, home directories) power: 1.5 MW max. load cooling systems: 100 tonnes of water 24th fastest in the world in debut (November 2012); first petaflop system in Australia fastest (parallel) filesystem in the s. hemisphere custom monitoring and deployment custom Linux kernel highly customised PBS Pro scheduler COMP4300/8300 L2-3: Classical Parallel Hardware

9 NCI s Integrated High Performance Environment COMP4300/8300 L2-3: Classical Parallel Hardware Further Reading: Parallel Hardware The Free Lunch Is Over! Ch 1, of Introduction to Parallel Computing Ch 1, 2 of Principles of Parallel Programming lecture notes from Calvin Lin: A Success Story: ISA Parallel Architectures COMP4300/8300 L2-3: Classical Parallel Hardware

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

Overview: Classical Parallel Hardware

Overview: Classical Parallel Hardware Overview: Classical Parallel Hardware Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop

Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia October 30, 2017

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Design of Parallel Algorithms. The Architecture of a Parallel Computer + Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

CSC630/CSC730: Parallel Computing

CSC630/CSC730: Parallel Computing CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Copyright 2010, Elsevier Inc. All rights Reserved

Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 Roadmap Some background Modifications to the von Neumann model Parallel hardware Parallel software

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Parallel Processing & Multicore computers

Parallel Processing & Multicore computers Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Parallel Numerics, WT 2013/ Introduction

Parallel Numerics, WT 2013/ Introduction Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Computer Organization. Chapter 16

Computer Organization. Chapter 16 William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between

More information

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD Pipeline and Vector Processing 1. Parallel Processing Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

Types of Parallel Computers

Types of Parallel Computers slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists

More information

Organisasi Sistem Komputer

Organisasi Sistem Komputer LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Processor Performance and Parallelism Y. K. Malaiya

Processor Performance and Parallelism Y. K. Malaiya Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

TDT 4260 lecture 3 spring semester 2015

TDT 4260 lecture 3 spring semester 2015 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1 Pipelining and Vector Processing Parallel Processing: The term parallel processing indicates that the system is able to perform several operations in a single time. Now we will elaborate the scenario,

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information