COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
|
|
- Hilary Chandler
- 5 years ago
- Views:
Transcription
1 COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult, division (sqrt maybe!) Integer operations (MIPS) - adds etc, also logical ops instruction processing MIPS: Machine Instructions Per Second on very old VAX 100/780 Anyway, what is a machine instruction on different CPUs! Our primary focus will be in floating point operations Clock: All ops take a fixed number of clock ticks to complete Clock speed is measured in GHz (10 9 cycles/second) or nsec (10 9 seconds) Apple iphone 6 ARM A8 1.4GHz (0.71ns), NCI Raijin Intel Xeon Sy Bridge 2.6GHz (0.38ns), IBM zec12 processor 5.5Ghz (0.18ns) Clock limited by etching+speed of light, hence motivates parallel (duo systems) (To my knowledge) IBM zec12 is fastest commodity processor at 5.5GHz Light travels about 10cm in.32ns, chip is a few cm! COMP4300/8300 Lecture 2-3 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Design So we talk the same language Many things happen in parallel even on a single processor Identify potential issues for parallel hardware Why use 2 CPUs if you can double the speed on one! Multiple Design Hardware models Shared/Distributed memory Hierarchical/flat memory Dynamic/static processor connectivity Evaluating static networks Routing mechanisms COMP4300/8300 Lecture 2-2 Copyright c 2015 The Australian National University 2.3 Performance FLOPS/Sec Prefix Occurrence 10 3 kilo very badly written code 10 6 mega badly written code 10 9 giga single-core tera multiple chip (NCI) peta 23 machines in Top500 (Nov 2012, measured) exa around 2020! PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz 40GF Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz 105GF NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz 1.19PF COMP4300/8300 Lecture 2-4 Copyright c 2015 The Australian National University
2 2.4 Adding Numbers Consider adding two double precision (8 byte) numbers ± Exponent Mantissa Possible Steps Determine largest exponent Normalize smaller exponent to the larger Add mantissas Renormalize the mantissa exponent of the result Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP) 2.6 Pipeline Operations#2 Requires same op consecutively on different (independent) data items good for vector operations note limitations on chaining output data to input Tendency to increase number of stages in pipeline if each stage can run faster More stages in a pipeline the greater the startup latency UltraSPARC II has 9 stage pipeline, UltraSPARC III has a 14 stage pipeline Prescott Pentium 4 processor had a 31 stage pipeline Not all operations are pipelined, eg integer multiplication, division, sqrt Clock cycles for different operations on Alpha EV6 Operation Latency Repeat +,-,* 4 1 / sqrt COMP4300/8300 Lecture 2-5 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-7 Copyright c 2015 The Australian National University 2.5 Pipeline Operations#1 Step in Pipeline Waiting Done X(6) X(5) X(4) X(3) X(2) X(1) X(1) takes 4 ticks to appear (startup latency) X(2) appears 1 tick after X(1) Asymptotically achieve 1 result per clock tick Operation is said to be pipelined Steps in the pipeline are running in parallel 2.7 Instruction Parallelism issues multiple instructions per clock cycle that are executed in parallel on different parts of the chip hardware Grouping rules: restriction on what can be done in parallel, eg UltraSPARC: 4 from 2*Floating, 2*Integer, 1*load/store, 1*branch Input 1 Input2 Input 3 Multiply unit Pentium III single FLOP per cycle Addition unit Result Opteron, UltraSPARC Alpha 2 (different) FLOPs per cycle Core2, Itanium2 IBM Power5 4 (DP) FLOPs per cycle Xeon Sy Bridge 8 (DP) FLOPs per cycle COMP4300/8300 Lecture 2-6 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-8 Copyright c 2015 The Australian National University
3 2.8 Structure Consider DAXPY: Y(i) = X(i)+Y(i) If theoretically the CPU can perform the 2 FLOPs in 1 cycle must deliver (load) two floats (X(i) Y(i) or 16 bytes) store one (Y(i) 8bytes) each clock cycle On 1GHz system this implies 1.0*16GB/sec 16GB/sec load traffic 8GB/sec store Typically A processor core can only issue one load OR store instruction in a clock cycle DDR3-SDRAM memory is available clocked at 1066MHz with access times accordingly Latency Bwidth are critical performance issues Caches: reduce latency provide improved cache to CPU bwidth banks: improve bwidth COMP4300/8300 Lecture 2-9 Copyright c 2015 The Australian National University 2.10 Cache Mapping Blocks of main memory are mapped to a cache line Cache line typically bytes wide Mapping may be direct, or n-way associative Entire cache line is fetched from memory not just one element Structure code to try use an entire cache line of data Best to have unit stride Pointer chasing is very bad Cache Main Mapping Line 1 1 or 3 2 or 4 1 or 3 2 or 4 Line 2 1 or 3 2 or 4 1 or 3 2 or 4 Line 3 1 or 3 2 or 4 1 or 3 2 or 4 Line 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 COMP4300/8300 Lecture 2-11 Copyright c 2015 The Australian National University 2.9 Cache Main Cache CPU Registers large cheap memory large latency/small bwidth small fast expensive memory lower latency/higher bwidth hierarchy or Non-Uniform Access (NUMA) Cache Hit - data in cache received in a few cycles Cache Miss - data fetched from main memory (or higher level cache) Try to ensure data is in cache (or as close to the CPU as possible) Can we block algorithm to minimize memory traffic Cache is effective because algorithms often use data that are close in memory. (Note duplication of data in cache will have implications for parallel systems!) 2.11 Banks bwidth improved by having multiple parallel paths to/from memory Bank 1 Bank 2 Bank 3 Bank 4 Traditional solution used by vector processors High initial latency Good performance for unit stride Very bad performance if bank conflict C P U COMP4300/8300 Lecture 2-10 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-12 Copyright c 2015 The Australian National University
4 2.12 Going Parallel Inevitably the performance of a single processor is limited by the clock speed Improved manufacturing increases clock but ultimately limited by speed of light Superscalar allows multiple ops at once, but not always applicable It s time to go parallel Hardware Issues Flynn s Taxonomy of parallel processors SIMD/MIMD Shared/distributed memory Hierarchical/flat memory Dynamic/static processor connectivity Characteristics of static networks COMP4300/8300 Lecture 2-13 Copyright c 2015 The Australian National University 2.14 SIMD MIMD SIMD: Single Instruction Multiple Data Also know as data parallel processors or array processors Vector processors (to some extent) Current examples include SSE instructions, SPEs on CellBE, GPUs NVIDIA s SIMT (T = Threads) is slight variation MIMD: Multiple Instruction Multiple Data Examples include quad-core PC, octa-core Xeons on Raijin Global Control Unit CPU CPU CPU CPU S I M D CPU Control CPU Control CPU Control M I M D CPU Control COMP4300/8300 Lecture 2-15 Copyright c 2015 The Australian National University 2.13 Architecture Classification: Flynn s Taxonomy Why classify: What kind of parallelism is employed? Which architecture has the best prospect for the future? What has already been achieved by current architecture types? Reveal configurations that have not yet considered by system architect. Enable building of performance models. Flynn s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction data streams Data Stream Single Multiple Single SISD SIMD Instruction 1CPU Array/Vector Stream Multiple MISD MIMD (Pipelined?) Multiple COMP4300/8300 Lecture 2-14 Copyright c 2015 The Australian National University 2.15 MIMD Most successful parallel model More general purpose than SIMD (eg CM5 could emulate CM2) Harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines Scheduling: efficient allocation of processors to tasks in a dynamic fashion Synchronization: prevent processors accessing data simultaneously Interconnection design: processor to memory processor to processor interconnects. Also I/O network - often processors dedicated to I/O devices Overhead: inevitably there is some overhead associated with coordinating activities between processors, eg resolve contention for resources Partitioning: identifying parallelism in processing algorithms that is capable of exploiting concurrent processing streams is non-trivial (Aside SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable. Simplifies use of shared address space.) COMP4300/8300 Lecture 2-16 Copyright c 2015 The Australian National University
5 2.16 Address Space Organization: Message Passing Each processor has local or private memory Interact solely by message passing Commonly known as distributed memory machines bwidth scales with number of processors Examples, between nodes on NCI Raijin System 2.18 Non-Uniform Access (NUMA) Machine includes some hierarchy in memory structure All memory local to the programmer (single address space), but some memory takes longer to access than others Cache introduces one level of NUMA Between sockets on NCI Raijin system or in a multisocket Opteron systems MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY MEMORY PROCESSOR PROCESSOR PROCESSOR PROCESSOR Cache Cache Cache Cache Cache Cache Cache Cache MEMORY MEMORY MEMORY MEMORY PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR PROCESSOR COMP4300/8300 Lecture 2-17 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-19 Copyright c 2015 The Australian National University 2.17 Address Space Organization: Shared Address Space s interact by modifying data objects stored in a shared address space Flat uniform memory access (UMA) Scalability of memory bwidth processor-processor communications a problem Example, dual/quad core PC (ignoring cache) MEMORY MEMORY MEMORY MEMORY PROCESSOR PROCESSOR PROCESSOR PROCESSOR COMP4300/8300 Lecture 2-18 Copyright c 2015 The Australian National University 2.19 Shared Address Space Access Parallel Rom Access Machine (PRAM): any shared memory machine What happens when multiple processors try to read/write to the same memory location at the same time PRAM models Exclusive-read, exclusive-write (EREW) PRAM Concurrent-read, exclusive-write (CREW) PRAM Exclusive-read, concurrent-write (ERCW) PRAM Concurrent-read, concurrent-write (CRCW) PRAM Concurrent read OK, but write requires arbitration: Common: allowed if all values being written are identical Arbitrary: an arbitrary processor is allowed to proceed the rest fail Priority: processors are organized into a predefined prioritized list, process with highest priority succeeds the rest fail Sum: the sum of all quantities is written COMP4300/8300 Lecture 2-20 Copyright c 2015 The Australian National University
6 2.20 Dynamic Connectivity: Crossbar Non-blocking network in that connection of two processors does not block connection between other processors Complexity grows as O(p 2 ) May be used to connect processors with its own local memory 2.22 Dynamic Connectivity: Bus gains exclusive access to bus for some period Performance of BUS limits scalability MEMORY MEMORY MEMORY MEMORY B U S Cache Cache Cache Cache PROCESSOR PROCESSOR PROCESSOR PROCESSOR Performance Cost Cross Bar > Multistage > Bus Cross Bar > Multistage > Bus COMP4300/8300 Lecture 2-21 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-23 Copyright c 2015 The Australian National University 2.21 Dynamic Connectivity: Multistaged Networks s S W I T C H I N G N E T W O R K Static Connectivity: Complete, Mesh, Tree Completely Connected (becomes very complex!) Linear Array/Ring, Mesh/2D Torus O M E G A N E T W O R K Consist of log 2 p stages, where p is the number of processors s t are binary representation of message source destination at stage 1 Route through if most significant bits of s t are the same Crossover if most significant bits of s t are different Process repeated for next stage using the next most significant bit etc COMP4300/8300 Lecture 2-22 Copyright c 2015 The Australian National University Tree (static if nodes are processors) Switches s COMP4300/8300 Lecture 2-24 Copyright c 2015 The Australian National University
7 2.24 Static Connectivity: Hypercube Multidimensional mesh with exactly two processors in each dimension p = 2 d where d is the dimension of the hypercube Disadvantage is the number of connections per processor increases rapidly Examples: Intel ipsc Hypercube, NCube & SGI Origin 2.26 Evaluating Static Interconnection Networks#1 Diameter The maximum distance between any two processors in the network Diameter directly determines communication time Connectivity The multiplicity of paths between any two processors High connectivity desirable as it minimizes contention Arch connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks 1 for linear arrays binary trees 2 for rings 2-D meshes 4 for 2-D torus d for d-dimensional hypercubes COMP4300/8300 Lecture 2-25 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-27 Copyright c 2015 The Australian National University 2.25 Static Connectivity: Hypercube Characteristics Two processors connected directly ONLY IF binary labels differ by one bit In a d-dimensional hypercube each processor directly connects to d others d-dimensional hypercube can be partitioned into two (d-1) subcubes etc The number of links in the shortest path between two processors is the Hamming distance between their labels 1001 The Hamming distance between two processor labeled s t is the number of bits that are on in the binary representation of s t where is the bitwise exclusive or operation (eg 3 for for ) Evaluating Static Interconnection Networks#2 Channel width The number of bits that can be communicated simultaneously over a link connecting two processors Bisection Width Bwidth Width is the minimum number of communication links that have to be removed to partition the network into two equal halves Bwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors Cost Many criteria can be used, we will use the number of communication links or wires required by the network. COMP4300/8300 Lecture 2-26 Copyright c 2015 The Australian National University COMP4300/8300 Lecture 2-28 Copyright c 2015 The Australian National University
8 2.28 Summary Static Interconnection Characteristics Bisection Arc Cost Network Diameter Width Connectivity (No of Links) Completely connected 1 p 2 /4 p 1 p(p 1)/2 Binary Tree 2log 2 ((p+1)/2) 1 1 p 1 Linear array p p 1 Ring p/2 2 2 p 2-D Mesh 2( p 1) p 2 2(p p) 2-D Torus 2 p/2 2 p 4 2p Hypercube log 2 p p/2 log 2 p (plog 2 p)/2 COMP4300/8300 Lecture 2-29 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationProcessor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design
Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues
More informationOverview: Classical Parallel Hardware
Overview: Classical Parallel Hardware Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationDesign of Parallel Algorithms. The Architecture of a Parallel Computer
+ Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationCPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport
CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationTypes of Parallel Computers
slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationAdvanced Computer Architecture. The Architecture of Parallel Computers
Advanced Computer Architecture The Architecture of Parallel Computers Computer Systems No Component Can be Treated In Isolation From the Others Application Software Operating System Hardware Architecture
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationCopyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 Roadmap Some background Modifications to the von Neumann model Parallel hardware Parallel software
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationChapter 2: Parallel Programming Platforms
Chapter 2: Parallel Programming Platforms Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents Implicit Parallelism: Trends in Microprocessor
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationIntroduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2
Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationFundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.
Fundamentals of Parallel Computing Sanjay Razdan Alpha Science International Ltd. Oxford, U.K. CONTENTS Preface Acknowledgements vii ix 1. Introduction to Parallel Computing 1.1-1.37 1.1 Parallel Computing
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationObjectives of the Course
Objectives of the Course Parallel Systems: Understanding the current state-of-the-art in parallel programming technology Getting familiar with existing algorithms for number of application areas Distributed
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationModel Questions and Answers on
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Model Questions and Answers on PARALLEL COMPUTING Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. Model Questions and Answers Subject Parallel Computing
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationNormal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory
Parallel Machine 1 CPU Usage Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory Solution Use multiple CPUs or multiple ALUs For simultaneous
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationHigh Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA
High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it
More informationChapter 11. Introduction to Multiprocessors
Chapter 11 Introduction to Multiprocessors 11.1 Introduction A multiple processor system consists of two or more processors that are connected in a manner that allows them to share the simultaneous (parallel)
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationChapter 4 Data-Level Parallelism
CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture
More informationShared-Memory Hardware
Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube Shared-Memory Hardware Hardware architecture: Processor(s), memory system(s), data path(s)
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationChapter Seven. Idea: create powerful computers by connecting many smaller ones
Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More information