Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design
|
|
- Lizbeth Pauline Warren
- 5 years ago
- Views:
Transcription
1 Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware why use 2 CPUs if you can double the speed on one! Multiple Processor Design architectural classifications shared/distributed memory hierarchical/flat memory dynamic/static processor connectivity FLOPS/Sec Prefix Occurrence 10 3 kilo very badly written code 10 6 mega badly written code 10 9 giga single-core tera multiple chip (NCI) peta 23 machines in Top500 (Nov 2012, measured) exa around 2020! PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz 40GF Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz 105GF NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz 1.19PF evaluating static networks case study; the NCI Raijin supercomputer COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware The Processor Adding Numbers Performs: floating point operations (FLOPS) - add, mult, division (sqrt maybe!) integer operations (MIPS) - adds etc, also logical ops and instruction processing MIPS: Machine Instructions Per Second (on a very old VAX 100/780) anyway, what is a machine instruction on different CPUs! our primary focus will be in floating point operations The clock: all ops take a fixed number of clock ticks to complete clock speed is measured in GHz (10 9 cycles/second) or nsec (10 9 seconds) Apple iphone 6 ARM A8 1.4GHz (0.71ns), NCI Raijin Intel Xeon Sandy Bridge 2.6GHz (0.38ns), IBM zec12 processor 5.5Ghz (0.18ns) clock speed is limited by etching+speed of light, hence motivates parallelism (to my knowledge) IBM zec12 is fastest commodity processor at 5.5GHz light travels about 1cm in 3.2ns, a chip is a few cm! COMP4300/8300 L2-3: Classical Parallel Hardware Consider adding two double precision (8 byte) numbers ± Exponent Significand Possible steps: determine largest exponent normalize significand of the smaller exponent to the larger add significand re normalize the significand and exponent of the result Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP) COMP4300/8300 L2-3: Classical Parallel Hardware
2 Operation Pipelining Step in Pipeline Waiting Done X(6) X(5) X(4) X(3) X(2) X(1) X(1) takes 4 ticks to appear (startup latency); X(2) appears 1 tick after X(1) asymptotically achieves 1 result per clock tick the operation (X) is said to be pipelined: steps in the pipeline are running in parallel requires same op consecutively on different (independent) data items good for vector operations (note limitations on chaining output data to input) not all operations are pipelined, e.g. integer mult., division, sqrt. E.g. Alpha EV6 operation latency repeat +,-,* 4 1 / sqrt COMP4300/8300 L2-3: Classical Parallel Hardware Pipelining: Dependent Instructions principle: CPU must ensure result is the same as if no pipelining (or ism) instructions requiring only 1 cycle in the EX stage: r r s r can be solved by pipeline feedback from EX stage to next cycle (important) instructions requiring c cycles for the EX stage (ie. f.p. * +, load, store) are normally implemented by having c EX stages. This requires any dependent instruction to be delayed by c cycles. e.g. c = 3 r r r r r r I0: FI DI FO EX1 EX2 EX3 WB I1: FI DI FO EX1 EX2 EX3 WB I2: FI DI FO EX1 EX2 EX3 WB I3: FI DI FO EX1 EX2 EX3 WB COMP4300/8300 L2-3: Classical Parallel Hardware Instruction Pipelining Superscalar (Multiple Instruction Issue) break instruction execution into k stages; can get k-way ism (generally, the circuitry for each stage is independent) eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX = Execute Instrn., WB = Write Back (branch): FI DI FO EX WB (guess) FI DI FO EX WB (guess) FI DI FO EX WB (guess) FI DI FO EX WB (sure) FI DI FO EX WB note: FO & WB stages may involve memory accesses (and may possibly stall the pipeline) conditional branch instructions are problematic: the wrong guess may require flushing succeeding instructions from the pipeline tendency to increase the number of stages, but this increases the startup latency e.g. number of stages: UltraSPARC II (9), UltraSPARC III (14), Intel Prescott (31) A small number (w) of instructions are scheduled by the H/W to execute together. groups must have an appropriate instruction mix 2 different floating point e.g. UltraSPARC (w = 4): 1 load / store ; 1 branch 2 integer / logical have w-way ism over different types of instr ns generally requires: multiple ( w) instruction fetches an extra grouping (G) stage in the pipeline instr ns per group problem: amplifies dependencies and other problems of pipelining by a factor of w! issues: the instruction mix must be balanced for maximum performance! i.e. floating point, + must be balanced COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware
3 Review - Instruction Level Parallelism instruction level parallelism (ILP): pipelining and superscalar, offer kw-way ism branch prediction techniques alleviates issue of conditional branches uses a table recoding the result of recently-taken branches out-of-order execution: alleviates the issue of dependencies pulls fetched instructions into a buffer of size W, W w execute them in any order provided dependencies are not violated must keep track of all in-flight instructions and associated registers (O(W 2 ) area and power!) in most situations, the compiler can do as good a job as a human at exposing this parallelism (ILP was part of the Free Lunch ) An exception: how to get a UltraSPARC III to execute 1 f.p. multiply (and add!) per cycle COMP4300/8300 L2-3: Classical Parallel Hardware Cache Memory Main Memory large cheap memory large latency/small bandwidth Cache small fast expensive memory lower latency/higher bandwidth CPU Registers part of the memory hierarchy cache hit: data is in cache and received in a few cycles cache miss: data must be fetched from main memory (or higher level cache) try to ensure data is in cache (or as close to the CPU as possible) can we block algorithm to minimize memory traffic? Cache memory is effective because algorithms often use data that was recently accessed from memory (temporal locality) was close to other recently accessed data (spatial locality) Note: duplication of data in the cache will have implications for parallel systems! COMP4300/8300 L2-3: Classical Parallel Hardware Consider the DAXPY computation: Memory Structure y(i) = x(i) + y(i) If theoretically the CPU can perform the 2 FLOPs in 1 cycle the memory system must load two s (x(i) and y(i) 16 bytes) and store one (y(i) 8 bytes) each clock cycle on a 1 GHz system this implies 16 GB/s load traffic and 8 GB/s store traffic typically: a processor core can only issue one load OR store instruction in a clock cycle DDR3-SDRAM memory is available clocked at 1066 MHz with access times accordingly Memory latency and bandwidth are critical performance issues. caches: reduce latency and provide improved cache to CPU bandwidth multiple memory banks: improve bandwidth (by parallel access) COMP4300/8300 L2-3: Classical Parallel Hardware Cache Mapping Blocks of main memory are mapped to a cache line cache lines are typically bytes wide mapping may be direct, or n-way associative. E.g. 2-way associative Cache Main Memory Mapping Line 1 1 or 3 2 or 4 1 or 3 2 or 4 Line 2 1 or 3 2 or 4 1 or 3 2 or 4 Line 3 1 or 3 2 or 4 1 or 3 2 or 4 Line 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 1 or 3 2 or 4 entire cache line is fetched from memory, not just one element (why?) try to structure code to use an entire cache line of data best to have unit stride pointer chasing is very bad! (large address strides guaranteed!) COMP4300/8300 L2-3: Classical Parallel Hardware
4 Memory Banks Going Parallel Memory bandwidth can be improved by having multiple parallel paths to/from memory, organized in b banks. Bank 1 Bank 2 Bank 3 Bank 4 traditional solution used by vector processors good performance for unit stride very bad performance if bank conflicts occur worst case: power-of-2 strides, where b is a power 2 C P U inevitably the performance of a single processor is limited by the clock speed improved manufacturing increases clock but ultimately limited by speed of light ILP allows multiple ops at once, but is limited It s time to go parallel Hardware Issues Flynn s Taxonomy of parallel processors SIMD/MIMD shared/distributed memory hierarchical/flat memory dynamic/static processor connectivity characteristics of static networks COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware Parallelism in Memory Access Architecture Classification: Flynn s Taxonomy Dual Inline Memory Module (DIMM) a processor may have 2 or 4 DIMMs (i.e. b = 16,32), depending on its lowest-level cache line size each DRAM chip access is pipelined (select row address, select column address, access byte) why classify: what kind of parallelism is employed? Which architecture has the best prospect for the future? What has already been achieved by current architecture types? Reveal configurations that have not yet considered by system architect. Enable building of performance models. Flynn s taxonomy is based on the degree of parallelism, with 4 categories determined according to the number of instruction and data streams Single Data Stream Multiple Single SISD SIMD Instruction 1CPU Array/Vector Processor Stream Multiple MISD MIMD (Pipelined?) Multiple Processor COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware
5 SIMD and MIMD Address Space Organization: Message Passing SIMD: Single Instruction Multiple Data also known as data parallel processors or array processors vector processors (to some extent) current examples include SSE instructions, SPEs on CellBE, GPUs NVIDIA s SIMT (T = Threads) is slight variation MIMD: Multiple Instruction Multiple Data examples include quad-core PC, octa-core Xeons on Raijin each processor has local or private memory interact solely by message passing commonly known as distributed memory parallel computers memory bandwidth scales with number of processors example: between nodes on the NCI Raijin system P P P P COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware MIMD Address Space Organization: Shared Address Space Most successful parallel model more general purpose than SIMD (e.g. CM5 could emulate CM2) harder to program, as processors are not synchronized at the instruction level Design issues for MIMD machines scheduling: efficient allocation of processors to tasks in a dynamic fashion synchronization: prevent processors accessing the same data simultaneously interconnect design: processor to memory and processor to processor interconnects. Also I/O network - often processors dedicated to I/O devices overhead: inevitably there is some overhead associated with coordinating activities between processors, e.g. resolve contention for resources partitioning: identifying parallelism in algorithms that is capable of exploiting concurrent processing streams is non-trivial processors interact by modifying data objects stored in a shared address space simplest solution is a flat or uniform memory access (UMA) scalability of memory bandwidth and processor-processor communications (arising from cache line transfers) are problems so is synchronizing access to shared data objects example: dual/quad core PC (ignoring cache) (Aside: SPMD Single Program Multiple Data: more restrictive than MIMD, implying that all processors run the same executable. Simplifies use of shared address space.) P P P P COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware
6 Non-Uniform Memory Access (NUMA) Machine includes some hierarchy in its memory structure all memory is visible to the programmer (single address space), but some memory takes longer to access than others Dynamic Processor Connectivity: Crossbar is a non-blocking network, in that the connection of two processors does not block connection between other processors complexity grows as O(p 2 ) may be used to connect processors each with their own local memory in this sense, cache introduces one level of NUMA between sockets on NCI Raijin system or in a multi-socket Opteron system (some ANU research on an Opteron (2011)) P P P P may also be used to connect processors with various memory banks e.g. the 8 cores on an UltraSPARC T2 access the 8 banks of its level 2 cache COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware Shared Address Space Access Dynamic Processor Connectivity: Multi-staged Networks Parallel Random Access Machine (PRAM): any shared memory machine P what happens when multiple processors try to read/write to the same memory location at the same time? PRAM models Exclusive-read, exclusive-write (EREW PRAM) Concurrent-read, exclusive-write (CREW PRAM) Exclusive-read, concurrent-write (ERCW PRAM) Concurrent-read, concurrent-write (CRCW PRAM) Concurrent read OK, but write requires arbitration: Common: allowed if all values being written are identical Arbitrary: an arbitrary processor is allowed to proceed and the rest fail Priority: processors are organized into a predefined prioritized list, process with highest priority succeeds the rest fail Sum: the sum of all quantities is written COMP4300/8300 L2-3: Classical Parallel Hardware consist of lg p stages, where p is the number of processors (lg p = log 2 (p)) s and t are binary representation of message source and destination at stage 1 route through if most significant bits of s and t are the same otherwise, crossover process repeats for next stage using the next most significant bit etc COMP4300/8300 L2-3: Classical Parallel Hardware
7 Dynamic Processor Connectivity: Bus Static Processor Connectivity: Hypercube processor gains exclusive access to the bus for some period the performance of the bus limits scalability P P P P a multidimensional mesh with exactly two processors in each dimension Performance crossbar > multistage > bus Cost crossbar > multistage > bus Discussion point: which would you use for small, medium and large numbers of processors (p)? What values of p would (roughly) be the cuttoff points? COMP4300/8300 L2-3: Classical Parallel Hardware p = 2 d where d is the dimension of the hypercube advantages: d hops between any processors disadvantage: the number of connections per processor is d can be generalized to k-ary d-cubes, e.g. a 4 4 torus is a 4-ary 2-cube examples: Intel ipsc Hypercube, NCube, SGI Origin, Cray T3D, TOFU COMP4300/8300 L2-3: Classical Parallel Hardware Static Processor Connectivity: Complete, Mesh, Tree Static Processor Connectivity: Hypercube Characteristics Completely connected (becomes very complex!) Linear array/ring, mesh/2d torus two processors connected directly ONLY IF binary labels differ by one bit in a d-dimensional hypercube each processor directly connects to d others a d-dimensional hypercube can be partitioned into two d 1 sub-cubes etc Tree (static if nodes are processors) the number of links in the shortest path between two processors is the Hamming distance between their labels the Hamming distance between two processors labeled s and t is the number of bits that are on in the binary representation of s t where is the bitwise exclusive or operation (e.g. 3 for and 2 for ) COMP4300/8300 L2-3: Classical Parallel Hardware COMP4300/8300 L2-3: Classical Parallel Hardware P
8 Diameter Evaluating Static Interconnection Networks#1 the maximum distance between any two processors in the network directly determines communication time (latency) Connectivity the multiplicity of paths between any two processors a high connectivity is desirable as it minimizes contention (also enhances fault-tolerance) arc connectivity of the network: the minimum number of arcs that must be removed for the network to break it into two disconnected networks 1 for linear arrays and binary trees 2 for rings and 2D meshes 4 for a 2D torus d for d-dimensional hypercubes COMP4300/8300 L2-3: Classical Parallel Hardware Summary: Static Interconnection Characteristics Bisection Arc Cost Network Diameter width connectivity (no. of links) Completely-connected 1 p 2 /4 p 1 p(p 1)/2 Binary Tree 2lg((p + 1)/2) 1 1 p 1 Linear array p p 1 Ring p/2 2 2 p 2D Mesh 2( p 1) p 2 2(p p) 2D Torus 2 p/2 2 p 4 2p Hypercube lg p p/2 lg p (p lg p)/2 Note: the Binary Tree suffers from a bottleneck: all traffic between the left and right sub-trees must pass through the root. The fat tree interconnect alleviates this: usually, only the leaf nodes contain a processor it can be easily partitioned into sub-networks without degraded performance (useful for supercomputers) Discussion point: which would you use for small, medium and large numbers of processors (p)? What values of p would (roughly) be the cuttoff points? COMP4300/8300 L2-3: Classical Parallel Hardware Evaluating Static Interconnection Networks#2 Channel width the number of bits that can be communicated simultaneously over a link connecting two processors Bisection width and bandwidth bisection width is the minimum number of communication links that have to be removed to partition the network into two equal halves bisection bandwidth is the minimum volume of communication allowed between two halves of the network with equal numbers of processors Cost many criteria can be used; we will use the number of communication links or wires required by the network COMP4300/8300 L2-3: Classical Parallel Hardware NCI s Raijin: A Petascale Supercomputer 57,472 cores (dual socket, 8 core Intel Xeon Sandy Bridge, 2.6 GHz) in 3592 compute nodes 160 TBytes (approx.) of main memory Mellanox Infiniband FDR interconnect (52 km cable), interconnects: ring (cores), full (sockets), fat tree (nodes) 10 PBytes (approx.) of usable fast filesystem (for shortterm scratch space apps, home directories) power: 1.5 MW max. load cooling systems: 100 tonnes of water 24th fastest in the world in debut (November 2012); first petaflop system in Australia fastest (parallel) filesystem in the s. hemisphere custom monitoring and deployment custom Linux kernel highly customised PBS Pro scheduler COMP4300/8300 L2-3: Classical Parallel Hardware
9 NCI s Integrated High Performance Environment COMP4300/8300 L2-3: Classical Parallel Hardware Further Reading: Parallel Hardware The Free Lunch Is Over! Ch 1, of Introduction to Parallel Computing Ch 1, 2 of Principles of Parallel Programming lecture notes from Calvin Lin: A Success Story: ISA Parallel Architectures COMP4300/8300 L2-3: Classical Parallel Hardware
COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,
More informationOverview: Classical Parallel Hardware
Overview: Classical Parallel Hardware Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues for parallel hardware
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationMessage Passing and Network Fundamentals ASD Distributed Memory HPC Workshop
Message Passing and Network Fundamentals ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia October 30, 2017
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationDesign of Parallel Algorithms. The Architecture of a Parallel Computer
+ Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz
More informationHow much energy can you save with a multicore computer for web applications?
How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationCopyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 Roadmap Some background Modifications to the von Neumann model Parallel hardware Parallel software
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationInterconnection Network
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationInterconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationComp. Org II, Spring
Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationTDT4260/DT8803 COMPUTER ARCHITECTURE EXAM
Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport
CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationComp. Org II, Spring
Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationParallel Processing & Multicore computers
Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationParallel Architectures
Parallel Architectures Instructor: Tsung-Che Chiang tcchiang@ieee.org Department of Science and Information Engineering National Taiwan Normal University Introduction In the roughly three decades between
More informationPipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD
Pipeline and Vector Processing 1. Parallel Processing Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose
More informationCOMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory
COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationTypes of Parallel Computers
slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists
More informationOrganisasi Sistem Komputer
LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationTDT 4260 lecture 3 spring semester 2015
1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationComputer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1
Pipelining and Vector Processing Parallel Processing: The term parallel processing indicates that the system is able to perform several operations in a single time. Now we will elaborate the scenario,
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different
More informationCS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011
CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More information