Shared-Memory Hardware

Size: px
Start display at page:

Download "Shared-Memory Hardware"

Transcription

1 Shared-Memory Hardware Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube

2 Shared-Memory Hardware Hardware architecture: Processor(s), memory system(s), data path(s) Each component may become the performance bottleneck Each component can be replicated Each parallelization target must be handled separately Modern processor Multiple instructions in the same cycle Multiple concurrent instruction streams per functional unit Multiple functional units (cores) Combination of multiple processors with one shared memory Logical hardware setup as seen by the software, physical hardware organization typically differs

3 Shared-Memory Hardware 3 Fetch instruction, update program counter (PC) Decode instruction Execute instruction Write Back result

4 Shared-Memory Hardware 4 [Stallings] Central Processing Units (CPUs) + volatile memory + I/O devices Fetch instruction and execute it - typically memory access, computation, and / or I/O I/O devices and memory controller may interrupt the instruction processing Improve processor utilization by asynchronous operations

5 RISC vs. CISC RISC - Reduced Instruction Set Computer MIPS, ARM, DEC Alpha, Sparc, IBM 801, Power, etc. Small number of instructions Few data types in hardware Instruction size constant, few addressing modes Relies on optimization in software CISC - Complex Instruction Set Computer VAX, Intel X86, IBM 360/370, etc. Large number of complex instructions, may take multiple cycles Variable length instructions Smaller code size Focus on optimization in hardware RISC designs lend themself to exploitation of instruction level parallelism

6 Shared-Memory Hardware 6 Major constraints of memory are amount, speed, and costs Faster access time results in greater costs per bit Greater capacity results in smaller costs per bit Greater capacity results in slower access Going down the memory hierarchy Decreasing costs per bit Increasing capacity for fixed costs Increasing access time I/O devices provide non-volatile memory on lower levels, which is an additional advantage

7 Shared-Memory Hardware 7 Principle of Locality Memory referenced by a processor (program and data) tends to cluster (e.g. loops, subroutines) Operations on tables and arrays involve access to clustered data sets Temporal locality: If a memory location is referenced, it will tend to be referenced again soon Spatial locality: If a memory location is referenced, locations whose addresses are close by will tend to be referenced soon Data should be organized so that the percentage of accesses to lower levels is substantially less than to the level above Typically implemented by caching concept [Stallings]

8 Shared-Memory Hardware 8 Caching Offer a portion of lower level memory as copy in the faster smaller memory Leverages the principle of locality Processor caches work in hardware, but must be considered by an operating system [Stallings]

9 Shared-Memory Hardware 9 Conflicting caching design goals Cache size per level Number of cache levels Block size exchanged with lower level memory Replacement algorithm Mapping function Write policy for modified cache lines All decisions made by hardware vendor, considerable by software Cache-optimized software needed when parallelization improvements start to depend on memory bottlenecks [Stallings]

10 Parallel Processing 10 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine Multiprocessing Shared memory With multiple processing elements in many machines Multicomputer Shared nothing (in terms of a globally accessible memory)

11 Instruction-Level Parallelism 11 Hardware optimizes sequential instruction stream execution Pipelining architecture Sub-steps of sequential instructions are overlapped in their execution to increase throughput Traditional concept in processor hardware design Relies on mechanisms such as branch prediction or the out-of-order execution of instructions Superscalar architecture Execution of multiple instructions in parallel, based on redundant functional units of the processor Very Long Instruction Word (VLIW) Explicitly parallel instruction computing (EPIC) SIMD vectorization support with special instructions

12 12 Pipelining

13 Pipelining 13 Pipelining overlaps various stages of instruction execution Fetch, decode and execute happen in parallel Increases instruction throughput with the same clock speed Analogue to assembly line concept Pipelining hazard Temporal dependencies between sub-steps Influence on speedup Structural hazard: Multiple instructions access the same resource Data hazard: Instruction needs the result of the previous instruction Control hazard: Instruction result changes the control flow (interrupt, branch)

14 14 Multi-Cycle Pipelining

15 Pipelining Conflicts 15 Conflict solution strategies Multiply commonly needed hardware units Inclusion of NOPs for changing the timing Reorder the instructions for changing the timing Write-after-read effect: Instruction uses a register value that is overwritten by a subsequent instruction Write-after-write effect: Two subsequent instructions write into the same register Read-After-write effect: Reliance on previous result Stall the pipeline until the conflict is solved Control conflicts are targeted by branch prediction Static branch prediction (forward never - backward ever) vs. dynamic branch prediction (based on previous jumps) Issues typically targeted by compiler and processor hardware

16 16 Superscalar Architectures

17 Superscalar Architectures 17 ILLIAC IV (1974) Good for problems with high degree of regularity, such as graphics/image processing Typically exploit data parallelism Today: GPGPU Computing, Cell processor, SSE, AltiVec Cray Y-MP Thinking Machines CM-2 (1985) Fermi GPU

18 Superscalar Architectures 18 Vector instructions for high-level operations on data sets Became famous with Cray architecture in the 70 s Today, vector instructions are part of the standard instruction set AltiVec Streaming SIMD Extensions (SSE) Example: Vector addition vec_res.x = v1.x + v2.x;! vec_res.y = v1.y + v2.y;! vec_res.z = v1.z + v2.z;! vec_res.w = v1.w + v2.w;! movaps xmm0,address-of-v1! (xmm0=v1.w v1.z v1.y v1.x)!! addps xmm0,address-of-v2! (xmm0=v1.w+v2.w v1.z+v2.z v1.y+v2.y v1.x+v2.x)!! movaps address-of-vec_res,xmm0!

19 Streaming SIMD Extensions (SSE) 19 Introduced by Intel with the Pentium III (1999) Specifically designed for floating point and vector operations New 128 Bit registers can be packed with four 32 bit scalars Operation is performed simultaneously on all of them Typical operations Move data between SSE registers and 32b registers / memory Add, subtract, multiply, divide, square root, maximum, minimum, reciprocal, compare, bitwise AND / OR / XOR Available as compiler intrinsic Function known by the compiler that maps to assembler Better performance than with linked library

20 Other Instruction Set Extensions 20 Fused Multiply-Add instructions (FMA) Supported in different variations by all processors Floating point multiply-add operation performed in one step Improves speed and accuracy of product accumulation Scalar product Matrix multiplication Efficient software implementation of square root and division Intel Advanced Vector Extensions (AVX) Extension of SSE instruction set Introduced with Sandy Bridge architecture (2011) Registers are now 256 Bit wide 512 bit support announced for 2015 version of Xeon Phi

21 Very Long Instruction Word (VLIW) 21 Very Long Instruction Word (VLIW), Fisher et al., 1980 s Compiler identifies instructions to be executed in parallel One VLIW instruction encodes several operations (at least one for each redundant execution unit) Less hardware complexity, higher compiler complexity VLIW processors typically designed with multiple RISC execution units Very popular in the embedded market and in GPU hardware Explicitely Parallel Instruction Computing (EPIC) Coined by HP-Intel alliance since 1997 Foundational concept for the Intel Itanium architecture Extended version of VLIW concept Turned out to be extremely difficult for compilers

22 EPIC bit register-rich explicitly-parallel architecture Implements predication, speculation, and branch prediction Hardware register renaming for parameter passing Parallel execution of loops Speculation, prediction, and renaming controlled by compiler Each 128-bit instruction word contains three instructions Stop-bits control parallel execution Processor can execute six instructions per clock cycle Thirty execution units for subsets of instruction set in eleven groups Each unit executes at a rate of one instruction per cycle (unless stall) Common instructions can be executed in multiple units

23 23 Itanium 30 Functional Units

24 Simultaneous Multi-Threading (SMT) 24 Reasons for bad performance in superscalar architectures depend on application Dynamically schedule the functional unit usage Support multiple instruction streams in one pipeline [Tullsen et al., 1995]

25 Hyperthreading 25 Intel s implementation of simultaneous multi-threading (SMT) Allows an execution core to function as two logical processors Main goal is to reduce the number of related instructions being in the pipeline at the same time Works nicely on cache miss, branch misprediction, or data dependencies in one of the threads Most core hardware resources are shared Caches, execution units, buses Each logical processor has an own architectural set Register bank is mirrored Mainly enables very fast thread context switch in pure hardware More than two logical threads per core would saturate the memory connection and pollute the caches

26 Hyperthreading [Intel]

27 Parallel Processing 27 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine Multiprocessing Shared memory With multiple processing elements in many machines Multicomputer Shared nothing (in terms of a globally accessible memory)

28 Chip Multi-Processing 28 One integrated circuit die (socket) contains multiple computational engines (core) Called many-core or multi-core architecture Cores share some / all cache levels and memory connection All other parts are dedicated per core (pipeline, registers,...) Increase in core count leads to resource contention problem with caches and memory Beside Intel / AMD, also available with ARM, MIPS, PPC Multi-Core vs. SMP SMP demands more replicated hardware (fans, bus,...) SMP is a choice, multi-core is given by default Cores typically have lower clock frequency Multi-Core and SMP programming problems are very similar Recent trends towards heterogeneous cores

29 Many-Core / Multi-Core Intel Core i7 SPARC64 VIIIfx

30 Parallel Processing 30 Inside the processor Instruction-level parallelism (ILP) Multicore Shared memory With multiple processing elements in one machine Multiprocessing Shared memory With multiple processing elements in many machines Multicomputer Shared nothing (in terms of a globally accessible memory)

31 Multiprocessor: Flynn s Taxonomy (1966) 31 Classify multiprocessor architectures among instruction and data processing dimension Multiple Instruction, Single Data (MISD) Single Instruction, Single Data (SISD) Multiple Instruction, Multiple Data (MIMD) Single Instruction, Multiple Data (SIMD) (C) Blaise Barney

32 Multiprocessor Systems 32 Symmetric Multiprocessing (SMP) Set of equal processors in one system (more SM-MIMD than SIMD) Traditionally memory bus, today on-chip network Demands synchronization and operating system support Asymmetric multiprocessing (ASMP) Specialized processors for I/O, interrupt handling or operating system (DEC VAX 11, OS-360, IBM Cell processor) Typically master processor with main memory access and slaves

33 Symmetric Multi-Processing Two or more processors in one system, can perform the same operations (symmetric) Processors share the same main memory and all devices Increased performance and scalability for multi-tasking No master, any processor can cause another to reschedule Challenges for an SMP operating system: Reentrant kernel, scheduling policies, synchronization, memory re-use,... [Stallings]

34 Shared Memory 34 All processors act independently and use the same global address space, changes in one memory location are visible for all others Uniform memory access (UMA) system Equal load and store access for all processors to all memory Default approach for SMP systems of the past Non-uniform memory access (NUMA) system Groups of physical processors (called nodes ) that have local memory, connected by some interconnect Still an SMP system (e.g. any processor can access all of memory), but node-local memory is faster OS tries to schedule close activities on the same node Became the default model in shared memory architectures Cache-coherent NUMA (CC-NUMA) in hardware

35 UMA Example 35 Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and mediates connections to memory

36 NUMA Example 36 Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one logical single address space of physically distributed memory

37 L3 Cache Memory Controller Memory Controller L3 Cache Memory Memory Core Core Q P I Q P I Core Core Core Core Core Core Core Core Core Core Core I/O Q P I I/O Memory Q P I Memory Core Memory Controller Core Core Memory Controller I/O L3 Cache 37 L3 Cache NUMA Example: Intel Nehalem I/O

38 CC-NUMA 38 Central crossbar for interaction of cores, memory controller and other processors via QPI Similar approach by other vendors Extended versions of MESI cache coherence protocol being used for L3 management [Schöne et al.] UNCORE

39 CC-NUMA 39 Cache coherency in a multi-core multi-socket system Extended problem of traditional cache coherency problem in multi-socket SMP systems Application of extended MESI cache coherence protocol in QPI Each cache line has one state Modified Written by the local core Exclusive First read by the local core Shared Read by two cores (cache hit) Write attempt in this state lead to cache invalidation New state is modified Invalid Cache line contains no valid data (read miss) Forwarding (new) - Direct L3 exchange of data Can be optimized by snooping into other caches

40 Hypertransport 40 Specification of I/O interconnect, originally developed by AMD, Alpha and API Networks in 2001 Point-to-point unidirectional links between components At least one host device (typically processor) Bridge functionality to PCI, PCI-X, PCI Express,... Tunnel devices connect a link to other HAT devices Extremely low overhead, suitable for inter-processor communication in SMP hardware [hypertransport.org]

41 Hypertransport 41 [hypertransport.org]

42 Quick Path Interconnect (QPI) 42 Competing technology from Intel, since 2008 Result of a continuous improvement in Intel processor interconnect technology processor processor processor processor processor processor processor processor Up to 4.2GB/s Platform Bandwidth Up to 12.8GB/s Platform Bandwidth snoop filter Memory Interface chipset Memory Interface chipset [intel.com] I/O I/O Traditional Shared Frontside Bus (until 2004) Dual Independent Buses (until 2005)

43 Quick Path Interconnect (QPI) 43 I/O chipset processor processor processor processor Up to 34GB/s Platform Bandwidth Memory Interface processor processor Memory Interface Memory Interface snoop filter chipset Memory Interface processor processor Memory Interface [intel.com] I/O chipset I/O Legend: Bi-directional bus Uni-directional link Dedicated Interconnects (until 2007) Quick Path Interconnect

44 Scalable Coherent Interface 44 ANSI / IEEE standard for NUMA interconnect, used in HPC world 64bit global address space, translation by SCI bus adapter (I/O-window) Used as 2D / 3D torus Processor A Processor B Processor C Processor D Cache Cache Cache Cache Memory SCI Cache Memory SCI Cache SCI Bridge SCI Bridge...

45 Theoretical Models for Parallel Hardware 45 Better use simplified parallel machine model than real hardware specification for parallelization optimization Allows theoretical investigation of algorithms Allows generic optimization, regardless of products Should improve algorithm robustness by avoiding optimizations to hardware layout specialties (e.g. network topology) Became popular in the 70 s and 80 s, due to large diversity in parallel hardware design Resulting computational model is independent from the programming model for the implementation Vast body of theoretical research results Typically, formal models adopt to hardware developments

46 (Parallel) Random Access Machine 46 RAM assumptions: Constant memory access time, unlimited memory PRAM assumptions: Non-conflicting shared bus, no assumption on synchronization support, unlimited number of processors Alternative models: BSP, LogP CPU CPU CPU CPU Shared Bus Input Memory Output Input Memory Output

47 PRAM Extensions 47 Rules for memory interaction to classify hardware support of a PRAM algorithm Memory access assumed to be in lockstep (synchronous PRAM) Concurrent Read, Concurrent Write (CRCW) Multiple tasks may read from / write to the same location at the same time, can be simulated with EREW Concurrent Read, Exclusive Write (CREW) One task may write to a given memory location at any time Exclusive Read, Concurrent Write (ERCW) One task may read from a given memory location at any time Exclusive Read, Exclusive Write (EREW) One task may read from / write to a memory location at any time, memory management must know concurrency

48 PRAM Extensions 48 Concurrent write scenario needs further specification by algorithm Ensures that the same value is written Selection of arbitrary value from parallel write attempts Priority of written value derived from processor ID Store result of combining operation (e.g. sum) into memory PRAM algorithm can act as starting point for a real implementation Unlimited resource assumption Allows to map,logical PRAM processors to a restricted number of physical processors Enables the design scalable algorithm based on unlimited memory assumption Focus only on concurrency opportunities, synchronization and communication later

49 Example: Parallel Sum 49 General parallel sum operation works with any associative and commutative combining operation Multiplication, maximum, minimum, logical operations, PRAM solution Build binary tree, with input data items as leaf nodes Internal nodes hold the sum, root node as global sum Additions on one level are independent from each other PRAM algorithm One processor per leaf node, in-place summation Computation in O(log2n) int sum=0; for (int i=0; i<n; i++) { sum += A[i]; }

50 Example: Parallel Sum 50 Example: n=8: l=1: Partial sums in X[1], X[3], X[5], [7] l=2: Partial sums in X[3] and X[7] l=3: Parallel sum result in X[7] Correctness relies on PRAM lockstep assumption (no synchronization) for all l levels (1..log2n){ for all i items (0..n-1) { if (((i+1) mod 2^l) = 0) then X[i] := X[i-2^(l-1)]+X[i] } }

51 Bulk-Synchronous Parallel (BSP) Model 51 Leslie G. Valiant. A Bridging Model for Parallel Computation, 1990 Success of von Neumann model Bridge between hardware and software High-level languages can be efficiently compiled on this model Hardware designers can optimize the realization of this model Similar model for parallel machines Should be neutral about the number of processors Program should be written for v virtual processors that are mapped to p physical ones When v >> p, the compiler has options BSP computation consists of a series of supersteps: 1.) Concurrent computation on all processors 2.) Exchange of data between all processes 3.) Barrier synchronization

52 Bulk-Synchronous Parallel (BSP) Model 52 Costs of a superstep depend on The costs for the slowest computation The costs for communication between all processes The costs for barrier synchronization Algorithm costs relate to the sum of all superstep costs Synchronization may only happen for some processes Long-running serial tasks are not slowed down from model perspective Recent industrial uptake with Pregel and ML language Apache Hama project implements BSP on top of Hadoop

53 Bulk-Synchronous Parallel (BSP) Model 53 Bulk-synchronous parallel computer (BSPC) is defined by: Components, each performing processing and / or memory functions Router that delivers messages between pairs of components Facilities to synchronize components at regular intervals L (periodicity) Computation consists of a number of supersteps Each L, global check is made if the superstep is completed Router concept splits computation vs. communication aspects, and models memory / storage access explicitly L is controlled by the application, even at run-time

54 LogP [Culler et al., 1993] 54 Criticism on overly simplification in PRAM-based approaches, encourage exploitation of,formal loopholes (e.g. communication) Trend towards multicomputer systems with large local memories Characterization of a parallel machine by: P: Number of processors g (gap): Minimum time between two consecutive transmissions Reciprocal corresponds to per-processor communication bandwidth L (latency): Upper bound on messaging time o (overhead): Exclusive processor time needed for send / receive operation L, o, G in multiples of processor cycles

55 55 LogP Architecture Model

56 LogP 56 Algorithm must produce correct results under all message interleaving, prove space and time demands of processors Simplifications With infrequent communication, bandwidth limits (g) are not relevant With streaming communication, latency (L) may be disregarded Convenient approximation: Increase overhead (o) to be as large as gap (g) Encourages careful scheduling of computation, and overlapping of computation and communication Can be mapped to shared-memory and shared nothing Reading a remote location requires 2L+4o processor cycles

57 LogP 57 Matching the model to real machines Saturation effects: Latency increases as function of the network load, sharp increase at saturation point - captured by capacity constraint Internal network structure is abstracted, so,good vs.,bad communication patterns are not distinguished - can be modeled by multiple g s LogP does not model specialized hardware communication primitives, all mapped to send / receive operations Separate network processors can be explicitly modeled Model defines 4-dimensional parameter space of machines Vendor product line can be identified by a curve in this space

58 58 LogP Optimal Broadcast Tree

59 59 LogP Optimal Summation

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processor Architectures Flynn s taxonomy from 1972 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing (Sta09 Fig 17.1) 2 Parallel

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Parallel Processing & Multicore computers

Parallel Processing & Multicore computers Lecture 11 Parallel Processing & Multicore computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1)

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Comp. Org II, Spring

Comp. Org II, Spring Lecture 11 Parallel Processing & computers 8th edition: Ch 17 & 18 Earlier editions contain only Parallel Processing Parallel Processor Architectures Flynn s taxonomy from 1972 (Sta09 Fig 17.1) Computer

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Computer Architecture

Computer Architecture Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism. CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Parallel Computing Introduction

Parallel Computing Introduction Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Chapter 2 Parallel Computer Models & Classification Thoai Nam

Chapter 2 Parallel Computer Models & Classification Thoai Nam Chapter 2 Parallel Computer Models & Classification Thoai Nam Faculty of Computer Science and Engineering HCMC University of Technology Chapter 2: Parallel Computer Models & Classification Abstract Machine

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Chapter 17 - Parallel Processing

Chapter 17 - Parallel Processing Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

CS 1013 Advance Computer Architecture UNIT I

CS 1013 Advance Computer Architecture UNIT I CS 1013 Advance Computer Architecture UNIT I 1. What are embedded computers? List their characteristics. Embedded computers are computers that are lodged into other devices where the presence of the computer

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 5th Edition, Irv Englander John

More information

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer Contents advanced anced computer architecture i FOR m.tech (jntu - hyderabad & kakinada) i year i semester (COMMON TO ECE, DECE, DECS, VLSI & EMBEDDED SYSTEMS) CONTENTS UNIT - I [CH. H. - 1] ] [FUNDAMENTALS

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

Parallelism and Performance Instructor: Steven Ho

Parallelism and Performance Instructor: Steven Ho Parallelism and Performance Instructor: Steven Ho Review of Last Lecture Cache Performance AMAT = HT + MR MP 2 Multilevel Cache Diagram Main Memory Legend: Request for data Return of data CPU L1$ Memory

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Computer Organization. Chapter 16

Computer Organization. Chapter 16 William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data

More information

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Design of Parallel Algorithms. The Architecture of a Parallel Computer + Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Parallel Programming Concepts!! Parallel Computing Hardware /! Models / Connection Networks

Parallel Programming Concepts!! Parallel Computing Hardware /! Models / Connection Networks Parallel Programming Concepts!! Parallel Computing Hardware /! Models / Connection Networks Andreas Polze, Peter Tröger Sources: Clay Breshears: The Art of Concurrency, Chapter 6 Blaise Barney: Introduction

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information