Chapter 17 - Parallel Processing

Similar documents
Chapter 18 - Multicore Computers

Computer Organization. Chapter 16

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

CSCI 4717 Computer Architecture

Chapter 18 Parallel Processing

Organisasi Sistem Komputer

Chapter 18. Parallel Processing. Yonsei University

Multi-Processor / Parallel Processing

Lecture 24: Virtual Memory, Multiprocessors

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Module 5 Introduction to Parallel Processing Systems

UNIT I (Two Marks Questions & Answers)

Multiprocessors & Thread Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Architecture

Computer Systems Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Introduction II. Overview

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Comp. Org II, Spring

Computer Systems Architecture

Parallel Processing & Multicore computers

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer parallelism Flynn s categories

COSC 6385 Computer Architecture - Multi Processor Systems

Memory Systems in Pipelined Processors

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Comp. Org II, Spring

RAID 0 (non-redundant) RAID Types 4/25/2011

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computing: Parallel Architectures Jin, Hai

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Chapter 14 - Processor Structure and Function

Parallel Architectures

Chapter 10 - Computer Arithmetic

Parallel Computing Introduction

Handout 3 Multiprocessor and thread level parallelism

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 9: MIMD Architectures

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Computer Organization

Lecture 24: Memory, VM, Multiproc

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter-4 Multiprocessors and Thread-Level Parallelism

Introduction to GPU programming with CUDA

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

3.3 Hardware Parallel processing

Lecture 25: Multiprocessors

Shared Symmetric Memory Systems

Introduction to Computing and Systems Architecture

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 7: Parallel Processing

RISC Processors and Parallel Processing. Section and 3.3.6

Unit 9 : Fundamentals of Parallel Processing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 9: MIMD Architectures

Chapter 1 - Introduction

Introduction to parallel computing

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

BlueGene/L (No. 4 in the Latest Top500 List)

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Instruction Register. Instruction Decoder. Control Unit (Combinational Circuit) Control Signals (These signals go to register) The bus and the ALU

Chapter 3 - Top Level View of Computer Function

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

2 MARKS Q&A 1 KNREDDY UNIT-I

Lecture 9: MIMD Architecture

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

PCS - Part Two: Multiprocessor Architectures

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

High Performance Computing Systems

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Overview. Processor organizations Types of parallel machines. Real machines

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week)

Parallel Architectures

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Chapter 9 Multiprocessors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Parallel programming. Luis Alejandro Giraldo León

Administrivia. Minute Essay From 4/11

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Transcription:

Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71

Table of Contents I 1 Motivation 2 Parallel Processing Categories of Computer Systems Taxonomy of parallel computing systems MIMD parallel processing Symmetric Multiprocessor Parallelism Symmetric Multiprocessor Parallelism Processor Organization Cache Coherence Protocols Clusters Luis Tarrataca Chapter 17 - Parallel Processing 2 / 71

Table of Contents II SIMD parallel processing Vector Computation Luis Tarrataca Chapter 17 - Parallel Processing 3 / 71

Table of Contents I 3 Where to focus your study 4 References Luis Tarrataca Chapter 17 - Parallel Processing 4 / 71

Motivation Motivation Today we will look at ways to increase processor performance: Always a fun topic =) What are some of the techniques you know of to increase processor performance? Luis Tarrataca Chapter 17 - Parallel Processing 5 / 71

Motivation Frequency Cache Easy way out =) Inherent vantages and disadvantages. Can you name them? High speed memory; Idea: diminish the number of reads from low latency memory (RAM, HD, etc); Very expensive! Pipeline; Idea: assembly line for instructions; Leads to performance increases; But also introduces a host of new considerations (hazards); Luis Tarrataca Chapter 17 - Parallel Processing 6 / 71

Today we will discuss another technique: parallelization OriginalµP design: sequential processing of instructions. Parallelism: Execute multiple instructions at the same time. How? As always in engineering there are multiple strategies. Choice made based on computational requirements. Luis Tarrataca Chapter 17 - Parallel Processing 7 / 71

Parallelism choice is made on computational requirements: Can you think of some of the dimensions that influence parallel computation? Luis Tarrataca Chapter 17 - Parallel Processing 8 / 71

Parallelism choice is made on computational requirements: Can you think of some of the dimensions that influence parallel computation? What about these? Are we processing a single source of data? Or multiple? Do we need to apply the same instruction to the data? Or different ones? Luis Tarrataca Chapter 17 - Parallel Processing 9 / 71

Categories of Computer Systems Categories of Computer Systems (1/4) Single instruction, single data (SISD): Single processor executes a single instruction stream to operate on data stored in a single memory. Luis Tarrataca Chapter 17 - Parallel Processing 10 / 71

Categories of Computer Systems Categories of Computer Systems (2/4) Single instruction, multiple data (SIMD): Single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis; Each processing element has an associated data memory: Instructions are executed on data sets by processors. Luis Tarrataca Chapter 17 - Parallel Processing 11 / 71

Categories of Computer Systems Categories of Computer Systems (3/4) Multiple instruction, single data (MISD): Sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence. Luis Tarrataca Chapter 17 - Parallel Processing 12 / 71

Categories of Computer Systems Categories of Computer Systems (4/4) Multiple instruction, multiple data (MIMD): Set of processors simultaneously execute instruction sequences on data sets. Luis Tarrataca Chapter 17 - Parallel Processing 13 / 71

Taxonomy of parallel computing systems Taxonomy of parallel computing systems Figure: A Taxonomy of Parallel Processor Architectures (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 14 / 71

Taxonomy of parallel computing systems Today s class will focus on: MIMD parallel processing: SMP (building block); Clusters (arrangement of building blocks). Data centers; Supercomputers. SIMD parallel processing: General-purpose computing on graphics processing units (GPGPU) Intel XEON Phi. Luis Tarrataca Chapter 17 - Parallel Processing 15 / 71

MIMD parallel processing Symmetric Multiprocessor Parallelism SMP has the following characteristics (1/2): There are two or more similar processors of comparable capability. Processors share Main memory; I/O facilities and devices; Processors are interconnected by: Bus or other connection scheme; Memory access time is approximately the same for each processor. Luis Tarrataca Chapter 17 - Parallel Processing 16 / 71

MIMD parallel processing Symmetric Multiprocessor Parallelism SMP has the following characteristics (2/2): Processors perform the same functions: Hence the term symmetric System is controlled by an operating system managing: Processors and programs; How processes / threads are to be executed in the processors; User does not need to worry about anything =) Luis Tarrataca Chapter 17 - Parallel Processing 17 / 71

MIMD parallel processing Processor Organization Figure: Generic Block Diagram of a Tightly Coupled Multiprocessor (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 18 / 71

MIMD parallel processing Two or more processors: Each processor includes: a control unit, ALU, registers, cache; Each processor has access to: Shared main memory; I/O devices. Processors communicate with each other through: Memory: messages and status information. Exchanging signals directly. Luis Tarrataca Chapter 17 - Parallel Processing 19 / 71

MIMD parallel processing With all the components involved: Do you think it would be easy for a programmer to control everything at assembly level? Any ideas? Luis Tarrataca Chapter 17 - Parallel Processing 20 / 71

MIMD parallel processing With all the components involved: Do you think it would be easy for a programmer to control everything at assembly level? Any ideas? It would be a nightmare... How are then the different computational resources best used? ideas? Any Luis Tarrataca Chapter 17 - Parallel Processing 21 / 71

MIMD parallel processing How are then the different computational resources best used? ideas? Any Combination of operating system support tools: Processes; Threads. Alongside application programming interfaces; Integral part of an Operating System course. Luis Tarrataca Chapter 17 - Parallel Processing 22 / 71

MIMD parallel processing Most common organization: Figure: Symmetric Multiprocessor Organization (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 23 / 71

MIMD parallel processing But how is the bus managed with multiple processors? Any ideas? Figure: Symmetric Multiprocessor Organization (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 24 / 71

MIMD parallel processing But how is the bus managed with multiple processors? Any ideas? Time-sharing: When one module is controlling the bus: Other modules are locked out; If necessary: modules suspend operation until bus access is achieved. Luis Tarrataca Chapter 17 - Parallel Processing 25 / 71

MIMD parallel processing Bus has the same structure for a single-processor: Control, address, and data lines. To facilitate DMA transfers the following features are provided: Addressing: to distinguish modules on the bus; Arbitration: Any I/O module can temporarily function as master. Mechanism arbitrates competing requests for bus control: Using some sort of priority scheme. Luis Tarrataca Chapter 17 - Parallel Processing 26 / 71

MIMD parallel processing Now we have a single bus being used by multiple processors. Can you see any problems with this? Luis Tarrataca Chapter 17 - Parallel Processing 27 / 71

MIMD parallel processing Now we have a single bus being used by multiple processors. Can you see any problems with this? All memory accesses pass through the common bus; Bus cycle time limits the speed of the system; This results in a performance bottleneck; Luis Tarrataca Chapter 17 - Parallel Processing 28 / 71

MIMD parallel processing But this leads to the following question: What is the most common way for reducing the number of memory accesses? Luis Tarrataca Chapter 17 - Parallel Processing 29 / 71

MIMD parallel processing But this leads to the following question: What is the most common way for reducing the number of memory accesses? Cache ;) In a multiprocessor architecture: Can you see any problems with using multiple caches? Any ideas? Luis Tarrataca Chapter 17 - Parallel Processing 30 / 71

Each local cache contains an image of a portion of memory: if a word is altered in one cache: Could conceivably invalidate a word in another cache; To prevent this: Other processors must be alerted that an update has taken place Commonly addressed in hardware rather than by software. Luis Tarrataca Chapter 17 - Parallel Processing 31 / 71

Cache Coherence Protocols Dynamic recognition at run time of potential inconsistency conditions: Always cache memory; Problem is only dealt with when it actually arises; Better performance over a software approach, Transparent to the programmer and the compiler: Reducing the software development burden. We will look at one such protocol: MESI Luis Tarrataca Chapter 17 - Parallel Processing 32 / 71

Not this Mesi though... Luis Tarrataca Chapter 17 - Parallel Processing 33 / 71

MESI protocol Each line of the cache can be in one of four states (1/2): Modified: Line in the cache has been modified ( from main memory) Line is not present in any other cache. Exclusive: Line in the cache is the same as that in main memory; Line is not present in any other cache. Luis Tarrataca Chapter 17 - Parallel Processing 34 / 71

MESI protocol Each line of the cache can be in one of four states (2/2): Shared: Line in the cache is the same as that in main memory; Line may be present in another cache. Invalid: Line in the cache does not contain valid data. Luis Tarrataca Chapter 17 - Parallel Processing 35 / 71

This time in table form: Figure: MESI cache line states (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 36 / 71

At any time a cache line is in a single state: If next event is from the attached processor: Transition is dictated by figure below on the left; If next event is from the bus: Transition is dictated by figure below on the right. Figure: Line in cache at initiating processor (Source: [Stallings, 2015]) Figure: Line in snooping cache (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 37 / 71

Caption for the previous figure: Figure: (Source: [Stallings, 2015]) Lets see what all of this means ;) Luis Tarrataca Chapter 17 - Parallel Processing 38 / 71

What happens when a read miss occurs? Any ideas? Luis Tarrataca Chapter 17 - Parallel Processing 39 / 71

What happens when a read miss occurs? Any ideas? Well it depends...: Has any cache a copy of the line in the exclusive state? Do any caches have a copy of the line in the shared state? Has any cache a copy of the line in the modified state? What if no other copies exist? Luis Tarrataca Chapter 17 - Parallel Processing 40 / 71

Read miss: occurs in the local cache, the processor: 1 initiates a main memory read of the missing address; 2 inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (1/4): Possibility 1: One cache has a clean copy of the line in the exclusive state: Luis Tarrataca Chapter 17 - Parallel Processing 41 / 71

Read miss: occurs in the local cache, the processor: 1 initiates a main memory read of the missing address; 2 inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (1/4): Possibility 1: One cache has a clean copy of the line in the exclusive state: 1 Cache returns a signal indicating that it shares this line; Luis Tarrataca Chapter 17 - Parallel Processing 41 / 71

Read miss: occurs in the local cache, the processor: 1 initiates a main memory read of the missing address; 2 inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (1/4): Possibility 1: One cache has a clean copy of the line in the exclusive state: 1 Cache returns a signal indicating that it shares this line; 2 Responding processor transitions from exclusive state to shared; Luis Tarrataca Chapter 17 - Parallel Processing 41 / 71

Read miss: occurs in the local cache, the processor: 1 initiates a main memory read of the missing address; 2 inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (1/4): Possibility 1: One cache has a clean copy of the line in the exclusive state: 1 Cache returns a signal indicating that it shares this line; 2 Responding processor transitions from exclusive state to shared; 3 Initiating processor reads line from memory: goes from invalid to shared. Luis Tarrataca Chapter 17 - Parallel Processing 41 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (2/4): Possibility 2: 1 Caches have a clean copy of the line in the shared state: Luis Tarrataca Chapter 17 - Parallel Processing 42 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (2/4): Possibility 2: 1 Caches have a clean copy of the line in the shared state: 1 Each cache signals that it shares the line; Luis Tarrataca Chapter 17 - Parallel Processing 42 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (2/4): Possibility 2: 1 Caches have a clean copy of the line in the shared state: 1 Each cache signals that it shares the line; 2 Initiating processor reads from memory and line goes from invalid to shared. Luis Tarrataca Chapter 17 - Parallel Processing 42 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (3/4): Possibility 3: If one other cache has a modified copy of the line: Luis Tarrataca Chapter 17 - Parallel Processing 43 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (3/4): Possibility 3: If one other cache has a modified copy of the line: 1 Cache containing modified line blocks memory read from initiating processor; Luis Tarrataca Chapter 17 - Parallel Processing 43 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (3/4): Possibility 3: If one other cache has a modified copy of the line: 1 Cache containing modified line blocks memory read from initiating processor; 2 Cache containing modified line provides the line to the requesting cache; Luis Tarrataca Chapter 17 - Parallel Processing 43 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (3/4): Possibility 3: If one other cache has a modified copy of the line: 1 Cache containing modified line blocks memory read from initiating processor; 2 Cache containing modified line provides the line to the requesting cache; 3 Responding cache then changes its line from modified to shared; Luis Tarrataca Chapter 17 - Parallel Processing 43 / 71

Read miss occurs in the local cache, the processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (3/4): Possibility 3: If one other cache has a modified copy of the line: 1 Cache containing modified line blocks memory read from initiating processor; 2 Cache containing modified line provides the line to the requesting cache; 3 Responding cache then changes its line from modified to shared; 4 Memory controller updates the contents of the line in memory; Luis Tarrataca Chapter 17 - Parallel Processing 43 / 71

Read miss occurs in the local cache, processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (4/4): Possibility 4: If no other cache has a copy of the line, no signals are returned: 1 Initiating processor reads the line from memory; Luis Tarrataca Chapter 17 - Parallel Processing 44 / 71

Read miss occurs in the local cache, processor: 1 Initiates a main memory read of the missing address; 2 Inserts a signal on the bus that: Alerts all other processor/cache units to snoop the transaction; 3 A number of possible outcomes may occur (4/4): Possibility 4: If no other cache has a copy of the line, no signals are returned: 1 Initiating processor reads the line from memory; 2 Initiating processor transitions the line in its cache from invalid to exclusive. Luis Tarrataca Chapter 17 - Parallel Processing 44 / 71

Read hit occurs on a line currently in the local cache: 1 Processor reads the required item; 2 No state change: state remains {modified, shared, or exclusive}. Luis Tarrataca Chapter 17 - Parallel Processing 45 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 1: some other cache may have a modified copy of this line: 1 Alerted processor signals that it has a modified copy of the line; 4 Processor signals read-with-intent-to-modify (RWITM) on bus: Reads the line from main memory; Modifies the line in the cache; Marks the line in the modified state. Luis Tarrataca Chapter 17 - Parallel Processing 46 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 1: some other cache may have a modified copy of this line: 1 Alerted processor signals that it has a modified copy of the line; 2 Initiating processor surrenders the bus and waits; 4 Processor signals read-with-intent-to-modify (RWITM) on bus: Reads the line from main memory; Modifies the line in the cache; Marks the line in the modified state. Luis Tarrataca Chapter 17 - Parallel Processing 46 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 1: some other cache may have a modified copy of this line: 1 Alerted processor signals that it has a modified copy of the line; 2 Initiating processor surrenders the bus and waits; 3 Other processor writes the modified line to main memory; 4 Processor signals read-with-intent-to-modify (RWITM) on bus: Reads the line from main memory; Modifies the line in the cache; Marks the line in the modified state. Luis Tarrataca Chapter 17 - Parallel Processing 46 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 1: some other cache may have a modified copy of this line: 1 Alerted processor signals that it has a modified copy of the line; 2 Initiating processor surrenders the bus and waits; 3 Other processor writes the modified line to main memory; 4 Other processor transitions the state of the cache line to invalid. 4 Processor signals read-with-intent-to-modify (RWITM) on bus: Reads the line from main memory; Modifies the line in the cache; Marks the line in the modified state. Luis Tarrataca Chapter 17 - Parallel Processing 46 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 2: No other cache has a modified copy of the requested line: 1 No signal is returned by the other processors; Luis Tarrataca Chapter 17 - Parallel Processing 47 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 2: No other cache has a modified copy of the requested line: 1 No signal is returned by the other processors; 2 Initiating processor proceeds to read in the line and modify it; Luis Tarrataca Chapter 17 - Parallel Processing 47 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 2: No other cache has a modified copy of the requested line: 1 No signal is returned by the other processors; 2 Initiating processor proceeds to read in the line and modify it; 3 If one or more caches have a clean copy of the line in the shared state: Luis Tarrataca Chapter 17 - Parallel Processing 47 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 2: No other cache has a modified copy of the requested line: 1 No signal is returned by the other processors; 2 Initiating processor proceeds to read in the line and modify it; 3 If one or more caches have a clean copy of the line in the shared state: Each cache invalidates its copy of the line. Luis Tarrataca Chapter 17 - Parallel Processing 47 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 2: No other cache has a modified copy of the requested line: 1 No signal is returned by the other processors; 2 Initiating processor proceeds to read in the line and modify it; 3 If one or more caches have a clean copy of the line in the shared state: Each cache invalidates its copy of the line. 4 If one cache has a clean copy of the line in the exclusive state Luis Tarrataca Chapter 17 - Parallel Processing 47 / 71

Write miss occurs in the local cache: 1 Initiating processor attempts to read the line of main memory; 2 Processor signals read-with-intent-to-modify (RWITM) on bus. 3 Possibility 2: No other cache has a modified copy of the requested line: 1 No signal is returned by the other processors; 2 Initiating processor proceeds to read in the line and modify it; 3 If one or more caches have a clean copy of the line in the shared state: Each cache invalidates its copy of the line. 4 If one cache has a clean copy of the line in the exclusive state That cache invalidates its copy of the line. Luis Tarrataca Chapter 17 - Parallel Processing 47 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 1 - Shared state: Before updating, processor must gain exclusive ownership of the line: Luis Tarrataca Chapter 17 - Parallel Processing 48 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 1 - Shared state: Before updating, processor must gain exclusive ownership of the line: Processor signals its intent on bus; Luis Tarrataca Chapter 17 - Parallel Processing 48 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 1 - Shared state: Before updating, processor must gain exclusive ownership of the line: Processor signals its intent on bus; Each processor that has a shared copy of the line goes from shared to invalid. Luis Tarrataca Chapter 17 - Parallel Processing 48 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 1 - Shared state: Before updating, processor must gain exclusive ownership of the line: Processor signals its intent on bus; Each processor that has a shared copy of the line goes from shared to invalid. Initiating processor updates line and then state from shared to modified. Luis Tarrataca Chapter 17 - Parallel Processing 48 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 2 - Exclusive state: Processor already has exclusive control of this line: Simply performs the update; Luis Tarrataca Chapter 17 - Parallel Processing 49 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 2 - Exclusive state: Processor already has exclusive control of this line: Simply performs the update; Transitions its copy of the line from exclusive to modified. Luis Tarrataca Chapter 17 - Parallel Processing 49 / 71

Write hit occurs on a line in the local cache, effect depends on line state: Possibility 3 - Modified state: Processor has exclusive control of this line and marked as modified:[<+->] Simply performs the update; Luis Tarrataca Chapter 17 - Parallel Processing 50 / 71

Seems complex? Luis Tarrataca Chapter 17 - Parallel Processing 51 / 71

Seems complex? Imagine multiple cache levels that can be shared... Luis Tarrataca Chapter 17 - Parallel Processing 52 / 71

Seems complex? Imagine multiple cache levels that can be shared... Luis Tarrataca Chapter 17 - Parallel Processing 53 / 71

Clusters Group of computers working together as a unified computing resource: Each computer in a cluster is typically referred to as a node. Some benefits: Scalability - Possible to add new nodes to the cluster; Availability - Failure of one node does not mean loss of service. Luis Tarrataca Chapter 17 - Parallel Processing 54 / 71

How is information stored within the cluster? Any ideas? Luis Tarrataca Chapter 17 - Parallel Processing 55 / 71

Figure: Shared Disk Cluster Configuration (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 56 / 71

A cluster system raises a series of important design issues: Failure Management: How are failures managed by a cluster? A computation can be restarted on a different node; Or a computation can be continued on a different node; Load Balancing: How is the processing load among available computers? Parallelize Computation: How is the computation parallelized? Compiler? Purposely programmed? Parametric? Luis Tarrataca Chapter 17 - Parallel Processing 57 / 71

Supercomputers are cluster systems: High-speed networks and physical proximity allow for better performance. Type of system used by the Santos Dumont supercomputer at LNCC. Luis Tarrataca Chapter 17 - Parallel Processing 58 / 71

SIMD parallel processing Vector Computation Perform arithmetic operations on arrays of floating-point numbers: Example: system simulation: These can typical be done by a set of equations; Basic Linear Algebra Subprograms (BLAS). Same operation is performed to different data: Thus Single Instruction Multiple Data (SIMD). Luis Tarrataca Chapter 17 - Parallel Processing 59 / 71

SIMD parallel processing Lets look at an example: We want to sum the elements of an array. C i = A i + B i Figure: Example of Vector Addition (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 60 / 71

SIMD parallel processing General-purpose computer iterate through each element of the array. Overkill and inefficient! Why? CPU can do floating-point operations; But can also do a host of other things: I/O operations; Memory management; Cache management; We are mostly interested in arithmetic operations: Arithmetic Logic Unit (ALU) Luis Tarrataca Chapter 17 - Parallel Processing 61 / 71

SIMD parallel processing Idea: General-purpose parallel computation based on ALUs Instead of using processors for parallel computation; ALUs are also cheap =) Careful memory management: Since memory is shared by the different ALUs. Luis Tarrataca Chapter 17 - Parallel Processing 62 / 71

SIMD parallel processing ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU N Figure: Summing two vectors (Source: [Sanders and Kandrot, 2011]) variable tid is the identification of the ALU; Luis Tarrataca Chapter 17 - Parallel Processing 63 / 71

SIMD parallel processing Two main categories: Pipelined ALU; Parallel ALUs (GPGPUs): Pipelined or Not-pipelined. Figure: Pipelined ALU (Source: [Stallings, 2015]) Figure: Parallel ALU (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 64 / 71

SIMD parallel processing Floating-point operations are complex operations: Decompose a floating-point operation into stages; Pipeline for floating-point operations =) Figure: The different stages for the pipeline for floating-point operations (Source: [Stallings, 2015]) Luis Tarrataca Chapter 17 - Parallel Processing 65 / 71

SIMD parallel processing Figure: Pipelined ALU (Source: [Stallings, 2015]) Figure: Parallel ALU (Source: [Stallings, 2015]) C - Compare exponent, S - shift exponent, A - add significands, N - normalize Same problems that occur with processor pipelining also occur here. Luis Tarrataca Chapter 17 - Parallel Processing 66 / 71

SIMD parallel processing As a curiosity lets look at the architecture of a GeForce8800: Figure: GeForce 8800 architecture (Source: NVIDIA ) Luis Tarrataca Chapter 17 - Parallel Processing 67 / 71

SIMD parallel processing 14 Streaming Multiple Processor (SM) each containing: Each SM contains 8 Streaming Processor (SP): Each SP processor has a fully pipelined integer arithmetic logic unit (ALU); floating point unit (FPU). Total: 224 ALUs. This an old card from 2006, recent models contain thousands of ALUs! Supercomputers are a combination of processors and GPGPU nodes. Luis Tarrataca Chapter 17 - Parallel Processing 68 / 71

SIMD parallel processing Figure: NVIDIA GPU Architecture(Source: NVIDIA) Luis Tarrataca Chapter 17 - Parallel Processing 69 / 71

Where to focus your study Where to focus your study After this class you should be able to: Summarize the types of parallel processor organizations; Present an overview of design features of symmetric multiprocessors; Understand the issue of cache coherence in a multiple processor system; Explain the key features of the MESI protocol. Understand the cluster systems and vector systems. Luis Tarrataca Chapter 17 - Parallel Processing 70 / 71

Where to focus your study Less important to know how these solutions were implemented: details of specific hardware solutions for parallel computation. Your focus should always be on the building blocks for developing a solution =) Luis Tarrataca Chapter 17 - Parallel Processing 71 / 71

References References I Sanders, J. and Kandrot, E. (2011). Cuda by Example: An Introduction to General-purpose GPU Programming. Addison Wesley Professional. Stallings, W. (2015). Computer Organization and Architecture. Pearson Education. Luis Tarrataca Chapter 17 - Parallel Processing 72 / 71