Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Similar documents
SMD149 - Operating Systems - Multiprocessing

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Computer Organization. Chapter 16

Chapter 18 Parallel Processing

Organisasi Sistem Komputer

CSCI 4717 Computer Architecture

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Multiprocessors & Thread Level Parallelism

Multi-Processor / Parallel Processing

Computer parallelism Flynn s categories

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Architectures

Multiprocessors - Flynn s Taxonomy (1966)

Lecture 7: Parallel Processing

Computer Systems Architecture

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Computer Architecture

Parallel Architecture. Sathish Vadhiyar

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Lecture 7: Parallel Processing

Uniprocessor Scheduling. Basic Concepts Scheduling Criteria Scheduling Algorithms. Three level scheduling

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Advanced Parallel Architecture. Annalisa Massini /2017

Computer Systems Architecture

Parallel Architectures

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lecture 9: MIMD Architectures

CS Parallel Algorithms in Scientific Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Comp. Org II, Spring

Parallel Processing & Multicore computers

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Comp. Org II, Spring

COSC4201 Multiprocessors

WHY PARALLEL PROCESSING? (CE-401)

Handout 3 Multiprocessor and thread level parallelism

Processor Architecture and Interconnect

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 9 Multiprocessors

Today s class. Scheduling. Informationsteknologi. Tuesday, October 9, 2007 Computer Systems/Operating Systems - Class 14 1

Parallel Computing Platforms

Crossbar switch. Chapter 2: Concepts and Architectures. Traditional Computer Architecture. Computer System Architectures. Flynn Architectures (2)

Chapter 11. Introduction to Multiprocessors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Dr. Joe Zhang PDC-3: Parallel Platforms

Lecture 9: MIMD Architectures

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Architectures

Chapter 8 : Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Module 5 Introduction to Parallel Processing Systems

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Computer Architecture Spring 2016

ECE519 Advanced Operating Systems

Chapter 18. Parallel Processing. Yonsei University

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Types of Parallel Computers

Online Course Evaluation. What we will do in the last week?

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Lecture 24: Virtual Memory, Multiprocessors

4. Networks. in parallel computers. Advances in Computer Architecture

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Multiprocessor and Real- Time Scheduling. Chapter 10

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Parallel Programming

Characteristics of Mult l ip i ro r ce c ssors r

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Dr e v prasad Dt

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessor and Real-Time Scheduling. Chapter 10

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

CS/COE1541: Intro. to Computer Architecture

Lecture 9: MIMD Architecture

Issues in Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lect. 2: Types of Parallelism

1. Memory technology & Hierarchy

Overview. Processor organizations Types of parallel machines. Real machines

Concepts of Distributed Systems 2006/2007

CMSC 611: Advanced. Parallel Systems

Memory Systems in Pipelined Processors

BlueGene/L (No. 4 in the Latest Top500 List)

COSC 6385 Computer Architecture - Multi Processor Systems

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Transcription:

Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system Computer that contains more than one processor Benefits Increased processing power Scale resource use to application requirements Additional operating system responsibilities All processors remain busy Even distribution of processes throughout the system All processors work on consistent copies of shared data Execution of related processes synchronized Mutual exclusion enforced Examples of multiprocessors Dual-processor personal computer Powerful server containing many processors Cluster of workstations Classifications of multiprocessor architecture Nature of datapath Interconnection scheme How processors share resources 3/55 4/55 Flynn s taxonomy SISD Based upon the number of concurrent instruction (or control) and data streams available SISD: Single instruction, single data stream MISD: Multiple instruction, single data stream SIMD: Single instruction, multiple data streams MIMD: Multiple instruction, multiple data streams Single instruction, single data stream Sequential computer Exploit minor parallelism: Pipelining Superscalar processors Hyper threading 5/55 6/55

MISD SIMD Multiple instruction, single data stream Not commonly used Redundancy Three processors do the same computation, results are compared Single instruction, multiple data streams One or more processing units Same instruction on a block of data Array or vector processors Most processors have SIMD instructions PowerPC s AltiVec, Intel s MMX, SSE, SSE2 and SSE3, AMD s 3DNow!, SPARCs VIS, PA-RISCs MAX and the MIPS MDMX and MIPS-3D. Multimedia 7/55 8/55 MIMD Processor interconnection schemes Multiple instruction, multiple data streams Multiple autonomous processors simultaneously executing different instructions on different data Multiprocessor systems Network of workstations Distributed systems Describes how the system s components, such as processors and memory modules, are connected Consists of nodes (components or switches) and links (connections) Parameters used to evaluate interconnection schemes Node degree Bisection width Network diameter Cost of the interconnection scheme 9/55 10 /55 Shared bus Shared bus Single communication path between all nodes Contention can build up for shared bus Lacks scalability Useful for interconnecting a small number of processors For systems with many processors, a bus could be used to interconnect nodes within a supernode Dynamic network Example: dual-processor Intel Pentium 11 /55 12 /55

Crossbar-switch matrix Crossbar-switch matrix Separate path from every processor to every memory module (or in general, from every node to every other node) For n processors and m memory modules, there will be n x m total switches High fault tolerance: Bisection width = (n x m)/2 High performance: Simultaneous transmissions can be supported Network diameter = 1 yields low latency High cost: proportional to n x m Example: Sun UltraSPARC-III 13 /55 14 /55 2-D mesh network 2-D mesh network n rows and m columns, in which a node is connected to nodes directly north, south, east and west of it Maximum node degree = 4 Moderate performance, fault tolerance and cost Not very scalable given that network diameter may be substantial Example: Intel Paragon (large-scale system, but communication is kept between neighboring nodes) 15 /55 16 /55 Hypercube Hypercube n-dimensional hypercube has 2 n nodes in which each node is connected to n neighbor nodes Better scalability and performance 4-D hypercube (16 nodes) results in network diameter of 4, while a 2-D mesh network with 16 nodes results in network diameter of 6 More fault tolerant, but more expensive than crossbar switch or 2-D mesh networks Example: ncube (up to 8192 processors) 17 /55 18 /55

Multistage networks Multistage networks Switch nodes act as hubs routing messages between nodes Cheaper (use simple hardware for switches that can be packed tightly) Less fault tolerant and worse performance compared to a crossbar-switch matrix Contention can occur at switches and network diameter is typically larger Example: IBM POWER4 19 /55 20 /55 Loosely coupled vs. tightly coupled systems Loosely coupled vs. tightly coupled systems Tightly coupled systems Processors share most resources including memory Communicate over shared buses Less burden for operating system programmers Less flexible, poor scalability and fault tolerance Loosely coupled systems Processors do not share most resources Dedicated memory Explicit messages using communication link or shared virtual memory More flexible, fault tolerant, scalable 21 /55 22 /55 Multiprocessor operating system organizations Master/slave Significantly different from uniprocessor systems Systems classified into three types (based on how processors share operating system responsibilities) Master/slave Separate kernels Symmetrical organization Master processor executes operating system code Computation and I/O jobs Slaves execute only user programs Processor-bound jobs Hardware asymmetry Slave processors have to wait for the master processor to handle the interrupt for OS operations Low fault tolerance Failure of the master processor results in system failure Good for computationally intensive jobs Example: ncube system 23 /55 24 /55

Separate kernels Symmetrical Each processor executes its own operating system and responds only to interrupts generated from user processes running on that processor Loosely coupled Little contention over resources Some globally shared operating system data System failure unlikely, but failure of one processor results in termination of processes on that processor Useful in environments where processes do not interact much and high availability is important Operating system manages a pool of identical processors High amount of resource sharing Need for mutual exclusion Allows processes to truly share the processors and exploit parallelism Highest degree of fault tolerance of any organization Facilitates graceful degradation Some contention for resources can limit performance improvement gained by adding more processors 25 /55 26 /55 UMA Classification based on how memory is shared Trade offs between performance, scalability and cost Uniform memory access (UMA) multiprocessor system All processors share main memory Access to any memory page is nearly the same for all processors and all memory modules Typically uses shared bus or crossbar-switch matrix Used in small multiprocessor systems (typically two to eight processors) 27 /55 28 /55 UMA NUMA Nonuniform memory access (NUMA) multiprocessor Global memory divided into memory partitions or modules Each node contains a few processors and a portion of system memory, which is local to that node Access to local memory faster than access to global memory (rest of the memory modules) More scalable than UMA with fewer bus collisions Most of processor s memory request served by local memory 29 /55 30 /55

NUMA COMA Basic physical interconnection similar to NUMA Main memory is viewed as a cache and called an attraction memory (AM) Allows system to migrate data to node that most often accesses it at granularity of a memory line (more efficient than a memory page) Reduces the number of cache misses serviced remotely Overhead: duplicated data items and complex protocol to ensure all updates are received at all processors 31 /55 32 /55 COMA NORMA No-Remote Memory Access Loosely coupled Simple to build: no complex interconnection No shared global memory Each node maintains its own local memory Communication through explicit messages Some implement the illusion of shared global memory known as shared virtual memory (SVM) Distributed systems Involve message passing via remote procedure calls (RPCs) Example: Google server farms 33 /55 34 /55 NORMA Memory sharing Memory coherence Ensure that all the processors have access to the most up- to-date copy of data Cache coherency Strategies to increase percentage of local memory hits Page migration Page replication 35 /55 36 /55

Cache coherence Page replication and migration UMA cache coherence Three different ways to implement Bus snooping (a.k.a. cache snooping) Maintain a centralized directory to indicate when to remove stale data Only one processor can cache a data item Cache coherent NUMA Each address is associated with a home node Contact home node on reads and writes to receive most updated version of data The protocol requires a maximum of three network traversals per coherency operation Page replication System places a copy of a memory page at a new node Good for pages read by processes at different nodes, but not written Page migration System moves a memory page from one node to another Good for pages written to by a single remote process Increases percentage of local memory hits Drawbacks Memory overhead Replicating and migrating more expensive than remotely referencing so only useful if referenced again 37 /55 38 /55 Scheduling Job-blind multiprocessor scheduling Determines the order and to which processors processes are dispatched Two goals Parallelism - timesharing scheduling Processor affinity - space-partitioning scheduling Soft affinity Hard affinity Types of scheduling algorithms Job-blind Job-aware Straightforward extension of uniprocessor algorithms Examples First-in-first out (FIFO) multiprocessor scheduling Round-robin (RR) multiprocessor scheduling Shortest process first (SPF) multiprocessor scheduling 39 /55 40 /55 Job-aware multiprocessor scheduling Coscheduling example Algorithms that maximize parallelism Shortest-number-of-processes-first (SNPF) scheduling Round-robin job (RRjob) scheduling Coscheduling Processes from the same job placed in adjacent spots in the global run queue Sliding window moves down run queue running adjacent processes that fit into window (size = number of processors) Algorithms that maximize processor affinity Dynamic partitioning 41 /55 42 /55

Dynamic partitioning Process migration Transferring a process from one node to another node Benefits of process migration Increases fault tolerance Load balancing Reduces communication costs Resource sharing 43 /55 44 /55 Flow of process migration Flow of process migration Either sender or receiver initiates request Sender suspends migrating process Sender creates message queue for migrating process s messages Sender transmits state to a dummy process at the receiver Sender and receiver notify other nodes of process s new location Sender deletes its instance of the process 45 /55 46 /55 Process migration concepts Process migration strategies Trade-off between residual dependency and initial migration latency Characteristics of a successful migration strategy Minimal residual dependency Transparent Scalable Fault tolerant Heterogeneous Eager migration Transfer entire state during initial migration Dirty eager migration Only transfer dirty memory pages Clean pages brought in from secondary storage Copy-on reference migration Similar to dirty eager except clean pages can also be acquired from sending node on demand Lazy copying No initial transfer of pages 47 /55 48 /55

Process migration strategies Load balacing Flushing migration Flush all dirty pages to disk before migration Transfer no memory pages during initial migration Precopy migration Begin transferring memory pages before suspending process Migrate once dirty pages reach a lower threshold System attempts to distribute the processing load evenly among processors Reduces variance in response times Ensures no processors remain idle while other processors are overloaded Two types Static load balancing Dynamic load balancing 49 /55 50 /55 Static load balacing Static load balacing using graphs Useful in environments in which jobs exhibit predictable patterns Goals Distribute the load Reduce communication costs Solve using a graph Must use approximations for large systems to balance loads with reasonable overhead 51 /55 52 /55 Dynamic load balancing Dynamic load balancing More useful than static load balancing in environments where communication patterns change and processes are created or terminated unpredictably Types of policies Sender-initiated Receiver-initiated Symmetric Random Algorithms Bidding algorithm Drafting algorithm Communication issues Communication with remote nodes can have long delays; load on a processor might change in this time Coordination between processors to balance load creates many messages Can overload the system Solution Only communicate with neighbor nodes Load diffuses throughout the system 53 /55 54 /55

Summary Next time: networks and distributed systems 55 /55