Midterm Exam 02/09/2009

Similar documents
Portland State University ECE 588/688. Cache-Only Memory Architectures

Portland State University ECE 588/688. Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

EC 513 Computer Architecture

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Multiprocessors & Thread Level Parallelism

1. Memory technology & Hierarchy

Handout 3 Multiprocessor and thread level parallelism

Lecture 25: Multiprocessors

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

CMSC 611: Advanced. Distributed & Shared Memory

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lect. 6: Directory Coherence Protocol

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Memory Hierarchy in a Multiprocessor

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Directory Implementation. A High-end MP

Scalable Cache Coherence

Computer Architecture

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

CMSC 611: Advanced Computer Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Multiprocessor Cache Coherency. What is Cache Coherence?

Chapter 5. Thread-Level Parallelism

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Introduction to Parallel Computing

Scalable Cache Coherence

Lecture 24: Board Notes: Cache Coherency

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Systems Architecture

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

COSC 6385 Computer Architecture - Multi Processor Systems

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Parallel Architecture. Hwansoo Han

Aleksandar Milenkovich 1

ECSE 425 Lecture 30: Directory Coherence

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

Page 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Foundations of Computer Systems

Flynn s Classification

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

CS377P Programming for Performance Multicore Performance Cache Coherence

ECE 485/585 Microprocessor System Design

CMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3

Computer Systems Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Cache Coherence in Scalable Machines

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Advanced OpenMP. Lecture 3: Cache Coherency

Lecture 24: Virtual Memory, Multiprocessors

Processor Architecture

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:

Cache Coherence and Atomic Operations in Hardware

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture: Coherence, Synchronization. Topics: directory-based coherence, synchronization primitives (Sections )

PARALLEL MEMORY ARCHITECTURE

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Computer Architecture

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Interconnect Routing

Lecture 1: Introduction

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Lecture 5: Directory Protocols. Topics: directory-based cache coherence implementations

Cache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols

Case Study 1: Single-Chip Multicore Multiprocessor

EECS 570 Lecture 11. Directory-based Coherence. Winter 2019 Prof. Thomas Wenisch

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Transcription:

Portland State University Department of Electrical and Computer Engineering ECE 588/688 Advanced Computer Architecture II Winter 2009 Midterm Exam 02/09/2009 Answer All Questions PSU ID#: Please Turn Over Page 1/8

Question 1: (16 points) Program P runs to completion in 100 seconds on a single, in-order processor. When Program P is parallelized, 20 seconds have to be spent in serial execution that cannot be parallelized, and the rest of execution time can be perfectly parallelized on any number of processors. (1A) (6 points) When Program P runs on 16 processors, how much time would it take to run to completion? Time = Serial time + (Parallel time/n) = 20 + 80 / 16 = 25 seconds Another solution: speedup = 1/(S + P/N) = 1/(0.2 + 0.8 / 16) = 4 Time = original time / speedup = 100/4 = 25 seconds (1B) (4 points) Assume we have a system with an infinite number of processors. How much time would it take to run Program P to completion? When N = infinity, time = Serial time + (Parallel time / infinity) = serial time = 20 seconds (1C) (6 points) If we have a fixed area on which we can build one of the following two systems: - System 1 contains 16 in-order single-issue processors similar to the ones in part (1A) - System 2 contains 4 out-of-order 4-wide superscalar processors. Program P runs to completion on one of these processors in 50 seconds. When parallelized, 4 seconds have to be spent in serial execution, and the rest can be perfectly parallelized. Which of these two systems would you choose to run Program P? System 1 s runtime = 25 seconds System 2 s runtime = serial time + parallel time / number of processors = 4 + 46/4 = 15.5 System 2 s runtime is lower, so we choose System 2. Page 2/8

Question 2: (16 points) (2A) (4 points) Identify one advantage and one disadvantage of UMA (compared to NUMA) Advantage: Simpler to design Disadvantage : Less scalable (2B) (4 points) What are the characteristics of programs that perform better on a CMP compared to a superscalar processor? Programs with high parallelism (e.g., IPC > 40) like most floating point applications Programs with high thread-level parallelism but low instruction level parallelism Parallel programs (2C) (4 points) Identify one advantage and one disadvantage of a hyper-threading processor compared to a multiprocessor. Advantage: Area and power efficiency Disadvantage: Resource contention due to sharing of resources between multiple threads (2D) (4 points) Identify one advantage and one disadvantage of the distributed memory architecture (compared to the shared memory architecture) Advantage: More scalable Disadvantage: Harder to program Page 3/8

Question 3: (16 points) Consider the following program fragment that runs in parallel on a multiprocessor system that implements the MSI cache coherence protocol: I1: LOAD R1, [A] // load from memory address A to register R1 I2: LOAD R2, [B] // load from memory address B to register R2 I3: ADD R3, R1, R2 // add the contents of R1 and R2, save result in R3 I4: STORE [A], R3 // store R3 to memory address A I5: SUB R4, R4, 1 // Subtract 1 from R4 I6: BNZ I1 // return to I1 if R4 is not zero (3A) (6 points) If A and B reside in two different cache blocks, complete the following table to identify the cache state of the blocks containing A and B after each instruction executes. Possible states are M, S, and I Instruction State for block A State for block B Initial State I I I1 S I I2 S S I3 S S I4 M S (3B) (10 points) Consider the following order of instructions on two processors P1 and P2: P1: I1, P2:I1, P1:I2, P1:I3, P1: I4, P2: I2, P2: I3, P2: I4 Complete the following table to identify the cache state for the two blocks containing A and B in both processors caches after each instruction executes. P1 s cache state P1 s cache state P2 s cache state P2 s cache state Instruction for A for B for A for B Initial State I I I I P1: I1 S I I I P2: I1 S I S I P1: I2 S S S I P1: I3 S S S I P1:I4 M S I I P2: I2 M S I S P2: I3 M S I S P2: I4 I S M S Page 4/8

Question 4: (18 points) (4A) (6 points) Identify one advantage and one disadvantage of write-invalidate cache coherence protocols (compare to write-update protocols). Advantage: Less bandwidth used when multiple writes are performed to the same block by the same processor Disadvantage: More bandwidth used when multiple writes are performed to the same block by more than one processor due to the ping-pong effect of invalidations (4B) (6 points) Identify one advantage and one disadvantage of building an inclusive cache system (where an L2 cache has to contain all blocks in L1 caches of processors that share it). Advantage: L2 cache handles coherence with other processors, so the L1 doesn t have to snoop the bus or maintain directory information Disadvantage: Total cache size is reduced, and L1 blocks need to be replaced if corresponding L2 block is evicted (4C) (6 points) Identify one advantage and one disadvantage of using a split-transaction bus in a snooping-based cache coherence protocol. Advantage: Less utilization of bus, and increase in effective bus bandwidth Disadvantage: Intermediate states are needed in coherence protocol to address cases when request is sent and the cache is waiting for response Page 5/8

Question 5: (16 points) (5A) (6 points) Using the terminology in the DASH paper, identify the owner of a memory block whose directory state is: (i) Uncached-remote Home cluster (ii) Shared-remote Home cluster (iii) Dirty-remote Remote cluster (5B) (4 points) In the DASH protocol, why does the directory need to keep track of all sharers of a memory block? To avoid broadcasts, the directory only sends invalidate requests to sharers of a block on a remote write (5C) (6 points) The SGI Origin coherence protocol added the Exclusive (i.e., E ) cache state to the MSI cache coherence protocol implemented in the DASH prototype. Identify two reasons why that change can improve performance. 1. Single fetch from memory is needed for read-modify-write requests (saves bandwidth and reduces latency) 2. Permits processor to replace a block in the E state without informing the directory to save bandwidth 3. A request from a processor that replaced an E block can be immediately satisfied from memory without communicating with other processors (saves bandwidth and latency) Page 6/8

Question 6: (18 points) (6A) (6 points) Identify two reasons why using IPC may be misleading when comparing performance of multiprocessor systems. An architectural enhancement can lead to a completely different execution path with a different number of instructions due to operating system scheduling or different orders of lock acquisition Programs with idle instruction and spin locks may execute instructions that do not contribute to useful work. (6B) (8 points) In the Data Diffusion Machine (DDM) implementation of COMA, list the sequence of cache coherence states and transactions when a processor needs to read and then write a block in the I state, and then another processor needs to write to that same block. Complete the following table with all the state information: Current State Requests and/or Responses New State I Pread/Nread R R Nread R R Ndata S S Pwrite/Nerase W W Nexclusive E E Pwrite E E Pread E E Nread/Ndata S S Nerase I (6C) (4 points) Identify one disadvantage of COMA architectures compared to cc-numa architectures. Overhead of tags and state that is not needed for ccnuma Coherence protocol is more complicated since we need to make sure last copy of a replaced block is stored somewhere in the system Page 7/8

Scratch Pad (Not graded) Page 8/8