CS 152 Computer Architecture and Engineering

Similar documents
CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering. Lecture 15 Virtual Memory Dave Patterson. John Lazzaro. www-inst.eecs.berkeley.

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Computer Systems Architecture

Chapter 5. Thread-Level Parallelism

12 Cache-Organization 1

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Page 1. Cache Coherence

1. Memory technology & Hierarchy

CS 152 Computer Architecture and Engineering

Computer Systems Architecture

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

High Performance Multiprocessor System

Handout 3 Multiprocessor and thread level parallelism

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Multiprocessors & Thread Level Parallelism

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Shared Symmetric Memory Systems

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Direct-Mapped and Set Associative Caches. Instructor: Steven Ho

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

ECE331: Hardware Organization and Design

CS61C : Machine Structures

Chap. 4 Multiprocessors and Thread-Level Parallelism

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 18 Cache Coherence

Cache Coherence and Atomic Operations in Hardware

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Limitations of parallel processing

PARALLEL MEMORY ARCHITECTURE

CS Digital Systems Project Laboratory. Lecture 10: Advanced Processors II

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

EC 513 Computer Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Cache Coherence Protocols: Implementation Issues on SMP s. Cache Coherence Issue in I/O

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

10/16/17. Outline. Outline. Typical Memory Hierarchy. Adding Cache to Computer. Key Cache Concepts

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Introducing Multi-core Computing / Hyperthreading

Chapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

14:332:331. Week 13 Basics of Cache

CS 152 Computer Architecture and Engineering

Cache Organizations for Multi-cores

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS 152 Computer Architecture and Engineering

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Multiprocessor Cache Coherency. What is Cache Coherence?

Computer Architecture Memory hierarchies and caches

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CS 152 Computer Architecture and Engineering

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Synchronization. P&H Chapter 2.11 and 5.8. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

The Cache Write Problem

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Dr. George Michelogiannakis. EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Aleksandar Milenkovich 1

ECE 485/585 Microprocessor System Design

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 13. Shared memory: Architecture and programming

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Computer Architecture and Engineering CS152 Quiz #5 April 27th, 2016 Professor George Michelogiannakis Name: <ANSWER KEY>

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

CS61C Machine Structures. Lecture 1 Introduction. 8/27/2006 John Wawrzynek (Warzneck)

CSC501 Operating Systems Principles. OS Structure

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Transcription:

CS 152 Computer Architecture and Engineering Lecture 27 Multiprocessors 2005-4-28 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/

Last Time: Synchronization Tail Head Tail Head Before: y x After: y Higher Addresses Higher Addresses T1 code (producer) T2 & T3 (2 copes of consumer thread) ORi R1, R0, x ; Load x value into R1 LW R2, tail(r0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queue ADDi R2, R2, 4 ; Shift tail by one word SW R2 0(tail) ; Update tail memory addr LW R3, head(r0) ; Load queue head into R3 spin: LW R4, tail(r0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ADDi R3, R3, 4 SW R3 head(r0) ; Read x from queue into R5 ; Shift head by one word ; Update head memory addr Critical section: T2 and T3 must take turns running red code.

Today: Memory System Design Multiprocessor memory systems: Consequences of cache placement. Write-through cache coherency: Simple, but limited, approach to multiprocessor memory systems. NUMA and Clusters: Two different ways to build very large computers.

Two CPUs, two caches, shared DRAM... Addr 16 CPU0 Cache Value 5 CPU1 Shared Main Memory Value Addr Addr Cache 16 5 0 Write-through caches Value 16 5 0 CPU0: LW R2, 16(R0) CPU1: LW R2, 16(R0) CPU1: SW R0,16(R0) View of memory no longer coherent. Loads of location 16 from CPU0 and CPU1 see different values! Today: What to do...

The simplest solution... one cache! CPU0 CPU1 Memory Switch Shared Multi-Bank Cache CPUs do not have internal caches. Only one cache, so different values for a memory address cannot appear in 2 caches! Shared Main Memory Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank. In that case, one request is stalled.

Not a complete solution... good for L2. CPU0 CPU1 For modern clock rates, access to shared cache through switch takes 10+ cycles. Memory Switch Shared Multi-Bank Cache Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good. Shared Main Memory This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.

Modified form: Private L1s, shared L2 CPU0 CPU1 L1 Caches L1 Caches Memory Switch or Bus Shared Multi-Bank L2 Cache Shared Main Memory Thus, we need to solve the cache coherency problem for L1 cache. Advantages of shared L2 over private L2s: Processors communicate at cache speed, not DRAM speed. Constructive interference, if both CPUs need same data/instr. Disadvantage: CPUs share BW to L2 cache...

Sequentially Consistent Memory Systems

Recall: Sequential Consistency Sequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order. T1 code (producer) T2 code (consumer) ORi R1, R0, x ; Load x value into R1 LW R2, tail(r0) ; Load queue tail into R2 SW R1, 0(R2) 1 ; Store x into queue ADDi R2, R2, 4 ; Shift tail by one word SW R2 0(tail) 2 ; Update tail memory addr LW R3, head(r0) ; Load queue head into R3 spin: LW R4, tail(r0) 3; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) 4; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(r0) ; Update head memory addr Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2... but not 2, 3, 1, 4!

Sequential consistency requirements... Addr CPU0 Cache Value CPU1 Shared Memory Hierarchy Value Addr 16 Addr Cache 5 Value 16 5 16 5 0 0 1. Only one processor at a time has write permission for a memory location. The sequential part of sequential consistency. 2. No processor can load a stale copy of a location after a write. The consistent part of sequential consistency.

Implementation: Snoopy Caches CPU0 CPU1 Cache Snooper Memory bus Cache Snooper Shared Main Memory Hierarchy Each cache has the ability to snoop on memory bus transactions of other CPUs. The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.

Writes from 10,000 feet... For write-thru caches... CPU0 Cache Snooper Memory bus CPU1 Cache Snooper Shared Main Memory Hierarchy To a first-order, reads will just work if write-thru caches implement this policy. A two-state protocol (cache lines are valid or invalid ). 1. Writing CPU takes control of bus. 2. Address to be written is invalidated in all other caches. Reads will no longer hit in cache and get stale data. 3. Write is sent to main memory. Reads will cache miss, retrieve new value from main memory

Limitations of the write-thru approach CPU0 CPU1 Every write goes to the bus. Cache Snooper Shared Main Memory Hierarchy Memory bus Cache Snooper Write-back big trick: keep track of whether other caches also contain a cached line. If not, a cache has an exclusive on the line, and can read and write the line as if it were the only CPU. For details, take CS 252... Total bus write bandwidth does not support more than 2 CPUs, in modern practice. To scale further, we need to use write-back caches.

Other Machine Architectures

NUMA: Non-uniform Memory Access CPU 0... CPU 1023 Each CPU has part of main memory attached to it. Cache DRAM Cache DRAM To access other parts of main memory, use the interconnection network. Interconnection Network Good for applications that match the machine model... For best results, applications take the non-uniform memory latency into account.

Clusters... an application-level approach Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology. University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 µs ping time - low latency network). Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.

Clusters also used for web servers Client Client IP network Client Client In some applications, each machine can handle a net query by itself. Single-site server Load manager Persistent data store Optional backplane Example: serving static web pages. Each machine has a copy of the website. Load manager is a special-purpose computer that assigns incoming HTTP connections to a particular machine. Image from Eric Brewer s IEEE Internet Computing article.

Clusters also used for web services Program Single-site server Program Altavista web search engine did not use clusters. Instead, Altavista used shared-memory multiprocessors. This approach could not scale with the web. IP network Load manager Program Partitioned data store Myrinet backplane Program In other applications, many machines work together on each transaction. Example: Web searching. The search is partitioned over many machines, each of which holds a part of the database.

CS 152: What s left... Monday 5/2: Final report due, 11:59 PM Watch email for final project peer review request. Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda. No class on Thursday. Review session in Tuesday 5/2, + HKN (???). Tuesday 5/10: Final presentations. Deadline to bring up grading issues: Tues 5/10@ 5PM. Contact John at lazzaro@eecs

Tomorrow: Final Project Checkoff P i p e l i n e d C P U IC Bus DC Bus Instruction Cache Data Cache IM Bus DM Bus D R A M C o n t r o l l e r DRAM TAs will provide secret MIPS machine code tests. Bonus points if these tests run by 2 PM. If not, TAs give you test code to use over weekend