Snoop-Based Multiprocessor Design III: Case Studies

Size: px
Start display at page:

Download "Snoop-Based Multiprocessor Design III: Case Studies"

Transcription

1 Snoop-Based Multiprocessor Design III: Case Studies Todd C. Mowry CS 41 March, Case Studies of Bus-based Machines SGI Challenge, with Powerpath SUN Enterprise, with Gigaplane Take very different positions on the design issues discussed above Overview For each system: Bus design Processor and Memory System Input/Output system Microbenchmark memory access results Application performance and scaling (SGI Challenge) SGI Challenge Overview SUN Enterprise Overview R44 CPUs and caches VME-4 SCSI- Graphics HPPI I/O subsystem Interleaved memory: 1 GB maximum P $ P $ $ $ mem ctrl Bus Interf ace / Switch CPU/Mem Cards Bus Interf ace I/O Cards (a) A four-processor board 3 R44 (peak.7 GFLOPS, 4 per board) or 1 R (peak 5.4 GFLOPS, per board) -way interleaved memory (up to 1 GB) 4 I/O ses of 3 MB/s each 1. GB/s 47. MHz, 1 slots, 39 signals Bytes lines (1 + 4 cycles) Split-transaction with up to outstanding reads all transactions take five cycles 3 Powerpath- (5 data, 4 address, 47. MHz) (b) Machine organization Up to 3 UltraSPARC processors (peak 9 GFLOPs) Gigaplane TM has peak bw.7 GB/s; upto 3GB memory 1 slots, for processing or I/O boards CPUs and 1GB memory per board memory distributed, unlike Challenge, but protocol treats as centralized Each I/O board has 4-bit 5Mhz SBUSes 4 Gigaplane TM (5 data, 41 address, 3 MHz) Page 1

2 Bus Design Issues Multiplexed versus non-multiplexed (separate addr and data lines) Wide versus narrow data ses Bus clock rate Affected by signaling technology, length, number of slots... SGI Powerpath- Bus Non-multiplexed, 5-data/4-address, 47. MHz, o/s requests Wide => more interface chips so higher latency, but more bw at slower clock Large block size also calls for wider Uses Illinois MESI protocol (cache-to-cache sharing) More detail in chapter. Resolution At least one requestor Split transaction versus atomic Flow control strategy 1. Arbitration No requestors Acknowledge 4. Decode 5 Bus Timing Processor and Memory Systems. Resolution At least one requestor 1. Arbitration No requestors 3. R44 R44 R44 R44 5. Acknowledge 4. Decode Arb Rslv Addr Decode Ack Arb Rslv Addr Decode Ack A-chip slice 1 slice slice 3 slice 4 Command Urgent arb arb arb Cmd ack ack Urgent arb arb arb D 1 D D 3 D D 1 D D 3 D resource Inhib resource Inhib Inhib Inhib and inhibit resource ID ID 7 Cmd ack ack 4 R44 processors per board share A and D chips A chip has address interface, request table, control logic CC chip per processor has duplicate set of Processor requests go from CC chip to A chip to 4 bit-sliced D chips interface CC chip to Powerpath- Page

3 Memory Access Latency Challenge I/O Subsystem 5ns access time from address on to data on Peripheral SCSI VME HPPI Personality graphics ASICs But overall latency seen by processor is 1ns! 3ns for request to get from processor to down through cache hierarchy, CC chip and A chip 4ns later, data gets to D chips 3 cycles to address phase of request transaction, to access main memory, 5 to deliver data across to D chips 3ns more for data to get to processor chip up through D chips, CC chip, and 4-bit wide interface to processor chip, load data into primary cache, restart pipeline 9 Multiple I/O cards on system, each has 3MB/s Personality ASICs connect these to devices (standard and graphics) Proprietary 4-bit multiplexed address/data, same clock as system Split read transactions, up to 4 per device Pipelined, but centralized arbitration, with several transaction lengths translation via mapping RAM in system interface Why the decouplings? (Why not connect directly to system?) I/O board acts like a processor to memory system 1 (3 MB/s) map System address System data (1. GB/s) path System to interface Challenge Memory System Performance Read microbenchmark with various strides and array sizes Time (ns) 1,5 TLB MEM 1, M 4 M M 1 M 5 K 5 5 K K 4 K 3 K 1 K L K 4 K 1 K 4 K 5 K 1 M 4 M Stride (bytes) Ping-pong flag-spinning microbenchmark: round-trip time. ms. Sun Gigaplane Bus Non-multiplexed, split-transaction, 5-data/41-address, 3.5 MHz Plus 3 ECC lines, 7 tag, 1 arbitration, etc. Total 3. Cards plug in on both sides: per side 1 outstanding transactions, up to 7 from each board Designed for multiple outstanding transactions per processor Emphasis on reducing latency, unlike Challenge Speculative arbitration if address not scheduled from prev. cycle Else regular 1-cycle arbitration, and 7-bit tag assigned in next cycle Snoop result associated with request phase (5 cycles later) Main memory can stake claim to data 3 cycles into this, and start memory access speculatively Two cycles later, asserts tag to inform others of coming transfer MOESI protocol (owned state for cache-to-cache sharing) 11 Page 3

4 Gigaplane Bus Timing Enterprise Processor and Memory System Arbitration Rd A Rd B Share ~Own Own 1 4,5 7 procs per board, external L caches, mem banks with x-bar lines buffered through UDB to drive internal 1.3 GB/s UPA Wide path to memory so full 4-byte line in 1 mem cycle ( cyc) Addr controller adapts proc and protocols, does cache coherence its keep a subset of states needed by (e.g. no M/E distinction) Status A D A D A D A D A D A D A D A D OK Cancel Memory (1 7-bit SIMMS) s s FiberChannel module () SBUS slots 1/1 Ethernet Fast wide SCSI D D 1 UDB UltraSparc UDB UltraSparc SBUS 5 MHz 4 SysIO SysIO D- controller controller (crossbar) controller controller (crossbar) Control Control Gigaplane connector Gigaplane connector Enterprise I/O System I/O board has same interface ASICs as processor boards But internal half as wide, and no memory path Only cache block sized transactions, like processing boards Uniformity simplifies design ASICs implement single-block cache, follows coherence protocol Two independent 4-bit, 5 MHz Ses One for two dedicated FiberChannel modules connected to disk One for Ethernet and fast wide SCSI Can also support three SBUS interface cards for arbitrary peripherals Performance and cost of I/O scale with no. of I/O boards 1 Memory Access Latency 3ns read miss latency 11 cycle min protocol at 3.5 Mhz is 13ns of this time Rest is path through caches and the DRAM access TLB misses add 34 ns Time (ns) M 4 M M 1 M 5 K 5 K K 4 K 3 K 1 K K 4 K 1 K 4 K 5 K 1 M 4 M Stride (bytes) Ping-pong microbenchmark is 1.7 ms round-trip (5 mem accesses) Page 4

5 Application s (Challenge) Application Scaling under Other Models Problem in Ocean with small problem: communication and barrier cost Problem in Radix: contention on due to very high traffic also leads to high imbalances and barrier wait time 17 LU: n = 1,4 Barnes-Hut: 1-K particles LU: n =,4 Barnes-Hut: 5-K particles Raytrace: balls Ocean: n = 13 Raytrace: car Ocean: n = 1,4 Radiosity: room Radix: 1-M keys Radiosity: large room Radix: 4-M keys Work (instructions) Number of bodies 1, 9, Naive TC Naive MC, TC 7, MC, 5, 4, 3,, 1, Naive TC Naive MC TC MC Naive TC Naive MC TC MC PC Number of points per grid Work (instructions) ,, 1,,,, 4,, 1 1 Naive TC Naive MC TC MC Naive TC Naive MC TC MC PC 4 Naive TC Naive MC TC MC Page 5

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Review: Symmetric Multiprocesors (SMP)

Review: Symmetric Multiprocesors (SMP) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from work

More information

Lecture 7: Implementing Cache Coherence. Topics: implementation details

Lecture 7: Implementing Cache Coherence. Topics: implementation details Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,

More information

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline

Page 1. What is (Hardware) Shared Memory? Some Memory System Options. Outline What is (Hardware) Shared Memory? ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs Coherence & Snooping Copyright 2006 Daniel J. Sorin Duke University

More information

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability

NOW Handout Page 1. Recap: Gigaplane Bus Timing. Scalability Recap: Gigaplane Bus Timing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Address Rd A Rd B Scalability State Arbitration 1 4,5 2 Share ~Own 6 Own 7 A D A D A D A D A D A D A D A D CS 258, Spring 99 David E. Culler

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

EE 457 Unit 7b. Main Memory Organization

EE 457 Unit 7b. Main Memory Organization 1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)

More information

Physical Design of Snoop-Based Cache Coherence on Multiprocessors

Physical Design of Snoop-Based Cache Coherence on Multiprocessors Physical Design of Snoop-Based Cache Coherence on Multiprocessors Muge Guher University Of Ottawa Abstract This report focuses on the hardware design issues associated with the physical implementation

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

A More Sophisticated Snooping-Based Multi-Processor

A More Sophisticated Snooping-Based Multi-Processor Lecture 16: A More Sophisticated Snooping-Based Multi-Processor Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes The Projects Handsome Boy Modeling School (So... How

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 11: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming Tsinghua has its own ice cream! Wow! CMU / 清华 大学, Summer 2017 Review: MSI state transition

More information

Input/Output Introduction

Input/Output Introduction Input/Output 1 Introduction Motivation Performance metrics Processor interface issues Buses 2 Page 1 Motivation CPU Performance: 60% per year I/O system performance limited by mechanical delays (e.g.,

More information

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses 1 Most of the integrated I/O subsystems are connected to the

More information

Introduction. Motivation Performance metrics Processor interface issues Buses

Introduction. Motivation Performance metrics Processor interface issues Buses Input/Output 1 Introduction Motivation Performance metrics Processor interface issues Buses 2 Motivation CPU Performance: 60% per year I/O system performance limited by mechanical delays (e.g., disk I/O)

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

JBus Architecture Overview

JBus Architecture Overview JBus Architecture Overview Technical Whitepaper Version 1.0, April 2003 This paper provides an overview of the JBus architecture. The Jbus interconnect features a 128-bit packet-switched, split-transaction

More information

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based)

Lecture 8: Snooping and Directory Protocols. Topics: split-transaction implementation details, directory implementations (memory- and cache-based) Lecture 8: Snooping and Directory Protocols Topics: split-transaction implementation details, directory implementations (memory- and cache-based) 1 Split Transaction Bus So far, we have assumed that a

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

A Basic Snooping-Based Multi-Processor Implementation

A Basic Snooping-Based Multi-Processor Implementation Lecture 15: A Basic Snooping-Based Multi-Processor Implementation Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Pushing On (Oliver $ & Jimi Jules) Time for the second

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

SunFire range of servers

SunFire range of servers TAKE IT TO THE NTH Frederic Vecoven Sun Microsystems SunFire range of servers System Components Fireplane Shared Interconnect Operating Environment Ultra SPARC & compilers Applications & Middleware Clustering

More information

Performance study example ( 5.3) Performance study example

Performance study example ( 5.3) Performance study example erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

14:332:331. Week 13 Basics of Cache

14:332:331. Week 13 Basics of Cache 14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson s UCB CS152 slides and Mary Jane Irwin s PSU CSE331 slides] 331 Lec20.1 Fall 2003 Head

More information

Chapter Seven Morgan Kaufmann Publishers

Chapter Seven Morgan Kaufmann Publishers Chapter Seven Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Introduction to memory system :from device to system

Introduction to memory system :from device to system Introduction to memory system :from device to system Jianhui Yue Electrical and Computer Engineering University of Maine The Position of DRAM in the Computer 2 The Complexity of Memory 3 Question Assume

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence 2014-3-6 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: 1 Today:

More information

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University Introduction to Computer Architecture Input/Output () Copyright 2012 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2012 Where We Are in This Course Right Now So

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

PCI and PCI Express Bus Architecture

PCI and PCI Express Bus Architecture PCI and PCI Express Bus Architecture Computer Science & Engineering Department Arizona State University Tempe, AZ 85287 Dr. Yann-Hang Lee yhlee@asu.edu (480) 727-7507 7/23 Buses in PC-XT and PC-AT ISA

More information

Scalable Shared-Memory Implementations. Erik Hagersten Uppsala University

Scalable Shared-Memory Implementations. Erik Hagersten Uppsala University Scalable Shared-emory Implementations Erik Hagersten Uppsala University Sweden -to-cache in snoop-based A: B: BusRTS y RTS wait for data Gotta answer $ $ $ Thread Thread Thread Read A... Read A Read A

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Science 46 Computer Architecture Spring 24 Harvard University Instructor: Prof dbrooks@eecsharvardedu Lecture 22: More I/O Computer Science 46 Lecture Outline HW5 and Project Questions? Storage

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Introduction to Input and Output

Introduction to Input and Output Introduction to Input and Output The I/O subsystem provides the mechanism for communication between the CPU and the outside world (I/O devices). Design factors: I/O device characteristics (input, output,

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Learning Curve for Parallel Applications. 500 Fastest Computers

Learning Curve for Parallel Applications. 500 Fastest Computers Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS)

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS) Sonics SOC Integration Architecture Drew Wingard 2440 West El Camino Real, Suite 620 Mountain View, California 94040 650-938-2500 Fax 650-938-2577 http://www.sonicsinc.com (Systems-ON-ICS) Overview 10

More information

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Student name: Date: Example 1 Section: Memory hierarchy (SRAM, DRAM) Question # 1.1 Circle the memory type based on electrically re-chargeable elements

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near

More information

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H Katz Computer Science 252 Spring 996 RHKS96 Review: Storage System Issues Historical Context of Storage I/O Storage I/O Performance

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Cache Coherence: Part II Scalable Approaches

Cache Coherence: Part II Scalable Approaches ache oherence: art II Scalable pproaches Hierarchical ache oherence Todd. Mowry S 74 October 27, 2 (a) 1 2 1 2 (b) 1 Topics Hierarchies Directory rotocols Hierarchies arise in different ways: (a) processor

More information

Chapter 6. I/O issues

Chapter 6. I/O issues Computer Architectures Chapter 6 I/O issues Tien-Fu Chen National Chung Cheng Univ Chap6 - Input / Output Issues I/O organization issue- CPU-memory bus, I/O bus width A/D multiplex Split transaction Synchronous

More information

Scalable Distributed Memory Machines

Scalable Distributed Memory Machines Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands of processors. Design Choices: Custom-designed or commodity nodes? Network scalability. Capability

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture - Memory Hierarchies (III) COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security Input/Output Today Principles of I/O hardware & software I/O software layers Disks Next Protection & Security Operating Systems and I/O Two key operating system goals Control I/O devices Provide a simple,

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Lecture 25: Busses. A Typical Computer Organization

Lecture 25: Busses. A Typical Computer Organization S 09 L25-1 18-447 Lecture 25: Busses James C. Hoe Dept of ECE, CMU April 27, 2009 Announcements: Project 4 due this week (no late check off) HW 4 due today Handouts: Practice Final Solutions A Typical

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

QUIZ Ch.6. The EAT for a two-level memory is given by:

QUIZ Ch.6. The EAT for a two-level memory is given by: QUIZ Ch.6 The EAT for a two-level memory is given by: EAT = H Access C + (1-H) Access MM. Derive a similar formula for three-level memory: L1, L2 and RAM. Hint: Instead of H, we now have H 1 and H 2. Source:

More information

SCI Basics. Hugo Kohmann

SCI Basics. Hugo Kohmann Hugo Kohmann Slide 1 What is SCI ½ The SCI standard ANSI/IEEE 1596-1992 defines a point-to-point interface and a set of packet protocols. ½ The SCI protocols use small packet sizes with a 16-byte header

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

High Performance Multiprocessor System

High Performance Multiprocessor System High Performance Multiprocessor System Requirements : - Large Number of Processors ( 4) - Large WriteBack Caches for Each Processor. Less Bus Traffic => Higher Performance - Large Shared Main Memories

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 5 Memory Hierachy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu CENG 3420 Computer Organization and Design Lecture 08: Memory - I Bei Yu CEG3420 L08.1 Spring 2016 Outline q Why Memory Hierarchy q How Memory Hierarchy? SRAM (Cache) & DRAM (main memory) Memory System

More information

[ 5.4] What cache line size is performs best? Which protocol is best to use?

[ 5.4] What cache line size is performs best? Which protocol is best to use? Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part

More information

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin

... The Composibility Question. Composing Scalability and Node Design in CC-NUMA. Commodity CC node approach. Get the node right Approach: Origin The Composibility Question Composing Scalability and Node Design in CC-NUMA CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley adapter Sweet Spot Node Scalable (Intelligent) Interconnect

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Static RAM (SRAM) Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 0.5ns 2.5ns, $2000 $5000 per GB 5.1 Introduction Memory Technology 5ms

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

OPENSPARC T1 OVERVIEW

OPENSPARC T1 OVERVIEW Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share

More information